discouraging Twitterbot?

Forum Moderators: phranque

Message Too Old, No Replies

discouraging Twitterbot?

Dan99

1:39 pm on Mar 26, 2015 (gmt 0)

OK, so someone did a Twitter post pointing to a file I have on the web. Not a big deal, but the file is a professional one, and I'd like to not encourage its access by social media.

So no sweat, I assume. In my robots.txt I have

User-agent: Twitterbot/1.0
Disallow: /

I also have Twitter IP adresses denied in my .htaccess file. As in

deny from 199.59.148.0/22

But Twitterbot keeps hammering me with

199.59.148.209 - - [26/Mar/2015:08:46:01 -0500] "GET /robots.txt HTTP/1.1" 200 454 "-" "Twitterbot/1.0"
199.59.148.209 - - [26/Mar/2015:08:46:01 -0500] "HEAD /myfile.htm HTTP/1.1" 403 - "-" "Twitterbot/1.0"

As in, I read your robots.txt file which tells me to go away, but I'll still poke at your file anyway, even if I can't get to it. Over and over and over and over. At least it's not demanding in lot of bandwidth, but it's getting kind of crazy. Go away already!

Is there a more compelling way to tell Twitterbot to go away?

not2easy

2:27 pm on Mar 26, 2015 (gmt 0)

Like any bot, you can prevent success but not attempt. :(

Dan99

2:32 pm on Mar 26, 2015 (gmt 0)

Well, Twitterbot is supposed to be some halfway robots.txt responsible bot, and is at least rumored to respect robots.txt. It's bothering to look at robots.txt, but just not bothering to obey it. I just wanted to make sure I was asking right.

Twitterbot developers forum makes it clear how to set up robots.txt so Twitterbot will feel welcome. That sort of implies that if it doesn't feel welcome, it'll go away.

not2easy

5:30 pm on Mar 26, 2015 (gmt 0)

Notice that the blocked page did not have a "GET" request.

Dan99

5:47 pm on Mar 26, 2015 (gmt 0)

Yes, that's correct. They keep hitting me with a HEAD request, which seems odd, considering that I have asked that they not GET it. Why would they be doing that? They want to know if what I'm asking them not to pay attention to is really there? That seems kinda pointless

lucy24

10:28 pm on Mar 29, 2015 (gmt 0)

:: bump ::

I have nearly the identical question, and I don't see the answer: What, precisely, does the Twitterbot do?

My case is loosely analogous to Dan's: Some stray human arrived on my test site, which for posting purposes I will call google-plays-silly-buggers dot com, as a type-in. Although it's primarily a test site --it's where I do htaccess experiments that I'm not positive will not lead to a 500-class error-- and is therefore categorically roboted-out, I do have a few pages of content to amuse the type-ins. This one was apparently amused enough to tweet the URL, leading to a flurry of twitterbot visits; I make it twelve visits over two days. (I only run this site's logs every two weeks, so this is all in the past.)

Exactly as reported above, it gets robots.txt and then puts in a HEAD request-- only-- for the front page. I don't, of course, know whether what got tweeted was the front page or some inner URL. But I'm left wondering if a HEAD alone-- especially a HEAD for the front page-- isn't considered a robots.txt violation. They're not asking to see the page, just verifying that it exists.

Dan99

11:44 pm on Mar 29, 2015 (gmt 0)

I actually sent a message to Twitter (you know, they don't make it easy to do!), but I sent it to privacy@twitter.com, asking them that gee, I keep getting this c**p at my website from a URL that corresponds to you, with a user-agent that is you, and, if it's really you, would you PLEASE stop?!

All I got back was an automated response that it wasn't really a privacy-related question (though I could never find an e-mail to complain about what appeared to be their web spamming). Interestingly, the next day, the number of requests from them dropped dramatically, and then stopped entirely. This was after a week of them filling up my logs with junk.

But yes, I just don't understand what this website hammering is trying to accomplish, other than just being pissed off that I won't let them in.

Samizdata

12:14 am on Mar 30, 2015 (gmt 0)

I'm left wondering if a HEAD alone-- especially a HEAD for the front page-- isn't considered a robots.txt violation

Botrunners will argue that the Robots Exclusion Protocol only covers "recursively retrieving" files.

This one appears to be a link checker.

...