hundreds of hits per day on robots.txt from FireFox UAs

Forum Moderators: open

Message Too Old, No Replies

hundreds of hits per day on robots.txt from FireFox UAs

Appear to be through "planetlab"

abates

1:31 am on May 15, 2006 (gmt 0)

I've been getting hits on robots.txt (and no other pages) numbering in the hundreds per day, all from IPs which reverse-resolve to something along the lines of "planetlab02.institution.edu". The majority of the user agents are something like "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6" but I have had some Windows-based FireFox UAs as well.

Planetlab says the project responsible is "umd_sidecar" which, judging by the description, is piggybacking on a browser session to download files from the server and in this case "emulating" a web-crawler.

Is anyone else being hit by this?

dmorison

6:09 am on May 15, 2006 (gmt 0)

Yes.

I posted the message below in this thread [webmasterworld.com] (robots.txt being requested through hundreds of open proxy servers) but nobody has confirmed a similar sighting until now.

Anyone seen this happen? Currently being hit by hundreds of requests for robots.txt from the same user-agent but different IP's. On researching the IPs, they all seem to be listed in open proxy server lists.


a.b.c.d - - [11/May/2006:08:42:00 +0000] "GET /robots.txt HTTP/1.1" 200 202 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6"
a.b.c.d - - [11/May/2006:08:42:00 +0000] "GET /robots.txt HTTP/1.1" 200 202 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6"
a.b.c.d - - [11/May/2006:08:42:07 +0000] "GET /robots.txt HTTP/1.1" 200 202 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6"
a.b.c.d - - [11/May/2006:08:42:07 +0000] "GET /robots.txt HTTP/1.1" 200 202 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6"

(where a.b.c.d is a different open proxy each time)

Been going on for about half an hour now!

Subsequently, I checked back through my logs and found it had been going for a lot longer than I has first thought.

Romeo

2:20 pm on May 15, 2006 (gmt 0)

Thanks, so I am not alone.

Saw this over the weekend, also this morning, and wondered
a) about the deeper sense of this operation, and
b) what will follow after that -- is this just a prelude of a parallel content scraping attack out of a distributed bot net of more than 200...300 nodes?

'"emulating" a web-crawler.' -- no, thanks!

Apparently nobody at Planetlab, which consists of a lot of 'white-hat' looking academic proxy nodes, is screening their 'researchers' enough for obvious nonsense or even abuse.

While -- from older experience -- they seem to react on abuse complaints, it mostly is too late if the abuse harm is already done. Their parallel capabilities contain a high abuse and damage potential.

After watching this for a few days now, a hopefully comprehensive list of all their nodes' IP addresses has accumulated in the log files in the meantime, so I will extract a list and feed the .htacces with it later this evening.

Threat fixed. Case closed.

After all, they were wasting my time with this kind of stuff.

Kind regards,
R.

abates

9:50 pm on May 15, 2006 (gmt 0)

I used the contact form to ask the researchers about this (it sends a copy to the planetlab administration automatically) and the hits seem to have tailed off, though I haven't got an email response. I'd already added a mod_rewrite rule to serve them a 403 error, of course.

Pfui

1:54 am on May 16, 2006 (gmt 0)

FYI, for 'regular' robots.txt-related hits from Firefox --

The Fasterfox extension will check for robots.txt. From the Fasterfox FAQ [fasterfox.mozdev.org]:

>>
Prior to generating any prefetching requests, Fasterfox checks for a file named "robots.txt" in your site's root directory (subdirectories are not checked). If this file contains the following 2 lines, no prefetching requests will be made to your domain:

User-agent: Fasterfox
Disallow: /
<<

This isn't new news but it's still startles me to see Firefox requesting robots.txt (there's no indication in the UA string that any extension may be involved). But at least Firefox-with-whatever reads and heeds it!

abates

10:48 pm on May 16, 2006 (gmt 0)

Huh, that's interesting. Nice that they provide a way to prevent FasterFox doing any precaching for your site, but robots.txt is supposed to be for bots, not for browser extensions.