Forum Moderators: open
Planetlab says the project responsible is "umd_sidecar" which, judging by the description, is piggybacking on a browser session to download files from the server and in this case "emulating" a web-crawler.
Is anyone else being hit by this?
I posted the message below in this thread [webmasterworld.com] (robots.txt being requested through hundreds of open proxy servers) but nobody has confirmed a similar sighting until now.
Anyone seen this happen? Currently being hit by hundreds of requests for robots.txt from the same user-agent but different IP's. On researching the IPs, they all seem to be listed in open proxy server lists.
a.b.c.d - - [11/May/2006:08:42:00 +0000] "GET /robots.txt HTTP/1.1" 200 202 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6"
a.b.c.d - - [11/May/2006:08:42:00 +0000] "GET /robots.txt HTTP/1.1" 200 202 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6"
a.b.c.d - - [11/May/2006:08:42:07 +0000] "GET /robots.txt HTTP/1.1" 200 202 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6"
a.b.c.d - - [11/May/2006:08:42:07 +0000] "GET /robots.txt HTTP/1.1" 200 202 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6"(where a.b.c.d is a different open proxy each time)
Been going on for about half an hour now!
Subsequently, I checked back through my logs and found it had been going for a lot longer than I has first thought.
Saw this over the weekend, also this morning, and wondered
a) about the deeper sense of this operation, and
b) what will follow after that -- is this just a prelude of a parallel content scraping attack out of a distributed bot net of more than 200...300 nodes?
'"emulating" a web-crawler.' -- no, thanks!
Apparently nobody at Planetlab, which consists of a lot of 'white-hat' looking academic proxy nodes, is screening their 'researchers' enough for obvious nonsense or even abuse.
While -- from older experience -- they seem to react on abuse complaints, it mostly is too late if the abuse harm is already done. Their parallel capabilities contain a high abuse and damage potential.
After watching this for a few days now, a hopefully comprehensive list of all their nodes' IP addresses has accumulated in the log files in the meantime, so I will extract a list and feed the .htacces with it later this evening.
Threat fixed. Case closed.
After all, they were wasting my time with this kind of stuff.
Kind regards,
R.
The Fasterfox extension will check for robots.txt. From the Fasterfox FAQ [fasterfox.mozdev.org]:
>>
Prior to generating any prefetching requests, Fasterfox checks for a file named "robots.txt" in your site's root directory (subdirectories are not checked). If this file contains the following 2 lines, no prefetching requests will be made to your domain:
User-agent: Fasterfox
Disallow: /
<<
This isn't new news but it's still startles me to see Firefox requesting robots.txt (there's no indication in the UA string that any extension may be involved). But at least Firefox-with-whatever reads and heeds it!