lucy24 - 10:27 pm on Apr 9, 2013 (gmt 0)
Gosh, I haven't met one of these in ages.
22.214.171.124 - - [08/Apr/2013:16:15:39 -0700] "GET /hovercraft/hovercraft HTTP/1.1" 404 1248 "http://www.example.com/hovercraft/hovercraft" "ip-web-crawler.com"
126.96.36.199 - - [08/Apr/2013:16:15:39 -0700] "GET /hovercraft/costofliving HTTP/1.1" 404 1248 "http://www.example.com/hovercraft/costofliving" "ip-web-crawler.com"
188.8.131.52 - - [08/Apr/2013:16:15:39 -0700] "GET /hovercraft/duct_tape HTTP/1.1" 404 1248 "http://www.example.com/hovercraft/duct_tape" "ip-web-crawler.com"
184.108.40.206 - - [08/Apr/2013:16:15:40 -0700] "GET /hovercraft/outside HTTP/1.1" 404 1248 "http://www.example.com/hovercraft/outside" "ip-web-crawler.com"
220.127.116.11 - - [08/Apr/2013:16:15:43 -0700] "GET /fonts/note3 HTTP/1.1" 404 1248 "http://www.example.com/fonts/note3" "ip-web-crawler.com"
<a name = "hovercraft" id = "hovercraft"
<a class = "outside"
<a href = "#note3"
Wait, I haven't got to the punch line.
18.104.22.168 - - [08/Apr/2013:16:15:52 -0700] "GET /fun/nofollow HTTP/1.1" 404 1248 "http://www.example.com/fun/nofollow" "ip-web-crawler.com"
As cosgan dot de would say:
Fortunately it knew enough to strip the # from the end of links, or we'd be here all day. The General Index to the Paston Letters has got 20,000 of them.
Obligatory detour to www page discloses:
The purpose of the IP-Web-Crawler is to identify web sites that host or link to copyrighted content such as torrents, movies, applications and other copyrighted works.
For whose benefit is not revealed.
Q. Does IP-Web-Crawler recognize and respect robots.txt files?
A. Yes. IP-Web-Crawler always checks the robots.txt file first.
Incredibly, this appears to be true. It not only started with robots.txt, it didn't go into any roboted-out directories, although it definitely met links that would have taken it that way. Utterly ignored the "crawl-delay" directive, making 643 requests in less than a minute-- at which point it must have heard a knock at the door, because it went off in mid-crawl and never came back-- but, er, so do some search engines one could name.
currently we only crawl textual content, and not rich media content such as images or videos
IP-Web-Crawler only looks at text and HTML. It does not crawl images, files or rich media.
Apparently the crawler, like some folks hereabouts, does not know what the .midi extension signifies ;) It picked up 32 of them. It ran off to answer the phone before it got into the /games/ directory, so I never got a chance to see if it recognizes .sit and .dmg files.
They say the crawler range is 22.214.171.124-12. But whois says that 50.31.0-127 all belongs to one entity (steadfast.net), and .128-255 is an outfit called Server Central. So I didn't see any great loss in slamming the door on 50.31 collectively.
Besides, it made me notice that I'd neglected to include /AlonzoMelissaFull.html in my package of auto-referer blocks. It's done in htaccess, so I only list the largest files-- the ones that run over 200K for the html alone. There is a particular type of robot that goes straight for the fattest files, though I don't understand how it identifies them ahead of time. I expect there's a simple explanation.