homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

ip-web-crawler - stupidity never goes out of fashion

 10:27 pm on Apr 9, 2013 (gmt 0)

Gosh, I haven't met one of these in ages.

Exhibit A: - - [08/Apr/2013:16:15:39 -0700] "GET /hovercraft/hovercraft HTTP/1.1" 404 1248 "http://www.example.com/hovercraft/hovercraft" "ip-web-crawler.com" - - [08/Apr/2013:16:15:39 -0700] "GET /hovercraft/costofliving HTTP/1.1" 404 1248 "http://www.example.com/hovercraft/costofliving" "ip-web-crawler.com" - - [08/Apr/2013:16:15:39 -0700] "GET /hovercraft/duct_tape HTTP/1.1" 404 1248 "http://www.example.com/hovercraft/duct_tape" "ip-web-crawler.com"

Exhibit B: - - [08/Apr/2013:16:15:40 -0700] "GET /hovercraft/outside HTTP/1.1" 404 1248 "http://www.example.com/hovercraft/outside" "ip-web-crawler.com"

Exhibit C: - - [08/Apr/2013:16:15:43 -0700] "GET /fonts/note3 HTTP/1.1" 404 1248 "http://www.example.com/fonts/note3" "ip-web-crawler.com"

<a name = "hovercraft" id = "hovercraft"
<a class = "outside"
<a href = "#note3"

Wait, I haven't got to the punch line. - - [08/Apr/2013:16:15:52 -0700] "GET /fun/nofollow HTTP/1.1" 404 1248 "http://www.example.com/fun/nofollow" "ip-web-crawler.com"

As cosgan dot de would say:


Fortunately it knew enough to strip the # from the end of links, or we'd be here all day. The General Index to the Paston Letters has got 20,000 of them.

Obligatory detour to www page discloses:
The purpose of the IP-Web-Crawler is to identify web sites that host or link to copyrighted content such as torrents, movies, applications and other copyrighted works.

For whose benefit is not revealed.

Q. Does IP-Web-Crawler recognize and respect robots.txt files?
A. Yes. IP-Web-Crawler always checks the robots.txt file first.

Incredibly, this appears to be true. It not only started with robots.txt, it didn't go into any roboted-out directories, although it definitely met links that would have taken it that way. Utterly ignored the "crawl-delay" directive, making 643 requests in less than a minute-- at which point it must have heard a knock at the door, because it went off in mid-crawl and never came back-- but, er, so do some search engines one could name.

currently we only crawl textual content, and not rich media content such as images or videos
IP-Web-Crawler only looks at text and HTML. It does not crawl images, files or rich media.

Apparently the crawler, like some folks hereabouts, does not know what the .midi extension signifies ;) It picked up 32 of them. It ran off to answer the phone before it got into the /games/ directory, so I never got a chance to see if it recognizes .sit and .dmg files.

They say the crawler range is But whois says that 50.31.0-127 all belongs to one entity (steadfast.net), and .128-255 is an outfit called Server Central. So I didn't see any great loss in slamming the door on 50.31 collectively.

Besides, it made me notice that I'd neglected to include /AlonzoMelissaFull.html in my package of auto-referer blocks. It's done in htaccess, so I only list the largest files-- the ones that run over 200K for the html alone. There is a particular type of robot that goes straight for the fattest files, though I don't understand how it identifies them ahead of time. I expect there's a simple explanation.



 3:09 am on Apr 10, 2013 (gmt 0)

FWIW, UA contains crawler [NC]

Steadfast Networks
"Chicago IT infrastructure provider focusing on dedicated servers, colocation,
cloud services, virtualization, and managed services."


 3:19 am on Apr 10, 2013 (gmt 0)

Besides having all those Steadfast ranges blocked, that UA would never have accessed anything on my servers except robots.txt and my 403 file.


 5:34 am on Apr 10, 2013 (gmt 0)

UA contains crawler

It would have got an automatic "robot" flag in log wrangling if it hadn't already caught my attention by the excess of 404s and huge number of page requests. Yes, I did detour to htaccess: "Huh? Don't I block those?" I have to do some more checking, because when I don't block something that obvious, it probably means there's a solid reason but I can't remember it at the moment :(

otoh I did just throw in the towel and lock out that new snippetbot after I caught it snuffling around my test site. Far as I can make out, that business on the www page about g+ users posting links is a complete fabrication; all it ever gets is the front page and favicon.

except robots.txt and my 403 file

With me it's robots.txt, assorted boilerplate, and the stylesheet that goes with error documents. Makes it easier to tell if I've accidentally locked out a human.


 6:36 pm on Apr 10, 2013 (gmt 0)

> A. Yes. IP-Web-Crawler always checks the robots.txt file first.

So, a bot checking web sites for illegal content that sites with illegal content can block with a simple entry in robots.txt. Delicious!


 8:14 pm on Apr 10, 2013 (gmt 0)

For those not wanting to block corporate users, a heads up:


I have corporate users in both these Steadfast ranges.


 8:19 pm on Apr 10, 2013 (gmt 0)

Also have them in No corporate users in the other ranges, I have those blocked.


 8:54 pm on Apr 10, 2013 (gmt 0)

They've all been listed before unless there's been a new range acquired by Steadfast.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved