Welcome to WebmasterWorld Guest from 23.20.238.193

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

ip-web-crawler - stupidity never goes out of fashion

     
10:27 pm on Apr 9, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Gosh, I haven't met one of these in ages.

Exhibit A:
50.31.96.12 - - [08/Apr/2013:16:15:39 -0700] "GET /hovercraft/hovercraft HTTP/1.1" 404 1248 "http://www.example.com/hovercraft/hovercraft" "ip-web-crawler.com" 
50.31.96.12 - - [08/Apr/2013:16:15:39 -0700] "GET /hovercraft/costofliving HTTP/1.1" 404 1248 "http://www.example.com/hovercraft/costofliving" "ip-web-crawler.com"
50.31.96.12 - - [08/Apr/2013:16:15:39 -0700] "GET /hovercraft/duct_tape HTTP/1.1" 404 1248 "http://www.example.com/hovercraft/duct_tape" "ip-web-crawler.com"

Exhibit B:

50.31.96.12 - - [08/Apr/2013:16:15:40 -0700] "GET /hovercraft/outside HTTP/1.1" 404 1248 "http://www.example.com/hovercraft/outside" "ip-web-crawler.com"

Exhibit C:

50.31.96.12 - - [08/Apr/2013:16:15:43 -0700] "GET /fonts/note3 HTTP/1.1" 404 1248 "http://www.example.com/fonts/note3" "ip-web-crawler.com"

That's:
<a name = "hovercraft" id = "hovercraft"
<a class = "outside"
<a href = "#note3"
respectively.

Wait, I haven't got to the punch line.

50.31.96.12 - - [08/Apr/2013:16:15:52 -0700] "GET /fun/nofollow HTTP/1.1" 404 1248 "http://www.example.com/fun/nofollow" "ip-web-crawler.com"

As cosgan dot de would say:

/images/smilie/froehlich/a065.gif


Fortunately it knew enough to strip the # from the end of links, or we'd be here all day. The General Index to the Paston Letters has got 20,000 of them.

Obligatory detour to www page discloses:
The purpose of the IP-Web-Crawler is to identify web sites that host or link to copyrighted content such as torrents, movies, applications and other copyrighted works.

For whose benefit is not revealed.

Q. Does IP-Web-Crawler recognize and respect robots.txt files?
A. Yes. IP-Web-Crawler always checks the robots.txt file first.

Incredibly, this appears to be true. It not only started with robots.txt, it didn't go into any roboted-out directories, although it definitely met links that would have taken it that way. Utterly ignored the "crawl-delay" directive, making 643 requests in less than a minute-- at which point it must have heard a knock at the door, because it went off in mid-crawl and never came back-- but, er, so do some search engines one could name.

currently we only crawl textual content, and not rich media content such as images or videos
...
IP-Web-Crawler only looks at text and HTML. It does not crawl images, files or rich media.

Apparently the crawler, like some folks hereabouts, does not know what the .midi extension signifies ;) It picked up 32 of them. It ran off to answer the phone before it got into the /games/ directory, so I never got a chance to see if it recognizes .sit and .dmg files.

They say the crawler range is 50.31.96.6-12. But whois says that 50.31.0-127 all belongs to one entity (steadfast.net), and .128-255 is an outfit called Server Central. So I didn't see any great loss in slamming the door on 50.31 collectively.

Besides, it made me notice that I'd neglected to include /AlonzoMelissaFull.html in my package of auto-referer blocks. It's done in htaccess, so I only list the largest files-- the ones that run over 200K for the html alone. There is a particular type of robot that goes straight for the fattest files, though I don't understand how it identifies them ahead of time. I expect there's a simple explanation.
3:09 am on Apr 10, 2013 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



FWIW, UA contains crawler [NC]


Steadfast Networks
"Chicago IT infrastructure provider focusing on dedicated servers, colocation,
cloud services, virtualization, and managed services."
STEADFAST-2 208.100.0.0 - 208.100.63.255 208.100.0.0/18
STEADFAST-5 208.117.0.0 - 208.117.63.255 208.117.0.0/18
STEADFAST-FASTROOT 208.66.168.0 - 208.66.175.255 208.66.168.0/21
STEADFAST-7 23.29.128.0 - 23.29.159.255 23.29.128.0/19
STEADFAST-1 216.86.144.0 - 216.86.159.255 216.86.144.0/20
STEADFAST-6 50.31.0.0 - 50.31.127.255 50.31.0.0/17
STEADFAST-3 67.202.64.0 - 67.202.127.255 67.202.64.0/18
STEADFAST-4 69.162.128.0 - 69.162.191.255 69.162.128.0/18
STEADFAST 2607:F128:: - 2607:F128:FFFF:FFFF:FFFF:FFFF:FFFF:FFFF
3:19 am on Apr 10, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month







Besides having all those Steadfast ranges blocked, that UA would never have accessed anything on my servers except robots.txt and my 403 file.
5:34 am on Apr 10, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



UA contains crawler

It would have got an automatic "robot" flag in log wrangling if it hadn't already caught my attention by the excess of 404s and huge number of page requests. Yes, I did detour to htaccess: "Huh? Don't I block those?" I have to do some more checking, because when I don't block something that obvious, it probably means there's a solid reason but I can't remember it at the moment :(

otoh I did just throw in the towel and lock out that new snippetbot after I caught it snuffling around my test site. Far as I can make out, that business on the www page about g+ users posting links is a complete fabrication; all it ever gets is the front page and favicon.

except robots.txt and my 403 file

With me it's robots.txt, assorted boilerplate, and the stylesheet that goes with error documents. Makes it easier to tell if I've accidentally locked out a human.
6:36 pm on Apr 10, 2013 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



> A. Yes. IP-Web-Crawler always checks the robots.txt file first.

So, a bot checking web sites for illegal content that sites with illegal content can block with a simple entry in robots.txt. Delicious!
8:14 pm on Apr 10, 2013 (gmt 0)

10+ Year Member



For those not wanting to block corporate users, a heads up:

STEADFAST-2 208.100.0.0 - 208.100.63.255 208.100.0.0/18
STEADFAST-5 208.117.0.0 - 208.117.63.255 208.117.0.0/18

I have corporate users in both these Steadfast ranges.
8:19 pm on Apr 10, 2013 (gmt 0)

10+ Year Member



Also have them in 69.162.128.0/18. No corporate users in the other ranges, I have those blocked.
8:54 pm on Apr 10, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



They've all been listed before unless there's been a new range acquired by Steadfast.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month