Forum Moderators: open

Message Too Old, No Replies

CrazyWebCrawler

         

lucy24

9:50 pm on Apr 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The name has come up in a couple of discussions of distributed crawlers and similar
[webmasterworld.com...]
[webmasterworld.com...]
but they haven't got a thread of their own.

I've just spent some time poring over their page [crazywebcrawler.com], which says in part
If you'd like us to stop crawling your website,
the best thing to do is to block our web crawler using the robots.txt specification.
...
Blocking our web crawler by IP address will not work. Due to the distributed nature
of our infrastructure, we have thousands of constantly changing IP addresses. We
strongly recommend you don't try to block our web crawler by IP address, as you'll
most likely spend several hours of futile effort and be in a very bad mood at the end
of it. You really should just include us in your robots.txt or contact us directly.

Well, I guess I will have to contact them directly, since I do not perfectly understand how they can heed robots.txt directives when they have never once asked for robots.txt.

I cross-checked the three IPs that this UA has used. One of their ranges (162.243, long blocked) is shared by seoprofiler and MJ12bot-- which do ask for robots.txt-- but I am inclined to doubt they share robots.txt information. Nobody from 192.241.128.0/17 (also long blocked) has ever asked for robots.txt; same for 128.199.

My goodness. What an astounding coincidence. All three are Digital Ocean. On second thought I won't bother about direct contact; I'll just block the IP and UA both. That should do it.

trintragula

10:24 am on Apr 15, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



I started to notice visits by crazywebcrawler about six months ago. I eventually realised that it was duplicating every request made by a human visitor - usually within a few seconds, and from IPs all over the world.
I think this was being caused by a web security product from Comodo - an internet security company. PrivDog is apparently a bundled third-party tool with it and I think is the culprit. The Comodo browsers have Dragon and IceDragon as keywords in the user agent string, but that does not necessarily indicate that PrivDog is installed. It's also possible that PrivDog could be used with other browsers.
A search for PrivDog on the web raises a number of related issues.

I have 342 IP addresses for crazywebcrawler in my filter (since with a UA like that they were blocked by default). None has ever followed a link not visible to a human, or wandered in to a roboted-out directory, so in spite of the useragent string, this thing is not actually behaving like a crawler (which makes sense for a web security product of this kind) so I would not expect it to read robots.txt. It's possible that Comodo/PrivDog have re-purposed either the software or UA string for another product.

My list does all seem to come from Digital Ocean. Scrunching it all down, I see:

5.101.96.0/20
46.101.0.0/16
80.240.128.0/20
95.85.0.0/18
104.131.0.0/16
104.236.0.0/16
107.170.0.0/16
128.199.0.0/16
162.243.0.0/16
178.62.0.0/16
188.226.128.0/17
192.34.56.0/21
192.81.208.0/20
192.241.128.0/17
198.199.64.0/18
198.211.96.0/19
208.68.36.0/22

which I think includes some ranges I haven't identified before (this is not the whole of Digital Ocean - just the ones I've seen crazywebcrawler from).

keyplyr

1:32 am on Apr 18, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond %{HTTP_USER_AGENT} (analyz|crawl|seo|spider|walker) [NC]

...and poke holes for friendlies.