Forum Moderators: open
"Mozilla/5.0 (compatible; zermelo +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]"
They crawled at a rate of 10 seconds or more per request
They came from 67.202.23.zzz (Amazon)
They started at and obeyed robots.txt
Their about us page does not reveal much
There is no search box to try it out
Background:
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
Given the above, it might be reasonable to allow this new guy in and give it a chance, but all my primeval instincts are telling me to block it, feels like another silent starter that either gets aborted or optimists paining a bull's eye on their back while M$ is on a shopping spree.
If I was a new SE developer, I would develop something first, get huge investment, then make sure the world wild web knows about me before I start crawling their pages at that rate, shoe string startups are not welcome to my pages till they have something to show me, am I being too paranoid or cruel?
[edited by: incrediBILL at 9:58 pm (utc) on April 8, 2008]
Where? Found nothing on WW and nothing relevant on Google for "67.202.23"
They have lots of IPs so finding that specific Class C might be harder to find than their rdns of "amazonaws.com" for instance.
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
It's kind of like nutch, something new will pop up daily.
[edited by: incrediBILL at 10:12 pm (utc) on April 8, 2008]
Connection: close
From: crawl@powerset.com
User-Agent: zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
Connection: close
From: paul@page-store.com
User-Agent: Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
None of them made it past reading Robots.txt, they have all obeyed it.
Thank you Ocean,
Looking for just heritrix revealed hits from Bombay India with that UA:
"my-heritrix-crawler(+http://mywebsite.com)"
the UA gave it a 403
so he fired up his browser and came as:
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
Then tried his luck as:
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
Which again gave him a 403
so the guy just gave up and moved on cursing my site no doubt :-)