Forum Moderators: open

Message Too Old, No Replies

Powerset

who's adopting?

         

Hobbs

9:10 am on Apr 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In 24 hours they went through 3k pages as:

"Mozilla/5.0 (compatible; zermelo +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]"

They crawled at a rate of 10 seconds or more per request
They came from 67.202.23.zzz (Amazon)
They started at and obeyed robots.txt
Their about us page does not reveal much
There is no search box to try it out

Background:
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]

Given the above, it might be reasonable to allow this new guy in and give it a chance, but all my primeval instincts are telling me to block it, feels like another silent starter that either gets aborted or optimists paining a bull's eye on their back while M$ is on a shopping spree.

If I was a new SE developer, I would develop something first, get huge investment, then make sure the world wild web knows about me before I start crawling their pages at that rate, shoe string startups are not welcome to my pages till they have something to show me, am I being too paranoid or cruel?

wilderness

9:27 pm on Apr 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



They came from 67.202.23.zzz (Amazon)

There's oodles of threads on this host and Class C range.

Hobbs

9:44 pm on Apr 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Where? Found nothing on WW and nothing relevant on Google for "67.202.23"

incrediBILL

9:53 pm on Apr 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't even bother checking the user agents from that range anymore as I block everything coming from the AWS IPs. There's just too many new Bot-du-Jours coming from their IPs to deal with so it was an all or nothing decision for me and I decided nothing.

[edited by: incrediBILL at 9:58 pm (utc) on April 8, 2008]

incrediBILL

10:06 pm on Apr 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Where? Found nothing on WW and nothing relevant on Google for "67.202.23"

They have lots of IPs so finding that specific Class C might be harder to find than their rdns of "amazonaws.com" for instance.

[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]

It's kind of like nutch, something new will pop up daily.

[edited by: incrediBILL at 10:12 pm (utc) on April 8, 2008]

Hobbs

11:48 pm on Apr 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hey, life is going to be pretty hard for bots hosted on

67.202.0.0/18
72.21.192.0/19
72.44.32.0/19
216.182.224.0/20

Ocean10000

2:00 am on Apr 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The information I have on it is limited. But I have found another variation for this bot, that others may have not put together with this one. Using the same email address "paul@page-store.com" is how I found it.


Connection: close
From: crawl@powerset.com
User-Agent: zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]

Connection: close
From: paul@page-store.com
User-Agent: Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)

None of them made it past reading Robots.txt, they have all obeyed it.

incrediBILL

6:44 am on Apr 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



None of them made it past reading Robots.txt, they have all obeyed it.

On the site I use to test the most I never tell them they are banned in robots.txt because I'm always curious what pages new bots know about and how they found out about them.

Bewenched

4:28 am on Apr 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



here's what I ask myself, why on earth would a reputable new bot rent server space from Amazon. then again, why does amazon get their domains through go daddy.

Hobbs

8:38 am on Apr 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>I have found another variation

Thank you Ocean,
Looking for just heritrix revealed hits from Bombay India with that UA:

"my-heritrix-crawler(+http://mywebsite.com)"

the UA gave it a 403
so he fired up his browser and came as:
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

Then tried his luck as:
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"

Which again gave him a 403
so the guy just gave up and moved on cursing my site no doubt :-)