Amazon Crawling

Forum Moderators: open

Message Too Old, No Replies

Amazon Crawling

aws-alexa libwww-perl/5.65

The Contractor

5:50 pm on Sep 1, 2005 (gmt 0)

What, they crawling now? This thing only left 1 second between requests.

165-38.amazon.com (207.171.165.38) was logged 1625 times,
starting at 12:38:38 PM on Thursday, September 1, 2005.
Browser/UA was aws-alexa libwww-perl/5.65.

Whois on the IP returns it as:
OrgName: Amazon.com, Inc.
OrgID: AMAZON-4
Address: 605 5th Ave S
City: SEATTLE
StateProv: WA
PostalCode: 98104
Country: US

NetRange: 207.171.160.0 - 207.171.191.255
CIDR: 207.171.160.0/19
NetName: AMAZON-01
NetHandle: NET-207-171-160-0-1
Parent: NET-207-0-0-0-0
NetType: Direct Assignment
NameServer: NS-1.AMAZON.COM
NameServer: NS-2.AMAZON.COM
NameServer: NS-3.AMAZON.COM
NameServer: AUTH00.NS.UU.NET
Comment:
RegDate: 1999-09-23
Updated: 2002-03-19

MarkHutch

6:06 am on Sep 4, 2005 (gmt 0)

Nice find. I haven't noticed them yet, but I'll keep an eye out. I checked the IP you listed, too and it is Amazon. Every week I notice more and more new crawlers in my logs. Many are just RSS feed crawlers, but others seem to be trying to go after content like Google and Yahoo. I guess the game is on. :)

The Contractor

2:10 pm on Sep 4, 2005 (gmt 0)

what's funny is the same IP came back later to the site with a UA of a9-pnews-crawler (+http://www.a9.com) and hit another 3279 pages of the one site. Contrary to the UA this site hasn't a thing to do with "news".

I believe they may be in test mode ...

Lord Majestic

2:20 pm on Sep 4, 2005 (gmt 0)

Contrary to the UA this site hasn't a thing to do with "news".

Ask yourself - how would a bot know about content before data is crawled and analysed? Bots can't read minds you know. It can be guessed for CNN, BBC and ABC - but there are thousands of obscure news sources that may not be even known before hand. It might be needed a number of recrawls before there will be enough data to make an automated guess whether your pages have anything to do with news.

Would be good if it was possible to give such hints in robots.txt, but it is not at the moment.

libwww-perl/5.65 implies that the crawler was written in Perl, and generally this is not the kind of language used to write very long running processes like crawlers. I'd be very suprised if they use it to crawl lots (>100 mlns) of pages. If you accept this logic then its likely to be a prototype system.

The Contractor

4:09 pm on Sep 4, 2005 (gmt 0)

I'd be very suprised if they use it to crawl lots (>100 mlns) of pages. If you accept this logic then its likely to be a prototype system.

That is exactly what I wrote in my last message:

I believe they may be in test mode ...

Iguana

4:29 pm on Sep 4, 2005 (gmt 0)

Amazon (through Alexa) has always crawled the sites of associates looking for invalid links to Amazon products. I seem to remember they even ignore robots.txt when they do this

Lord Majestic

4:48 pm on Sep 4, 2005 (gmt 0)

That is exactly what I wrote in my last message

And I concur - I merely made a comment about somewhat optimistic expectation of bots knowing content before they even crawled it. There is no way of 100% knowing whether a page is a news page before crawling it, and even so a few more crawls may be required.

The Contractor

4:59 pm on Sep 4, 2005 (gmt 0)

I merely made a comment about somewhat optimistic expectation of bots knowing content before they even crawled it.

Sorry about that ;)

I did not mean to infer that in my post. The only reason I added that comment is I thought for sure someone would come along and post "I have a news site and it's not crawling it... what type of news do you have on your site"

wilderness

5:37 pm on Sep 4, 2005 (gmt 0)

There's more than a bit in the archives on Amazon:

[google.com...]

Amazon Crawling

aws-alexa libwww-perl/5.65

The Contractor

MarkHutch

The Contractor

Lord Majestic

The Contractor

Iguana

Lord Majestic

The Contractor

wilderness

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week