Forum Moderators: open
165-38.amazon.com (207.171.165.38) was logged 1625 times,
starting at 12:38:38 PM on Thursday, September 1, 2005.
Browser/UA was aws-alexa libwww-perl/5.65.
Whois on the IP returns it as:
OrgName: Amazon.com, Inc.
OrgID: AMAZON-4
Address: 605 5th Ave S
City: SEATTLE
StateProv: WA
PostalCode: 98104
Country: US
NetRange: 207.171.160.0 - 207.171.191.255
CIDR: 207.171.160.0/19
NetName: AMAZON-01
NetHandle: NET-207-171-160-0-1
Parent: NET-207-0-0-0-0
NetType: Direct Assignment
NameServer: NS-1.AMAZON.COM
NameServer: NS-2.AMAZON.COM
NameServer: NS-3.AMAZON.COM
NameServer: AUTH00.NS.UU.NET
Comment:
RegDate: 1999-09-23
Updated: 2002-03-19
Contrary to the UA this site hasn't a thing to do with "news".
Ask yourself - how would a bot know about content before data is crawled and analysed? Bots can't read minds you know. It can be guessed for CNN, BBC and ABC - but there are thousands of obscure news sources that may not be even known before hand. It might be needed a number of recrawls before there will be enough data to make an automated guess whether your pages have anything to do with news.
Would be good if it was possible to give such hints in robots.txt, but it is not at the moment.
libwww-perl/5.65 implies that the crawler was written in Perl, and generally this is not the kind of language used to write very long running processes like crawlers. I'd be very suprised if they use it to crawl lots (>100 mlns) of pages. If you accept this logic then its likely to be a prototype system.
That is exactly what I wrote in my last message
And I concur - I merely made a comment about somewhat optimistic expectation of bots knowing content before they even crawled it. There is no way of 100% knowing whether a page is a news page before crawling it, and even so a few more crawls may be required.
I merely made a comment about somewhat optimistic expectation of bots knowing content before they even crawled it.
Sorry about that ;)
I did not mean to infer that in my post. The only reason I added that comment is I thought for sure someone would come along and post "I have a news site and it's not crawling it... what type of news do you have on your site"
[google.com...]