Forum Moderators: bakedjake
I've got Google blocked from the directory in question (but not before their last crawl... yipe!), but Alexa's browser/OS comes up as 'undefined' in my stat program. Anyone know what to use for Alexa's U/A for robots.txt? Would just plain "User-agent: Alexa" work?
host¦IP¦UA
arc24-public.alexa.com¦209.247.40.174¦ia_archiver
arc20-public.alexa.com¦209.247.40.170¦ia_archiver
arc22-public.alexa.com¦209.247.40.172¦ia_archiver
arc24-public.alexa.com¦209.247.40.174¦ia_archiver
arc22-public.alexa.com¦209.247.40.172¦ia_archiver
arc22-public.alexa.com¦209.247.40.172¦ia_archiver
arc18-public.alexa.com¦209.247.40.168¦ia_archiver
arc18-public.alexa.com¦209.247.40.168¦ia_archiver
arc24-public.alexa.com¦209.247.40.174¦ia_archiver
arc18-public.alexa.com¦209.247.40.168¦ia_archiver
arc24-public.alexa.com¦209.247.40.174¦ia_archiver
arc21-public.alexa.com¦209.247.40.171¦ia_archiver
arc18-public.alexa.com¦209.247.40.168¦ia_archiver
arc20-public.alexa.com¦209.247.40.170¦ia_archiver
arc21-public.alexa.com¦209.247.40.171¦ia_archiver
arc21-public.alexa.com¦209.247.40.171¦ia_archiver
arc21-public.alexa.com¦209.247.40.171¦ia_archiver
arc21-public.alexa.com¦209.247.40.171¦ia_archiver
I've been doing some research and although I have no true solid evidence I have found clues.
First clue:
Notice Alexa listed
The Internet Archive [archive.org]
Second clue
....take a look at the fantastic paper NFFC posted a while back in one of the forums
[research.compaq.com...]
Their spidering activity fits this quote:
""The Internet Archive also uses multiple machines to crawl the web [6,14]. Each crawler process is assigned up to 64 sites to crawl, and no site is assigned to more than one crawler. Each single-threaded crawler process reads a list of seed URLs for its assigned sites from disk into per-site queues, and then uses asynchronous I/O to fetch pages from these queues in parallel. Once a page is downloaded, the crawler extracts the links contained in it. If a link refers to the site of the page it was contained in, it is added to the appropriate site queue; otherwise it is logged to disk. Periodically, a batch process merges these logged ``cross-site'' URLs into the site-specific seed sets, filtering out duplicates in the process. ""
Third clue:
ia_archiver = InternetArchive_Archiver?
What do you think?
¦crawl1-public.alexa.com¦209.247.40.104
¦crawl2-public.alexa.com¦209.247.40.105
¦crawl3-public.alexa.com¦209.247.40.106
¦crawl4-public.alexa.com¦209.247.40.107
¦crawl5-public.alexa.com¦209.247.40.108
¦crawl6-public.alexa.com¦209.247.40.109
¦crawl7-public.alexa.com¦209.247.40.98
¦crawl8-public.alexa.com¦209.247.40.99
There doesnt appear to be a crawl9 or above.