Forum Moderators: open
A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response
There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page.
src: The Anatomy of a Large-Scale Hypertextual Web Search Engine [www7.scu.edu.au] by Sergey Brin and Lawrence Page, the google guys
If I read this correctly, this means that google works from the domain names, not the ip addresses, they run their own dns servers to speed up lookups, but urls are stored, not ip's. This makes logical sense also if you think about it, part of the search result is based on domain name, so why have two id's, one for domain name, one for ip address, which has almost zero data/search value. If this is wrong, it would be good to explain why, and not just say it's wrong.