Forum Moderators: open

Message Too Old, No Replies

DNS Caching: Looking for Good Explanation

I know there is some doc out there that explains this...

         

Nick_W

6:36 am on Apr 28, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



HI all,

I have to explain to someone that SE's do indeed crawl via IP address and not domain name.

I know there is some google doc out there, but no idea where/what? - Is there anything out there (other than a discussion thread) that talks about this?

Many thanks...

Nick

isitreal

5:59 pm on May 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response

There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page.

src: The Anatomy of a Large-Scale Hypertextual Web Search Engine [www7.scu.edu.au] by Sergey Brin and Lawrence Page, the google guys

If I read this correctly, this means that google works from the domain names, not the ip addresses, they run their own dns servers to speed up lookups, but urls are stored, not ip's. This makes logical sense also if you think about it, part of the search result is based on domain name, so why have two id's, one for domain name, one for ip address, which has almost zero data/search value. If this is wrong, it would be good to explain why, and not just say it's wrong.