Spiders and referral information

Forum Moderators: phranque

Message Too Old, No Replies

Spiders and referral information

theney

1:32 pm on Dec 14, 2007 (gmt 0)

In one of the old threads (from 2004) [webmasterworld.com], I found a quote for jdMorgan saying:

That only works if there is one and only one link to your page. Otherwise, they'd have to re-fetch your page every time they found a link to it in order to "give you a chance" to reject each and every incoming link referrer...
That's why spiders don't do this. They work from a database that may contain dozens to tens of thousands of link referrers to your one page. How would they know which one you won't like without trying all of them? :(

This is an interesting point, however I am not sure i understand completely the logic behind it. Wouldn't it actually be beneficial to have the referrer info from spider visits? Googlebot will fetch my page every time it finds a link to my site (will it?). It will not index or cache it each time but it will visit. What would be a logical problem with spiders passing referral information?

jdMorgan

1:39 pm on Dec 14, 2007 (gmt 0)

> Googlebot will fetch my page every time it finds a link to my site (will it?)

No, it almost certainly will not re-refetch your page every time... What if your site was CNN or Amazon.com, with hundreds of thousands of incoming links to your homepage? It would make no sense to re-fetch the homepage every time one of those links (many of them stale) was found.

Google and the others are not going to want to pay for that wasted bandwidth either, and you can be sure they de-duplicate their URL lists to save bandwidth, money, and time.

Jim

theney

1:51 pm on Dec 14, 2007 (gmt 0)

so how do they decide which links are they going to use to visit my site? and where are daily 20-30 (or 200-300) Googlebot visits to my site coming from? every time from the same link?

g1smd

2:30 am on Dec 21, 2007 (gmt 0)

The Google system has many parts.

One part parses previously cached pages and adds newly discovered URLs found in those pages to the list of URLs to fetch.

Another part of the system uses that list to fetch the content of those URLs.

The system is NOT like a browser, directly following links from page to page and site to site.

The system fetches the pages working from a pre-compiled list of pages to collect.

They are not interested in "using particular links to visit a site", they are interested in fetching the content from as many URLs as possible.

jdMorgan

3:13 am on Dec 21, 2007 (gmt 0)

But to add to that to address one aspect of the question, Google uses thousands of hosts in possibly hundreds of datacenters for spidering, and these are not always in perfect synchronization; Therefore you will indeed see multiple fetches, even though these machines may all be working from the same de-duplicated list of URLs. However, you will usually not see fetches for the same page from the same (or similar) IP address, unless a crawl is restarted due to a problem.

This "many-servers" aspect is also the reason why you may often get different search results from hour to hour or even from minute to minute; Google uses round-robin load-balancing DNS, so your first search may be handled in Chicago, and your second search by a server in San Diego. If these machines are not working from the same index, you may see different search results -- different pages listed in different order.

Jim