Forum Moderators: phranque
That only works if there is one and only one link to your page. Otherwise, they'd have to re-fetch your page every time they found a link to it in order to "give you a chance" to reject each and every incoming link referrer...That's why spiders don't do this. They work from a database that may contain dozens to tens of thousands of link referrers to your one page. How would they know which one you won't like without trying all of them? :(
This is an interesting point, however I am not sure i understand completely the logic behind it. Wouldn't it actually be beneficial to have the referrer info from spider visits? Googlebot will fetch my page every time it finds a link to my site (will it?). It will not index or cache it each time but it will visit. What would be a logical problem with spiders passing referral information?
No, it almost certainly will not re-refetch your page every time... What if your site was CNN or Amazon.com, with hundreds of thousands of incoming links to your homepage? It would make no sense to re-fetch the homepage every time one of those links (many of them stale) was found.
Google and the others are not going to want to pay for that wasted bandwidth either, and you can be sure they de-duplicate their URL lists to save bandwidth, money, and time.
Jim
One part parses previously cached pages and adds newly discovered URLs found in those pages to the list of URLs to fetch.
Another part of the system uses that list to fetch the content of those URLs.
The system is NOT like a browser, directly following links from page to page and site to site.
The system fetches the pages working from a pre-compiled list of pages to collect.
They are not interested in "using particular links to visit a site", they are interested in fetching the content from as many URLs as possible.
This "many-servers" aspect is also the reason why you may often get different search results from hour to hour or even from minute to minute; Google uses round-robin load-balancing DNS, so your first search may be handled in Chicago, and your second search by a server in San Diego. If these machines are not working from the same index, you may see different search results -- different pages listed in different order.
Jim