Forum Moderators: open
Here's the short story on it:
Google is constantly crawling and updated selected pages that meet some predetermined criteria. That may involve last modified dates and PR values.
Google has many data centers and runs a distributed load sharing system across more than 10k pc's running linux with 80 gig drives at last report. Somehow, the copy of the index must get transferred to all those hard drives in all those data centers. You ever transfer 80gig across the net? And then distribute that 80 gig down into thousands of hard drives?
All of that takes a great deal of time. It's a constant process for Google. More-than-likely, the daily updates only copy out those parts of the index that are really updated. That's yet another possibility where new and old data could get mixed.
Load sharing works transparently. You do a search on Google and the request is routed via dns magic to the either the nearest data center or the nearest data center with the least load (we don't know their load distribution criteria on that).
Lastly, they could be working on the index, rolling indexes back, switching parts of the index, backing up parts of the index, rewriting some offending part of the index, deleting parts of an index - or a multitude of other actions or problems that only Google could know about.
Take those combinations of not knowing which box you are going to connect to and which index it may have, and the possibility of daily updating going on at the same time, and results may be unpredictable. There could be dozens of different indexes floating around various data centers - we have no clue.
One minute you'll get one copy of a index during a search, and the next you'll get another. Sometimes that could be yesterdays crawl, or last months crawl, or four months ago crawl.
Somehow, the copy of the index must get transferred to all those hard drives in all those data centers. You ever transfer 80gig across the net? And then distribute that 80 gig down into thousands of hard drives?
No, but I once downloaded a large image editing program over limewire with a 28k modem after 3 months on a computer. I can't imagine that amount of cussing times 400 mouths.;)
Lastly, they could be working on the index, rolling indexes back, switching parts of the index, backing up parts of the index, rewriting some offending part of the index, deleting parts of an index - or a multitude of other problems that only Google could know about.
The site in question insist on using their own server and the site is down frequently (I guess they're not very good at it). Would being down during a 'freshness' visit possibly cause the cache to revert to a state prior to the previous 'freshness' visit?
Also, as pertains to "rolling back part of the index", "switching parts of the index", and "backing up parts of the index". Is there any way to isolate any of these to see if there is any correlation to the update, just for future updates.
Sorry about
1. I didn't see thread 6394, which was two doors down when I started this and..
B. I pretty much knew the answer before I posted as I've read many similar posts. Doesn't quite sink in until its your site though.
On the flip side, I (and hopefully others) did gain from your post Bret).
ps. can we repost an update time once our previous one expires or is there no hope to ever obtain a mousepad? Wait a minute, I think I can use that image editing program and the picture of the mouse pad and my wifes rounded corner cutter and the special paper and backing. I may not be able to duplicate Google but i bet I can put out a black market mousepad without any venture capital at all.
I'll gather Brett you've never run a Usenet server that carries binaries? With all that pr0n and copyright challenged stuff on Usenet, the daily feed is 500 GB. Thus a Usenet server has to be able to handle taking in 80 GB in just a matter of hours.
For the last two months, the majority of these pages have appeared for around 5 days and then disappeared.
How long does this go on for, before the pages stick in the index?
Also, the cache of my index page is over two months old, from before the links even existed to these pages!