Page is a not externally linkable
Swanson - 12:39 am on May 4, 2006 (gmt 0)
It seems to me they are discarding data, that is why people are seeing odd cache dates, removal of pages, duplicate content filters increased. But this is amplified by problems with canonical pages etc. so the end result is that if you take a harsh line on the data you want to retain and then totally remove "duplicates" from your index and get it wrong which pages you remove - you can't get them back, and you can't recrawl them because you havent got enough space or you aren't sure which ones to recrawl because you arent sure which ones are real anymore (when you find that there is a bug with your handling of duplicate content and canonicalisation). End result, chaos.....
g1smd, thats what I was trying to say - big daddy is the attempt at removing data that allows Google to continue at least in the short term. And that is why they can't correct the problems that Googleguy said would take a "few weeks to correct by speaking to the crawl team" a few months ago. Lets be honest we all know that these sort of problems occurred a long time ago (years) and I believe this is the academic fix rather than the "throw more cash at the hardware fix" - and it could have worked, if the fix didn't have bugs in it. I say that because when I worked in that type of industry we found storage requirements started to explode when we added new data requirements when were using the original techniques - the solution was to re-write the existing method of storage as it became un-economical as the type of data that you were storing changed, but that became a false economy as in the long term you still hit the storage barrier - and it arrives faster than you think.