Forum Moderators: open

Message Too Old, No Replies

How Google can/does? reduce the amount of spidering they have to do

Taken from another thread

         

Clark

10:03 am on May 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Someone posted an extremely interesting speculation and the thread was only up for a minute or two and disappeared. I wasn't surprised because it related a lot to this update so it might have been merged. Unfortunately, I can't find it. So I'm pulling out any discussion of the update and sticking to a discussion on spidering algos.

The idea was to speculate that Google would start to combine 3 months of indexes as one big index. All versions of the same page would be analyzed for changes, and then the freshbot would know which pages changes frequently/considerably, and know where to go after the freshbot. And also, it would be less necessary to refresh certain pages again and again each update.

This makes a lot of sense. If I were a "G" I would definitely add an extra field to each url noting if there were changes to the file, how drastic the changes were, and how frequent the update. If it didn't change in 5 years, maybe only send the spider every few months and yet still keep the url in the index.

When you think about the processing required to achieve this, it's huge. With terabytes of data for one index, this would require one machine having access to the many versions of the same page...now if it is only a 3 month comparison, it's more doable. But to do this in an ideal world, it would be lovely if the engineers figured out a way to compare all versions of a page throughout the history of the spidering of Google. Possible? I don't know. But if I were a google engineer, I'd try to come up with a way.

chris_f

12:31 pm on May 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Clark,

The proposal sounds good in theroy, however, I can see problems trying put it in practice. Google is an excellent search engine and one of it's best features is the freshness of the data. I wouldn't jeapodise this so I doubt Google will go along with this.

Chris