The crawl caching proxy

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

The crawl caching proxy

Did I miss a thread?

pontifex

6:40 pm on May 9, 2006 (gmt 0)

Hi,

Matt talked about the "new" spider proxy OR cache in boston and wrote about it in his blog. Yet I have not seen any posting about it here... did I miss that thread?

I think that is an important change coming along with the bigdaddy infrastructure and may has impact on the algo itself.

If the Mediabot has fetched a page, it normally was triggered by a surfer, looking at that page. I would say, that this might fill the cache preferably FIRST with pages, which have AdSense on them. The question now is, how does the algo cruncher work with that data? If they just ADD additional pages to the cache, if they find links to those from pages in that proxy, the whole listing would shift (just as a thought here)...

However, IMHO is the strict implementation of such a proxy worth a lot of thoughts regarding the algo in the future!

Cheers,
P!

ChadSEO

10:28 pm on May 9, 2006 (gmt 0)

I don't think this will affect the algo directly at all, in that a page only accessed only by Mediabot before will now enter the main index. It's simply that if a page was previously getting fetched by Mediabot and the main Googlebot, it will now only get fetched once. That's as I understand it, based solely on what Matt has said.

Looking back over the last month+, Googlebot has indexed about the same number of pages/day, but the number of unique URLs it indexes has increased, so from that perspective, the changes all look good to me.

Chad

g1smd

12:23 am on May 10, 2006 (gmt 0)

I think there is a disconnect between what is spidered and what is indexed, or what is indexed and fails to be updated with newer information that is available at the website, but which never seems to make it into the index.

However, that might just be on datacentres where that old version of the index is being phased out - I do see different actions and treatment of sites across the various DCs. Maybe soon, some obvious patterns may become apparent?

pontifex

12:33 am on May 10, 2006 (gmt 0)

I agree that the crawler cache is not the biggest news, but given the fact, that google has majorly changed infrastructure with bigdaddy, I wonder if that cache does not have more impact on the ranking on the long run, than we all think.

matt tried hard to point out, that it is just a cache, but would YOU work differently on your computer, if I remove all caches from your system? The hard drive cache, the processor cache, etc.?

A cache, if used right (which I suspect with Google), has a tremendous impact on the way data is handled and processed.