Google Datacenters Watch: 2006-01-30

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Datacenters Watch: 2006-01-30

Observations, Analysis and Remarks

johnwards

3:55 pm on Jan 30, 2006 (gmt 0)

< continued from [webmasterworld.com...] >

This is just odd.

The 64.* DC's return about 300 pages from my site.

The 216.* DC's return about 46,000 pages from my site.

And the 66.* return 69,000 pages from my site.

Currently I have about 65,000 pages.

If I go to google.co.uk I get 46,000 pages. If I go to google.com from my US based server I get the same 46,000 results.

It is all very odd and confusing.

[edited by: tedster at 9:56 pm (utc) on Jan. 30, 2006]

Eazygoin

1:28 pm on Feb 2, 2006 (gmt 0)

Dayo, do you think that there is a single crawl for BL and PR updates?...or did I misundertsnad?

I thought that BL and PR were under constant review, as sites are crawled, much as new pages are indexed on a daily basis.

Dayo_UK

1:32 pm on Feb 2, 2006 (gmt 0)

There is two different types of data out there at the moment using different crawled material.

Big Daddy - Own Serps, Own Cache with Mozilla Googlebot crawl.

Normal - Own Serps, Own Cache with Normal Googlebot crawl.

As there are two crawls there should be 2 sets of BL, PR and Ranking infastructures (At least internally) for the two senarios. Whether BD is using this new BL, PR and Ranking infastructure yet I dont know. They should be both being constantly updated/reviewed as each of the seperate crawls occur.

Eazygoin

2:03 pm on Feb 2, 2006 (gmt 0)

So, if BD and classic SERP's have their own BL and PR values, then BD's must be hidden/internal as TBPR values are the same across the board.

IF that's the case, then they will probably remain hidden until they become spread throughout the SERP's. Otherwise, any BL and PR values would not have any bearing on the bulk or classic SERP's as viewed by everyone but would rather be values for a few DC's that aren't on display much of the time.

Added: What I meant to mention is that seeing as we can't expect BD to be across the board until Mid March, IF there is a BL and PR update prior to that, then it would probably feed off the classic SERP's

[edited by: Eazygoin at 2:08 pm (utc) on Feb. 2, 2006]

g1smd

2:04 pm on Feb 2, 2006 (gmt 0)

I keep seeing people mention that this isn't a data refresh. However, when I look at all the pages that show a modern cache and modern snippet for some queries, and an ancient snippet for other queries (when searching for words that are no longer on the real page, that is) I can see lots of problems.

The ancient snippet is sometimes served with a modern cache, and for other sites it is served with an ancient cache for that page. I still haven't worked out what triggers Google to keep an old cache, or to show a modern cache against an ancient snippet; but it is abundantly clear that the supplemental database is filling up with poisoned data.

However, seing all the old cache data, and old snippet data out there now, the most important thing I can think of to clean this up would be exactly that: a data refresh.

Matt Cutts mentioned a "Supplemental Googlebot" in passing, a few weeks ago. He said that this would help clear up some of the problems with supplemental results. Has anyone seen any evidence of such a bot, yet? or, as I suspect, is Google still at the planning or building stages of this right now? It is funny that Supplemental Googlebot had never come up in previous postings either here or elsewhere.

Dayo_UK

2:28 pm on Feb 2, 2006 (gmt 0)

Eazygoin

Try to ignore displayed TBPR.

Normal crawl may have assigned a page a PR of 1, while Mozilla Googlebot crawl may assign the same page a PR of 4.

On the normal DCs we would be happy enough saying it is using the PR of 1 - on the BD DCs we would think that it would be using the PR of 4.....

However, who is to know that BD is using the PR that it has recalculated at this stage - instead of using its local DCs calculated PR there maybe a central place and the PR of 1 may take precedence at this stage while BD is still in development.

In recent experience when internal PR is recalculated (or so it appears) it hits all the DCs at pretty much the same time so maybe there is a central place where DCs query the PR for the serp calculation. Probably unlikely as it would slow the process down - but normal googlebot PR might locally be over-riding Mozilla Googlebot calculated PR.

This is all guesswork though.

But as BD has a different crawl and indexing characterstics (eg largely Mozilla Googlebot) then a PR recalculation will/is happen(ing)... I am trying to figure out if internally it has been applied to the BD dcs.

Judging by the serps and MC comment that this is not about ranking changes then I would guess internally it has not been applied. With MC also saying it is not pertinent at this stage is that more evidence that it has not been applied.

OK lots of people will say PR is pointless. But I dont just mean PR, I mean PR,BL,Ranking and other external factors that effect a sites ranking.

[edited by: Dayo_UK at 2:34 pm (utc) on Feb. 2, 2006]

300m

2:31 pm on Feb 2, 2006 (gmt 0)

g1smd
Thank you for posting that. More attention need to go to that subject imho.

In my observations the "data refresh" that occured on December 27th is unrelated to big daddy. At least that is what Matt said in his blog. Interestingly enough, in his earliest post about bigdaddy (before the initial feedback friday posts, he gave an ip to watch. I was looking good with the large index result increase, but then the 27th something happened that changed all of that. I went from having content rich web pages that were on target in the top 3 for the majority of the keywords I go after, to seeing subdomain spam, link spam and big directories and newspaper organizations.

The really weird thing about all of it is that a good bit of the pages that are ranking well on goole after that data refresh, have not been touched in a long time. Some pages even say in great big bold text (This page is obsolete)

Some are most certainly supp results.

At this point, bd is not what it is cracked up to be, unless they do another data refresh that addresses what occured on December 27th.

frakilk

2:57 pm on Feb 2, 2006 (gmt 0)

I have an important question, how likely is it that we will see an algo update / data refresh on the BD datacenters while they are rolled out? Will the Google team only be concentrating on the rollout or will they be able to 'multitask'?

Judging by the rough guess that Matt made on his blog that a new datacenter will become part of BD every 10 days and if there is no algo update until the rollout is complete, could we be facing another 2 or so months of terrible SERPs?

300m - the phrases that you are watching, do they return a large number of results or are they low competition phrases. The large majority of my traffic has been lost due to poor rankings for low competition phrases.

needinfo

3:05 pm on Feb 2, 2006 (gmt 0)

frakilk,
I've been wondering about the same thing excately.

300m

3:22 pm on Feb 2, 2006 (gmt 0)

well results and competiton are always debated from what i have read. From what I have noticed, yes they are very competitive keywords in my area.

Example:
71,300,000 on a non bd dc

and

133,000,000 on bd

before the data refresh, it was steady at 49,000,000

Are the keywords competitive? For the above example, in my area yes very. I have always been in the top 3 for the above example and it was a big money maker.

However, I do target less competitive keywords and those dropped as well.

g1smd

3:42 pm on Feb 2, 2006 (gmt 0)

Anyone know why the "HTML Cache" of PDF files does NOT include the crawl date?

It is there for HTML content, but not for PDF files.

This 275 message thread spans 28 pages: 275