Welcome to WebmasterWorld Guest from

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Supplemental club: Big Daddy coming - Part 1

W'sup with google?

6:43 pm on Mar 2, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 24, 2002
votes: 0

Carrying on from here:


and here


A lot of members are seeing huge sites going supplemental. One of our main sites lots all rankings and 200,000 + pages disappeared and now we are left with 19k useless results. This could be a goof or it could be a new round of penalties. If you have had your site reduced to the 'sup index lets here about it and compare notes.

8:57 pm on Mar 4, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 23, 2000
votes: 0

Thank you for your post!
I sent you mail...
9:20 pm on Mar 4, 2006 (gmt 0)

Senior Member from KZ 

WebmasterWorld Senior Member lammert is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 10, 2005
votes: 24

The docid we see in the url after cache: is a hash value and is not the same of what Scarecrow is referring to. He is talking about the internal unique binary value of each URL starting at 0 to some limit, either 2^32, 2^40, 2^64 or whatever.

Being one of the WebmasterWorld members that caused Scarecrow to put on his flame proof suit about a year ago I would like to refrain from comments until this whole supplemental issue become a little bit clearer. But Matt Cutts' words that the Big Daddy infrastructure is primarily there to solve canonicalization problems is not in contradiction with a merge from several separate 32 bits index systems to one large index.

9:40 pm on Mar 4, 2006 (gmt 0)

Full Member

10+ Year Member

joined:Jan 13, 2004
votes: 0

Well, the DocID appears in the URLs for the cache links, so do we look there for a longer string, or are they going to try to "hide" it?

I just looked up the DOCID on new and old datacentres for an indexed page. They are the same.

The docID I'm talking about is defined in The Anatomy of a Large-Scale Hypertextual Web Search Engine [www-db.stanford.edu]. It was originally 4 bytes. The ID in the URL is NOT the docID. That URL ID is about 12 bytes. It is some sort of look-up number. It has to be URL-compatible (7-bit ASCII) because it is used in the URL. If you tried to put the docID that is used internally as a binary number into a URL directly, the URL would crash. It has to be converted to URL-acceptable characters. For all I know, maybe the docID is contained somewhere within it. Maybe the rest of it is additional locater information for speedier access.

The docID is 32-bits or 4 bytes, or at least it was originally. This gives you a maximum of 4.29 billion counts before you run out of unique combinations and roll over.

Best estimates are that on the average, each docID is used twice per word per page. That's because they have two inverted indexes. One is "fancy" and the other is "plain."

The average number of words per web page is 300. Here are the space requirements for the docID if we assume 4 bytes, 12 bytes, and 20 bytes, for 4 billion web pages:

4 bytes: 300 * 4 billion * 8 = 9.6 to 12th power (10 terabytes)

12 bytes: 300 * 4 billion * 24 = 2.88 to 13th power (29 terabytes)

20 bytes: 300 * 4 billion * 40 = 4.8 to 13th power (48 terabytes)

If you were designing a search engine, how many bytes would you choose for your docID? Obviously, you'd go with the minimum number of bits you think you will ever need.

9:42 pm on Mar 4, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 4, 2004
votes: 0

gee, why I am I not surprised that googleguy suddenly reappears for a brief posting?

Anyway, if scarecrow isn't all the way right, he's enough right so that the details don't matter. Why? because this explains basically everything google has done in the last 3 years, all the problems, the weirdnesses, updates that aren't updates, sandboxes that aren't sandboxes, and so on:

It would make a lot of sense, if you are a Google engineer figuring out what to do back in 2003, to stall on the docID problem until you can migrate to 64-bit processors. For one thing, Google got a lot richer and 64-bit processors got a lot cheaper at the same time. For another, there's a new trend toward more processing power per watt, and Google's huge electric bills are a source of concern to them.

If you actually are interested in having a somewhat long term understanding of what's going on, this is about as clear as it gets. And if you want to understand supplementals and all that stuff, you really don't need to go much further than this. Personally I stopped worrying about supplementals about the time I first heard about them, but I guess they interest some people enough to make it a topic worth continuing.

For those of you who don't follow such things, performance per watt is not a minor topic in very large datacenter design and implementation. Especially not for data centers like google runs.

<added>anyway, just saw scarecrow posted again, personally I'm not concerned with the details since I can't know them, unless googleguy wants to actually say something more revealing than he's allowed to say. But the basic idea is simple: a system designed for 32 bits isn't going to just switch overnight to 64 bit stuff, it's hard to do that, lots of work. And datacenters aren't just going to switch over night to 64 bit machines. If you want to know how long it takes to do that, just look at the first appearance of big daddy until it's spread through all of google's networks.

Like scarecrow, I've lost all interest in arguing this stuff, it was obvious then, and now it's a fact.

I have to say though, this fits EXACTLY with what I thought google was doing for the last 6-8 months, including bourbon and jagger.

10:11 pm on Mar 4, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:June 14, 2003
votes: 0


Are you saying that pages need to be first purged from the old 32 bit index in order to be assigned a new doc id in the 64 bit architecture?

Just trying to figure out why pages would be deleted.

10:30 pm on Mar 4, 2006 (gmt 0)

New User

10+ Year Member

joined:Nov 3, 2005
votes: 0

Just try to figure out, why Google does not delete 404-pages in its databases. Did anybody ask on SES?
11:18 pm on Mar 4, 2006 (gmt 0)

Full Member

10+ Year Member

joined:Jan 13, 2004
votes: 0

Are you saying that pages need to be first purged from the old 32 bit index in order to be assigned a new doc id in the 64 bit architecture?

I wouldn't know. The only thing I can suggest is that if there is a major shift in infrastructure underway, there will be some churn. The best we can hope for is some evidence that the shift is rational in the way it progresses.

If I were a Google engineer I might start the migration with certain top-level domains: gov, edu, org. These are more manageable because they are many times smaller than the dot-coms. Also, the sort of people who normally aren't heard from in the press when it comes to Google quality-control, might suddenly start noticing if gov, edu, and org get turned upside-down. I would start looking for patterns about what sort of sites are affected.

I have a dot-org that has been stable for three years on my end, but has been like a roller-coaster for the last three years in terms of fully-indexed pages vs. URL-only listings. The ratio has gone from 3 to 1, to 1 to 3, to 2 to 1, to 1 to 2, and then back again, for the indexed pages compared to the URL-only pages. In the meantime, nothing important was changed on my end. The site has 130,000 pages.

A couple weeks ago, all of my URL-only pages disappeared completely. My Google referrals are up only slightly so far, but those URL-only pages are gone every time I check. It's very stable. That's good news for me, because those URL-only pages never drew any traffic.

11:53 pm on Mar 4, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 19, 2003
votes: 0

Background BD HP good all others supplemental, cache recently changed from june - Aug to hash code like MMNJLTot6sYJ:www.mydomain.com, default google good

Digging into my supplementals I found urls like www.mydomain.com//page.htm.

I have some code at the top of each page that redirects non-www to www, checked that with a couple of header status checkers and it works. But I now have a problem with an www.mydomain.com//page.htm showing up in googles index instead of www.mydomain.com/page.htm.

I am not an ASP programer so I am a little concerned that maybe the code doing the non-www to www redirect I was provided with is causing the problem.

Here is the code

dim hostname
dim pathinfo
dim bRedirect
dim MainDomain

hostname = request.servervariables("HTTP_HOST")
pathinfo = request.servervariables("PATH_INFO")
MainDomain = "www.mydomain.com"
if pathinfo <> "" then
if instr(lcase(pathinfo),"default.asp") > 0 or instr(lcase(pathinfo),"index.asp") > 0 or instr(lcase(pathinfo),"index.aspx") > 0 then
MainDomain = MainDomain &"/" & mid(pathinfo,2,instrrev(pathinfo,"/")-1)
MainDomain = MainDomain & pathinfo
end if
MainDomain = MainDomain & "/"
end if
if left(hostname,instr(hostname,".")) <> "www." then bRedirect = true

if instr(lcase(hostname),"default.asp") > 0 or instr(lcase(hostname),"index.asp") > 0 or instr(lcase(hostname),"index.aspx") > 0 then bRedirect = true

if bRedirect then
response.status = "301 Moved Permanently"
response.addheader "Location", "http://" & MainDomain
end if

Anyone see something that may be causing the www.mydomain.com//page.htm problem? or have an idea how I can add a redirect that would redirect www and non www www.mydomain.com//page.htm to www.mydomain.com/page.htm throughout the site.

1:17 am on Mar 5, 2006 (gmt 0)

New User

10+ Year Member

joined:Mar 3, 2006
votes: 0

Are you using Google Sitemaps and what is the cache dates on your double slashed URLs.
Here's a thread that discusses this issue - no resolution but several webmasters pointing at a early problem with the Google sitemaps bot that was actually creating the problem.


2:23 am on Mar 5, 2006 (gmt 0)

Full Member

10+ Year Member

joined:July 18, 2004
votes: 0

Welcome to 100,000 pages club?

Scanning this thread I have noticed that seems all sites that are affected has very large number of pages.

Is it so? Does all who was affected has, say more than 10,000 page sites?

Personally I do not understand how is it possible to build a site with such 100,000 pages that *all* are worth to show in the search results.

Of course there are large companies, but there are few of them and Google probably treats them manually.

What are the other large sites? Are they superstores? If so their page may be interesting only to local visitors. Does the supplemental depends on the location?

Another sample of the type of the large sites that come to my mind is a directory.

However, since the goal of the widget directory to provide better search for the widget, it make sense for Google to leave just one main directory page in the index. Those who search for the specific sort of the widget in general will be more satisfied when find the site of the widget producer and not the directory.

So is it possible that in additional to duplicate content filter we have now the filter for the sites with too large number of pages? Let's call it directory filter.


2:27 am on Mar 5, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 19, 2004
votes: 0

No effects for me on any datacenters. All of my sites (about 10) are between 100-1000 pages.

Some older than 5 years, some newer than one year.

All had 301 redirect applied in September 2005.

No new supplemental listings for any of them that I can see, and about 10-15% increase in google traffic over the last 7 days.

Just my observations to help us get to the bottom of this.

3:17 am on Mar 5, 2006 (gmt 0)

Full Member

joined:Dec 1, 2003
votes: 0

It is a shame 301's aren't effected more in this DC update. In my experience most 301's are carried out because the page in question has gained a result through black hat seo. As soon as they get the result they do a 301 to a "clean page".Its bull#*$!.
I think unless its a run of site 301 all 301's should be ignored.
3:48 am on Mar 5, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 16, 2004
votes: 0

One of our sites which is effected has around 1500 pages of handmade custom content online for about 5 years with a pr7 and stable traffic for years, nothing fancy seo wise and basically the entire site is supp besides the homepage so no this isnt only effecting 100k plus page sites.
4:00 am on Mar 5, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 8, 2004
votes: 0

I am trying to find a link between the sites which have been placed in supplemental listings. If your page is supplemental, please email me the URL as well as the age of the site.

Thanks you. Once the list is compiled I will email you back the entire list of sites that have this problem and any common occurances I found.

email to edbri871 (at) gmail.com

4:06 am on Mar 5, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Oct 8, 2001
votes: 0

Wow, Scarecrow is around too. It's like old times. :)

I'm not eager to get flamed for the zillionth time on the 4-byte ID problem, but if GoogleGuy wants to deny it once again, for the record, that would be fine with me.

I'm fine to deny this, because docids and their size has nothing at all to do with what people have been describing on this thread. I've been reading through the feedback, and it backs up the theory that I had before I asked for feedback.

Based on the specifics everyone has sent (thank you, by the way), I'm pretty sure what the issue is. I'll check with the crawl/indexing team to be sure though. Folks don't need to send any more emails unless they really want to. It may take a week or so to sort this out and be sure, but I do expect these pages to come back to the main index.

This 233 message thread spans 16 pages: 233