Welcome to WebmasterWorld Guest from 188.8.131.52
A lot of members are seeing huge sites going supplemental. One of our main sites lots all rankings and 200,000 + pages disappeared and now we are left with 19k useless results. This could be a goof or it could be a new round of penalties. If you have had your site reduced to the 'sup index lets here about it and compare notes.
Being one of the WebmasterWorld members that caused Scarecrow to put on his flame proof suit about a year ago I would like to refrain from comments until this whole supplemental issue become a little bit clearer. But Matt Cutts' words that the Big Daddy infrastructure is primarily there to solve canonicalization problems is not in contradiction with a merge from several separate 32 bits index systems to one large index.
Well, the DocID appears in the URLs for the cache links, so do we look there for a longer string, or are they going to try to "hide" it?
I just looked up the DOCID on new and old datacentres for an indexed page. They are the same.
The docID is 32-bits or 4 bytes, or at least it was originally. This gives you a maximum of 4.29 billion counts before you run out of unique combinations and roll over.
Best estimates are that on the average, each docID is used twice per word per page. That's because they have two inverted indexes. One is "fancy" and the other is "plain."
The average number of words per web page is 300. Here are the space requirements for the docID if we assume 4 bytes, 12 bytes, and 20 bytes, for 4 billion web pages:
4 bytes: 300 * 4 billion * 8 = 9.6 to 12th power (10 terabytes)
12 bytes: 300 * 4 billion * 24 = 2.88 to 13th power (29 terabytes)
20 bytes: 300 * 4 billion * 40 = 4.8 to 13th power (48 terabytes)
If you were designing a search engine, how many bytes would you choose for your docID? Obviously, you'd go with the minimum number of bits you think you will ever need.
Anyway, if scarecrow isn't all the way right, he's enough right so that the details don't matter. Why? because this explains basically everything google has done in the last 3 years, all the problems, the weirdnesses, updates that aren't updates, sandboxes that aren't sandboxes, and so on:
It would make a lot of sense, if you are a Google engineer figuring out what to do back in 2003, to stall on the docID problem until you can migrate to 64-bit processors. For one thing, Google got a lot richer and 64-bit processors got a lot cheaper at the same time. For another, there's a new trend toward more processing power per watt, and Google's huge electric bills are a source of concern to them.
If you actually are interested in having a somewhat long term understanding of what's going on, this is about as clear as it gets. And if you want to understand supplementals and all that stuff, you really don't need to go much further than this. Personally I stopped worrying about supplementals about the time I first heard about them, but I guess they interest some people enough to make it a topic worth continuing.
For those of you who don't follow such things, performance per watt is not a minor topic in very large datacenter design and implementation. Especially not for data centers like google runs.
<added>anyway, just saw scarecrow posted again, personally I'm not concerned with the details since I can't know them, unless googleguy wants to actually say something more revealing than he's allowed to say. But the basic idea is simple: a system designed for 32 bits isn't going to just switch overnight to 64 bit stuff, it's hard to do that, lots of work. And datacenters aren't just going to switch over night to 64 bit machines. If you want to know how long it takes to do that, just look at the first appearance of big daddy until it's spread through all of google's networks.
Like scarecrow, I've lost all interest in arguing this stuff, it was obvious then, and now it's a fact.
I have to say though, this fits EXACTLY with what I thought google was doing for the last 6-8 months, including bourbon and jagger.
Are you saying that pages need to be first purged from the old 32 bit index in order to be assigned a new doc id in the 64 bit architecture?
If I were a Google engineer I might start the migration with certain top-level domains: gov, edu, org. These are more manageable because they are many times smaller than the dot-coms. Also, the sort of people who normally aren't heard from in the press when it comes to Google quality-control, might suddenly start noticing if gov, edu, and org get turned upside-down. I would start looking for patterns about what sort of sites are affected.
I have a dot-org that has been stable for three years on my end, but has been like a roller-coaster for the last three years in terms of fully-indexed pages vs. URL-only listings. The ratio has gone from 3 to 1, to 1 to 3, to 2 to 1, to 1 to 2, and then back again, for the indexed pages compared to the URL-only pages. In the meantime, nothing important was changed on my end. The site has 130,000 pages.
A couple weeks ago, all of my URL-only pages disappeared completely. My Google referrals are up only slightly so far, but those URL-only pages are gone every time I check. It's very stable. That's good news for me, because those URL-only pages never drew any traffic.
Digging into my supplementals I found urls like www.mydomain.com//page.htm.
I have some code at the top of each page that redirects non-www to www, checked that with a couple of header status checkers and it works. But I now have a problem with an www.mydomain.com//page.htm showing up in googles index instead of www.mydomain.com/page.htm.
I am not an ASP programer so I am a little concerned that maybe the code doing the non-www to www redirect I was provided with is causing the problem.
Here is the code
hostname = request.servervariables("HTTP_HOST")
pathinfo = request.servervariables("PATH_INFO")
MainDomain = "www.mydomain.com"
if pathinfo <> "" then
if instr(lcase(pathinfo),"default.asp") > 0 or instr(lcase(pathinfo),"index.asp") > 0 or instr(lcase(pathinfo),"index.aspx") > 0 then
MainDomain = MainDomain &"/" & mid(pathinfo,2,instrrev(pathinfo,"/")-1)
MainDomain = MainDomain & pathinfo
MainDomain = MainDomain & "/"
if left(hostname,instr(hostname,".")) <> "www." then bRedirect = true
if instr(lcase(hostname),"default.asp") > 0 or instr(lcase(hostname),"index.asp") > 0 or instr(lcase(hostname),"index.aspx") > 0 then bRedirect = true
if bRedirect then
response.status = "301 Moved Permanently"
response.addheader "Location", "http://" & MainDomain
Anyone see something that may be causing the www.mydomain.com//page.htm problem? or have an idea how I can add a redirect that would redirect www and non www www.mydomain.com//page.htm to www.mydomain.com/page.htm throughout the site.
Scanning this thread I have noticed that seems all sites that are affected has very large number of pages.
Is it so? Does all who was affected has, say more than 10,000 page sites?
Personally I do not understand how is it possible to build a site with such 100,000 pages that *all* are worth to show in the search results.
Of course there are large companies, but there are few of them and Google probably treats them manually.
What are the other large sites? Are they superstores? If so their page may be interesting only to local visitors. Does the supplemental depends on the location?
Another sample of the type of the large sites that come to my mind is a directory.
However, since the goal of the widget directory to provide better search for the widget, it make sense for Google to leave just one main directory page in the index. Those who search for the specific sort of the widget in general will be more satisfied when find the site of the widget producer and not the directory.
So is it possible that in additional to duplicate content filter we have now the filter for the sites with too large number of pages? Let's call it directory filter.
Some older than 5 years, some newer than one year.
All had 301 redirect applied in September 2005.
No new supplemental listings for any of them that I can see, and about 10-15% increase in google traffic over the last 7 days.
Just my observations to help us get to the bottom of this.
Thanks you. Once the list is compiled I will email you back the entire list of sites that have this problem and any common occurances I found.
email to edbri871 (at) gmail.com
I'm not eager to get flamed for the zillionth time on the 4-byte ID problem, but if GoogleGuy wants to deny it once again, for the record, that would be fine with me.
I'm fine to deny this, because docids and their size has nothing at all to do with what people have been describing on this thread. I've been reading through the feedback, and it backs up the theory that I had before I asked for feedback.
Based on the specifics everyone has sent (thank you, by the way), I'm pretty sure what the issue is. I'll check with the crawl/indexing team to be sure though. Folks don't need to send any more emails unless they really want to. It may take a week or so to sort this out and be sure, but I do expect these pages to come back to the main index.