Forum Moderators: Robert Charlton & goodroi
[webmasterworld.com...]
and here
[webmasterworld.com...]
A lot of members are seeing huge sites going supplemental. One of our main sites lots all rankings and 200,000 + pages disappeared and now we are left with 19k useless results. This could be a goof or it could be a new round of penalties. If you have had your site reduced to the 'sup index lets here about it and compare notes.
Being one of the WebmasterWorld members that caused Scarecrow to put on his flame proof suit about a year ago I would like to refrain from comments until this whole supplemental issue become a little bit clearer. But Matt Cutts' words that the Big Daddy infrastructure is primarily there to solve canonicalization problems is not in contradiction with a merge from several separate 32 bits index systems to one large index.
Well, the DocID appears in the URLs for the cache links, so do we look there for a longer string, or are they going to try to "hide" it?
I just looked up the DOCID on new and old datacentres for an indexed page. They are the same.
The docID is 32-bits or 4 bytes, or at least it was originally. This gives you a maximum of 4.29 billion counts before you run out of unique combinations and roll over.
Best estimates are that on the average, each docID is used twice per word per page. That's because they have two inverted indexes. One is "fancy" and the other is "plain."
The average number of words per web page is 300. Here are the space requirements for the docID if we assume 4 bytes, 12 bytes, and 20 bytes, for 4 billion web pages:
4 bytes: 300 * 4 billion * 8 = 9.6 to 12th power (10 terabytes)
12 bytes: 300 * 4 billion * 24 = 2.88 to 13th power (29 terabytes)
20 bytes: 300 * 4 billion * 40 = 4.8 to 13th power (48 terabytes)
If you were designing a search engine, how many bytes would you choose for your docID? Obviously, you'd go with the minimum number of bits you think you will ever need.
Anyway, if scarecrow isn't all the way right, he's enough right so that the details don't matter. Why? because this explains basically everything google has done in the last 3 years, all the problems, the weirdnesses, updates that aren't updates, sandboxes that aren't sandboxes, and so on:
It would make a lot of sense, if you are a Google engineer figuring out what to do back in 2003, to stall on the docID problem until you can migrate to 64-bit processors. For one thing, Google got a lot richer and 64-bit processors got a lot cheaper at the same time. For another, there's a new trend toward more processing power per watt, and Google's huge electric bills are a source of concern to them.
If you actually are interested in having a somewhat long term understanding of what's going on, this is about as clear as it gets. And if you want to understand supplementals and all that stuff, you really don't need to go much further than this. Personally I stopped worrying about supplementals about the time I first heard about them, but I guess they interest some people enough to make it a topic worth continuing.
For those of you who don't follow such things, performance per watt is not a minor topic in very large datacenter design and implementation. Especially not for data centers like google runs.
<added>anyway, just saw scarecrow posted again, personally I'm not concerned with the details since I can't know them, unless googleguy wants to actually say something more revealing than he's allowed to say. But the basic idea is simple: a system designed for 32 bits isn't going to just switch overnight to 64 bit stuff, it's hard to do that, lots of work. And datacenters aren't just going to switch over night to 64 bit machines. If you want to know how long it takes to do that, just look at the first appearance of big daddy until it's spread through all of google's networks.
Like scarecrow, I've lost all interest in arguing this stuff, it was obvious then, and now it's a fact.
I have to say though, this fits EXACTLY with what I thought google was doing for the last 6-8 months, including bourbon and jagger.
Are you saying that pages need to be first purged from the old 32 bit index in order to be assigned a new doc id in the 64 bit architecture?
If I were a Google engineer I might start the migration with certain top-level domains: gov, edu, org. These are more manageable because they are many times smaller than the dot-coms. Also, the sort of people who normally aren't heard from in the press when it comes to Google quality-control, might suddenly start noticing if gov, edu, and org get turned upside-down. I would start looking for patterns about what sort of sites are affected.
I have a dot-org that has been stable for three years on my end, but has been like a roller-coaster for the last three years in terms of fully-indexed pages vs. URL-only listings. The ratio has gone from 3 to 1, to 1 to 3, to 2 to 1, to 1 to 2, and then back again, for the indexed pages compared to the URL-only pages. In the meantime, nothing important was changed on my end. The site has 130,000 pages.
A couple weeks ago, all of my URL-only pages disappeared completely. My Google referrals are up only slightly so far, but those URL-only pages are gone every time I check. It's very stable. That's good news for me, because those URL-only pages never drew any traffic.
Digging into my supplementals I found urls like www.mydomain.com//page.htm.
I have some code at the top of each page that redirects non-www to www, checked that with a couple of header status checkers and it works. But I now have a problem with an www.mydomain.com//page.htm showing up in googles index instead of www.mydomain.com/page.htm.
I am not an ASP programer so I am a little concerned that maybe the code doing the non-www to www redirect I was provided with is causing the problem.
Here is the code
<%
dim hostname
dim pathinfo
dim bRedirect
dim MainDomain
hostname = request.servervariables("HTTP_HOST")
pathinfo = request.servervariables("PATH_INFO")
MainDomain = "www.mydomain.com"
if pathinfo <> "" then
if instr(lcase(pathinfo),"default.asp") > 0 or instr(lcase(pathinfo),"index.asp") > 0 or instr(lcase(pathinfo),"index.aspx") > 0 then
MainDomain = MainDomain &"/" & mid(pathinfo,2,instrrev(pathinfo,"/")-1)
else
MainDomain = MainDomain & pathinfo
end if
else
MainDomain = MainDomain & "/"
end if
if left(hostname,instr(hostname,".")) <> "www." then bRedirect = true
if instr(lcase(hostname),"default.asp") > 0 or instr(lcase(hostname),"index.asp") > 0 or instr(lcase(hostname),"index.aspx") > 0 then bRedirect = true
if bRedirect then
response.status = "301 Moved Permanently"
response.addheader "Location", "http://" & MainDomain
response.end
end if
%>
Anyone see something that may be causing the www.mydomain.com//page.htm problem? or have an idea how I can add a redirect that would redirect www and non www www.mydomain.com//page.htm to www.mydomain.com/page.htm throughout the site.
[groups.google.com...]
Scanning this thread I have noticed that seems all sites that are affected has very large number of pages.
Is it so? Does all who was affected has, say more than 10,000 page sites?
Personally I do not understand how is it possible to build a site with such 100,000 pages that *all* are worth to show in the search results.
Of course there are large companies, but there are few of them and Google probably treats them manually.
What are the other large sites? Are they superstores? If so their page may be interesting only to local visitors. Does the supplemental depends on the location?
Another sample of the type of the large sites that come to my mind is a directory.
However, since the goal of the widget directory to provide better search for the widget, it make sense for Google to leave just one main directory page in the index. Those who search for the specific sort of the widget in general will be more satisfied when find the site of the widget producer and not the directory.
So is it possible that in additional to duplicate content filter we have now the filter for the sites with too large number of pages? Let's call it directory filter.
Vadim.
Some older than 5 years, some newer than one year.
All had 301 redirect applied in September 2005.
No new supplemental listings for any of them that I can see, and about 10-15% increase in google traffic over the last 7 days.
Just my observations to help us get to the bottom of this.
Thanks you. Once the list is compiled I will email you back the entire list of sites that have this problem and any common occurances I found.
email to edbri871 (at) gmail.com
I'm not eager to get flamed for the zillionth time on the 4-byte ID problem, but if GoogleGuy wants to deny it once again, for the record, that would be fine with me.
I'm fine to deny this, because docids and their size has nothing at all to do with what people have been describing on this thread. I've been reading through the feedback, and it backs up the theory that I had before I asked for feedback.
Based on the specifics everyone has sent (thank you, by the way), I'm pretty sure what the issue is. I'll check with the crawl/indexing team to be sure though. Folks don't need to send any more emails unless they really want to. It may take a week or so to sort this out and be sure, but I do expect these pages to come back to the main index.
We do have a few 301s, primarily the non-www to the www version of the homepage.
So glad that GG showed up, and it made me regret not jumping on that SES gmail address earlier. So be it. I love hearing that the "panic" switch has been flipped back to "chill" for many of us already. Definitely encouraging news...
However, I wonder how the distribution panned out, and what this magical theory works that relegated some of us to the supplementals while others (all except one it seems) are making out atop the rankings. What a pain.
Let's hope it doesnt' take that long--I'm tired of optimizing for Yahoo in the same way I'm tired of completing the elementary Sudoku puzzles... ;)
What you guys are seeing is what google has cached in its memory of your site. Stuff you did now with stuff you did way back when that conflicts with the present. How does it conflict - well in google's opinion some past pages it is showing as supplemental are conflicting with newer content. Now you need to make a decision which page is more relevent to your current content and 404 the one that aint.
My largest site has been hit very hard. On one DC I have 109,000 indexed pages and on the current DC I have only 126 with most pages coming up as supplimental results....
I too have many pages coming back with a very old cache of around a year ago...
This site is custom template built and completely original. I started checking copyscape just to see if I was having any duplicate problems and I found nothing....
Some quick facts about our site:
Google PR: 6
Theme: Newspaper/Media/Homepage
Size: 109,000 Pages (2% HTML / 90% php)
Year Created: 2002
This is pretty much killing us but the fact of the matter is when search engines become this important to your business it's time to rethink a new marketing plan. I'd say we are at a 60% loss of traffic with our main referers being MSN and Yahoo.
GG & MT, I would really appreciate and opportunity to somehow have one of you look at this site.
The upper echelon of the Webmaster community appears to be eager to please Goggle, and give them exactly what they want.
I have looked at
[google.com...]
that lists in a general way, what Google is looking for.
To maintain the obvious leadership position that Google now has in the Search Engine World, is it now time, like any great leader should, to paint out a map, of where Google would like the Webmaster community to head?
This will make everyones work more efficient, and hopefully create a net full of meaningful content, well indexed, and easily searched for the user who can find what they want quickly and efficiently.
Please comment on whatever you know of Google's plans to create more road maps of what is wanted by Google.
Please feel free to forward to higher ups at Google, (Perhaps you are one?)
Humbly submitted by a Google User
dk
I do expect these pages to come back to the main index (GoogleGuy)
But will they restore their previous position in the main index?
I am afraid that the effect is related with the attempt to find the original source of the information and give it the boost (remember the site signature that requires the sitemap?). If it is really so, many directories will continue to have just one main page with high SERP position and only the sites with the original content restore the SERP in main index.
Supplementary may be unexpected side effect of the directory or rather derivative content filter after all. I.e. may be Google now tries to filter not only literally duplicate content but also derivative content that adds nothing new to the original.
Vadim.
About 2 hours ago my site went supplemental except for the homepage.
Whew...i almost had a panic attack when I saw my site go into supplemental. I am the first to say I am not an seo expert, so I had to figure out what supplemental even meant! Though it sucks that we are all having the same problem, at least I know/hope its not something I did.
All my pages (215K pages) except for the homepage are listed as supplemental.
I have 0 duplicate content from any other site (cant speak for people who scraped me). The only thing I may have are duplicate links pointing to the same page as its forum software, and various links, like last post, new post, etc...point to the same topic. I do not see other forums having these same problems though. So do not think its that. Also. if it was these types of links causing the problem, you would think only those links would be dropped, not an entire site.
I have also seen pages dropping from the index altogether. Pages that I was highly ranked on (spyware removal guides) suddenly do not exist in any google index.
Reading through the posts I see that some people are finding if they search at 64.233.161.105 they are in the dog house, but if they search at 64.233.167.104 they are fine. Same exact thing is happening to me.
Huge drop of traffic from google for me.
But the main reason why pages go supplimental is that they are not getting crawled - not necessarily that the pages are of a poor quality, duplicates etc - always has been the case.
I put this to the test. I looked at two of the pages in the supp results and then searched for that in my access_logs for when google hit it. Both showed a few recent hits in my access logs.
Hits it pretty much everyday.
One thing that I have found interesting, is that all the pages that are in supp limbo, are older pages that maybe should be in supp as they may no longer exist. I do not see any newer links. On the other hand if I check via 64.233.167.10, I see all the results that are now supped, but also all the other results that are newer. My guess...bug happened and they supped a bulk of pages that were unreachable and didnt reinclude the good ones like they did for everyone else.
Some more info in case others have similar characteristics as we piece this together:
Hope this helps figure it out.
[webmasterworld.com...]
Problems handling canonicals, redirects etc results in dropped pages/problems in indexing.
Anyway - thanks GG for looking into this for us.