| This 233 message thread spans 8 pages: < < 233 ( 1 2 3 4 5 6  8 ) > > || |
|Supplemental club: Big Daddy coming - Part 1|
W'sup with google?
Carrying on from here:
A lot of members are seeing huge sites going supplemental. One of our main sites lots all rankings and 200,000 + pages disappeared and now we are left with 19k useless results. This could be a goof or it could be a new round of penalties. If you have had your site reduced to the 'sup index lets here about it and compare notes.
Thank you for your post!
I sent you mail...
The docid we see in the url after cache: is a hash value and is not the same of what Scarecrow is referring to. He is talking about the internal unique binary value of each URL starting at 0 to some limit, either 2^32, 2^40, 2^64 or whatever.
Being one of the WebmasterWorld members that caused Scarecrow to put on his flame proof suit about a year ago I would like to refrain from comments until this whole supplemental issue become a little bit clearer. But Matt Cutts' words that the Big Daddy infrastructure is primarily there to solve canonicalization problems is not in contradiction with a merge from several separate 32 bits index systems to one large index.
|Well, the DocID appears in the URLs for the cache links, so do we look there for a longer string, or are they going to try to "hide" it? |
|I just looked up the DOCID on new and old datacentres for an indexed page. They are the same. |
The docID I'm talking about is defined in The Anatomy of a Large-Scale Hypertextual Web Search Engine [www-db.stanford.edu]. It was originally 4 bytes. The ID in the URL is NOT the docID. That URL ID is about 12 bytes. It is some sort of look-up number. It has to be URL-compatible (7-bit ASCII) because it is used in the URL. If you tried to put the docID that is used internally as a binary number into a URL directly, the URL would crash. It has to be converted to URL-acceptable characters. For all I know, maybe the docID is contained somewhere within it. Maybe the rest of it is additional locater information for speedier access.
The docID is 32-bits or 4 bytes, or at least it was originally. This gives you a maximum of 4.29 billion counts before you run out of unique combinations and roll over.
Best estimates are that on the average, each docID is used twice per word per page. That's because they have two inverted indexes. One is "fancy" and the other is "plain."
The average number of words per web page is 300. Here are the space requirements for the docID if we assume 4 bytes, 12 bytes, and 20 bytes, for 4 billion web pages:
4 bytes: 300 * 4 billion * 8 = 9.6 to 12th power (10 terabytes)
12 bytes: 300 * 4 billion * 24 = 2.88 to 13th power (29 terabytes)
20 bytes: 300 * 4 billion * 40 = 4.8 to 13th power (48 terabytes)
If you were designing a search engine, how many bytes would you choose for your docID? Obviously, you'd go with the minimum number of bits you think you will ever need.
gee, why I am I not surprised that googleguy suddenly reappears for a brief posting?
Anyway, if scarecrow isn't all the way right, he's enough right so that the details don't matter. Why? because this explains basically everything google has done in the last 3 years, all the problems, the weirdnesses, updates that aren't updates, sandboxes that aren't sandboxes, and so on:
|It would make a lot of sense, if you are a Google engineer figuring out what to do back in 2003, to stall on the docID problem until you can migrate to 64-bit processors. For one thing, Google got a lot richer and 64-bit processors got a lot cheaper at the same time. For another, there's a new trend toward more processing power per watt, and Google's huge electric bills are a source of concern to them. |
If you actually are interested in having a somewhat long term understanding of what's going on, this is about as clear as it gets. And if you want to understand supplementals and all that stuff, you really don't need to go much further than this. Personally I stopped worrying about supplementals about the time I first heard about them, but I guess they interest some people enough to make it a topic worth continuing.
For those of you who don't follow such things, performance per watt is not a minor topic in very large datacenter design and implementation. Especially not for data centers like google runs.
<added>anyway, just saw scarecrow posted again, personally I'm not concerned with the details since I can't know them, unless googleguy wants to actually say something more revealing than he's allowed to say. But the basic idea is simple: a system designed for 32 bits isn't going to just switch overnight to 64 bit stuff, it's hard to do that, lots of work. And datacenters aren't just going to switch over night to 64 bit machines. If you want to know how long it takes to do that, just look at the first appearance of big daddy until it's spread through all of google's networks.
Like scarecrow, I've lost all interest in arguing this stuff, it was obvious then, and now it's a fact.
I have to say though, this fits EXACTLY with what I thought google was doing for the last 6-8 months, including bourbon and jagger.
Are you saying that pages need to be first purged from the old 32 bit index in order to be assigned a new doc id in the 64 bit architecture?
Just trying to figure out why pages would be deleted.
Just try to figure out, why Google does not delete 404-pages in its databases. Did anybody ask on SES?
|Are you saying that pages need to be first purged from the old 32 bit index in order to be assigned a new doc id in the 64 bit architecture? |
I wouldn't know. The only thing I can suggest is that if there is a major shift in infrastructure underway, there will be some churn. The best we can hope for is some evidence that the shift is rational in the way it progresses.
If I were a Google engineer I might start the migration with certain top-level domains: gov, edu, org. These are more manageable because they are many times smaller than the dot-coms. Also, the sort of people who normally aren't heard from in the press when it comes to Google quality-control, might suddenly start noticing if gov, edu, and org get turned upside-down. I would start looking for patterns about what sort of sites are affected.
I have a dot-org that has been stable for three years on my end, but has been like a roller-coaster for the last three years in terms of fully-indexed pages vs. URL-only listings. The ratio has gone from 3 to 1, to 1 to 3, to 2 to 1, to 1 to 2, and then back again, for the indexed pages compared to the URL-only pages. In the meantime, nothing important was changed on my end. The site has 130,000 pages.
A couple weeks ago, all of my URL-only pages disappeared completely. My Google referrals are up only slightly so far, but those URL-only pages are gone every time I check. It's very stable. That's good news for me, because those URL-only pages never drew any traffic.
Background BD HP good all others supplemental, cache recently changed from june - Aug to hash code like MMNJLTot6sYJ:www.mydomain.com, default google good
Digging into my supplementals I found urls like www.mydomain.com//page.htm.
I have some code at the top of each page that redirects non-www to www, checked that with a couple of header status checkers and it works. But I now have a problem with an www.mydomain.com//page.htm showing up in googles index instead of www.mydomain.com/page.htm.
I am not an ASP programer so I am a little concerned that maybe the code doing the non-www to www redirect I was provided with is causing the problem.
Here is the code
hostname = request.servervariables("HTTP_HOST")
pathinfo = request.servervariables("PATH_INFO")
MainDomain = "www.mydomain.com"
if pathinfo <> "" then
if instr(lcase(pathinfo),"default.asp") > 0 or instr(lcase(pathinfo),"index.asp") > 0 or instr(lcase(pathinfo),"index.aspx") > 0 then
MainDomain = MainDomain &"/" & mid(pathinfo,2,instrrev(pathinfo,"/")-1)
MainDomain = MainDomain & pathinfo
MainDomain = MainDomain & "/"
if left(hostname,instr(hostname,".")) <> "www." then bRedirect = true
if instr(lcase(hostname),"default.asp") > 0 or instr(lcase(hostname),"index.asp") > 0 or instr(lcase(hostname),"index.aspx") > 0 then bRedirect = true
if bRedirect then
response.status = "301 Moved Permanently"
response.addheader "Location", "http://" & MainDomain
Anyone see something that may be causing the www.mydomain.com//page.htm problem? or have an idea how I can add a redirect that would redirect www and non www www.mydomain.com//page.htm to www.mydomain.com/page.htm throughout the site.
Are you using Google Sitemaps and what is the cache dates on your double slashed URLs.
Here's a thread that discusses this issue - no resolution but several webmasters pointing at a early problem with the Google sitemaps bot that was actually creating the problem.
Welcome to 100,000 pages club?
Scanning this thread I have noticed that seems all sites that are affected has very large number of pages.
Is it so? Does all who was affected has, say more than 10,000 page sites?
Personally I do not understand how is it possible to build a site with such 100,000 pages that *all* are worth to show in the search results.
Of course there are large companies, but there are few of them and Google probably treats them manually.
What are the other large sites? Are they superstores? If so their page may be interesting only to local visitors. Does the supplemental depends on the location?
Another sample of the type of the large sites that come to my mind is a directory.
However, since the goal of the widget directory to provide better search for the widget, it make sense for Google to leave just one main directory page in the index. Those who search for the specific sort of the widget in general will be more satisfied when find the site of the widget producer and not the directory.
So is it possible that in additional to duplicate content filter we have now the filter for the sites with too large number of pages? Let's call it directory filter.
No effects for me on any datacenters. All of my sites (about 10) are between 100-1000 pages.
Some older than 5 years, some newer than one year.
All had 301 redirect applied in September 2005.
No new supplemental listings for any of them that I can see, and about 10-15% increase in google traffic over the last 7 days.
Just my observations to help us get to the bottom of this.
It is a shame 301's aren't effected more in this DC update. In my experience most 301's are carried out because the page in question has gained a result through black hat seo. As soon as they get the result they do a 301 to a "clean page".Its bull#*$!.
I think unless its a run of site 301 all 301's should be ignored.
One of our sites which is effected has around 1500 pages of handmade custom content online for about 5 years with a pr7 and stable traffic for years, nothing fancy seo wise and basically the entire site is supp besides the homepage so no this isnt only effecting 100k plus page sites.
I am trying to find a link between the sites which have been placed in supplemental listings. If your page is supplemental, please email me the URL as well as the age of the site.
Thanks you. Once the list is compiled I will email you back the entire list of sites that have this problem and any common occurances I found.
email to edbri871 (at) gmail.com
Wow, Scarecrow is around too. It's like old times. :)
|I'm not eager to get flamed for the zillionth time on the 4-byte ID problem, but if GoogleGuy wants to deny it once again, for the record, that would be fine with me. |
I'm fine to deny this, because docids and their size has nothing at all to do with what people have been describing on this thread. I've been reading through the feedback, and it backs up the theory that I had before I asked for feedback.
Based on the specifics everyone has sent (thank you, by the way), I'm pretty sure what the issue is. I'll check with the crawl/indexing team to be sure though. Folks don't need to send any more emails unless they really want to. It may take a week or so to sort this out and be sure, but I do expect these pages to come back to the main index.
One of my sites was smacked into the supplementals as well, but we do NOT have thousands of pages. We have only about 150 pages.
We do have a few 301s, primarily the non-www to the www version of the homepage.
So glad that GG showed up, and it made me regret not jumping on that SES gmail address earlier. So be it. I love hearing that the "panic" switch has been flipped back to "chill" for many of us already. Definitely encouraging news...
However, I wonder how the distribution panned out, and what this magical theory works that relegated some of us to the supplementals while others (all except one it seems) are making out atop the rankings. What a pain.
Let's hope it doesnt' take that long--I'm tired of optimizing for Yahoo in the same way I'm tired of completing the elementary Sudoku puzzles... ;)
OK google is basically good. I Just wish it was better than good.
What you guys are seeing is what google has cached in its memory of your site. Stuff you did now with stuff you did way back when that conflicts with the present. How does it conflict - well in google's opinion some past pages it is showing as supplemental are conflicting with newer content. Now you need to make a decision which page is more relevent to your current content and 404 the one that aint.
Just relax, sit back and think before you 404 anything
I stand by what I have said.
No infact I won't just stand by what I said. There are people that will hijack your orphaned pages that you have long forgotten about to sell porn and drugs and loads of other stuff so manage your content people.
as I said before just calm down, mate
I just want to through something in here for the record.... I don't have anything else to really bring to the table but I'll lay out what I see:
My largest site has been hit very hard. On one DC I have 109,000 indexed pages and on the current DC I have only 126 with most pages coming up as supplimental results....
I too have many pages coming back with a very old cache of around a year ago...
This site is custom template built and completely original. I started checking copyscape just to see if I was having any duplicate problems and I found nothing....
Some quick facts about our site:
Google PR: 6
Size: 109,000 Pages (2% HTML / 90% php)
Year Created: 2002
This is pretty much killing us but the fact of the matter is when search engines become this important to your business it's time to rethink a new marketing plan. I'd say we are at a 60% loss of traffic with our main referers being MSN and Yahoo.
GG & MT, I would really appreciate and opportunity to somehow have one of you look at this site.
GG I'd love an answer to this if possible - is Google letting spam reports go by the wayside while the changes are happening?
I'm seeing duplicate content on different domains by the truckload in my area of activity and despite many spam reports nothing happens.
Will Google be cracking down on duplicate content again soon?
I completely redesigned one of my sites last year. I just checked and it appears that the only Supplemental Results are the old pages that went away after the redesign.
All of my new stuff seems to be there.
I have a question for Google Guy that expands even bigger than this thread.
The upper echelon of the Webmaster community appears to be eager to please Goggle, and give them exactly what they want.
I have looked at
that lists in a general way, what Google is looking for.
To maintain the obvious leadership position that Google now has in the Search Engine World, is it now time, like any great leader should, to paint out a map, of where Google would like the Webmaster community to head?
This will make everyones work more efficient, and hopefully create a net full of meaningful content, well indexed, and easily searched for the user who can find what they want quickly and efficiently.
Please comment on whatever you know of Google's plans to create more road maps of what is wanted by Google.
Please feel free to forward to higher ups at Google, (Perhaps you are one?)
Humbly submitted by a Google User
|I do expect these pages to come back to the main index (GoogleGuy) |
But will they restore their previous position in the main index?
I am afraid that the effect is related with the attempt to find the original source of the information and give it the boost (remember the site signature that requires the sitemap?). If it is really so, many directories will continue to have just one main page with high SERP position and only the sites with the original content restore the SERP in main index.
Supplementary may be unexpected side effect of the directory or rather derivative content filter after all. I.e. may be Google now tries to filter not only literally duplicate content but also derivative content that adds nothing new to the original.
|About 2 hours ago my site went supplemental except for the homepage. |
Whew...i almost had a panic attack when I saw my site go into supplemental. I am the first to say I am not an seo expert, so I had to figure out what supplemental even meant! Though it sucks that we are all having the same problem, at least I know/hope its not something I did.
All my pages (215K pages) except for the homepage are listed as supplemental.
I have 0 duplicate content from any other site (cant speak for people who scraped me). The only thing I may have are duplicate links pointing to the same page as its forum software, and various links, like last post, new post, etc...point to the same topic. I do not see other forums having these same problems though. So do not think its that. Also. if it was these types of links causing the problem, you would think only those links would be dropped, not an entire site.
I have also seen pages dropping from the index altogether. Pages that I was highly ranked on (spyware removal guides) suddenly do not exist in any google index.
Reading through the posts I see that some people are finding if they search at 184.108.40.206 they are in the dog house, but if they search at 220.127.116.11 they are fine. Same exact thing is happening to me.
Huge drop of traffic from google for me.
|But the main reason why pages go supplimental is that they are not getting crawled - not necessarily that the pages are of a poor quality, duplicates etc - always has been the case. |
I put this to the test. I looked at two of the pages in the supp results and then searched for that in my access_logs for when google hit it. Both showed a few recent hits in my access logs.
Hits it pretty much everyday.
One thing that I have found interesting, is that all the pages that are in supp limbo, are older pages that maybe should be in supp as they may no longer exist. I do not see any newer links. On the other hand if I check via 18.104.22.168, I see all the results that are now supped, but also all the other results that are newer. My guess...bug happened and they supped a bulk of pages that were unreachable and didnt reinclude the good ones like they did for everyone else.
Some more info in case others have similar characteristics as we piece this together: Use adsense on almost all pages
All original content - some forums topics could be duplicates..not 100% sure
Some pages use relative urls (mostly in the forum which I have not hacked apart too much)
No blackhat SEO as far as I know
Was not 301ing a lot of my older links that stayed in the index, so may have had duplicates (starting to fix those now). Was not 301ing domain.com to www.domain.com (am now).
Hope this helps figure it out.
Thanks a million Googleguy for getting back to us on this.
I wonder if this is similar to what happened back here:-
Problems handling canonicals, redirects etc results in dropped pages/problems in indexing.
Anyway - thanks GG for looking into this for us.
All the sites I am seeing going sup are large forums which use bb scripts which produce the same page in a variety of forms - dup content. If anybody has specific examples that are not forums and do not involve dup content, I would appreciate a private message so I can see for mineself.
my site is not a forum and there is no dup. 220,000 pages.
| This 233 message thread spans 8 pages: < < 233 ( 1 2 3 4 5 6  8 ) > > |