|Lost Index Files|
Some people may not like this post, and criticism of it would not at all surprise me. However, I suggest that before reading it you ask yourself whether Google actually DESIRES the current fluctuating situation. Is it where they actually want to be? Is it what they want to project to webmasters and the public?
Against that background perhaps the following analysis and theory may fall into place more easily.
DATA ANALYSIS AND BACKGROUND[
Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.
After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).
Two general phenomena seem to be dominating the debate:
a) Index pages ranking lower than sub-pages for some sites on main keyword searches
b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.
These problems are widespread and there is much confusion out there between the two (and some others).
The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.
By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).
So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?
Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.
Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.
This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.
I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.
THE GOOGLE TWILIGHT ZONE
In addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".
Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.
1) The 2BN+ Set
If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.
2) The Twilight Set
But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?
To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.
You can probably see where this is heading.
If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.
I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.
The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).
IS IT A PROBLEM?
I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.
Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.
The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.
WHY HAVE A TWILIGHT ZONE?
Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.
However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.
If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.
At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.
Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.
The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.
I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.
The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.
|I'm curious, does this mean that you don't believe the answer above - or do you have a different interpretation of it? |
After reading Napolean's paper, it's appears to me that he, as I, support GG's answer & your interpretation of it. :)
It's also apparent that Brett does not. Makes for interesting discussion IMO.
I still say for all the evidence I have seen this missing index page issue is just due to overall Google incompetence. Google is giving all around lousy SERPs. People are just noticing it more with index pages because those are the ones that for a lot of sites should be predictably doing well. With single topic sites where every other site links to the home page, that is the one that should rank the best with the appropriate search terms given PR and anchor text being algo factors. Thus, these index pages are the ones people are looking for to do well, and notice that they aren't. People just don't realize things are equally bad with rankings for internal pages.
Um, ok - you can poke fingers at Google all you want, (their mistake, etc) however from where I'm floating, it's never been easier to find it fast, find it relevant, and find it easily in their search results.
Especially if I'm shopping for something - gotta love the options they give the end user (me).
I've tried selling friends & family on other engines, but none (yet) compete with Google well enough to make me or people I know switch it up.
As long as they continue, doing what they do well, they'll keep their edge.
However, it could also be that I'm missing what the rest of the webmasters here are seeing - dropped pages, missing links, the wrong page ranking for "pet keyword here" etc.
Your theory sounds interesting but it doesn't fit with what I'm seeing with my site.
Site is PR6, getting about 4500 uniques/day before Dominic. Now getting between 2000 - 2500 uniques/day. (About half of the difference can be attributed to summer slowdown.) My site is almost 2 years old.
Since Dominic, I've been having trouble with:
"b) Sites appearing much lower than they should on main keyword searches"
(in my case, this is "subpages appearing much lower than usual on main keyword searches")
Have seen a little of "a) Index pages ranking lower than sub-pages" also. That doesn't bother me because most of my visitors enter via a subpage.
Also, some of my earliest backlinks are showing up and some of my earliest backlinks are missing. Some of my later backlinks are showing and some are missing. All of my backlinks for the past several months are missing.
[edited by: bether2 at 8:16 pm (utc) on June 23, 2003]
I think that was an unnecessary post Jeremy.
You're entitled to think all is well with Google when it patently isn't, but the implication that you somehow work on a higher plane than anyone else is simply not needed. You may think that if you wish, but it helps no-one when you articulate it in a thread which has more analytical content (from many contributors) than most.
If you don't wish to analyze the current status, that's fine. But dipping in here with a Google-groveling post like that doesn't contribute one jot.
Could it be jeremy you tend to do shopping searches quite a bit? Personally, I almost never do commercial searches. My comment about Google declining in quality is based strictly in informational searches. I should also probably modify my previous comment that by lousy SERPs, I mean compared to what Google was like a few months back. Google isn't wretched like Wisenut now. They just aren't as good as Alltheweb is at the moment. The curious thing is that early in the Esmeralda dance, Google was showing some very good SERPs all around. Whatever they factored in the algo right at the end shot the quality to hell.
>> some of my earliest backlinks are showing up and some of my earliest backlinks are missing. Some of my later backlinks are showing and some are missing. All of my backlinks for the past several months are missing <<
There's a lot of factors in there Beth.
Firstly, there is always going to be a lot of fluctuation in the crawl itself. Even the old DeepBot didn't find all links. I'd be interested to hear what 'several months' constitutes for you as well.
One other point though... Google certainly HAS many of the missing links (if not all).... it's just that it seems to lack the ability to apply them all at the same time. It's the news/fresh phenomena I mentioned above. It has them and then loses them again (from the repository) until it can refresh. That's what may explain the coming and going of sites to some degree.
I've recently started a process of checking for links on selected sites periodically as well, just to see how stable that part of Google is. The problem with this aspect is of course estimating a trust factor in the command and/or toolbar.
My apologies for what ya'll viewed as a 'flip' comment, building on what NFFC mentioned in his earlier post.
Most of the searches I do are not commercial, however, there is almost always two data sets for me to look at: Adwords, & the natural SERP's.
Didn't that recently concluded lawsuit with that Oklahoma company prove beyond a shadow of a doubt one very important thing that Google admitted to?
PageRank is now an opinion and NOT a mathematical algorithm. At least, in the embodiment of Google it is.
I've had sites "hand tweaked" into oblivion before, and after a reinclusion request, the site has been back for months BUT still doesn't 'rank' like it should. Imho, it still suffers from the hand adjusted opinion of low worth that Google has admitted they shell out to people.
Who would they target first in their arbitrage? SEO folks that use the toolbar, that they can track, with similar (or the same) whois data.
Sure. Great. The best engine in the world (imho) and they resort to this.
It sucks, but that, as they say, is life. My bad for offending ya'll, but if we are to do any analysis on something -> it has to be mathematical to be reverse engineered, and NOT a hand tweaked & filtered set of 'opinions' disguised as algorithmically delivered results.
Now that IS an interesting viewpoint Jeremy (and more like you). Jeez though.... I'm paranoid enough!
>> Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times. <<
Or maybe not. Worryingly quiet.
>a thread which has more analytical content
I think we are in dire need of a few facts;
random sampling - the process of selecting individuals from a population so that each member of the population has an equal chance of being included.
selection bias - the exclusion of certain members of a target population because of the way the selection process has been conducted.
I must admit I am astounded, nay gob smacked, by both the level of ignorance in this thread and also the tone towards certain members.
>> some of my earliest backlinks are showing up and some of my earliest backlinks are missing. Some of my later backlinks are showing and some are missing <<
I noted this in several sites. I noted that these backlinks show on other sites: the link page is in the google index.
Several backlinks do not show because the anchor text link has about the same keywords of the title page where it links.
>I must admit I am astounded, nay gob smacked, by both the level of ignorance in this thread and also the tone towards certain members.<
Administrators' input not excluded.
As a confession of my own ignorance, I don't know what, "nay gob smacked" means or how it relates to this thread or SEO.
Isn't it funny how two girls can look at the exact same thing and yet see something totally different? I have had the privelege of reading some of the most absurd, uninformed, over-reactive posts in this forum I have ever seen anywhere and seen them supported by mods and admins. Then someone actually takes the time to LOOK at something, freely admit it isn't scientific data, only to "astound" at the level of ignorance.
Personal ignorance already admitted, but it would seem to dumb little ole me that the mods and admins would ENCOURAGE members to look more and speak less instead of just smacking nay gobs.
I have been unable to provide any strong theory to support the exclusion of index pages by Google, so I have chosen to lurk rather than post. There is always a pattern in long threads. They start out with some pretty good discussion about an important point. After most of the leading theories have been spoken for, the info starts to decline in value.
My take on the missing index pages:
I'm pretty sure this will work itself out. You would expect to see the page totally gone from Google or PR zeroed if it was a penalty. I haven't found any pages totally gone. They just are totally unranked. I think deepfresh bot is still working out the kinks.
We just don't have a large enough sample of time to even produce a strong theory.
Also, since Brett and GoogleGuy have produced 2 seemingly conflicting views, it would be nice if one of them would come in here and clarify what they said. Or could this be some kind of cruel joke they are playing to see how we'll respond?
>As a confession of my own ignorance, I don't know what, "nay gob smacked" means
UK slang expression. It means to be completely and totally astonished by something. Shocking.
>Also, since Brett and GoogleGuy have produced 2 seemingly conflicting views, it would be nice if one of them would come in here and clarify what they said. Or could this be some kind of cruel joke they are playing to see how we'll respond?
I presume both were aware of what they said. Brett just doesn't believe that what GG posted was accurate at best, and that GoogleGuy dissembled at worst.
William Goldman wrote a book about his career as a screenwriter called "Adventures in the Screen Trade." In it, he wrote (over and over and over) that when it comes to filmmaking and making successful films:
"Nobody knows anything."
All the elements that worked in one place might end up failing in another. What we have here are many sites/pages ranking perfectly normally as they have for a long time. Absolutely zero effect on their main pages or any other. Then we have some perhaps random (perhaps not) domains having ludicrous results, with contact pages outranking content pages. What we do know is Google guy has said this phenomenon is not an algo tweak. What is not hard to observe is he is plainly right because only a minority of sites are effected. What we also know, for a fact, is the backlinks that Google displays show a very large number of errors, even in its own Directory where there should be no excuse for errors.
Nobody may know anything, but some parts of the puzzle are known. In spring, Google had failures in its crawling and link recording. What the ramifications of that is, is just something we don't "know", but we sure should expect that there are ramifications. And dorkily ranking a minority of the pages on the web is one obvious possibility.
I don't "know" that they will fix it, but it is a pretty good assumption that they are trying.
|1. Similar <Title> <H1 Tags> <Internal Linking>? |
2. Filter of most optimized keyword?
3. Too much duplicate content?
4. Toolbar tracking?
5. Google Dance search tracking?
6. Non-underlined and/or colored hyperlink?
7. Too far ahead of 2nd place listing?
>What we also know, for a fact, is the backlinks that Google displays show a very large number of errors, even in its own Directory where there should be no excuse for errors.
Interesting. Can you cite specific examples?
Still doesnt explain why my index page has gone up from PR3 to PR6 and all inner pages have gone from PR3 to white bar.
Why dont you all just chill until its all over and then discuss it in the full knowledge of all the info.
I'm not surprised that GG doesnt contribute to this particular bear pit.
<<What we also know, for a fact, is the backlinks that Google displays show a very large number of errors, even in its own Directory where there should be no excuse for errors. >>
It's funny how very few are considering just this. Google may very well be having difficulties with their "transition". It's not totally unexpected given the past couple of months. Maybe they felt it might be a problem and that is why they reverted back to an older, established index last month.
GoogleGuy said it was not an algo tweak to lower index pages but it may be a side effect of an algo tweak resulting in the changes.
Something I haven't seen before is going on. Different results for the allinanchor search on different datacenters. Even though the backlink totals are the same, the anchor search is quite different in some cases. Makes me think that they are either totally screwy or have some filtering that is still taking place.
|>> Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times. << |
Or maybe not. Worryingly quiet
Normally the case when updates start to go bad! I think this isn't an algo tweak, but a bodge repair job of the nightmare that occured last month. I strongly agree with Napoleon's Twilight theory as much as sites under e.g. 4-6 months old have been the worst affected. (Accept those with a page rank of over 5).
|Why dont you all just chill until its all over and then discuss it in the full knowledge of all the info. |
Been chillin as much as possible for going on three months now. I've always come here as a great place to get answers and I look forward to continuing to get answers. Of course, we haven't been getting straight answers for some time now. Won't be long before clients start asking some serious questions, I'd prefer to be ready to give as educated responses as possible :-)
In other words, lets keep exchanging information!
>Something I haven't seen before is going on. Different results for the allinanchor search on different datacenters. Even though the backlink totals are the same, the anchor search is quite different in some cases. Makes me think that they are either totally screwy or have some filtering that is still taking place.
Very interesting. I just figured out why one of my domains is doing so poorly on a specific search word. Google seems to have lost most of the anchor text for the home page of that site. With allinanchor: one of that site's internal pages comes up higher. Most of the links on this site are to the home page with that anchor text, so this is glaringly wrong. However, for other search terms allinanchor: seems reasonable for this site. I'll take this as evidence that Google is still seriously broken.
YEs, it is interesting. I have a page that ranks #1 allinanchor on 3 datacenters and #209 on the others. This isn't the typical dancing of SERPS, rather some strange discrepancy in data.
My only advice is to put out as many sites/pages as possible. It makes it hard to be adversely affected by Google whim's.
Whilst I find some information within WW very interesting and informative, I must say I find it strange that most of the people here are arguing over conjecture and assumptions.
Well done to Napoleon for his input and all the hard work BUT it is as he says "Theory"
For my part (and my company's) I make the following obervations.
1. Google may have had a few problems.
2. Google Guy has asked us to be patient.
3. GG advised us long ago to develop our web sites for the user with quality content.
4. Reading between the lines of his previous posts we have been advised by GG that once this update and maybe the next update have passed he expects things to settle down.
5. In the last 7 weeks I have gone from a banned site to a reasonable listing on SERP's for my main keyphrases (some even at no1) and hopefully with the ongoing work of quality links and more fresh pages I will be in the top 5 SERP's for all my main key phrases.
6. Most importantly after careful checking and re-checking, reconfiguring, increasing content and links and rectifying any mistakes to comply with "Google Guidelines" for ALL CLIENT sites, their rankings for ALL pages have gone up.
One has reached a level for a one word keyword which I never expected to attain.
7. Thanks to this "problem" update it has focused me into reviewing all client sites, which has lead to better listings and more correct traffic for my clients.
8. Google may have had a few problems over the last 7-8 weeks but I am confident all will settle down when this, and maybe the next update, is over.
REMEMBER if it doesn`t come out in the wash Google will go the way of AV, and I am quite certain, that as a leader in the SE world, that is the last thing on their mind.
Until this update is over less chatter and conjecture and more SEO, as advised by GG, might help some of the peoples listings I have been reading about.
|Firstly, there is always going to be a lot of fluctuation in the crawl itself. Even the old DeepBot didn't find all links. I'd be interested to hear what 'several months' constitutes for you as well. |
I'm comparing the links that were showing pre-Dominic to the links that were showing during Dominic and during this current update.
I'm not a big backlink watcher in the sense of keeping track of the numbers every single month - more like every few months. (Well, recently I've been paying a lot more attention.) But I do watch my logs like a hawk and keep track of when I get backlinks.
Of the 10 (yes, ten) off-site backlinks that are still showing in google for my index page, here are the approximate dates when they originally appeared on the referring site:
[uncertain one of of them, but it was long before 1/1/03]
So when I said "All of my backlinks for the past several months are missing," it looks like that means since Jan 30 of this year.
<<GG advised us long ago to develop our web sites for the user with quality content. >>
GoogleGuy doesn't pay the bills. Nor, does he have any interest in advising people on how to obtain good rankings. There are thousands of tremendous, informative, well written pages that don't rank well.
Although I have not been adversely effected by the past 2 updates/backdates, I find it to be condescending when people say "develop good content for the users and all will be fine" type statements. Many here are having problems, and you are suggesting that they do not have quality content which is often not the case.
I understand everyone here as I also suffer the exact same "All the lights are on but nobody is home" twilight zone nose-bleed theory.
I can add this, and maybe someone can make sense of it. ~My opinion at bottom.~
1) I have a new page (PR 3) with strong incoming links to the (2nd level) main new service I offer. Links are strong and anchor text is just right. - I am all over the firs page results from position 1-5 across the board, no reason to complain here. It does not go poof ever, stays there with no flux.
2) My main company with extremely hard to get to the first page phrases is now on Page 1 and page 2 for my primary targets. (PR 5) (85+ strong back links and anchor text) What is so nose-bleedy is that I am getting major hits for my primary phrases, but then...... POOF, nothing, listings are off the top 40 pages results (SERPS) and this lasts sometimes about 2-8 hours. (What’s weird is it is consistent to disappear for about 2-8 hours or so +/- 1 hr)
At the same time www2 - www3 - fi all show top listings but www does not as it is in the nose-bleed time zone.
Very odd - feel free to comment, I am very interested in anything people think about this?
My opinion is this, something is off track but there is a reason, I feel there is something going on, something new and something very unexpected, I think it will be good for most all of us non-spammers, so I wait patient as if anyone can get this right Google can.
In short someone is doing something - I coined this phrase at a past engineering position I had in a Quality Assurance Dept at a LARGE top ten Forbes 50 Corp. I have learned that patience is key and almost everything seems to go wrong almost all the time, so I think this is a glitch to some degree but for a good reason, i.e.-new thing ready to go mainstream.
Looking for some good comments here...
|My only advice is to put out as many sites/pages as possible. It makes it hard to be adversely affected by Google whim's. |
Yep. The more pages, the merrier. My index, gone for 3 days, has been back since yesterday in 5 - 7 of the dc's, currently on www.com, might be gone again tomorrow. All my text-heavy subpages, the ones that actually bring in the traffic on a tremendous variety of kw's, have been doing fine throughout Esmeralda; I've been busy getting more text-heavy fieldnote pages up since this started, something that I was behind on getting done anyway, so yeah man, feed Google and all the SE's lots. But...
It sure seems to me that my index was getting hammered in the serps at first because of two directory links that have the URL www.mydomain.org/?theirdomain.com, and are my index page, showing in Google, although only easily found with the &filter=0.
Currently, the #1 spot in Ink for my domain url kw's, the index, is www.mydomain.org/?theirdomain.com, same deal but hasn't disappeared at all.
Maybe Google fixed something, partially, to figure it out and un-bury my index kw's for the moment... I don't know. I've told the directory people that I need that URL gone, (dynamically generated is it? my pages get done by hand in Wordpad, I have no idea), even if that means being taken out of the directories.
It sure isn't a filter or anything intentional, imho, it's just Google having problems.
|Why are some index pages missing? |
1. Google's Broke
2. Similar <Title> <H1 Tags> <Internal Linking>?
3. Filter of most optimized keyword?
4. Filter eliminating most used keyword on page?
5. Too much duplicate content?
6. Toolbar tracking?
7. Google Dance search tracking?
8. Non-underlined and/or colored hyperlink?
9. Too far ahead of 2nd place listing?
Could use some help here. Please feel free to sticky examples too. Thanks!