|Lost Index Files|
Some people may not like this post, and criticism of it would not at all surprise me. However, I suggest that before reading it you ask yourself whether Google actually DESIRES the current fluctuating situation. Is it where they actually want to be? Is it what they want to project to webmasters and the public?
Against that background perhaps the following analysis and theory may fall into place more easily.
DATA ANALYSIS AND BACKGROUND[
Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.
After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).
Two general phenomena seem to be dominating the debate:
a) Index pages ranking lower than sub-pages for some sites on main keyword searches
b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.
These problems are widespread and there is much confusion out there between the two (and some others).
The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.
By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).
So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?
Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.
Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.
This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.
I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.
THE GOOGLE TWILIGHT ZONE
In addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".
Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.
1) The 2BN+ Set
If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.
2) The Twilight Set
But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?
To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.
You can probably see where this is heading.
If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.
I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.
The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).
IS IT A PROBLEM?
I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.
Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.
The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.
WHY HAVE A TWILIGHT ZONE?
Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.
However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.
If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.
At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.
Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.
The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.
I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.
The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.
we can all throw-up theories and figure out what's going on, but what if google is actually trying to index all live webpages on the internet, have them stored on all datacentres and then decide to apply the filters to block out spammy sites and so on.
You may ask then, whats the point of the algo, well it certainly has taken this update long enough to re-stracture its index, so that coincides with the algo at work.
Still, I believe google has still to unleash its big guns, so hopefully in the next few days we'll see those at work.
Any more theories =)
[edited by: spud01 at 9:43 am (utc) on June 24, 2003]
Do you think age of links could have anything to do with it or links from commonly owned sites?
We know from the expired domain filter that Google now uses age of link data and WhoIs info. I wonder if these now come into other parts of the algo or filters?
>> Are you sure this 'index effect' is happening on older sites? <<
Certainly. But I think the thing to bear in mind is that the issue isn't the age of the site itself, but the link structure.
Zafile: You make some very interesting comments there, especially the third, relating to link data at the end (plus ODP). Obviously a lot of people will hope you are current, and that the missing link data will suddenly appear. The problem is that there is strong evidence to suggest that they have that data, but are just not using it or able to use it (all of the time). Still... I do recall him saying that as well, so we will see.
>> Do you think age of links could have anything to do with it or links from commonly owned sites? <<
Commonly owned sites may be an issue, but it doesn't explain the index problem... most of the sample I have comprises sites with links from sites that are not commonly owned.
The age of the links... yes, I believe that to be a key (see above).
From a reliable source, I understand ODP is switching ALL of its data to UTF-8.
Taking into account that Google uses Linux boxes, the switch of ODP data to UTF-8 will improve its share of information with Google.
And maybe, it will make it faster ...
|Are you sure this 'index effect' is happening on older sites? |
I'm seeing it on a site from 1999....
Could you sticky me the url please and the search term.
For those that were questioning the age of sites...the one Im seeing this effect on is 4 years old, gets updated daily, lost 50% of its backlinks during April, gained different backlinks during May (fresh backlinks) and has been a stable high PR5 for some time.
One unusual effect is that for some single keywords the index page has actually jumped, but the target 2 and 3 word phrases of anchor links is the one that is going in and out of the results. I do not have one specific anchor link, but is spread over 4 different phrases to the index and have been top 5 for over 2 years.
Hope that info might help Napolean
anyone else seeing new results on www this morning? We certainly are.... index page is back to #3 for main keyword. Also notice that SERPs are showing many, many interior page results...in fact, my link page comes up #1 for many keywords now...not sure I like that.
I am not sure if this is relevant here .. but the pages that I see gone down in rankings (for sites that I take care of) are freshly indexed ones. As if the fresh bot boost has turned into a fresh bot penalty ..
Great thread Napoleon, I agree with your age theory but I think one part of this may have been overlooked. I shared a website for your data set, I have the PR7 with a solid PR8 backlink but not quite a year old.
One issue I didn't see mentioned was the fact that these index pages that are "missing" may not be missing at all, for me at least, the index page is in, but it's duplicated 4 times.
Let me explain
all incoming links go to www.domain.com
all internal links point to /index.shtml
I find my index page is indexed, though dropped significantly in the serps. I see several pages down from where it has always been a double listing (indented) for the same page
where www.domain.com/index.shtml is listed like 145 and domain.com is listed 146 - as you already know, this is the same page, indexed two seperate times, further searching shows the page is indexed a total of 4 times, each with slightly different url paths.
Did you find this to be true on all of the sites you sampled?
Also, when you were studying these sites, did you notice that last week many of them returned to normal for about 2 days?
I still think that it will all work itself out, but the drastic loss of income for people that have played by the rules has got to hurt and it's going to be even harder for SEO's to convince clients to play it safe and let thier content to the talking...
I was wondering if, using the dataset you used for this research, you have tracked this same dataset for changes over the last few days. It seems as though a lot of people keep talking about radical changes in position on a daily (or hours in some cases) basis. It would be interesting, I think, to know what changes, if any, have taken place with the dataset you used. I don't mean specifics necessarily, but some sort of summary analysis. I realize how much work that would have entailed, so you may not have done this, but if so, it might be an interesting view of what is going on and where the process stands.
>> I'm seeing it on a site from 1999.... <<
It's the links that matter nuts. How old are the ones with the relevant anchor text?
And thanks Marval "different backlinks during May (fresh backlinks)" sort of fits as well... the link aspect that is.
3cents: >> Did you find this to be true on all of the sites you sampled? <<
Not at all. I think there are several problems (at least 2) swilling around. The main one is the standard index page ranking below sub-pages. This seems to pertain to the age link theory. The other is that darned 'rogue' filter (negated with &filter=0) which may well be affecting you.
>> did you notice that last week many of them returned to normal for about 2 days? <<
Yup... that fits the proposition perfectly doesn't it. A twilight repository.
>> you have tracked this same dataset for changes over the last few days? <<
I have, and it certainly hasn't improved. No... it's has got worse. Some of the sites that were hanging in there for grim life have lost their grip in the last 24 hours completely. For some of them I would say it's Google loss - some of the omissions and replacements make Google look a little silly.
Is this dream (nightmare) ever going to end?
|It's the links that matter nuts. How old are the ones with the relevant anchor text? |
Don't hang your hat on that. I've got sites that launched in March with plenty of recent, optimized links that are doing fine.
Hmm simes with time they will make google less popular, with all this testing of the ranks/serps,
I was affraid this would happen in early May everything was great, good serps all over, ok there will allways be a few bad ones, but is was a almost perfect SE,
but as they began there testing, most of it went down to the level of Altavista rankings.
In a way I hope they had/have problems and this is not the future of Google.
Never change a winning TEAM.
Could there be any relationship with duplicate content and missing index pages?
Ex. Affiliate site with duplicated product descriptions and titles is missing index
Or are there some sites that clearly are original that are missing index pages?
I still thinks it has something to do with the internal links, I have noticed the last 3 month that they have cut down the backlinks from internal links even if there has been no changes,
its like the PR flow has droped through out the site.
Like I have gone from 450 backlinks to 99 today on 3 month, with no changes made, that must tell something.
I not the only one here I think.
my internal pages show more backlinks than the index, even with the index being the most externally linked to and on every page internally.
Hmm I have not checked that, maybe its worth a try.
P.s Not here the index page has still the most links
Why are some index pages missing?
1. Google's Broke Maybe, especially recent SEO - New and Old sites
2. Similar <Title> <H1 Tags> <Internal Linking>? No
3. Filter of most optimized keyword? Maybe
4. Filter eliminating most used keyword on page? Maybe
5. Too much duplicate content? Maybe
6. Toolbar tracking? Probably not
7. Google Dance search tracking? Probably not
8. Non-underlined and/or colored hyperlink? No
9. Too far ahead of 2nd place listing? No
10. index.php?uniquecontentstring=dan No
11. Failure to read 301 Perm Redirect correctly. Could be in my case, anyone else?
12. Fresh Bot Drop (interesting - although I've seen it actually help, so I'll say No)
|Don't hang your hat on that. I've got sites that launched in March with plenty of recent, optimized links that are doing fine. |
Perhaps that says something itself.
Napoleon, your question about the age of links in intriguing. PageRank as we (think we) know it, gives no weight to the age of a link. But what would happen if Google decided to take into account the age of a link? Suppose there is a PR9 page with two outgoing links - one link has been on the page for 3 years and the other was added last month. Does each outgoing link deserve an equal 'vote'? If the answer is no, which one is worth more? The answer to the latter question is not clear cut. On the one hand, a new link might be more relevant and timely, thus deserving of greater weight. On the other hand, a newer link might be temporary in nature, such as a "site of the week" link - which would suggest a lower weight for the newer link. The latter case is particularily interesting when taken in the context of blogs. Newer links are 'temporary', rolling down and then off the home page within days or weeks. Blogs are also an intersting case because blogger often link to each other using 'permalinks', which are actually links to sub-pages, not index pages. But 'blogrolls' are essentially static link-lists that sit on a blog home page and link to other blogger's home pages. Might Google be trying to make some adjustments to account for these types of linking patterns?
If Google ever started using the age of a page or link, it would certainly not be able to rank sites appropriately anymore.
If the Government comes out with a new site with a link from whitehouse.gov, shouldn't it receive high PR(importance)? It is silly to equate age with value, and I doubt that Google has gone down that road.
|brotherhood of LAN|
>>>one link has been on the page for 3 years and the other was added last month. Does each outgoing link deserve an equal 'vote'? If the answer is no, which one is worth more?
I guess you hit on my IMO here later in your post, to me the newer link would be "worth" more in the short term, i.e. the fresh effect gives the page linked to a boost in the short term.
Maybe its just not a case of what is important but for how long any addded bonus lasts.
I don't think the age of a link should have anything to do with the PR it transfers.
I'd say the chronological issues surrounding this problem are more focused on when backlinks were spidered and when affected sites came into existence.
Sorry if this has alreaady been covered but I didn't have time to read through all of the posts....but does anybody know when the backlink weight factors in? I've sort of thought in the past that changes in backlinks (and presumably PR) affect you more the NEXT update then the current one. Since at the time you were actually spidered you were your previous PR, and the new PR would affect the deepcrawl that takes place after the update. Does that make sense? But I could be off base.
I have some more info share...
I am seeing the occasional serp that has nothing at all to do with the search term, I mean, nothing at all, nothing in the title, nothing in the meta, text, etc.
I have also seen a couple of completely blank pages that have been top 20 since Dominic for some pretty popular terms.
The urls appear to be somewhat related to the search terms, for example: (not keyword domains)
search term: car widgets
Title: We Design Websites (or my favorite: New Page 1)
What does this have to do with the missing index pages and site age issue?
I believe that these sites used to be relavant to the search terms several month ago, when index pages were appearing properly last week for a couple days, I noticed that these results were gone or had dropped considerably.
The pages are now showing a web design company that offers SEO, one even claimed to specialize in google seo. I doubt that google wants to promote these, I think it's just old data, with fresh serps showing.
The page has it's old position, but a nice fresh and completely irrelevant title and description (if any).
GG had mentioned that this update will take place in steps, where ingredients are added, migrated, then calculated, then more ingredients added, etc.
So what's my point?
I'm taking a big loss on this index problem and it has had me stressed out, but I think we are not seeing what good is going to come from this "new technology" quite yet.
I expect that google can tell if it is listing the same page with four different url paths and can tell where backlinks point to, maybe they just can't calculate the latest results at the same time they are adding the most recent data?
I have hope that this will all work itself out but I wonder what the benefot of showing live results with completely irrelevant serps mixed in to the public.
ps: I agree, this is a bad dream, let's hope it will be over soon...
>I still thinks it has something to do with the internal links, I have noticed the last 3 month that they have cut down the backlinks from internal links even if there has been no changes, its like the PR flow has droped through out the site.
I think you may have just invented the "prior hoc, propter hoc" fallacy.
Try the theory the other way. It's the EXTERNAL links! The site just isn't getting PR to flow. And what would happen then? Well, with less PR to recirculate, fewer internal pages would make the cut, ergo fewer internal links. Which bears an uncanny resemblance to the situation you describe.
But ... you say, none of the external sites changed either? But Google's "doorway filter" did! and _that_ has to be the critical piece of their anti-spam algorithm.
Can anyone explain these strange goings on with this update.
I was No 1 for widgets for children. During the last 24 hours my widgets for women and widgets for men pages have both shown up in the widgets for children search results, both times replacing the widgets for children page. Now my widgets for children page has vanished from the index. BTW there is no mention of men or women's widgets on the children's widget page.
>Commonly owned sites may be an issue<
I won't even pretend to have all the answers. There are some pretty smart people doing some pretty smart looking involved in this thread if you ask me, but I had come to the "commonly owned sites" conclusion almost two years ago when a discussion came up here about the toolbar and it's purpose. I know it's been discussed over and over since then but this was one where GG gave a typical non-answer. Scared me a little and got me to thinking about what I was risking.
In my opinion, if you think all your domains can't be tracked by IP#, server ID, registration and linking stratgies, you're nuts. Registering all your domains to the same person at the same address with the same class c IP, cross linking them all and then wondering what the connection is seems bizarre to me.
I have to spend as much on servers and IP#'s as probably some of you spend in a year on Adwords, but all I know is the domains that are old or I didn't pay attention to are having this problem. The ones I used my head on are doing fine with PR and placement. But maybe that is just me. It seems to me that just because I'm paranoid doesn't mean they aren't watching. I have kids to raise and I can't afford to go back to data entry jobs. I built those sites with a purpose and I never had any intention of you ending up on my contact page for me to start my pitch.
As an illustration, I have 4 domains covering a wide range of varied products and services that pay well. These are all domains that are over two years old and have consistently produced very well. Well enough that I got lazy and never "protected" them.
Now they have the weirdest pages, (not my index), coming up for the weirdest search terms. I actually have better placements and more "hits" than I did a couple of weeks ago, but they stopped converting sh**! Who needs hits?
It has always been a huge obstacle to overcome Google's creative choices of displaying desriptions for my sites, but now having goofy snippets of sentences leading to a contact page is having the exact reaction any logical person would expect. Looks to me like no one is getting what they thought they were going to get out of this deal.
My free advice is,(and we all know what that is worth), if you are not in a position to gain control over your domain registrations, your IP#'s and your servers, at least look closely at your linking strategy.
|But ... you say, none of the external sites changed either? But Google's "doorway filter" did! and _that_ has to be the critical piece of their anti-spam algorithm. |
hutcheson, this may be the most plausable theory yet for the problem with sub-pages ranking higher than home pages!
What are index pages other than doorways? There are doorways! Yes, they are legitimate doorways, but it is true that many index pages serve the purpose as the doorway to the rest of your site. What if a new doorway filter was penalising index pages? If this were indeed occuring, it begs the second question: is it due to a faulty doorway filter (transient issue that will be fixed) or by Google's design (a fundamental algorithm change)? ....which is one of the original questions asked in this thread.
I don't know the answers, but home pages getting caught in a doorway filter seems possible.
"There are doorways!"
No they aren't. They are first pages. They have content themselves and links to other content. That is the heart and soul of the Internet itself.
They can be content-less doorways sometimes, but that certainly doesn't need to be the case, and not a soul has posted anything that would suggest the index page problem is a lack of content problem. Pages being listed are the ones lacking content.
Just checked a couple of other forums to see what's being said about the current problems with Google and they seem resigned to this mess going on for some time yet.
The owner of one forum has suggested the word is that it won't be sorted until the end of July.