Forum Moderators: open
Against that background perhaps the following analysis and theory may fall into place more easily.
DATA ANALYSIS AND BACKGROUND[
Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.
After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).
Two general phenomena seem to be dominating the debate:
a) Index pages ranking lower than sub-pages for some sites on main keyword searches
b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.
These problems are widespread and there is much confusion out there between the two (and some others).
The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.
By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).
So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?
Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.
Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.
This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.
I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.
THE GOOGLE TWILIGHT ZONE
In addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".
Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.
1) The 2BN+ Set
If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.
2) The Twilight Set
But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?
To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.
You can probably see where this is heading.
If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.
I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.
The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).
IS IT A PROBLEM?
I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.
Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.
The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.
WHY HAVE A TWILIGHT ZONE?
Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.
However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.
If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.
At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.
VERDICT?
Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.
The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.
I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.
The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.
Why are some index pages missing?
1. Google's Broke
2. Similar <Title> <H1 Tags> <Internal Linking>?
3. Filter of most optimized keyword?
4. Filter eliminating most used keyword on page?
5. Too much duplicate content?
6. Toolbar tracking?
7. Google Dance search tracking?
8. Non-underlined and/or colored hyperlink?
9. Too far ahead of 2nd place listing?
10. index.php?uniquecontentstring=dan
Could use some help here. Please feel free to sticky examples too. Thanks!
Why are some index pages missing?
1. Google's Broke - Brokish - for sites under 4-6 months old.
2. Similar <Title> <H1 Tags> <Internal Linking> - Disagree
3. Filter of most optimized keyword? Quite Possibly
4. Filter eliminating most used keyword on page? Again Quite Possibly
5. Too much duplicate content? - Disagree
6. Toolbar tracking? Disagree
7. Google Dance search tracking? Disagree
8. Non-underlined and/or colored hyperlink? Disagree
9. Too far ahead of 2nd place listing? Disagree
10. index.php?uniquecontentstring=dan No Idea
In the 100 to 200 results returned listings, -in is showing an extra 10 to 15 results compared to other datacentres.
In the 5000 results returned listings, -in is showing ~300 to ~400 more results when compared to other datacentres.
I think -in is having an injection of fresh data into the listings. I've seen a few entries go in from sites that started mentioning the particular search topic on their site only in the last week or so.
Could this be that we are watching the dripping in of fresh data, without fresh tags, a sort of rolling update?
.
[edited by: g1smd at 12:51 am (utc) on June 24, 2003]
It maybe that I can throw something new into the mix because of this site (excuse my general ignorance on the subject - but maybe this will help).
During the recent update the site was recrawled by Google, as per normal (visible by browser tag - several sessions). The big difference here is that I had just completely rebuilt the site and most of the old pages had disappeared and been replaced by redirects to the home page. The old pages were, in the change to the site, orphaned.
Despite having what is vitually a brand new site, Google is still listing the links and pages in the old site. In other words the index has not updated the site. The PR and general ranking looks pretty much as before. It's as if nothing had changed at all....
At the same time I have built the same site & structure for several other domains, but for a different geographic market. Some of these have been crawled but still nothing showing as yet?
Hope this contributes soemthing useful to the debate.
I have one particular (6 month old) site which is incredibly well backlinked, yet it is clear, although a link: shows a lot of this, that the benefit is not being applied right now. Whereas others (domains that are 2 plus years old) with fewer backlinks are doing much better.
Sure, there could be other factors. But I don't think so. As the older domains had a html remake around the same time the new domain was launched. Although different keywords / markets, the newer site should be storming ahead relativ to the older domains. It aint.
I'd be looking at something else if it were me (but then, i'm not an a newer than 1 year site, so maybe that's the difference).
on my index page I'm not seeing any problem with same words in title and h1.
The phrases are not identical but everything in h1 is also in the title.
I had a brief index disappearance in April, 2003 PD (pre-dominic) but the index page came back prior to dominic and has ridden well since dominic, including after esmerelda tossed on her dancing shoes
These two claims are making a lot of sense with what I am seeing on www-in right now but it's showing terribly irrelevant results. I'm sure I rank high for "turquoise widgets" when it's mentioned once on my page but the page is not about turquoise widgets, it's about "umber widgets" for which it has horrible ranking.
They have definetely targeted my number one phrase.
To coin a new term: "Google Fatigue," the overwhelming frustration and weariness a Webmaster endures while waiting for Google to produce SERPS that are not discombobulated."
Is making sense for some of the results. I believe Google are running this on just one of the xx algos they use to place results.
I would have thought stickymail a better medium rather than show a need on your part.
Phrases that have me as #1 in anchor text showing up much lower than sites who were number#1 in thier anchor text prior to last year. I was number#1 in anchor and SERP prior to this last update.
Sites with more older established links not having problems with index pages dropping.
For search in paticular, dropped from #5 to #86 in the blink of an eye.
The results are bouncing back and forth between what they were last week and now.
Somehow, as speculated in this thread before, newer links and anchor text seem to of lost their luster.
Is this by design or by accident?
Who knows!
I'll check again in a few days and see how it looks then.
[edited by: mrguy at 2:39 am (utc) on June 24, 2003]
www-fi
www-sj
www-dc
www-ab
www-zu
www-in
www-ex
www-cw
www-va
?
I have noticed a big difference in my SERPS for every datacenter I check.
Its as if each data center is running a different algo or running off of inconsistent data.
It constantly changes on an hourly basis.