Forum Moderators: open
Against that background perhaps the following analysis and theory may fall into place more easily.
DATA ANALYSIS AND BACKGROUND[
Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.
After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).
Two general phenomena seem to be dominating the debate:
a) Index pages ranking lower than sub-pages for some sites on main keyword searches
b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.
These problems are widespread and there is much confusion out there between the two (and some others).
The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.
By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).
So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?
Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.
Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.
This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.
I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.
THE GOOGLE TWILIGHT ZONE
In addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".
Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.
1) The 2BN+ Set
If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.
2) The Twilight Set
But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?
To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.
You can probably see where this is heading.
If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.
I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.
The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).
IS IT A PROBLEM?
I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.
Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.
The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.
WHY HAVE A TWILIGHT ZONE?
Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.
However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.
If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.
At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.
VERDICT?
Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.
The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.
I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.
The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.
How many other people here are seeing good results but scared of getting flamed? :)
I guess I will have to disagree that everything is solid.
No...I wouldn't say everything is solid. I will say that the first page (top 10) for the keyphrase I check most often is almost the same across all the DCs. To me that says that the PR calculations for established sites are finished/all current relative to Esmerelda, or depending on the mechanics of the system, the dissemination of the calculated PR data is complete.
That said, I do think we're in some kind of constant update mode. It may be that the updates of adding new sites/pages just aren't affecting things at the most competitive levels. I'm not sure yet whether the constant update concept will extend to continuous recalculations of PR and ever-changing SERPs, or just better/faster ways of adding new sites/pages and determining the relevence of modified pages.
Probably outside the top 10 or 20 results, the quality (and stability) is tapering off a bit, but then I don't expect to find great sites buried that deep, and it seems natural that less established sites would be moving around if the 'constant update' theory is correct.
Frankly, I like the results I'm seeing now. But then, I'm in the UK - things may well look different in the US.
This same thread has been running in se circles for 5+ years. Everytime an se changes their algo everyone thinks the sky is falling.
This is the same discussion we had about Infoseek in 97, Excite 98, Alta 99 when they dropped keywords in titles, dashes in domain names, or h1 headers here and there for a few updates.
and of course a lot of us would not have been involved with websites back then - I did not even own a computer until 2001 :)
PR has much less relevance now, from what I can see. Content is being rewarded much more than PR. Looks like an improvement to me.
that explenation could maybe help others here.
zeus
Brett..I dont see a big algo change from here, the only big difference this update is the rolling effect and the lack of a real deep crawl so far...alot of bots looking for new pages but no reindexing of older pages that have changed.
They don't. You are assuming again that it is 100% certain this is an algo change. Has Google told you that?
Has it been proved?
I think it hasn't, and if you actually read the thread and check the centers you will find plenty of evidence to suggest it isn't. Changing data and swapping it in and out is NOT an algo change.
Surely an algo change would be consistent across the centers by now? One would also expect it to affect all sites the same, regardless of age. One would also expect GG's comments (suggesting it isn't an algo change) to closely reflect the real situation (quoted somewhere above).
I don't understand why you cannot see why there is doubt, and why you wish threads like this, which admitedly after the first 20 messages or so cover the symptoms, to cease.
Sure, it may be an algo change, and of course some webmasters will be making changes to some sites to explore that avenue. But the case is far from proved and just stating that it IS an algo change doesn't make it one and isn't convincing.
Why not post the detailed rationale of why you think it is an algo change (also defining what you mean by algo change)? Why not throw in your assessment of what that change actually is?
How about it?
The latest changes which began on -fi look better, but there are still significant index pages missing.
There's no consistent change to suggest a consistent change in the algorithm. It's like two or three things are happening and they're being switched in and out.
I think suggestions that index pages are being filtered out is highly premature, and of very doubtful value. Perhaps it looks that way to an overstressed SEO, but there are still lots of index pages out there that have been left alone.
This smells like a glitch.
I am seeing in and out results of index page.
But, every time it goes in (at one point it was in 3 data centers) it seems to rank a bit better.
#5 if the index sticks to several DC's.
Usually GG tells us what the scoop is, but has been pretty quite lately.
To me that somewhat indicates things aren't done cooking and he'd rather not pop in here to speculate.
The DC's are playing index page Ping Pong. All I can say is:
PRINTSCREEN when you see a good result. ;)
AW
...but I only disappear when i get the actual fresh tag. Are your experiences similar?
...but I only disappear when i get the actual fresh tag. Are your experiences similar?
I am noticing something similar, but my site doesn't disappear completely. Rather, when I get the fresh tag, my rankings fall two or three pages for primary keyphrases. When the fresh tag disappears, my rankings go back to the top. Very Strange. Incidentally, my site is very fresh. Changed significantly over the weekend, and today's fresh tag is June 25.
Brett: sometimes I can't tell the difference when you are throwing water on the fire or gasoline.... lol
I have seen some improvement over the last 24 hours in one aspect, the completely blank pages and completely off topic pages that were showing up in the top 10 are dropping towards 20-30 now.
The different paths to the index pages seem to be losing more ranking too though.
Napoleon, are you seeing this with your set?
Welp folks, Google has finally devalued the useless multiple keyword search, in addition to banning those useless index pages. I mean, really, what's the point of it anyway?
I've been saying for years that a good 'contact us' page is where it's at... :p