Welcome to WebmasterWorld Guest from 18.104.22.168
Forum Moderators: open
Against that background perhaps the following analysis and theory may fall into place more easily.
DATA ANALYSIS AND BACKGROUND[
Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.
After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).
Two general phenomena seem to be dominating the debate:
a) Index pages ranking lower than sub-pages for some sites on main keyword searches
b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.
These problems are widespread and there is much confusion out there between the two (and some others).
The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.
By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).
So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?
Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.
Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.
This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.
I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.
THE GOOGLE TWILIGHT ZONE
In addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".
Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.
1) The 2BN+ Set
If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.
2) The Twilight Set
But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?
To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.
You can probably see where this is heading.
If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.
I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.
The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).
IS IT A PROBLEM?
I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.
Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.
The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.
WHY HAVE A TWILIGHT ZONE?
Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.
However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.
If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.
At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.
Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.
The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.
I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.
The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.
If I were you I would forget all about trying to control where visitors arrive from Google or any other SE.
I would love to agree with you, except that instead of the sub-pages showing up where the index file would have, they are showing up in the 10+ results.
Besides, we are trying to figure out what is going on. The more we know, the better chance we have at #1 results. No matter what Google does, there has to be an algo to prioritize results - its our job to pull together and figure out what that is, with or without their help.
For one second, maybe we can assume this is all still just a dance. I'd still like to be able to place higher during the dance too :-)
upside down google world :-(
I had the same problem, but I have seen this a few times now, if you are placed on page 1 for your keyword and have been there for some time, 4 month or more, you will be back if you have not made any big changes.
The problem you cant find you site on your major keyword and you used to have some good ranking, but now you are not within the first 200 pages.
Then you will be back later, thats what I have seen many times now.
I hope this helps for the dancing nervs.
They stabalized a little yesterday with a lot of index pages in, and now they have stabalized a bit with a ton of topical and index pages out
It's like Google is suffering from senile dementia and can't remember from one day to the next where the car keys are.
P.s Stefan & steveb try sticky me your index site I will see if theres a different serps from where I am, because it looks great here. Europe
Guess I'll have to hunt for those yellow pages, oh yeah, I can find it #1 on almost every other search engine, never mind...
GoogleGuy may have addressed Esmerelda. He has not however addressed this behavior where the index pages are shuffled out and back in, with seemingly identical cycles for 3 weeks in a row. Quite a different thing to address.
I'm glad others are seeing the same. I follow a ton of sites and many index were showing first page of the datacenters starting last night and are now off the map.
I am starting to believe that the data is, in fact, missing at times.
Why are some index pages missing?
1. Google's Broke? Maybe, especially recent SEO
2. Filter of most optimized keyword? Maybe
3. Filter eliminating most used keyword on page? Maybe
4. Non-underlined and/or colored hyperlink? Maybe
5. Too far ahead of 2nd place listing? Maybe
6. Failure to read 301 Perm Redirect correctly? Quite Possibly
7. Fresh Bot Drop? No
8. Filter of Mouseover Hyperlinks? No
Thanks for all the stickies! Couple more for you:
9. Dmoz update of which domain its pointing to? Maybe, but I can't see this causing such a drop off of only most competitive terms and only for index page.
10. Filter of most optimized keyword phrase? maybe
11. Filter eliminating keyword phrases used more than x times on a page? maybe
Take WebmasterWorld for example....I wonder how many of Brett's new SE visitors find the home page first....my guess is very few, most probably arrive in a thread somewhere.
I have some indications that this may be rolling backwards in time. Another pair of barometer sites disappeared (for the first time) today. Anyone breathing a sigh of relief believing they have escaped may be a little premature.
I do now really wonder about the absence GG. Why no comment at all for almost on a week on this topic?
Obviously you can read a number of things into that: everything from the TZ theory is correct and he doesn't want to give it credibility, right through to he has nothing to say because the index will return to relative normality soon.
In my opinion this is getting worse, and the index is deteriorating because of it.
The sites that are starting to disappear - you would put your reputation on the line and say they are definetly spam free?
Not the case that Spam Filters are coming in?
Unfortunately I have not got your sample base, I assume that you have now built up a large sample?
Possibly because he said that we could expect about one datacenter to update per day, but things have gone much differently. I would assume that this has been more difficult than anticipated.
Certainly, it's not the traditional update that GG said we should expect.
Absoulutely. Some of those missing are not SEO'd at all, but were selected for other reasons (eg: people I know, interesting sites on a particular topic, etc).
When this thing first started, it wasn't too hard to find sites that had some impact by just searching some of the sites that presented multiple Google searches.
This is the THIRD week in a row of exactly the same thing. They start shuffling us out with 2 datacenters first, then when they get up to 5, they start adding us back in to the ones they dropped us from. Then at that point, they continue on to the rest of the datacenters. The process seems to take several days and has been ALMOST IDENTICAL for the past 3 weeks.
This week (thankfully) seems to be going along a bit faster than the rest, pointing to a Thursday conclusion instead of Friday.
We're talking a 100% spam-free PR6 website with many backlinks here, just to get that out there...
[edited by: drewls at 4:00 pm (utc) on June 25, 2003]
one more reason to vanish:
I wrote the same just in an other threat,
My index vanished, but my index was still there. :-)
Much more down the serps i discovered my index.php,
therefore my domain.de/ disappeared.
- domain.de receives all the backlinks, but was
- domain.de/index.php still is in serps but much more down
Yes, by mistake Google had two times the same site from my domain. ( my own mistake )
But why is my better ranked site filtered out instead?
A number of sites in my collection haven't returned at all for days.
I'll give this to the weekend before going back to the drawing board and working out a fresh strategy to address the post Dom/Esm world. You can only sit and hope for so long.
Seriously, if no stability or PR calculation occurs soon, I'd expect alot of people to start spamming just to stay ahead.
I really don't think googleguy knows anything either. All he does is re-affirm our assumptions, I'd put little faith in his cryptic remarks.
I for one gave up on him during the "dominic" update. During this timeframe he continually said "expect the data to propogate to all the servers, then expect backlinks to come in, and then expect another traditional dance".
Well guess what people. The data propogated, but the backlinks really never came back(or at least haven't been calculated for PR), and we aren't even sure if this is a traditional dance or not. You put your faith in this? I think I could have told you more from a fortune cookie.
I am not saying they need to give their algo away, but punishing sites for something they have not stated is a violation of first amendment rights. I've studied media law and believe a case could be made here. Winable or not, it could cost everyone a crap load of money. Bad PR and bleeding money is the only end goal of such a case. Hurt everyone. Cast a dark cloud.
Lies upon lies have been told here. This fiasco is affecting webmasters for months, not weeks, in direct contradiction to GG's posts.
GG has obviously been sidelined because of the tech department or VIP department for serious reasons unknown to us. GG is not quiet for no reason. He was told to shut up. When the "Director of Communications" is told to be quiet, something is seriously wrong at Google. GG's posts are nothing but a PR bandaid. That's what he was hired to do. PR.
Oh yeah, it's time to rent a bus and have face to face time at the GooglePlex.
This post will be eliminated shortly along with my membership. I have no doubt.
This to me is a *serious* concern. Before, the idea was that if you built a solid site, and played by the rules, over time likely you would do well. At the moment, it looks to me like things are largely random. Nothing is predictable. Thus, best to toss out a lot of spam, and hope some will always make it to the top.