| This 345 message thread spans 12 pages: 345 (  2 3 4 5 6 7 8 9 ... 12 ) > > || |
|Lost Index Files|
Some people may not like this post, and criticism of it would not at all surprise me. However, I suggest that before reading it you ask yourself whether Google actually DESIRES the current fluctuating situation. Is it where they actually want to be? Is it what they want to project to webmasters and the public?
Against that background perhaps the following analysis and theory may fall into place more easily.
DATA ANALYSIS AND BACKGROUND[
Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.
After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).
Two general phenomena seem to be dominating the debate:
a) Index pages ranking lower than sub-pages for some sites on main keyword searches
b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.
These problems are widespread and there is much confusion out there between the two (and some others).
The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.
By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).
So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?
Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.
Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.
This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.
I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.
THE GOOGLE TWILIGHT ZONE
In addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".
Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.
1) The 2BN+ Set
If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.
2) The Twilight Set
But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?
To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.
You can probably see where this is heading.
If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.
I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.
The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).
IS IT A PROBLEM?
I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.
Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.
The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.
WHY HAVE A TWILIGHT ZONE?
Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.
However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.
If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.
At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.
Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.
The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.
I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.
The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.
A very good post, well thought out and I'm sure your armour is fitted... you'll need it :)
My view is much more simple, because that's all my brain will take ... I think Google tried a different 'algo' ..it didn't work, they lost a lot of data and can't fix it.
They have old data, and with new submissions and spider results they'll have very new data...the bit in the middle has gone. It won't come back.
I think it needs a fresh start. Get all those bots working, send em out, index everything ... back it up, filter and re index.
That's my opinion anyway.
Full marks for the research you have done and for your findings. I think it will be interesting to read GG's response.
Nice post and certainly a sound theory! :)
My problem with it is that due to the sheer volume of largely unknown factors to be taken into account, it is not possible for anyone to come up with a hypothesis thats even close to the actual problem.
|Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience. |
I dont think users are particularly concerned about what page they enter a site from.
Yeh, walking through the womens lingerie department to get to the gents clothes is fun, but it wastes my time. But Ive entered the store for a reason so Ill go and have a look around anyway.
I think most of us assume that visitors can hit our sites from any page, so we accommodate for this in our linking structure.
Perhaps we DON'T want people coming to our sites in the middle of our sales pitch, but hey - that's the nature of the web.
|Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases. |
Sorry, regular searches wont miss something that isn't there.
Granted those searching for brand name sites and arent able to find them may be slightly confused, but the majority of the searching public looks for services, key phrases, etc - not branding.
The only people who will notice are the site owners (regular visitors will arrive at site via type ins in most cases).
|Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term. |
I agree wholeheartedly. Most people only play ball when it´s in their own best interests! ;)
Overall I think many people are analysing these past few weeks (or months?) way too deeply - noone has nearly enough facts to base any reasonable theory on.
Even before the changes, when we "knew" what was going on, most people struggled to make heads or tails of the algo.
Taking a sample set to base an experiment on is all very well, but you can´t control the environment under which the experiment is carried out. That, coupled with so many unknown variables, makes any conclusions guesses at best.
To be honest, I don´t care too much about any changes - Ive only breifly read up on some of the posts and a few other sites.
Any resulting new algo or change in the current algo will just open up the ballgame for the leaders to lead again and for the followers to trail back behind! ;)
Let the games commence!
>> I dont think users are particularly concerned about what page they enter a site from. <<
As a searcher, if I'm looking for widgets I don't want to enter the widgets site on the site stats page. I don't want to see a title of 'site stats' in the Google SERPS (and I did today on at least one search for a product). I don't want to guess, assuming I do click on that page, how to actually find widgets.
I DO want to hit the proper front door navigation, designed to get me to widgets in the appropriate way. I believe most people will agree with that.
>> The only people who will notice are the site owners <<
I did actually make the point that this was less visible. Not to worry.... it is STILL an issue. Writing off great sites (lots of them) because theye can't get their filters right does Google no good at all.
That's another minority they will upset. Which minority tomorrow? They should remember that a bandwaggon is easy to start and difficult to stop.
>> Most people only play ball when it´s in their own best interests! <<
They should not under estimate that either. They have changed the way most webmasters think and driven them to create content as a priority (no problem there). If they are not careful from here though, they will change how they think again.
Another disturbing aspect is how long this is lasting.... and will it ever stop? I suggest that's something else that supports the theory.
Napoleon,Your theory is consitent with my observations.
It all sounds well-reasoned, but I'm left wondering if the clues match your conclusions.
For example, since we know of a number of sites that are experiencing these phenomena, why not check site logs to see if, when, and how deeply freshdeepbot is crawling them?
If a consistent pattern can be identified, that would seem to hold evidence on whether a particular site is a part of the initial 2 billion, or in the 2B+ set. If, as GG says, the crawl winds down after 2 billion, that would suggest at least a temporal difference in spidering if not a situation where sites might not be spidered very well at all.
Also, one thing puzzle me about this. Index pages would seem like the most important pages to many sites, and no doubt google is aware of this. So it would seem that if you had incomplete data and could only get part of a site spidered and indexed properly, you might want to do that with the index page.
Since you've looked at some of these sites...are you sure there aren't any consistent patterns that might have tipped off a poorly-conceived spam filter? Anything like a common link structure, heavy keyword density, or minimal content could do it.
Also, are there any trends in when these sites were launched beyond the < 2 years thing? It seems like 2 years is just about geological time on the internet, and if google couldn't get an accurate picture of a site in that amount of time, then its a crappier system than I would have ever thought.
As of last night our SERPs changed from #3 to somewhere back on page 6, some of the new top SERPs sites are
relevant to the keyword search, many are not. There are quite a few new sites, seems that most of the new sites do not have whatever problem most of us that have been around are having.
I checked the meta tags of the top 10 sites, strangely none of the top ten used the following tag:
<META NAME="ROBOTS" CONTENT="all,index,follow">
|I checked the meta tags of the top 10 sites, strangely none of the top ten used the following tag: |
<META NAME="ROBOTS" CONTENT="all,index,follow">
Nothing strange about that...I think most people realize that its superfluous compared to no tag at all.
>> Anything like a common link structure, heavy keyword density, or minimal content... <<
If you read the original post you will see I checked all the basic stuff for patterns.
>> are there any trends in when these sites were launched beyond the < 2 years thing? <<
It doesn't appear to be the age of the site itself, but the age of the external link structure. It is link data that is not being taken account of when sites fall. That also answers your point on the incomplete data - the site itself is being captured - the links are the problem.
It's also much less than two years. I can't give a precise fix, but any threshold is certainly less than that.
|Overall I think many people are analysing these past few weeks (or months?) way too deeply - noone has nearly enough facts to base any reasonable theory on. |
I would tend to agree with that ;)
Index page fluctuating or not ranking properly? I think the part where Dominic hurt most SEOs was bcoz of the missing backlinks. Then we have Esmeralda where backlinks are brought into the picture. My logic says that since index pages have most no. of backlinks (and coming from a phase where we had less amount of links a month before) we will see the index page fluctuate a lot. These are just my initial thoughts and these may change when I see the final results. Your theories about 2B limit are very interesting :)
this might be little offtopic ...but ...
i have a site that was getting around 80 visits on index per day.it is nice site,totally clean,in all major directories ,dmoz etc,pr 5. around week ago it climed to 130+ per day for 2 days (just index) and then it slowly came down. now it is around 20 or even down to 10! that sucks. Quickly looking at stats i see that overall visits are somewhere there like before but i dont know if this is jsut new "thing"(more hits to other pages not index) or because 10 new pages is in the index now. anyway,from SEO point of view, i definitelly dont like this new no-index "Algo".
also i still dont see some pages in index that are over 2 months old but some pages that are 2 weeks old or so are there.What gives?
also this pr 0 on interior pages is slowly getting on my nerves now.pages like totaly clean,ther is NO reason for any penalty and i am sure it aint.
Also this pr 0 on tons of pages that actually dont have pr 0 is making it really difficult to get quality backlinks. they askyou to link to them but they have like pr 0!Even my site has pr 0 on some interior pages while some are pr 5.
i am definitelly not satisified with google last few months as you can see. not that i want something for nothing but it is getting little annoying now. I have this feelign that it is all falling appart and that they are applying some "new" fixes/algos/tweaks and who knows what instead of totally renowating their algo..
that is just my personal view.:)
napoleon,nice post, i tend to agree.it seems my site fell into this twilight zone category.
[edited by: JonB at 3:26 pm (utc) on June 23, 2003]
|are you sure there aren't any consistent patterns that might have tipped off a poorly-conceived spam filter? |
This brings to mind questions that I haven't seen asked in the index issue analysis:
-What's the only SE that has a blog clog issue in their SERPs that can only be blamed on their current, perhaps past, method of allocating relevancy?
-What page do most blogs obtain & pass on sometimes false "relevancy" and rely upon for a majority of traffic, yet are a glorified link/sitemap page hybrid?--this is where Brett could be right, although I can't see Google permanently settling this issue with one big swipe unless they make index pages now "prove" themselves with age to reach their proper status.
-Where does Google stand with the BLOG tab being added? Is this their premature attempt to eliminate blog clog without giving in to adding another tab? I mean what percentage of the real world reads this stuff? In my opinion, not enough to warrant a tab of precious Google real estate.
Napolean, were there any blog sites in your sample? An analysis of older and newer blog sites might be worthwhile.
>> Napolean, were there any blog sites in your sample? <<
None of the blogs I looked at were affected at all. Granted it was only a handful.
>> Is this their premature attempt to eliminate blog clog without giving in to adding another tab? <<
If it was, surely some sort of pattern would be evident (other than link structure age)? And before I'm asked, yes, I checked that those sites affected did not have almost exclusively the same anchor text for their links. Some did.... but here's the bummer... some didn't.
Are most of these comments related to the majority of the datacenters or all of them? I get a big SERP difference in the 3 datacenters: ex, in, & cw.
I wonder about a couple of things...why the Update Esmerelda Part 3 thread had several pages deleted and no more replies allowed...and will GG make any comments today, as he implied in his last post of that same thread. If so, I assume it will most likely be in this thread or the Index pages lost thread, since they seem to be the most active ones at the moment. Moderators? GG? Any comments on what happened to the Update thread? I'm fairly new here, so maybe it is standard practice to do this here, but seems odd to me...
|Any comments on what happened to the Update thread? |
Please don't get this one nuked & locked! :)
Napolean spent some time on this post.
i wonder where this sits in the theory:
site used to be #1 of 2,000,000 for single keyword. Now index dropped from www.co.uk with sub-page appearing 3 pages down, but a search on www.co.uk set to uk pages only brings it up at number one with the same sites that were showing on a webwide search now ranking below it.
Bit off topic but just noticed on some of the datacenters tonight Google has once again dropped my index page in favour of a dynamic off-site link to my index page. The robot seems to think that the linked-to page is the page itself.
my index page is optimised for keyword1keyword2..a search brings a page optimised for keyword2keyword1...a search for keyword2keyword1 brings up a page optimised for keyword3keyword2...the more i look at it the more it looks like a seo filter.
A SEO filter makes sence.
I offered my theory a while ago that Google being manipulated so easy by any body with a little knowledge does not look good for a company thinking of going public.
Isn't if funny we start to see these changes after they bring in "their money oriented" person to run the company and let Sergy and Larry do what they like to do which is code and run search theories.
The crown jewel of Google is the search engine. The fact that the crown jewel has spawned so many new SEO businesses does not look good. If a person who is just starting out can figure out in a short period of time what to do to manipulate a search result to bring up their desired pages, then it does not say much for the crown jewel.
It is in fact in Google's best interest if people can not manipulate it so easily.
The problem is there are so many entrenched sites with 1000's of links that applying the filter is really going to benefit the older sites.
They would really have to re-tool the whole link structure thing to change things. Something like only counting 1 link per site would have a huge impact on things. Doing so would change the whole PR thing.
Of course, I hope I'm wrong and Google is just screwed up right now.
A note about the searcher and (b). There are a finite number of relevant quality results for every query. It is not the total number of results. It is much lower. In many cases there can be very few (10-20). Missing one or two relevant pages seriously downgrades the results. Does Joe know? Individually, probably not. In the aggregate, I think so. And of course there is the here today gone tomorrow issue with searches. Users do notice this.
Sure an SEO filter makes some sense. But what in the SEO is causing this? Lets make a list of possibilities:
1. Similar <Title> <H1 Tags> <Internal Linking>?
2. Filter of most optimized keyword?
3. Too much duplicate content?
4. Toolbar tracking?
5. Google Dance search tracking?
Problem I have, is GoogleGuy has been here to help and has always given us quite a few tidbits of very good information. Why would they do all this, and then throw up an SEO filter and not tell us why these pages are being dropped? This would be a PR nightmare.
I'm still holding out hope these index pages will be back.
I haven't looked, so I cannot say. But has anyone noticed any relationship between the number of external links on an index page and the tendency of an index page to disappear?
I'm thinking this would be a good test if you were intent on going after blogs. The A-list blogrolls typically have dozens of external links on their main page.
I think, in times like this, it pays to fall back on the basics. Personally I like to think in general terms, in my experience they stand the test of time, rather than concentrate on a specific update. With that in mind why not address some specifics with generalities ;)
>wind down the crawl after fetching 2B+ URLs
We can focus on a specific number or a principle.
The number is 2B+, I'm a simple man and that means to me more than 2B, I'm unsure if that means less than 3B or 4B or 5B.
To get to the meat; do the SE's crawl the web? The answer is 100% no, they crawl a subset of the web and make decisions on what to crawl and what not to crawl. Our work must surely involve ensuring that our pages are included in that crawl, there is a "core" set of pages, we need to make sure we are part of that "core".
>Index pages ranking lower than sub-pages
It's not a new thing, maybe new to G!, the simple answer is too much focus not enough guile. I'd be thinking about inverse document frequency a lot and asking the simple question; is your site Rocky V or Raging Bull.
>Sites appearing much lower than they should
Time to step away, once we start to make assumptions based on past performance then we are dead. This is a fluid game, it ain't 99 no more, its not even March 03.
>‘algo tweak’ suggestion just doesn’t stack up
The game has changed.
Bring up your home page, view source. Take a step back, then another, then 20 more. Concentrate and then start to float, go through the roof, go higher and higher, go high enough until you can almost see the entire www.
Where are you? [mappamundi.net]
>> number of external links on an index page and the tendency of an index page to disappear? <<
I looked at that Kackle.... and I would say there is no relationship.
Some of the sites hammered actually have zero external links. At the other end of the spectrum, I have sites here with plenty external links which have also been hit.
Ditto the other way round too... plenty of sites with zero and plenty with stacks totally unaffected.
I'm pretty sure it isn't an SEO filter by the way (at least the index problem isn't). If it is, it's a hell of a sophisticated one - and it doesn't work well either, because some of the barometer sites I use have no SEO whatsoever applied.
My money is well and truly on technology and missing/unstable link data. Some of the points above also edge me further along that path - it would take a real front to edge webmasters to content-content-content/link-link-link and then batter them to death with a trap.
I can't see that at all. But then again, I am sometimes called naive.... we'll find out in time I guess.
the question posed by the poster in this thread is basically the same question i asked in condensed form to GG in his question and answer post a few weeks back. he did not answer that question or ones that may have been like it in this regard, as far as i remember when reading his answers to a batch of the questions.
i would have thought it would be one that surely would have been addressed, considering the consternation unpredictable results have been causing web masters since this all started.
are constantly facilating, fluctuating, results in Google now a normative phenomenon with Google or not? Does Google think this is better for Google users, or do they want optimizers to back off free listing effort? What is it that Google is trying to accomplish with it's recent changes?
Maybe GG has answered this directly somewhere in the past. If so, I have missed it.
|Two general phenomena seem to be dominating the debate: |
a) Index pages ranking lower than sub-pages for some sites on main keyword searches
b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.
First: I could not agree with you more, this is exactly what I have been seeing since the start of the last update in mid April.
Second: This is the best post I have ever read on WW.
Third: One other thing that I have noticed, unless the site has a high PR, 6 or greater, the only sites that remained in the top 20 (for my main KW) that are PR5 and below, are single page sites, no sub pages.
Please keep in mind that this is not a current dance issue, this has been going on for over 2 full months.
From [webmasterworld.com...] (message 5), GoogleGuy writes:
|Q: For many sites, the index page seems to be buried on search terms for which logic determines they should rank highly. Is this a transient feature, like some of the other recent issues, resulting from the changeover to newer data? Or is it due to a more fundamental algorithmic change? |
A: I don’t think it’s a fundamental algorithmic change. I don’t recall hearing about any changes would bring about long-term behavior like this. I’m pretty sure that it’s more of a transient issue, and I wouldn’t be concerned about this.
When I first read this, I interpreted it as a roundabout way of saying that it was due to a bug, and that it would be fixed. But now many are speculating that it is indeed due to an algorithm change. I'm curious, does this mean that you don't believe the answer above - or do you have a different interpretation of it?
lol, NFFC - last time I saw you post that link, I rewrote an 'SEO algo' that I originally dreampt up in 2001, almost exactly two years ago.
I'd forgotten about it -> but now, floating above the ether, and being one with the web, I can see it clearly.
Become a part of the web - integrate your sites, your mind, and your intuition.
Dreams are powerful, and floating inside the web, you too can prevent 'update chaos' from impacting your bottom line.
That image link NFFC posted is quite possibly the most important image you can think of while doing any kind of optimization. More powerful than keyword analysis, and more potent than a high PageRank. :)
keep calm, they are still experimenting. My index page has just disappeared leaving my contact page as no. 1.
Google, I don't mind if people directly go to the contact page, thinking about it, I probably prefer it :-))
"just" means just an hour ago, and for the first time all data centers agree!
I decided not to take anything seriously for a long time.
| This 345 message thread spans 12 pages: 345 (  2 3 4 5 6 7 8 9 ... 12 ) > > |