Forum Moderators: open

Message Too Old, No Replies

Lost Index Files

         

Napoleon

11:10 am on Jun 23, 2003 (gmt 0)



Some people may not like this post, and criticism of it would not at all surprise me. However, I suggest that before reading it you ask yourself whether Google actually DESIRES the current fluctuating situation. Is it where they actually want to be? Is it what they want to project to webmasters and the public?

Against that background perhaps the following analysis and theory may fall into place more easily.

DATA ANALYSIS AND BACKGROUND[
Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.

After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).

Two general phenomena seem to be dominating the debate:

a) Index pages ranking lower than sub-pages for some sites on main keyword searches

b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.

These problems are widespread and there is much confusion out there between the two (and some others).

The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.

By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).

So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?

Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.

Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.

This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.

I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.

THE GOOGLE TWILIGHT ZONE
In addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".

Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.

1) The 2BN+ Set
If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.

2) The Twilight Set
But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?

To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.

You can probably see where this is heading.

If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.

I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.

The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).

IS IT A PROBLEM?
I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.

Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.

The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.

WHY HAVE A TWILIGHT ZONE?
Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.

However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.

If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.

At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.

VERDICT?
Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.

The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.

I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.

The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.

Marval

10:43 am on Jun 26, 2003 (gmt 0)

10+ Year Member



Dolemite..I guess I will have to disagree that everything is solid...Im seeing completely different results on www2 and 3 than the www and 4 datacenters are different than the others...as far as I can see, either we are in a rolling update mode, or the update has not completed. I am also seeing old results on the Directory.
I still think its a little too early to call this one finished.

James_Dale

10:51 am on Jun 26, 2003 (gmt 0)

10+ Year Member



SERPS seem pretty good to me. Everything has evened out across the phrases I am checking. Top 10 sites are all high quality resources, usual market leaders in place with a couple of shiftovers in positions, but nothing dramatic. I think the update is finishing.

How many other people here are seeing good results but scared of getting flamed? :)

Dolemite

11:07 am on Jun 26, 2003 (gmt 0)

10+ Year Member



I guess I will have to disagree that everything is solid.

No...I wouldn't say everything is solid. I will say that the first page (top 10) for the keyphrase I check most often is almost the same across all the DCs. To me that says that the PR calculations for established sites are finished/all current relative to Esmerelda, or depending on the mechanics of the system, the dissemination of the calculated PR data is complete.

That said, I do think we're in some kind of constant update mode. It may be that the updates of adding new sites/pages just aren't affecting things at the most competitive levels. I'm not sure yet whether the constant update concept will extend to continuous recalculations of PR and ever-changing SERPs, or just better/faster ways of adding new sites/pages and determining the relevence of modified pages.

James_Dale

11:13 am on Jun 26, 2003 (gmt 0)

10+ Year Member



Dolemite, I agree. The top-listed sites seem to be stable, quality resources right now.

Probably outside the top 10 or 20 results, the quality (and stability) is tapering off a bit, but then I don't expect to find great sites buried that deep, and it seems natural that less established sites would be moving around if the 'constant update' theory is correct.

Frankly, I like the results I'm seeing now. But then, I'm in the UK - things may well look different in the US.

Brett_Tabke

11:24 am on Jun 26, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Why does everyone always forget se history?

This same thread has been running in se circles for 5+ years. Everytime an se changes their algo everyone thinks the sky is falling.

This is the same discussion we had about Infoseek in 97, Excite 98, Alta 99 when they dropped keywords in titles, dashes in domain names, or h1 headers here and there for a few updates.

killroy

11:28 am on Jun 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well said Brett.

SN

Dolemite

11:28 am on Jun 26, 2003 (gmt 0)

10+ Year Member



Brett,

Keep in mind a lot of us are new to this whole thing and we need folks like you to remind us what's par for the course.

OTOH, when you see a site that you've labored over and played fair with inexplicably drop off the map, the sky has fallen...there's no two ways about it.

Dayo_UK

11:28 am on Jun 26, 2003 (gmt 0)



Yes, Brett and where are they now? ;)

and of course a lot of us would not have been involved with websites back then - I did not even own a computer until 2001 :)

James_Dale

11:30 am on Jun 26, 2003 (gmt 0)

10+ Year Member



Brett, I agree. The sites I'm seeing towards the top of the SERPs are good quality sites. Those lower down are lower quality sites (regardless of sheer numbers of backlinks, high PR, multiple directory listings, etc)

PR has much less relevance now, from what I can see. Content is being rewarded much more than PR. Looks like an improvement to me.

merlin30

11:32 am on Jun 26, 2003 (gmt 0)

10+ Year Member



Because only through discussions, arguments and counters, will any sort of understanding of the unknown take place. No doubt this thread will run for another 5+ years.

Dolemite

11:33 am on Jun 26, 2003 (gmt 0)

10+ Year Member



Content is being rewarded much more than PR.

Now that I can't get behind...otherwise we wouldn't see these glorified site maps doing so well.

fancy list of links!=content

James_Dale

11:43 am on Jun 26, 2003 (gmt 0)

10+ Year Member



Ag, well, it was only a guess :)

zeus

11:55 am on Jun 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yesterday my index site was on all the main keywords like it used to, today they are gone on www3, but I dont panic because it will be back for sure and on www3 where im gone there is only fresh 25 date on the site, so I just think that my fresh site (25june) has not been passed around yet.

that explenation could maybe help others here.

zeus

Marval

11:57 am on Jun 26, 2003 (gmt 0)

10+ Year Member



Dolemite...the different results Im seeing for keyword phrases are in the top 10 and for very competitive searches.
There are some very good quality results in both versions of the indexes, sites that have good content and been around for 6 or 7 years. They are very different though and trying to figure out a reason, only thing I can think is that the update is not over....algo different on some dc's I believe.
Probably not filters as Id be willing to say all of the sites in the top 20 are comparitively clean.
The only anomoly that I think I posted about yesterday that I have seen is that the cache size of the pages is being reported incorrectly on one set of results, wheras the other set seems to have the cache sizes correct. The cache size that is incorrect on some of the sites (mine is only one of the many) is a page size from 4 months ago. The cache pages on both versions is identical so its not a fresh results effect.

Brett..I dont see a big algo change from here, the only big difference this update is the rolling effect and the lack of a real deep crawl so far...alot of bots looking for new pages but no reindexing of older pages that have changed.

Napoleon

12:34 pm on Jun 26, 2003 (gmt 0)



>> Why does everyone always forget se history? <<

They don't. You are assuming again that it is 100% certain this is an algo change. Has Google told you that?

Has it been proved?

I think it hasn't, and if you actually read the thread and check the centers you will find plenty of evidence to suggest it isn't. Changing data and swapping it in and out is NOT an algo change.

Surely an algo change would be consistent across the centers by now? One would also expect it to affect all sites the same, regardless of age. One would also expect GG's comments (suggesting it isn't an algo change) to closely reflect the real situation (quoted somewhere above).

I don't understand why you cannot see why there is doubt, and why you wish threads like this, which admitedly after the first 20 messages or so cover the symptoms, to cease.

Sure, it may be an algo change, and of course some webmasters will be making changes to some sites to explore that avenue. But the case is far from proved and just stating that it IS an algo change doesn't make it one and isn't convincing.

Why not post the detailed rationale of why you think it is an algo change (also defining what you mean by algo change)? Why not throw in your assessment of what that change actually is?

How about it?

skipfactor

1:00 pm on Jun 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This same thread has been running in se circles for 5+ years...This is the same discussion we had about Infoseek in 97, Excite 98, Alta 99

You mean to tell me those SEs Lost Index Files for over 2 months before they went belly-up? Now we're getting somewhere.

Stefan

1:11 pm on Jun 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oy, Dolemite and the others who gave me some feedback on my index page problem, many thanks.

Dolemite, you're right. I'm telling those two directories to get me unlisted today. I didn't even know the mirrors existed until I found about &filter=0.

S

Marval

1:32 pm on Jun 26, 2003 (gmt 0)

10+ Year Member



Just a thought...for those that are seeing the rock solid results...how competitive are the categories and what percentage of the results would be involved if "filters" were being tested?

<added> and do you see differences on www2 and 3?

Total Paranoia

1:43 pm on Jun 26, 2003 (gmt 0)

10+ Year Member



I am noticing more stable, solid and matching results throughout all datacenters for one word searches whilst multiple phrases still seem to be all over the place.

Any one else seeing this?

Ltribe

1:49 pm on Jun 26, 2003 (gmt 0)

10+ Year Member



I continue to agree with Napoleon.

The latest changes which began on -fi look better, but there are still significant index pages missing.

There's no consistent change to suggest a consistent change in the algorithm. It's like two or three things are happening and they're being switched in and out.

I think suggestions that index pages are being filtered out is highly premature, and of very doubtful value. Perhaps it looks that way to an overstressed SEO, but there are still lots of index pages out there that have been left alone.

This smells like a glitch.

Alphawolf

2:56 pm on Jun 26, 2003 (gmt 0)

10+ Year Member



Didn't read the whole thread. :)

I am seeing in and out results of index page.

But, every time it goes in (at one point it was in 3 data centers) it seems to rank a bit better.

#5 if the index sticks to several DC's.

Usually GG tells us what the scoop is, but has been pretty quite lately.

To me that somewhat indicates things aren't done cooking and he'd rather not pop in here to speculate.

The DC's are playing index page Ping Pong. All I can say is:

PRINTSCREEN when you see a good result. ;)

AW

Marval

3:07 pm on Jun 26, 2003 (gmt 0)

10+ Year Member



Total Paranoia...Im seeing single word results doing the same movement...in results of 202 mil, Im seeing a site bounce by between 4 and 8 pages whereas a single keyword with real uncompetitive results (1 or 2 mil results) will only move one or two pages. The multiple keyword results in the 5 mil. SERP range are doing the same 4-8 page bounce, and again in the smaller results just doing the 1-2 pages.
And still seeing major differences between www2,3 and the www results(of course those are dependant on the DC)

swerve

3:11 pm on Jun 26, 2003 (gmt 0)

10+ Year Member



Powdork writes:
...but I only disappear when i get the actual fresh tag. Are your experiences similar?

I am am seeing this too. One of my sites has improved compared to previous indices, from 50ish to 16. But, like many others, evey once in a while it just disappears completely (I stopped looking after 10 pages deep). After reading Powdork's comment above, I have kept a closer eye on it. Sure enough, today the site has dissapeared - but when I do a search for "domain.com", the site appears with a Fresh date of June 25. The site (home page) has not changed - could this be a 'non-fresh' demotion? Could this be related to the lost index issue? Does anyone have fresh dates on their buried index pages?

crobb305

3:16 pm on Jun 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



...but I only disappear when i get the actual fresh tag. Are your experiences similar?

I am noticing something similar, but my site doesn't disappear completely. Rather, when I get the fresh tag, my rankings fall two or three pages for primary keyphrases. When the fresh tag disappears, my rankings go back to the top. Very Strange. Incidentally, my site is very fresh. Changed significantly over the weekend, and today's fresh tag is June 25.

my3cents

3:20 pm on Jun 26, 2003 (gmt 0)

10+ Year Member



Nobody is getting flamed for noting that they are not having the same problem as those of us that seem to have misteriously lost index rankings. This is a thread about lost index pages, index pages that are in the G index with several different pahts and PR divided, filter issues, etc. While it's good to know that some areas were not effected, and we already know that most sites were not effected, do you think we're all wrong for wanting to try and find answers?

Brett: sometimes I can't tell the difference when you are throwing water on the fire or gasoline.... lol

I have seen some improvement over the last 24 hours in one aspect, the completely blank pages and completely off topic pages that were showing up in the top 10 are dropping towards 20-30 now.

The different paths to the index pages seem to be losing more ranking too though.

Napoleon, are you seeing this with your set?

drewls

3:24 pm on Jun 26, 2003 (gmt 0)

10+ Year Member



I'm noticing this too with the fresh tags. Only on multiple keyword searches. That's a large part of what leads me to believe this is a bug and not an algo change.

Welp folks, Google has finally devalued the useless multiple keyword search, in addition to banning those useless index pages. I mean, really, what's the point of it anyway?

I've been saying for years that a good 'contact us' page is where it's at... :p

my3cents

3:31 pm on Jun 26, 2003 (gmt 0)

10+ Year Member



"This is the same discussion we had about Infoseek in 97, Excite 98, Alta 99"

That's 4 years ago, it's 2003. I think there is some data in this thread that may be helpful and a little more fresh.

: )

Anon27

3:48 pm on Jun 26, 2003 (gmt 0)

10+ Year Member



Napoleon:

As I stated very early on in this thread that I am in total agreement with you.

I would like to now add that my missing index page (gone since mid-April) in now dancing back in (and in a respectable position) on FI

Let’s hope this is a move in the right direction.

dvduval

3:54 pm on Jun 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



One of my sites had the index page out of the SERPS for 2-3 weeks. As of today, it's back in on all datacenters and ranking as expected. Woohoo!
There is hope...

customdy

4:01 pm on Jun 26, 2003 (gmt 0)

10+ Year Member



Yes, but will it be there 5 minutes from now? : )
This 345 message thread spans 12 pages: 345