homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 345 message thread spans 12 pages: < < 345 ( 1 2 [3] 4 5 6 7 8 9 10 11 ... 12 > >     
Lost Index Files

 11:10 am on Jun 23, 2003 (gmt 0)

Some people may not like this post, and criticism of it would not at all surprise me. However, I suggest that before reading it you ask yourself whether Google actually DESIRES the current fluctuating situation. Is it where they actually want to be? Is it what they want to project to webmasters and the public?

Against that background perhaps the following analysis and theory may fall into place more easily.

Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.

After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).

Two general phenomena seem to be dominating the debate:

a) Index pages ranking lower than sub-pages for some sites on main keyword searches

b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.

These problems are widespread and there is much confusion out there between the two (and some others).

The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.

By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).

So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?

Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.

Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.

This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.

I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.

In addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".

Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.

1) The 2BN+ Set
If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.

2) The Twilight Set
But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?

To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.

You can probably see where this is heading.

If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.

I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.

The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).

I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.

Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.

The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.

Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.

However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.

If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.

At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.

Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.

The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.

I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.

The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.



 11:39 pm on Jun 23, 2003 (gmt 0)

Why are some index pages missing?
1. Google's Broke
2. Similar <Title> <H1 Tags> <Internal Linking>?
3. Filter of most optimized keyword?
4. Filter eliminating most used keyword on page?
5. Too much duplicate content?
6. Toolbar tracking?
7. Google Dance search tracking?
8. Non-underlined and/or colored hyperlink?
9. Too far ahead of 2nd place listing?
10. index.php?uniquecontentstring=dan

Could use some help here. Please feel free to sticky examples too. Thanks!


 11:45 pm on Jun 23, 2003 (gmt 0)

Index pages that were lost last night are not back on www
Anyone else seeing changes (again)?


 11:49 pm on Jun 23, 2003 (gmt 0)

I'm seeing changes in the IN datacentre SERPS as of 1/2 hour ago.


 11:56 pm on Jun 23, 2003 (gmt 0)

IMHO - As I interpret it so far:

Why are some index pages missing?
1. Google's Broke - Brokish - for sites under 4-6 months old.
2. Similar <Title> <H1 Tags> <Internal Linking> - Disagree
3. Filter of most optimized keyword? Quite Possibly
4. Filter eliminating most used keyword on page? Again Quite Possibly
5. Too much duplicate content? - Disagree
6. Toolbar tracking? Disagree
7. Google Dance search tracking? Disagree
8. Non-underlined and/or colored hyperlink? Disagree
9. Too far ahead of 2nd place listing? Disagree
10. index.php?uniquecontentstring=dan No Idea


 11:59 pm on Jun 23, 2003 (gmt 0)

Subway....I agree 100% with your HO. Exactly what I have been leaning towards myself.


 12:05 am on Jun 24, 2003 (gmt 0)

wow,just as fast as index page reappeared for #1 keyword, now it is gone again. This is very strange......


 12:05 am on Jun 24, 2003 (gmt 0)

>I'm seeing changes in the IN datacentre SERPS as of 1/2 hour ago.

Confirmed. -in is on the move.


 12:13 am on Jun 24, 2003 (gmt 0)

>I'm seeing changes in the IN datacentre SERPS as of 1/2 hour ago.
Confirmed. -in is on the move

Is it me or is the IN index looking more like FI with Filter=0?


 12:16 am on Jun 24, 2003 (gmt 0)

<add>Nope, it's actually looking quite different</add>!


 12:46 am on Jun 24, 2003 (gmt 0)

I'm looking at some searches that return a lowish number of results.

In the 100 to 200 results returned listings, -in is showing an extra 10 to 15 results compared to other datacentres.

In the 5000 results returned listings, -in is showing ~300 to ~400 more results when compared to other datacentres.

I think -in is having an injection of fresh data into the listings. I've seen a few entries go in from sites that started mentioning the particular search topic on their site only in the last week or so.

Could this be that we are watching the dripping in of fresh data, without fresh tags, a sort of rolling update?


[edited by: g1smd at 12:51 am (utc) on June 24, 2003]


 12:47 am on Jun 24, 2003 (gmt 0)

I did a backlink check today, and found a site linking to me that apparently has the date/time at the top of the page. When Google crawled the page, it captured the date. April 22! Definitely old data being used in backlinks. Of course this is not new information. My site is 2 years old, and has a PR=7. I am experiencing the same problems Napolean is talking about. Subpages ranking higher than index page. Lost rankings on major keyphrases (except on ONE datacenter). NONE of the datacenters have the same info/index! Google is so broke it's ridiculous.


 12:53 am on Jun 24, 2003 (gmt 0)

-in is on the move

I can confirm it has the latest cache of my site. Haven't seen any other DC yet with cache newer than 2 months apart from IN.


 12:56 am on Jun 24, 2003 (gmt 0)

IN definately has new data and SERPs look much better!


 12:56 am on Jun 24, 2003 (gmt 0)

Depends which sites you look at. I've seen cache results on all datacentres that are only a couple of weeks old in some cases.


 12:59 am on Jun 24, 2003 (gmt 0)

we have lots of fresh tags showing June 22 on IN
including our Index Page


 1:01 am on Jun 24, 2003 (gmt 0)

>>> I've seen cache results on all datacentres that are only a couple of weeks old in some cases.

sorry, was talking specifically about one of my problem sites.


 1:09 am on Jun 24, 2003 (gmt 0)

Thanks for the excellent post at the start and the work involved. I am relatively new to Google watching but have had a site which has been listed 4/5 PR for some time.

It maybe that I can throw something new into the mix because of this site (excuse my general ignorance on the subject - but maybe this will help).

During the recent update the site was recrawled by Google, as per normal (visible by browser tag - several sessions). The big difference here is that I had just completely rebuilt the site and most of the old pages had disappeared and been replaced by redirects to the home page. The old pages were, in the change to the site, orphaned.

Despite having what is vitually a brand new site, Google is still listing the links and pages in the old site. In other words the index has not updated the site. The PR and general ranking looks pretty much as before. It's as if nothing had changed at all....

At the same time I have built the same site & structure for several other domains, but for a different geographic market. Some of these have been crawled but still nothing showing as yet?

Hope this contributes soemthing useful to the debate.


 1:21 am on Jun 24, 2003 (gmt 0)

I just checked some very competitive phrases.

My take is that it looks like -fi is the same data as -in but a spam filter would appear to be the difference. Currently, -fi gives much better results that -in for a searcher in the 'typically '-spammy' competitive categories I just checked.


 1:26 am on Jun 24, 2003 (gmt 0)

>>> Currently, -fi gives much better results that -in

FI still showing old data, I would say that so far it looks like a lot of dropped (unspammy) sites are back into IN.


 1:29 am on Jun 24, 2003 (gmt 0)

IF this is a type of spam filter I would have to say that it is a little overactive.. I don't think repeating the keyword phrase in the title and the H1 should be considered spam..... and that is what it appears we are getting hit(filtered) for.

Title - Widets
H1 - Custom Blue Widets


 1:32 am on Jun 24, 2003 (gmt 0)

I agree that the results show older sites fairing much better (without good / any other reason) then newer domains / sites.

I have one particular (6 month old) site which is incredibly well backlinked, yet it is clear, although a link: shows a lot of this, that the benefit is not being applied right now. Whereas others (domains that are 2 plus years old) with fewer backlinks are doing much better.

Sure, there could be other factors. But I don't think so. As the older domains had a html remake around the same time the new domain was launched. Although different keywords / markets, the newer site should be storming ahead relativ to the older domains. It aint.


 1:35 am on Jun 24, 2003 (gmt 0)


I'd be looking at something else if it were me (but then, i'm not an a newer than 1 year site, so maybe that's the difference).

on my index page I'm not seeing any problem with same words in title and h1.

The phrases are not identical but everything in h1 is also in the title.

I had a brief index disappearance in April, 2003 PD (pre-dominic) but the index page came back prior to dominic and has ridden well since dominic, including after esmerelda tossed on her dancing shoes


 1:36 am on Jun 24, 2003 (gmt 0)

"3. Filter of most optimized keyword?"
"4. Filter eliminating most used keyword on page?"

These two claims are making a lot of sense with what I am seeing on www-in right now but it's showing terribly irrelevant results. I'm sure I rank high for "turquoise widgets" when it's mentioned once on my page but the page is not about turquoise widgets, it's about "umber widgets" for which it has horrible ranking.

They have definetely targeted my number one phrase.

To coin a new term: "Google Fatigue," the overwhelming frustration and weariness a Webmaster endures while waiting for Google to produce SERPS that are not discombobulated."


 1:38 am on Jun 24, 2003 (gmt 0)

"3. Filter of most optimized keyword?"
"4. Filter eliminating most used keyword on page?"

Is making sense for some of the results. I believe Google are running this on just one of the xx algos they use to place results.


 2:05 am on Jun 24, 2003 (gmt 0)

Brett, you might has well have left the Esmeralda thread alive and just cleaned out the dross every couple of days... it's going to keep stumbling back from the grave in one thread or another.


 2:26 am on Jun 24, 2003 (gmt 0)


I would have thought stickymail a better medium rather than show a need on your part.


 2:34 am on Jun 24, 2003 (gmt 0)

My apologies if I've insulted you somehow. I thought the Esmeralda thread served a purpose and was merely commenting on the obvious. Carry on. I'm as interested as all of you in the index problem; I've been experiencing it too.


 2:35 am on Jun 24, 2003 (gmt 0)

Here is an observation for SERPS I'm watching.

Phrases that have me as #1 in anchor text showing up much lower than sites who were number#1 in thier anchor text prior to last year. I was number#1 in anchor and SERP prior to this last update.

Sites with more older established links not having problems with index pages dropping.

For search in paticular, dropped from #5 to #86 in the blink of an eye.

The results are bouncing back and forth between what they were last week and now.

Somehow, as speculated in this thread before, newer links and anchor text seem to of lost their luster.

Is this by design or by accident?

Who knows!

I'll check again in a few days and see how it looks then.

[edited by: mrguy at 2:39 am (utc) on June 24, 2003]


 2:37 am on Jun 24, 2003 (gmt 0)

I have an index page ranking #2 for allinanchor, but the subpages are listed on the regular search. When searching for the top keyword with &filter=0, my site has far more listings in the top 100 than any other site. Where is the index? I just don't understand!


 3:12 am on Jun 24, 2003 (gmt 0)

Has everybody been checking all the Google datacents when examining their SERPS?



I have noticed a big difference in my SERPS for every datacenter I check.

Its as if each data center is running a different algo or running off of inconsistent data.

It constantly changes on an hourly basis.


 3:31 am on Jun 24, 2003 (gmt 0)


I am expereincing different results on all of the datacenters also. I am #2 on every datacenter except cw, ex, and in, which I am #3. Can any one explain why they are fluctuating and is this normal. Thank you for your replies.


This 345 message thread spans 12 pages: < < 345 ( 1 2 [3] 4 5 6 7 8 9 10 11 ... 12 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved