homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 345 message thread spans 12 pages: < < 345 ( 1 2 3 4 5 6 [7] 8 9 10 11 12 > >     
Lost Index Files

 11:10 am on Jun 23, 2003 (gmt 0)

Some people may not like this post, and criticism of it would not at all surprise me. However, I suggest that before reading it you ask yourself whether Google actually DESIRES the current fluctuating situation. Is it where they actually want to be? Is it what they want to project to webmasters and the public?

Against that background perhaps the following analysis and theory may fall into place more easily.

Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.

After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).

Two general phenomena seem to be dominating the debate:

a) Index pages ranking lower than sub-pages for some sites on main keyword searches

b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.

These problems are widespread and there is much confusion out there between the two (and some others).

The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.

By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).

So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?

Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.

Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.

This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.

I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.

In addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".

Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.

1) The 2BN+ Set
If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.

2) The Twilight Set
But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?

To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.

You can probably see where this is heading.

If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.

I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.

The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).

I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.

Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.

The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.

Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.

However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.

If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.

At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.

Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.

The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.

I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.

The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.



 6:46 am on Jun 25, 2003 (gmt 0)

Percentages wrote:
If I were you I would forget all about trying to control where visitors arrive from Google or any other SE.

I would love to agree with you, except that instead of the sub-pages showing up where the index file would have, they are showing up in the 10+ results.

Besides, we are trying to figure out what is going on. The more we know, the better chance we have at #1 results. No matter what Google does, there has to be an algo to prioritize results - its our job to pull together and figure out what that is, with or without their help.

For one second, maybe we can assume this is all still just a dance. I'd still like to be able to place higher during the dance too :-)


 7:04 am on Jun 25, 2003 (gmt 0)

Beside the problems talked about in this thread there is another big prpblem which is not solved yet:
Guestbooks entries, Guestbooks spammers and their continuing Number one positions.
As time goes by more and more people realize that guestbook spamming works and do it. Serps get worser and worser.
For a major keyword there is a site which has Nr. 2 since 5 about 5 days and has 350 backlinks from guestbooks. The title and the google cache of this domain shows still the standard template of the provider (like: this is a new webpage no contents have been uploaded)
This means that this guy has made it Nr.2 for a competitive search terms within some days while those of us who use legal methods to optimize their domains fall down the serps more and more.

upside down google world :-(



 10:49 am on Jun 25, 2003 (gmt 0)

I have seen that many has talked about they where gone from there good stable ranking for some main keywords, where they where on page one between no.1-8 on that first site.

I had the same problem, but I have seen this a few times now, if you are placed on page 1 for your keyword and have been there for some time, 4 month or more, you will be back if you have not made any big changes.

The problem you cant find you site on your major keyword and you used to have some good ranking, but now you are not within the first 200 pages.
Then you will be back later, thats what I have seen many times now.

I hope this helps for the dancing nervs.



 11:41 am on Jun 25, 2003 (gmt 0)

It looks like the results are a little more stable now on www, so I think we will see a completion of the update within 2 days.



 12:00 pm on Jun 25, 2003 (gmt 0)

They stabalized a little yesterday with a lot of index pages in, and now they have stabalized a bit with a ton of topical and index pages out. This is like the worst of the bad back at the beginning. I am expecting though that it will slop around again to something more sensible.... maybe even five minutes from now.


 12:11 pm on Jun 25, 2003 (gmt 0)

Zues, good point. we've all witnessed the same behavior with google over the past 7 days or so...and the www do seems to be settling. only sj results are way different than other centers.


 12:26 pm on Jun 25, 2003 (gmt 0)

They stabalized a little yesterday with a lot of index pages in, and now they have stabalized a bit with a ton of topical and index pages out

Exactly my case. Yesterday my index was #2, after the main page at #1, right across all 9 dc's; today the index is buried somewhere again with my info page at #2, across all 9 dc's.

It's like Google is suffering from senile dementia and can't remember from one day to the next where the car keys are.


 12:28 pm on Jun 25, 2003 (gmt 0)

I do think we are seeing some of the best results now on www for a long time for different keywords and sections of the net. The index sites is also in place for many sites, it looks good right now.


P.s Stefan & steveb try sticky me your index site I will see if theres a different serps from where I am, because it looks great here. Europe


 1:38 pm on Jun 25, 2003 (gmt 0)

I am trying to find the website for a police station, <sarcasm>I know it was there before, but these results are so good, it's gone</sarcasm>. The top serps though bring me to an old domain that I think used to link to them.

Guess I'll have to hunt for those yellow pages, oh yeah, I can find it #1 on almost every other search engine, never mind...


 2:07 pm on Jun 25, 2003 (gmt 0)

I'm sorry, but I'm not seeing anything stable. They seem to have sped up the shuffling that they are doing, but it's still the same ole thing that's been going on for 3 weeks.

GoogleGuy may have addressed Esmerelda. He has not however addressed this behavior where the index pages are shuffled out and back in, with seemingly identical cycles for 3 weeks in a row. Quite a different thing to address.


 2:11 pm on Jun 25, 2003 (gmt 0)

<<Exactly my case. Yesterday my index was #2, after the main page at #1, right across all 9 dc's; today the index is buried somewhere again with my info page at #2, across all 9 dc's. >>

I'm glad others are seeing the same. I follow a ton of sites and many index were showing first page of the datacenters starting last night and are now off the map.

I am starting to believe that the data is, in fact, missing at times.


 2:26 pm on Jun 25, 2003 (gmt 0)

The index on www seems to be relatively stable and has in my opinion a good quality, althought the index on www-fi and -ab are a bit bettter I think.


 2:36 pm on Jun 25, 2003 (gmt 0)

Grrrr...it looks like I'm in same boat with y'all on the missing index problem.

Only on -ex, though...and adding &filter=0 puts me right back where I should be.

As long as this index doesn't propagate or I magically reappear, I'll be happy.


 2:47 pm on Jun 25, 2003 (gmt 0)

Why are some index pages missing?
1. Google's Broke? Maybe, especially recent SEO
2. Filter of most optimized keyword? Maybe
3. Filter eliminating most used keyword on page? Maybe
4. Non-underlined and/or colored hyperlink? Maybe
5. Too far ahead of 2nd place listing? Maybe
6. Failure to read 301 Perm Redirect correctly? Quite Possibly
7. Fresh Bot Drop? No
8. Filter of Mouseover Hyperlinks? No

Thanks for all the stickies! Couple more for you:

9. Dmoz update of which domain its pointing to? Maybe, but I can't see this causing such a drop off of only most competitive terms and only for index page.
10. Filter of most optimized keyword phrase? maybe
11. Filter eliminating keyword phrases used more than x times on a page? maybe


 3:03 pm on Jun 25, 2003 (gmt 0)

Take WebmasterWorld for example....I wonder how many of Brett's new SE visitors find the home page first....my guess is very few, most probably arrive in a thread somewhere.

Very true. I dicovered WebmasterWorld when I clicked on the number one search result for "google update": [webmasterworld.com...]


 3:41 pm on Jun 25, 2003 (gmt 0)

Anyone want the bad news?

I have some indications that this may be rolling backwards in time. Another pair of barometer sites disappeared (for the first time) today. Anyone breathing a sigh of relief believing they have escaped may be a little premature.

I do now really wonder about the absence GG. Why no comment at all for almost on a week on this topic?

Obviously you can read a number of things into that: everything from the TZ theory is correct and he doesn't want to give it credibility, right through to he has nothing to say because the index will return to relative normality soon.

In my opinion this is getting worse, and the index is deteriorating because of it.


 3:45 pm on Jun 25, 2003 (gmt 0)


The sites that are starting to disappear - you would put your reputation on the line and say they are definetly spam free?

Not the case that Spam Filters are coming in?

Unfortunately I have not got your sample base, I assume that you have now built up a large sample?


 3:47 pm on Jun 25, 2003 (gmt 0)

<<I do now really wonder about the absence GG. Why no comment at all for almost on a week on this topic? >>

Possibly because he said that we could expect about one datacenter to update per day, but things have gone much differently. I would assume that this has been more difficult than anticipated.

Certainly, it's not the traditional update that GG said we should expect.


 3:48 pm on Jun 25, 2003 (gmt 0)

>> you would put your reputation on the line and say they are definetly spam free? <<

Absoulutely. Some of those missing are not SEO'd at all, but were selected for other reasons (eg: people I know, interesting sites on a particular topic, etc).

When this thing first started, it wasn't too hard to find sites that had some impact by just searching some of the sites that presented multiple Google searches.


 3:53 pm on Jun 25, 2003 (gmt 0)

This musical chairs game has been going on for us since BEFORE the last 'dance'.

This is the THIRD week in a row of exactly the same thing. They start shuffling us out with 2 datacenters first, then when they get up to 5, they start adding us back in to the ones they dropped us from. Then at that point, they continue on to the rest of the datacenters. The process seems to take several days and has been ALMOST IDENTICAL for the past 3 weeks.

This week (thankfully) seems to be going along a bit faster than the rest, pointing to a Thursday conclusion instead of Friday.

We're talking a 100% spam-free PR6 website with many backlinks here, just to get that out there...

[edited by: drewls at 4:00 pm (utc) on June 25, 2003]


 3:56 pm on Jun 25, 2003 (gmt 0)


one more reason to vanish:

I wrote the same just in an other threat,

My index vanished, but my index was still there. :-)
Much more down the serps i discovered my index.php,
therefore my domain.de/ disappeared.

- domain.de receives all the backlinks, but was
filtered out.
- domain.de/index.php still is in serps but much more down

Yes, by mistake Google had two times the same site from my domain. ( my own mistake )

But why is my better ranked site filtered out instead?

any ideas?


 3:58 pm on Jun 25, 2003 (gmt 0)

>> they start adding us back in to the ones they dropped us from.<<

If only....

A number of sites in my collection haven't returned at all for days.

I'll give this to the weekend before going back to the drawing board and working out a fresh strategy to address the post Dom/Esm world. You can only sit and hope for so long.


 4:35 pm on Jun 25, 2003 (gmt 0)

Napoleon, I'm in the exact same situation...my sites also have dropped the main keywords and where in and out daily.

This time they have not shown up at lease over the past few days.


 4:37 pm on Jun 25, 2003 (gmt 0)

Suggestion: Do whatever the top sites in the search results are doing. If its not what you are doing now, diversify.


 5:12 pm on Jun 25, 2003 (gmt 0)

Yeah, but what index would the angry webmasters want Google to use? The one on -fi? The one on -ex? Or the one on -va, which is much different than -fi and -ex? Perhaps they could at least get Google to make up their minds. With all these indexes out there, which is the real, official one?


 5:14 pm on Jun 25, 2003 (gmt 0)

I have a site that has appeared for a while in the top of the results for a keyword and then appears to be downgraded in the results. I have checked and the last time it was downgraded coincided with just after the latest freshbot crawl. I then checked another site that I have been watching and this site followed the same pattern. Is anyone who seems to be having their site downgraded seeing this as well?


 5:17 pm on Jun 25, 2003 (gmt 0)

I'd rather get half of us SEO's on a bus to Microsoft, and the other half to Yahoo. Maybe we can work for free and help them to roll out their engines quickly. Bye bye "G"!

Seriously, if no stability or PR calculation occurs soon, I'd expect alot of people to start spamming just to stay ahead.

I really don't think googleguy knows anything either. All he does is re-affirm our assumptions, I'd put little faith in his cryptic remarks.

I for one gave up on him during the "dominic" update. During this timeframe he continually said "expect the data to propogate to all the servers, then expect backlinks to come in, and then expect another traditional dance".

Well guess what people. The data propogated, but the backlinks really never came back(or at least haven't been calculated for PR), and we aren't even sure if this is a traditional dance or not. You put your faith in this? I think I could have told you more from a fortune cookie.


 5:18 pm on Jun 25, 2003 (gmt 0)

Pick either -fi, -ex or even -va....but please not the

if they pick -sj i will drive the bus. ;-)


 5:39 pm on Jun 25, 2003 (gmt 0)

If Google is going to change the rules of how they judge SEO or SEF (search engine friendly) pages, they need to post it on their webmaster guidelines as no-nos. They may not owe anyone anything but if they are going to single out index pages or websites (for violations) they need to post the no-no guidelines to follow every time they make an algo change.

I am not saying they need to give their algo away, but punishing sites for something they have not stated is a violation of first amendment rights. I've studied media law and believe a case could be made here. Winable or not, it could cost everyone a crap load of money. Bad PR and bleeding money is the only end goal of such a case. Hurt everyone. Cast a dark cloud.

Lies upon lies have been told here. This fiasco is affecting webmasters for months, not weeks, in direct contradiction to GG's posts.

GG has obviously been sidelined because of the tech department or VIP department for serious reasons unknown to us. GG is not quiet for no reason. He was told to shut up. When the "Director of Communications" is told to be quiet, something is seriously wrong at Google. GG's posts are nothing but a PR bandaid. That's what he was hired to do. PR.

Oh yeah, it's time to rent a bus and have face to face time at the GooglePlex.

This post will be eliminated shortly along with my membership. I have no doubt.



 5:49 pm on Jun 25, 2003 (gmt 0)

>Seriously, if no stability or PR calculation occurs soon, I'd expect alot of people to start spamming just to stay ahead.

This to me is a *serious* concern. Before, the idea was that if you built a solid site, and played by the rules, over time likely you would do well. At the moment, it looks to me like things are largely random. Nothing is predictable. Thus, best to toss out a lot of spam, and hope some will always make it to the top.


 5:56 pm on Jun 25, 2003 (gmt 0)

Yeah, but anyone with imaginative thinking skills can run spam without Google ever figuring out where it comes from. I never bothered to because I wanted to play by the rules. I followed Brett's guidelines to a strong site in Google and got shot in the neck. Since they only deal with spam through their joke of an algos, one is almost bulletproof in running spam on Google.

Google is forcing career webmasters to "fly under the radar."

Google is so overrated. Webmasters have figured that out now and it's only a matter of time until the media has figured it out.

This 345 message thread spans 12 pages: < < 345 ( 1 2 3 4 5 6 [7] 8 9 10 11 12 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved