homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 345 message thread spans 12 pages: < < 345 ( 1 2 3 [4] 5 6 7 8 9 10 11 12 > >     
Lost Index Files

 11:10 am on Jun 23, 2003 (gmt 0)

Some people may not like this post, and criticism of it would not at all surprise me. However, I suggest that before reading it you ask yourself whether Google actually DESIRES the current fluctuating situation. Is it where they actually want to be? Is it what they want to project to webmasters and the public?

Against that background perhaps the following analysis and theory may fall into place more easily.

Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.

After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).

Two general phenomena seem to be dominating the debate:

a) Index pages ranking lower than sub-pages for some sites on main keyword searches

b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.

These problems are widespread and there is much confusion out there between the two (and some others).

The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.

By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).

So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?

Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.

Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.

This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.

I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.

In addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".

Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.

1) The 2BN+ Set
If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.

2) The Twilight Set
But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?

To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.

You can probably see where this is heading.

If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.

I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.

The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).

I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.

Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.

The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.

Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.

However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.

If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.

At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.

Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.

The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.

I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.

The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.



 3:31 am on Jun 24, 2003 (gmt 0)


I am expereincing different results on all of the datacenters also. I am #2 on every datacenter except cw, ex, and in, which I am #3. Can any one explain why they are fluctuating and is this normal. Thank you for your replies.



 3:32 am on Jun 24, 2003 (gmt 0)

Can someone sticky me the url of that site that allows you to check all the data centers on google..Can't remember what it is.



 3:44 am on Jun 24, 2003 (gmt 0)

>> IN definately has new data and SERPs look much better! <<

Thank you for pointing that out. I'm going to bed with a little depression lifted off my shoulders. :)


 4:03 am on Jun 24, 2003 (gmt 0)

Please pardon my ignorance here, but how do you apply the filter (&filter=0) that Napoleon mentioned?



 4:39 am on Jun 24, 2003 (gmt 0)

Do a search on google and when you get the results append
the &filter=0 to the end of the url in your address bar.


 4:52 am on Jun 24, 2003 (gmt 0)

Thanks to all who sent dance url.



 5:06 am on Jun 24, 2003 (gmt 0)

There is a lot of good solid update information here, will this thread now replace the previous google update?

Although I'm sure it was never intended to be that way ... it happened :}


 5:27 am on Jun 24, 2003 (gmt 0)

Hey I read in places like Google blog that Google stopped crawling for awhile but I have friends who say the bot has visited their sites for months. What happend I was on vaction.


 5:27 am on Jun 24, 2003 (gmt 0)

How many of you went straight to Google and searched for the math knobs algorithm when Googleguy mentioned it?

My guess is that each time the google crew called GoogleGuy posts here they're able to tag some 20 new SEO's for life.

And you're wondering what is happening with the Eminem update?


 5:32 am on Jun 24, 2003 (gmt 0)

My guess is that each time the google crew called GoogleGuy posts here they're able to tag some 20 new SEO's for life.

That could start a whole new batch of theories ... good though thouhh ;)


 5:52 am on Jun 24, 2003 (gmt 0)

What is happening to my site fits Napoleon's hypothesis:

1) this site is less than 1 year old (all clean)
2) had reached position #7 for targeted keywords before Dominic
3) suffered from mild to severe Dominitis, fluctuating from position 20 to 150
4) came back to position # 8 with Esmeralda, where it stayed until today
5) is not at # 34
6) I have hardly seen fresh/deepbot lately

Comparing the previous results to the very last ones, I noticed that my index page ranked the same for allinanchor (#3), but sunk for allintext (from # 3 to #20) and allintitle (from #3 to #16). I have made essentially no changes to the site for several months.

It is as if Google had written down on a little piece of paper a note to try and remember the fact that my site was important for these keywords. However, it has now lost this unimportant little piece of paper when it started working on all of the fresh new data.

Another way to see it: Google is getting old, and has a hard time with recent memories. If not reinforced constantly, these recent memories get lost. However, Google still remembers perfectly everything about it's childhood... and the web sites that existed then.... :)

brotherhood of LAN

 6:00 am on Jun 24, 2003 (gmt 0)

NFFC, I think I could see you on that map, next to the plough constellation, I believe it was the farm ;)

Interesting thread, I have to admit I havent noticed the trends mentioned, but have noticed SERP irregularities.

I just wonder whta would happen to these "missing pages" if they had a few more inbound links.

In RE: to the % of the web crawled by Google, I remember reading something along the lines of 10 billion documents being on the web, so Google indexes around 1/3, its understandable how they have to prioritize which URL's are crawled (and revisited for freshness)......I'd lend credit to a theory that says some of the index is being sacrificed for freshness.

Must be hard to keep the index comprehensive and fresh 24/7/365.

Napoleon, any relation between the dropped pages and the PR of their highest PR backlink?


 6:19 am on Jun 24, 2003 (gmt 0)

Oh great - I have been trying to stay clear from this thread - but for my site it seems to be getting worse and worse with each day :(

I also have a new site - launched February and the serps for the site are so unstable!

I dont seem to have any problem attracting freshbot - it just that when the pages get freshed you can never tell where they are going to appear in the index (OK - I know about flux but this is when the pages are fresh - not in and out).

So freshbot seems to find a page and can place me number 1 for the targetted keyphrase - all good - so I dont make any changes to this page and then freshbot finds me again and places me number 78 - this seems to me that freshbot is really struggling placing the pages in the correct positions in the serps - as the current Serps seem to be totally reliant on freshbot this could be a reason for the unstable indexes.

And to cap it all one of my main target keyphrases has a page from a large multi-national using some spam techniques (Javascript re-directs are still alive and strong!) - even if I spam reported them I doubt I would have any success.


 6:31 am on Jun 24, 2003 (gmt 0)

Just a quick interjection and light relief whilst everyone tears their hair out over Google. A link to an article on the BBC website - about Google becoming the ubiquitous term for Searching the Internet:)


Just what you want to hear right now?


 6:51 am on Jun 24, 2003 (gmt 0)

Here's my latest theory. We may be seeing sites sacrificed in order to keep other sites fresh. But I don't think things are done yet. GoogleGuy mentioned (in the early dominicene epoch) how excited he was about the new technology. I doubt he was talking about downsizing the web and dropping index pages. I didn't get the idea he was referring to the algo at all. We have seen veryfreshbot, we have seen deepfreshbot, we have yet to see verydeepbot. I think we will see the new technology soon with the upcoming verydeepcrawl.
OTOH maybe 3 billion is enough and now its time to clear out the clutter.
mytwocents (I'm gonna go broke soon;))


 6:56 am on Jun 24, 2003 (gmt 0)

Hey I read in places like Google blog that Google stopped crawling for awhile but I have friends who say the bot has visited their sites for months. What happend I was on vaction.

Hmm..., not much, you did not miss a thing.


 6:59 am on Jun 24, 2003 (gmt 0)

You are the first to mention, other than myself that your pages are disappearing when getting fresh tagged. This has happened to mine though only once in Dominic and once in Esmeererer... FreshBots visit regularly, but I only disappear when i get the actual fresh tag. Are your experiences similar? One strange thing I noticed was that usually the freshbot visits robots.txt, index.html and then off to play around. Last time I got dropped it was robots, go play around, then index.html. Perhaps it wasn't freshbot, it was spambot, or seobot, or linkstoonewbot (not to be confused with links2youbot).


 7:30 am on Jun 24, 2003 (gmt 0)


I have been experiencing this since the week before Esmeralda - I tried to start a new post but it did not gain approval :)

Anyway what seems to happen is freshbot comes along and grabs robots.txt and then normally the rest of the site (not really noticed if index.html is first - have not got the logs to hand)

Now the same page which has not had any changes can get picked up by freshbot and come in at 1 in the serps, freshbot comes again - picks up the page and at the next fresh update the pages come in at 70+ - freshbot comes again picks up the page and it can come in again at number 1.

So what I am seeing is a page that has not been changed at all move from 1 to 70+ to 1 to 70+ (all with freshtags - not everflux).

I am really not sure what this means - it is hard to decide if I should change (optimize differently) the page though as for two to three days it is doing well and for two to three days it is not.

Also, I am really not sure what this says about the Serps at the moment but you can see that freshbot is inconsistent and therefore I feel does not know how to rank a page properly.


 7:48 am on Jun 24, 2003 (gmt 0)

I don't understand why Googleguy hasn't chimed in yet and said that things are still in flux and don't worry your sites WILL show up?

That window is starting to look awfully attractive again as my dominating site has VANISHED out of the top 1000 for it's kw!

This is insanity - please fill us in while we go broke please GG!

*hmph* I ould drink a gallon whiskey right now but google done gone and stole all me gold and took me youngest daughter too - them darn rascalz ;/


 7:52 am on Jun 24, 2003 (gmt 0)

It would be interesting if Googleguy could confirm a couple of aspects that he has confirmed previously - just so we know what the current state of play is.

At one stage he likened the current changes to Google to a journey and he said dont look at every bump and look at the whole journey - are we at the end of the journey yet or are there a few more miles and bumps to go?

Secondly, during update Dominic someone asked if Google were happy with the serps - GG confirmed that they were - are Google still happy with the current serps?


 8:03 am on Jun 24, 2003 (gmt 0)

I think a bunch of us should rent a bus and take a roadtrip to the Googleplex.



 8:11 am on Jun 24, 2003 (gmt 0)

>> Napoleon, any relation between the dropped pages and the PR of their highest PR backlink? <<

Some of the sites dropped have backlinks of PR7 (and one 8 I think), which is interesting.... because you would imagine that this fact alone would make it an early(ish) crawl contender. It was another factor that led me to consider age as a prime driver for the problem (as much as I didn't want to).

By the way NFFC... as a graduate (UK) in computational and statistical science I don't need lessons in sampling theory.

Some very interesting stuff has been posted overnight. The basic tenant of the issue seems to be though... is the problem caused by missing data or an algo tweak?

I chose missing data because I believe that the evidence available suggests this very heavily indeed. For example:

a) GoogleGuy said there was no algo tweak with respect to index files
b) An algo tweak would normally be expected to affect all sites similarly constructed and placed, in the same manner (old and new)
c) The wild variations in SERPS suggests on-going and changing activity, rather than a change in the basis of calculation, which would tend to produce more static output.

A filter? Well, much of the substance of these points apply here as well, albeit not as strongly in some cases.

So if it IS instability with certain core data (notably link data), one would imagine that it is being worked upon. Visibly though, things seem to be getting worse, not better, certainly on the sites I have tracked (which is a growing number - thanks for sending them everyone).

As noted above, it is interesting that GoogleGuy still hasn't contributed with at least something on time frame. That itself can be taken in many ways.... so I won't get into it any deeper! It would be nice to have some sort of indication that there will be an end to THE TWILIGHT ZONE.


 8:19 am on Jun 24, 2003 (gmt 0)

Didn't he tell us he was gone fishin'. Wouldn't be back for several dayz.


 8:25 am on Jun 24, 2003 (gmt 0)

I am really not sure what this means - it is hard to decide if I should change (optimize differently) the page though as for two to three days it is doing well and for two to three days it is not.

This may be exactly the point.
Sorry Charlie. Google doesn't want tuna with good taste. They want tuna that tastes good.


 8:28 am on Jun 24, 2003 (gmt 0)

He is back.But not touching this one it seems.
GG has made a couple of posts in the last few hours.


 8:33 am on Jun 24, 2003 (gmt 0)

Here's a dumb question, is the dance over with? GG said it should go faster this time around but with millions of indexed pages zapped out of existence in the last 24 hours, I can't imagine the dance is over.

Marketing Guy

 8:40 am on Jun 24, 2003 (gmt 0)

Ive been interested to read all the posts looking to Googleguy for an answer - he doesn't owe any of us an answer - in fact for all we know he may not be privy to the information we seek.

In the same light, posting a message that suggests (or even insinuates) any member here "should" offer their thoughts is not the way to go. Every member here (including, Brett, GG, Mods, Admin) offer their contributions voluntary and are all no doubt aware of this thread. I am sure they would contribute if they wanted to, but I guess they don't because dozens of you would jump on and analyse their every word or that they just don't have any new information to offer.



 8:58 am on Jun 24, 2003 (gmt 0)

Well one of my sites has been suffering with this 'in and out' the index syndrome for main keyword...........in fact im having record hits for this 5 month old site and sales are also up!?

I know alot of people are in the same boat with their sites seemingly dissapearing but whats your stats been like?


 9:09 am on Jun 24, 2003 (gmt 0)


Interesting thread.

I'm tending to go with the theory that we're still seeing PR data from the February crawl being used at times. If not that then there's some kind of penalty relating to incoming links.

Are you sure this 'index effect' is happening on older sites? If so are you sure that the PR of the sites you have been looking at did not take a dip in the March index?

I'd love to see evidence that this is affecting older sites too.


 9:17 am on Jun 24, 2003 (gmt 0)

Remember what GoogleGuy wrote about the current update: "Here's what I would expect.":

1. "Probably about one data center per day will get switched to the Esmeralda index."

2. "You may see some improvements during the course of the switchover as ingredients get blended in as they're ready."

3. "I would expect another round of ingredient-adding after the index is switched over."

The ingredients include: spam snapshots and backlink information. I think it is important to mention a third one: Open Directory data.

After exchanging some stickys, I'm getting more convinced that Google's blues have to do with ODP data. Let's wait ...

[edited by: zafile at 9:37 am (utc) on June 24, 2003]


 9:34 am on Jun 24, 2003 (gmt 0)

we can all throw-up theories and figure out what's going on, but what if google is actually trying to index all live webpages on the internet, have them stored on all datacentres and then decide to apply the filters to block out spammy sites and so on.

You may ask then, whats the point of the algo, well it certainly has taken this update long enough to re-stracture its index, so that coincides with the algo at work.

Still, I believe google has still to unleash its big guns, so hopefully in the next few days we'll see those at work.

Any more theories =)

[edited by: spud01 at 9:43 am (utc) on June 24, 2003]

This 345 message thread spans 12 pages: < < 345 ( 1 2 3 [4] 5 6 7 8 9 10 11 12 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved