homepage Welcome to WebmasterWorld Guest from 23.20.44.136
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 345 message thread spans 12 pages: < < 345 ( 1 ... 2 3 4 5 6 7 8 9 10 11 [12]     
Lost Index Files
Napoleon




msg:190018
 11:10 am on Jun 23, 2003 (gmt 0)

Some people may not like this post, and criticism of it would not at all surprise me. However, I suggest that before reading it you ask yourself whether Google actually DESIRES the current fluctuating situation. Is it where they actually want to be? Is it what they want to project to webmasters and the public?

Against that background perhaps the following analysis and theory may fall into place more easily.

DATA ANALYSIS AND BACKGROUND[
Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.

After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).

Two general phenomena seem to be dominating the debate:

a) Index pages ranking lower than sub-pages for some sites on main keyword searches

b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.

These problems are widespread and there is much confusion out there between the two (and some others).

The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.

By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).

So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?

Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.

Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.

This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.

I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.

THE GOOGLE TWILIGHT ZONE
In addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".

Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.

1) The 2BN+ Set
If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.

2) The Twilight Set
But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?

To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.

You can probably see where this is heading.

If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.

I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.

The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).

IS IT A PROBLEM?
I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.

Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.

The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.

WHY HAVE A TWILIGHT ZONE?
Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.

However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.

If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.

At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.

VERDICT?
Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.

The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.

I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.

The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.

 

WebMistress




msg:190348
 3:19 am on Jun 27, 2003 (gmt 0)

I lost my homepage, AGAIN, in the SERPS...was having this problem:

Both domain.com and www.domain.com listed, www.domain.com lower and today is an indented sub-page of internal pages (backlinks on homepage is over 1,500...internal pages only internal backlinks)...made a mess of me in the SERPs, and figured was google's funky update. But just checked Overture and for my main 2-word keyword phrase (after sponsored listings), www.domain.com is #1 and domain.com is #3.

Can anyone think of a reason why I would be experiencing this other than google's funky update? I was on two servers from May 5 to about 1 week ago while I transitioned from one to the other, making sure google would not lose me in the transition. Could this conceivably cause the problem I am seeing on both search engines? If not, any other ideas? And any solutions? Thanks.

Anon27




msg:190349
 3:24 am on Jun 27, 2003 (gmt 0)

Stefan:

If it is any comfort, I suggest that you correct the error and see how Google treats you.

Anon27




msg:190350
 3:27 am on Jun 27, 2003 (gmt 0)

Hey WebMistress;

In addtion to the issues that have been openly discussed here, we have all had problems much like yours...

Remain calm, thats the best advice I can give.

Stefan




msg:190351
 3:30 am on Jun 27, 2003 (gmt 0)

Thanks, Anon27, I've been cleaning it up and making progress. The index page only ever brought in about 10% of traffic, but the dupes have been messing up my PR. My site used to be a PR6, now PR5 cause of that nonsense. Most of the kw's I count on are still #1, or close to it, so not a problem. It's been an ordeal getting these guys to understand what they're doing though. So it goes.

ADDED: Not that it matters, but I've noticed my post count hasn't gone up the last few msg's. Brett, I don't want to be a Senior Member anyway, too much pressure... you could count all my msgs and keep the "Senior Member" count a few dozen ahead of whatever I have. That might work.

[edited by: Stefan at 3:38 am (utc) on June 27, 2003]

Spica




msg:190352
 3:31 am on Jun 27, 2003 (gmt 0)

For the query: mysite.com, my index page does not come up. Two subpages are returned at position #2 and #3

If I search for www.mysite.com, it is worse, these same subpages appear at positions # 8 and #9.

I am wondering whether this is a symptom common to all of the sites that currently suffer frequent up/downs or in/outs of the SERPs since Dominic. Napoleon, is this true also for the other "twilight zone sites" that you monitor?

There is no reason to believe that my site has any sort of penalty. Not only is it clean, but my index page comes up on the first page of the SERPs for my targeted keywords (at least right now- that may not be true tomorrow...).

Anon27




msg:190353
 3:46 am on Jun 27, 2003 (gmt 0)

Spica:

Start at the first post in this thread by Napoleon.

Read every post, all night long.

Replay to your own post in the morning.

Good night.

mrguy




msg:190354
 3:52 am on Jun 27, 2003 (gmt 0)

I think it is interesting to note that the searching public has taken notice of a problem with Google.

When a DJ for some radio show in the largest market in North America is making fun if it, I find it very hard to believe that is what Google wants. I've seen other comments and not in webmaster or SEO circles. I myself have been asked "what is wrong with Google" by naive surfers who think I know everything about the web. It seems their biggest complaint is not being able to find something again after they found it the first time. Consistancy is non existant at this point.

The Googleheads here will defend Google to the end and at one time I was one them, but don't forget that word of mouth made Google what it is.

All these posts on trying to find a pattern are a waste of time. There is no pattern because Google is broke right now plain and simple. Things did not go as planned and now they are trying to recover. I know it makes us all feel better, but does it really? I just get more frustrated every time I look and see the same thing happening day in and day out. I keep waiting for the day when things are all right again on Google Land, but that day just does seems so far off.

Google did this prior to Yahoo using Ink in Sept so they could work the bugs out. Yes Yahoo has said they won't dump Google, but come on, they will at least bury any results from Google under their first ink results much like the way MSN buries ink results under Looksmarts. It will all depend on how many results are ahead of you in the ink SERPS as to whether or not Google will send you traffic through Yahoo. Yahoo and Google are competitors let there be no mistake about it.

So, I think the timing is right for them to this becasue it is much easier to weather the storm when there really is no other viable alternative to switch to at the moment. Six months from now, there probably will be an alternative and one that advertises heavily. Do I like them doing this now, of course not. It plain sucks.

So, for me, I'm done watching the SERPS and datacenters and getting my blood in a boil watching my listings play ping pong and will now go about doing what it takes to survive while they work out their problems.

Hopefully it will be soon:)

Spica




msg:190355
 4:09 am on Jun 27, 2003 (gmt 0)

Anon27:

I have read all of these posts. And all of the ones on the semi-penalty, and all of the ones on the Dominic update, and all of the ones on the Esmeralda update.

I am asking a question here, because I am not sure what the answer is. I am asking whether all the sites that are currently somehow misindexed in Google behave like mine for the query "mysite.com".

If I missed the post that answers this exact question, would you be so kind as to point it out?

customdy




msg:190356
 4:19 am on Jun 27, 2003 (gmt 0)

Spica, Do they all behave like this? No, but some of them do. I know it is difficult but try to wait this out and see what happens when things settle down, I turly believe Google is severly broken right now.

mrbrad




msg:190357
 4:23 am on Jun 27, 2003 (gmt 0)

MrGuy: My thoughts exacty.

Google's success has gone to their head.
I think they honestly beleive these are still good SERPS and that nobody from the surfer community is complaining.
They are nieve to the reality of the situation and its potential impact.

If Google thinks they have the SE market on lock they are gravely mistaken.
Do the names Northern Light and Alta Vista ring a bell?

Yahoo/INK and Overture/FAST are only one step behind.

Stefan




msg:190358
 4:33 am on Jun 27, 2003 (gmt 0)

Spica, it might be, (maybe not), a duplicate content problem. If your domain/index has multiple listings such as, domain.org, www.domain.org, www.domain.org/, www.domain.org/index, or directories that have the form www.domain.org/?theirdomain.com, all getting found by Google and confusing the hell out of it, this could do it.

I got nailed by an incorrect incoming link on domain.org, and several directories with that BS link instead of the right one. None of this might apply to you but it affected me, for sure.

steveb




msg:190359
 4:38 am on Jun 27, 2003 (gmt 0)

Topical/index pages still missing in action.

Fresh trash still being served to the public on a platter.

And for the past day and a half, Google sees fit to show [google.com...] in the language of the planet Zork.

Spica




msg:190360
 4:51 am on Jun 27, 2003 (gmt 0)

Stefan:
I don't see how it could be a duplicate content problem in my case. My domain name is fairly unique, and none of the other .net, .org, etc. have been registered by anyone. To the best of my knowledge, all of the links pointing to my site are in the proper form (http://www.mysite.com/).

Spica




msg:190361
 4:55 am on Jun 27, 2003 (gmt 0)

Steveb:
There is nothing wrong with this page from where I am. What did you see?

steveb




msg:190362
 5:07 am on Jun 27, 2003 (gmt 0)

URL toevoegen of updaten
We voegen constant nieuwe sites toe aan onze index en we nodigen je uit om je eigen website aan ons op te sturen. We voegen niet alle sites aan onze index toe en we kunnen niet voorspellen of garanderen of en wanneer ze verschijnen.

Vul je complete webadres (URL) in, inclusief het http:// voorvoegsel. Voorbeeld: [google.nl...] Je mag ook een omschrijving van je site invullen. Die wordt alleen gebruikt ter onzer informatie, en heeft geen effect op hoe je site in de index wordt weergegeven.

--> Alleen de index-pagina van een website of domein is nodig, je hoeft niet elke pagina opnieuw in te vullen. Onze crawler,

This 345 message thread spans 12 pages: < < 345 ( 1 ... 2 3 4 5 6 7 8 9 10 11 [12]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved