|Lost Index Files|
Some people may not like this post, and criticism of it would not at all surprise me. However, I suggest that before reading it you ask yourself whether Google actually DESIRES the current fluctuating situation. Is it where they actually want to be? Is it what they want to project to webmasters and the public?
Against that background perhaps the following analysis and theory may fall into place more easily.
DATA ANALYSIS AND BACKGROUND[
Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.
After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).
Two general phenomena seem to be dominating the debate:
a) Index pages ranking lower than sub-pages for some sites on main keyword searches
b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.
These problems are widespread and there is much confusion out there between the two (and some others).
The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.
By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).
So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?
Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.
Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.
This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.
I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.
THE GOOGLE TWILIGHT ZONE
In addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".
Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.
1) The 2BN+ Set
If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.
2) The Twilight Set
But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?
To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.
You can probably see where this is heading.
If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.
I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.
The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).
IS IT A PROBLEM?
I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.
Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.
The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.
WHY HAVE A TWILIGHT ZONE?
Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.
However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.
If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.
At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.
Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.
The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.
I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.
The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.
>>But google loves to brag about how freshbot adds new content, so I don't see it going away. <<
Google's fresh results are not the freshest!
I have a web site that just started getting some inbound links a week ago. Altavista already has 208 pages of this site indexed, and has also found two of the pages that link to it. Google only has the index page.
Is that why Google is now listing Altavista as the # 1 search engine? :)
Couldn't resist chiming in...my site had white bar, then grey bar for a couple months, but ranked super-well. Now, PR is back to 5, but I've either totally dropped in ranks and/or completely disappeared. Google just started to list an old URL that it is pulling from dmoz (a million update url requests didn't seem to work on dmoz), which may have something to do with the current "brokeness" of Google. Someone else touched on the role of dmoz in this horrific update earlier in this thread, but didn't clarify. Anyone have ideas on this? BTW, my site has more content for users than any others...so my disappearance from the index can't be user-friendly! (My site has been ranking very well for over 6 months, and I've been around for years.)
[edited by: needhelp at 2:07 am (utc) on June 26, 2003]
Yep, Spica, it's true. AV, Ink, ATW, they have all had fresher serps than Google for several months... but Google brings in 80% of the SE traffic to my site, just like everyone else's. Thus, we have to continue to work with Google.
This thread is supposed to be addressing the index page problem...
I still think the Esmeralda thread should have been left alive to allow a catharsis wrt Google, to give a place where people could get it out of their systems.
Meanwhile, back to the "one to none" index situation...
Spica, Altavista isn't so important, read [news.com.com...] and you'll know which are the Web's 4 bigger players:
The main problem depicted on this thread is the fact that Google currently powers Yahoo and AOL.
So don't put so much effort into Altavista. Instead, try to get a good listing on MSN in case you're also feeling the Google blues. Cheers!
How do I change my user name to GrimmacingGordon?
I did everything according to the book (Google's and here). I made #1 with what is an immensely content rich site. I am now on page 7, with a guest book (with a Spammer's entry on it from my industry who made page 2 for their efforts) ahead of me. Yes, that's right, a guest book ranks on page 3, I am on page 7.
I am sorry GG, but I am giving in. I am going to follow suit and wear a Hormel badge. I have had enough. And I bet genuine surfers have too.
Spam for breakfast anyone?
needhelp, I was told by a reliable source that ODP(dmoz.org) is upgrading equipment this week.
I was told also that OPD is getting prepared to inform users about the upgrade. In the meantime, some ODP features are turned off.
The upgrade will last about a week.
zeus and nova: I agree. All of our keywords are looking great.
In less than 3 days I have had 2 brand new sites indexed and are showing in the SERP's. Not sure if this is the "new" thing, but it seems much faster than in the past.
GrinninGordon be patient for one more week.
U.S. District Court Judge Vicki Miles-LaGrange ruled in late May 2003: "PageRanks are opinions--opinions of the significance of particular Web sites as they correspond to a search query." "Accordingly, the court concludes Google's PageRanks are entitled to full constitutional protection."
First Amendment rules even if Google results are sometimes blue.
|How do I change my user name to GrimmacingGordon? |
I believe the proper procedure is to do a 301 redirect from Grinninn to Grimmacing. Hormel badge?
Hormel are the company that make SPAM (all caps - a trademarked term) - the canned luncheon meat of processed pork and ham, which Monthy Python sent up in the infamous SPAM skit - and has been more recently been used to describe unsolicted commercial email and crappy search engine results....
GrinninGordon is off to make his own tinned luncheon meat...
> new SE visitors find the home page first
I can find 12 se referrals, all from logged in members to the home page out of *cough* several thousand referrals ;-) today.
The home page is nothing but a glorified site map. (-me 1996...2003).
I've always wanted to experiment with the "homepageless" site where the root of the .com/ address returned a standard 404 from apache. hmmm might have to do that to prove the point.
>I did everything according to the book (Google's and here). I made #1 with what is an immensely content rich site. I am now on page 7, with a guest book (with a Spammer's entry on it from my industry who made page 2 for their efforts) ahead of me. Yes, that's right, a guest book ranks on page 3, I am on page 7.
ROFLMAO. Yeah, I know to you it isn't funny. For a while I was being beat out by an Ultimate Search-type site on a SERP until Google squished that. Of late at Google spam has been ranked better than quality content. Google is broke, but we can hope it will be fixed.
|1. MSN |
Looking at this list, IMO Ink is on the way up.
Brett, that works in your business.
The problem is for OTHER businesses in which users normally first use a key entry keyword phrase.
It happens that many US citizens who are looking to retire to other countries mainly use one term for information: real estate.
Then, those US citizens associate that specific term with the name of the country. Then you get 90% or more US citizens looking for "country" plus "real estate". It doesn't matter what kind of "real estate" they are looking for.
Nevertheless, there are some webmasters that build other pages to catch additional terms such as "homes for sale", "condos for sale", "land for sale", etc. I'm one of those webmasters.
My problem is 90% of my users employ "country" plus "real estate" and if my index.htm is nowhere to be found, those users end up with Web sites made of doorways, link farms, duplicate content, multiple domains and out-dated content.
To get back on topic...
Msg 217, this thread. I have been trying to get feedback on this for several days. I have two directories that have snapshots of my index listed, and have live URL's that show www.mydomain.com/?theirdomain.com. In Ink, that bloody URL is #1 on a search of the two kw's of my domain name, which should show my index. It's doing a number on my index in Google too, I'm sure of it.
Hormel badges aside, (which is ironic because I'm about to eat a can of Jamaican bully beef which is not much different), am I wrong with my theory on my index problems?
Since Brett seems to have inside know when it comes to Google;) and been quite vocal with his opinion on the loss of index pages, and GoogleGuy has suddenly been muzzled on the subject, I think based on his statements and the abscence of GG it is safe to say that Google must feel that what is happening right now is good and we can expect it to continue this way going forward. After all, only special sites get to keep their gorified site maps in the SERPS.
No more update, get links and hope they count some day, in one day out the next because after all what is relevant one day surly can't be relevant the next and therefore should be gone.
A lot of the old timers have so many sites, that if a few go down, it doesn't matter since their others are up. Also a great way to get links for yourself to;)
I guess it's time to open up a can of whoop "spam", cook it the same way as the more experienced chefs and quite trying to figure out why playing by the rules gets you no where.
|I guess it's time to open up a can of whoop "spam", |
The bully beef was ok. I will try one last time for feedback on my index problem.
Will directory sites that present snapshots of your index page that have the form www.mydomain.org/?theirdomain.com cause duplicate content problems with an index page? I think the answer is yes, yes Stefan, you might be right. If someone knows that I'm wrong, I really want to know about it too. My last attempt at this...
"Of late at Google spam has been ranked better than quality content. Google is broke, but we can hope it will be fixed."
I am glad you said that rfgdxm1. GG made a comment not too long ago about the arrogant necessity to "take with a grain of salt" the outrageous posts by members who were not more established or have such a respected history with WebmasterWorld.
But when a member with more than 2,900 posts says you a FUBAR, Google - you are FUBAR.
But like Kennedy and the Bay of Pigs, the Googleplex obviously has foul groupthink problem. "Everything's great, nothing's wrong. - Let's change the rules and not tell webmasters. So what if we have crud serps? We're Google. Everybody loves us. Great idea!"
The only comfort I receive is knowing their are 100s if not 1,000s of similar fated webmasters out there.
Fast forward six months: I hope Yahoo has adapted inktomi and AOL decided to pay less for betters serps from FAST.
I have about 18 sites that are important to me.
On every update, at least a couple of the sites take some kind of major hit. I've never been banned, but I've sure lost rankings for no logical reason.
Also on every update, I have more pages added to the Google index. I now have thousands of pages that are getting hits.
On the last update, my #1 site that has been #1 for 2˝ years was overtaken by a site using a technique that goes against the Google Webmaster recommendations. This month, I can't seem to find that site. What a shame! Let this be a warning: go against Google's TOS in a big way, and it WILL catch up to you.
This month my #3 site's index page is not appearing in the SERPS. Fortunately, this site has over 600 pages and i'm still getting a fair amount of hits.
Moral of the story:
If you keep producing content, stay informed about SEO practices and be patient, you'll be happy in the end. Finally, by having several sites, you'll probably never see a complete loss, and most likely you'll always win overall.
My requests for feedback are being submerged by the reactions of those who are frustrated with Google. So it goes.
Because I now officially give up asking, I would like to comment on the term, "twilight zone", as I am, of all the members of WW, probably most knowledgeable on this subject.
The twilight zone was not a term invented by Rod Serling; it refers to the outer part of a cave where there is some illumination during the daytime hours. This part of a cave will be inhabited by species referred to as "troglophiles"; they are happy in the dark but don't mind a little light. Futher into the cave, you enter the realm of permanent darkness; here, the species found are referred to as "troglobytes" and they require total darkness. If the cave dwelling species is human, whether philes or bytes, they are referred to as "troglodytes".
I leave it up to those posting in this thread to decide in which category they belong, but would strongly suggest that we have left the twilight zone behind several pages ago. Guidance.
Google might have devalued index pages on purpose, (I doubt it)
Think that will cause more problems than it will
solve, i.e. look for split or mini 'index' pages, more fragmented and unstable serps etc.
Oh, I am diversified, fortunately. I learned that lesson in September. Unfortunately, one can never be too diversified. I'm building new helper sites (original, useful content) and adding more pages all the time. By the end of this month it will total 100 new pages. Not doorways or spam either. Good content pages.
Stefan, sorry you are not getting the reply to your question but I don't have the answer. It's late in the US, most people are in bed.
--Will directory sites that present snapshots of your index page that have the form www.mydomain.org/?theirdomain.com cause duplicate content problems with an index page?--
I don't think anyone has an answer to what you ask at this point or they would of answered. With the way Google is acting right now, I don't think any conclusions can be drawn about how any index page will be treated for their directory.
People are dumping their frustration here because the mods won't allow other threads of that nature to start.
As for being a philes or bytes, I couldn't tell you. I was hatched from an egg after the mother ship dropped me off;)
I don't think any of us are going to get an answer to the questions. This topic of index pages has been avoided by GG, why?
Maybe because it's too early to tell what's really happeneing? Maybe becuase this is the way it's going to be, cheat to win, just like in the real world where dishonest crooks make the laws? Maybe they don't know why so many perfectly legit sites are being crushed by B.S. pages? Maybe they just want to shake things up anmd lose contracts and part of their market share so that when they go public they can improve and show stockholders they're getting better? or maybe they know that this is hurting a lot of people who worked hard to succeed and they don't want to take the abuse when they tell us what's happennning?
I mean, so you play by the book and your company folds because you lose most of your traffic for 2-3 months. Don't worry, they are plenty of blank, spam, duplicate sites and off topic pages for Joe Surfer to go to, maybe Joe will start using different key phrases to find what he wants, maybe he'll turn to another search engine?
I can personally attest for a few businesses and dozens of people, their families and employees that need to make some tough decisions over the next couple weeks. A little "heads up" would be nice. I'm not crying about this ruining businesses and peoples lives, they will have to try and find other ways to survive.
Some of them will spam, I doubt google wants the wake of spam that will happen if things don't improve.
Some of them will find different jobs or change the focus of their busniess or just do something else.
It's just like real life, not everything is fair. I mean, who has more money? the small business owner who works his life away and sticks to his morals to raise his family and put his kids through school, or the crack dealer in the Hummer that shot two kids last month?
Ok Stefan - this is what I think.
1. Most people have higher pagerank on their homepage.
2. Many people's sites are just like a whitepages telephone listing - people who only get found for their own company name - and their one pet search term/ phrase - and the traffic comes in via the 'home' page.
3. What I think Brett was politely trying to tell you a page or so back was that most people who have been doing this for a while are getting 'yellow pages' traffic (ie surfers looking for a 'category' or 'topic' - rather than a specific company/ person) - which comes in 'every other way' into the site.
4. My proposition?
IF Google has played with the algorithm - which we all believe (although we don't know or agree on what has changed - but 'something' has) and lets say - Google actually devalued the scoring of pagerank (ie reduced the importance of Pagerank in the formula - not changed the numbering on the toolbar) - which page would that most affect on your site?
Your homepage? Where all your white pages and 'pet phrase' traffic came from?
And why did it make 'less' impact on more established sites? Maybe because their traffic was already being driven 'yellow pages' style?
I don't know - but thats my guess.
The whole issue of devaluing index pages and index pages being glorified sitemaps is not the cae a lot of the time.
In this example, assume "widgets is actually a 2 word, semi-competitive (1000 searches a day) key phrase in a niche market
widgets.com index page is made user firendly and search engine friendly for the keyword "widgets"
widgets.com is divided into several specialized section of products and information - redwidgets.html - bluewidgets.com - greenwidgets.html - kidswidgets.html - menswidgets.html - widgetinfo.html - widgetcomparisions.html - widgetreviews.html - widgetfacts.html - contactus.html - etc.
the index page welcomes the visitor, explains briefly what the site has to offer and visually or literaly shows them the different categories of widgets and widget information they offer as well as information about the widget company.
Here's what doesn't make sense. The site has been in the top 10 for widgets (in this case widgets is a 1000+ a day term, not extremely competitive, but substantial). A couple of months ago, the "new technology" has decided that it changes the way it indexes widgets.com
Now searchers that type in widgets, get top 10 results including:
"Website Design & Search Engine Optimimzation"
with the word widgets, never appearing in the title or description, not on the page or in any form.
widgets.com is not completely gone, at #67 or so, he has widgets.com/index.html
for the search term: kids widgets
he has a completely different page of his site showing up, lets say menswidgets.html even though he has a dedicated kidswidget.html Page with the exact phrase incorporated into the page, title, decsription, etc.
the thinking poroposed earlier was that Joe surfer would know how to find his way from the mens widgets section to the kids widget section, thereby eliminating the need for this user friendly front page.
I read on another thread about how much Joe Surfer is paying attention when Brett pointed out that:
www.yahoo.com is the #1 alltime most popular phrase searched for at Yahoo!
So which one is it? Are index pages useless glorified sitemaps that should not show up for non specific search terms, like widgets? Because although Joe can't figure out where to type in a url, he can someone assume that this website has the content he's looking for even though everything on the screen is slightly off topic.
Is Joe going to hit the back button and find some spam? oh look! another website's index page? One of four copies of it spread across multiple domains, all of which seem to be doing just fine with their index pages? huh?
Is there something freaking wrong here?
[edited by: my3cents at 10:02 am (utc) on June 26, 2003]
from what I watch I would say that the taking account of pagerank in the SERPS is still going on.
I think it is too early to discuss.
Agreed... something has slipped for sure, but (fingers crossed) Google will sort this out and we will all see the good old relevant SERPS once again. It is painful to watch and I do find it really annoying not knowing where I stand, If I am #120 - I analyse and re-optimise. If I am #1 - I leave well alone. It's the not knowing that gets me down.
Stefan, if you mean a link like this: /out.php3?ID=402 I have had so much trouble with G confusing one of these links with my index page that I wrote to the site owner and asked them to just take the link down.
|Will directory sites that present snapshots of your index page that have the form www.mydomain.org/?theirdomain.com cause duplicate content problems with an index page? |
I certainly wouldn't let anyone mirror my sites like this. Even if they were disallowed from being crawled, its more of a risk than I'd be willing to take.
|from what I watch I would say that the taking account of pagerank in the SERPS is still going on. |
I think it is too early to discuss.
I don't agree. Top 10's seem relatively solid from here.
Getting back to the index page issue, I don't agree that index pages are glorified site maps...at least much of the time they are not. But where they are, they should be treated as such. Unfortunately, I see painfully bad examples of these glorified site maps ranking well on competitive terms. There are sites that rank in the top 10 for 2M+ keyphrases with homepages that are 95% anchor text. Google isn't clamping down on these, so I don't think we can attribute this "lost index page" phenomenon to any attempt to address this issue.
It seems to me Google should be able to discern the difference between a content-rich homepage and a site map homepage. Markup to plain visible text ratios would seem a dead giveaway, but then some bigtime corporate sites are seriously markup-heavy...and god forbid the likes of CNN and MSN should have to tackle their code-bloat to stay relevent. Remember, controversy is bad for IPOs. Still, a set of filters looking at visible text should be able to find a nice threshold at which point an index file has no usable content of its own. There's no reason the index file can't pass some nice PR/anchor text on to internal pages with real content (if there is any), but some sites are glorified site maps.
Dolemite..I guess I will have to disagree that everything is solid...Im seeing completely different results on www2 and 3 than the www and 4 datacenters are different than the others...as far as I can see, either we are in a rolling update mode, or the update has not completed. I am also seeing old results on the Directory.
I still think its a little too early to call this one finished.