|Lost Index Files|
| 11:10 am on Jun 23, 2003 (gmt 0)|
Some people may not like this post, and criticism of it would not at all surprise me. However, I suggest that before reading it you ask yourself whether Google actually DESIRES the current fluctuating situation. Is it where they actually want to be? Is it what they want to project to webmasters and the public?
Against that background perhaps the following analysis and theory may fall into place more easily.
DATA ANALYSIS AND BACKGROUND[
Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.
After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).
Two general phenomena seem to be dominating the debate:
a) Index pages ranking lower than sub-pages for some sites on main keyword searches
b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.
These problems are widespread and there is much confusion out there between the two (and some others).
The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.
By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).
So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?
Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.
Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.
This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.
I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.
THE GOOGLE TWILIGHT ZONE
In addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".
Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.
1) The 2BN+ Set
If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.
2) The Twilight Set
But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?
To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.
You can probably see where this is heading.
If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.
I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.
The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).
IS IT A PROBLEM?
I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.
Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.
The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.
WHY HAVE A TWILIGHT ZONE?
Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.
However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.
If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.
At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.
Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.
The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.
I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.
The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.
| 5:56 pm on Jun 25, 2003 (gmt 0)|
Yeah, but anyone with imaginative thinking skills can run spam without Google ever figuring out where it comes from. I never bothered to because I wanted to play by the rules. I followed Brett's guidelines to a strong site in Google and got shot in the neck. Since they only deal with spam through their joke of an algos, one is almost bulletproof in running spam on Google.
Google is forcing career webmasters to "fly under the radar."
Google is so overrated. Webmasters have figured that out now and it's only a matter of time until the media has figured it out.
| 5:58 pm on Jun 25, 2003 (gmt 0)|
Im sorry, but I dont see all those bad results and I do se index pages rank better then yesterday on www and www2
| 5:58 pm on Jun 25, 2003 (gmt 0)|
|>Seriously, if no stability or PR calculation occurs soon, I'd expect alot of people to start spamming just to stay ahead. |
This to me is a *serious* concern. Before, the idea was that if you built a solid site, and played by the rules, over time likely you would do well. At the moment, it looks to me like things are largely random. Nothing is predictable. Thus, best to toss out a lot of spam, and hope some will always make it to the top.
|Yeah, but anyone with imaginative thinking skills can run spam without Google ever figuring out where it comes from. |
Exactly, people are getting to be better GoogleSpammers. Its not that hard to think of something that the algorithms won't catch and looks legit for a couple months.
The more of it that's out there, the worse the index is, the worse things get overall.
| 6:43 pm on Jun 25, 2003 (gmt 0)|
I think Zap is way off. You can talk to GG all you want. You just have to do it in a thread about adsense or the new Google toolbar or something else GG WANTS to talk about. He seems to be able to find time for those threads. AND Zap said:
>but punishing sites for something they have not stated is a violation of first amendment rights.<
I thought the court just said they were protected by the 1st amendment and they can screw anyone they want anyway they want for any reason they want and it's not unfair or restraint of free trade, it's merely their opinion. It's not to dictate behaviour of webmasters to give themselves an unfair advantage over competitors, or to encourage commercial webmasters to buy more adwords or to influence public opinion, it's just their opinion of the democratic nature of the web. They are just trying to provide a quality search experience and if it makes money fine but if it doesn't, well, it just doesn't because the users' search experience is all that really matters and not just Google's success.
| 6:49 pm on Jun 25, 2003 (gmt 0)|
My main index.htm page has been affected due of the current Google blues. Since the latest 24 Jun 2003 refresh started, the page is no 58-59 in all data centers. The same page is nowhere to be found in SJ.
The only sin I committed was to repeat "term term" a few times because current site no. 1 uses doorways and current site no. 2 uses LinksToYou. Besides, a few sites that haven't changed content in two years show as well on the top ten.
However, I have to give credit to PageRank and respect what U.S. District Court Judge Vicki Miles-LaGrange ruled in late May 2003: "PageRanks are opinions--opinions of the significance of particular Web sites as they correspond to a search query." "Accordingly, the court concludes Google's PageRanks are entitled to full constitutional protection."
First Amendment rules even if Google results are wrong sometimes!
| 7:09 pm on Jun 25, 2003 (gmt 0)|
Am Im realy the only one here that thinks Googles rankings/serp today are good or much better then the last to month and I dont see the problem with the index pages anymore, ok there will always be a few.
| 7:27 pm on Jun 25, 2003 (gmt 0)|
I know exactly why my index page is having problems; it's because of two directories that have this www.mydomain.org/?theirdomain.com both listed and live. It is causing a problem on the other SE's too. Are these dynamically generated? I am attempting to get the situation rectified without having to ask to be removed entirely.
If anyone wants, I can sticky a URL for a meta-SE that shows the problem quite nicely.
Others might have different reasons, but I'm sure that's what my problem is. Fortunately, not much of my SE traffic has ever come in on that intro page.
[edited by: Stefan at 7:29 pm (utc) on June 25, 2003]
| 7:27 pm on Jun 25, 2003 (gmt 0)|
I must add something else ... (Google, I still like you!)
My Web site position has been affected for almost one month. I thank God that June is slow for my business search results.
However, all theories presented in the current thread could lead to one issue: Google's implementation of Linux.
As I said, I like Google a lot because Sergei Brin and Lawrence Page had a great vision in 1995. They met at Stanford University and by year's end, Brin and Page collaborated to develop great technology that will become the foundation for Google.
However, they choosed a free operating system.
Now that Microsoft is working to provide competition, it will be interesting to see the results. Finally, the world will be the witness of which operating system implementation is better.
[edited by: zafile at 8:02 pm (utc) on June 25, 2003]
| 7:45 pm on Jun 25, 2003 (gmt 0)|
Troll elsewhere with that 'get what you pay for' crap.
If you actually did get what you pay for, all bloated microsoft junk would be given away free and linux would cost an arm and a leg.
That has nothing to do with the subject at hand here. So, perhaps you should start a linux bashing thread of your own, so people can properly educate you.
| 7:54 pm on Jun 25, 2003 (gmt 0)|
Troll elsewhere about Linux. Not viable as a mass market consumer OS. That requires a clean GUI. However, Google could not care less about that. As such, Linux is a very good choice for Google.
| 7:55 pm on Jun 25, 2003 (gmt 0)|
Napoleon wrote: "Some people may not like this post, and criticism of it would not at all surprise me. However, I suggest that before reading it you ask yourself whether Google actually DESIRES the current fluctuating situation. Is it where they actually want to be? Is it what they want to project to webmasters and the public?"
Then, a contributor posted the following alternatives:
Why are some index pages missing?
1. Google's Broke? Maybe, especially recent SEO
2. Filter of most optimized keyword? Maybe
3. Filter eliminating most used keyword on page? Maybe
4. Non-underlined and/or colored hyperlink? Maybe
5. Too far ahead of 2nd place listing? Maybe
6. Failure to read 301 Perm Redirect correctly? Quite Possibly
7. Fresh Bot Drop? No
8. Filter of Mouseover Hyperlinks? No
If option no. 1 is correct, one of the possible causes could be Google's implementation of Linux.
| 8:02 pm on Jun 25, 2003 (gmt 0)|
>If option no. 1 is correct, one of the possible causes could be Google's implementation of Linux.
I'm thinking more the code that Google is running, rather than the OS itself. It doesn't make a difference how good the OS is if you run buggy code on it. I definitely see clear signs that Google is partially broke. The question is whether things will get worse or better?
| 8:25 pm on Jun 25, 2003 (gmt 0)|
As a final perspective, I must say that my business can't afford one month of downtime. It's a commercial and serious business.
I will not consider paying for AdWords until ALL the spammy and outdated results are removed from the serps.
As soon as I see serps free of Web sites using doorways, link farms, duplicate content, multiple domains and others, only then I will suggest the owners of the Web sites I represent to pay for AdWords.
So Google, keep cleaning those results!
| 8:37 pm on Jun 25, 2003 (gmt 0)|
Dominic was a disaster - but by the end of that cycle some small recover. Esmerelda was a slight improvement - better results hanging around on -fi. Today I check all datacenters & everything is back just as good as pre-dominic on every datacenter - haven't seen that for a long while.
Hoping it lasts for at least a few days!
| 8:40 pm on Jun 25, 2003 (gmt 0)|
NovaW, Im also seeing good results on www and www2 all day
| 9:13 pm on Jun 25, 2003 (gmt 0)|
It's hard to judge the quality of the new index right now becasue of the flux.
Fi has very clean, relevant results, while some of the others seem a bit off.
| 9:25 pm on Jun 25, 2003 (gmt 0)|
At the moment mfishy, the real question is "which index?" There are 2 others floating around at the moment. -fi by far looks the best, at least for SERPs I check.
| 9:55 pm on Jun 25, 2003 (gmt 0)|
"As a final perspective, I must say that my business can't afford one month of downtime. It's a commercial and serious business.
I will not consider paying for AdWords until ALL the spammy and outdated results are removed from the serps."
Now THAT I really don't understand. If the serps are returning spammy, non relevant sites your AdWords will do great. We have them in a number of areas that are like that and are getting 4-5% CTR.
Sounds to me that if these are not your own sites you are a little embarrased to go to your clients and tell them that you can't get them back on the first page. Bite the bullet and tell them Google is screwed up and start an AdWords campaign. If your business is "commercial and serious" stop depending on FREE results and join the rest of us who also have "commercial and serious" businesses with appropriate advertising and promotion budgets.
| 10:22 pm on Jun 25, 2003 (gmt 0)|
what exactly are google doing?
I have notcied the quality of results fall over time, but suposedly google has made improvements, news to me!
Infact i've never seen so much spam on google on what should be improved spam filters. All the while im seeing good quality pages drop from the results.
Google also have billions of webpages to crawl, yet my site still shows the description and title dating back to april despite still being crawled many times since, whats the point in that?
I see peeple protecting google saying that they can still find what they are looking for, thats not entirly true when you are getting dated results and pages not included (good pages at that) you are not getting the high quality results that you should.
Google is being very irresponible, many businesses rely on google traffic and for their site to dissapear for no reason is wrong and a painful blow for site owners.
| 10:26 pm on Jun 25, 2003 (gmt 0)|
|At the moment mfishy, the real question is "which index?" There are 2 others floating around at the moment. -fi by far looks the best, at least for SERPs I check. |
fi looks good to me. :)
| 10:34 pm on Jun 25, 2003 (gmt 0)|
We've moved right off topic in this thread...
If anyone thinks I'm mistaken in msg 220 of this thread on why my index is having problems, I'd very much like to find out about it. It seems to explain my entire situation. All my other pages are doing great in Google; Esmeralda recovered my lost pages and added newer ones. I think the index page problem is on my end, (hope so anyway... easier to fix).
Zafile, ya can't count on any one SE for everything, man. The plex is controlling too many of the searches but this will change... then you'll have the same situation with the next monster that moves in, (i.e. the Vole). You have to use all the options to keep your head above water.
| 10:41 pm on Jun 25, 2003 (gmt 0)|
Tropical_Island, thank you for your input.
The point I want to stress is the following:
I will not pay to place AdWords in search results in which I have a page properly built with relevant content and without spam techniques. The reason that said page hasn't reached the Top Ten during the last month is the main topic of this thread.
Many theories have been posted: Google's broke, filter of most optimized keyword, filter eliminating most used keyword on page, etc.
I find it non-sense at this point to pay for AdWords because probably Google has failed in the implementation of its technology. Previous posts tend to verify the failure.
Now, after I obtain back the previous ranking of the Web site, I might consider pay for AdWords to be placed in other category of search results. The business of one of the sites is real estate for which the pages are properly built. Why not pay for AdWords so the site will show under travel results?
Tropical_Island, what's important now is for Google to clean up its search results, apply the latest ODP data and provide excellent competition to Microsoft's up coming search engine.
As I said before, the world will finally witness which technology implementation is better, Google's or Microsoft's. The time will tell. Cheers!
| 10:53 pm on Jun 25, 2003 (gmt 0)|
And then ...
Let the fraudulent Web sites that use doorways, link farms, duplicate content and multiple domains to pay for the AdWords.
That makes sense!
| 11:12 pm on Jun 25, 2003 (gmt 0)|
Weeeeelllllll, okay, I'm sure Google Guy tried to be right, but this back to square one and hideously bad results with "fresh" spam everywhere and best/topical pages dropped to oblivion just can't be "adding data", unless you think dumping manure on the dinner table is adding good seasoning.
The results looked like they had improved drastically a couple days ago, spam sites disappearing, fresh garbage not absurdly rated, etc.
But now we are back to spam ruling (albeit not all spam types), and worse, fresh piffle ranking well. To me this has been the very worst aspect of everything. Google being broken is obviously not deliberate. They aren't *trying* to screw up their crawls or trying to rank contact pages above topical ones, but it sure does seem that "fresh" is a deliberate choice over quality and relevance. That to me is the important philosophy issue here. The brokeness is just a given now (except to those in truly serious denial).
Again, maybe showing this trash has a freaking point in terms of filtering it out (!), but at some point Google does have to fix itself, and actually seeing some progress on that front was nice, but we are going on two full months now of trash and that is pretty sad.
| 11:17 pm on Jun 25, 2003 (gmt 0)|
|But now we are back to spam ruling |
In the past when I reported spam to Google, they took care of the problem right away (within a day or two). I've made some additional spam reports about sites using hidden links AND tricky redirects AND the description in Google is different than what's found on the page. Google doesn't seem to have reviewed them yet.
I'm wondering if they are busy with other problems right now (like MISSING INDEX PAGES <--is this legible?) and don't have time to deal with the spam?
| 11:23 pm on Jun 25, 2003 (gmt 0)|
I agree that the majority of spam comes in with freshbot. But google loves to brag about how freshbot adds new content, so I don't see it going away. Whatever new spam filters google added, they removed good quality content index pages, but the freshbot spam is worse than ever. GOOGLEGUY, add your spam filters on fresh pages too! And put the index pages back!
| 11:27 pm on Jun 25, 2003 (gmt 0)|
"Q: What measures will be taken to combat Spam, will reporting help or will Google allow Spam to flourish?"
GoogleGuy: "I think shaadi asked this question. So shaadi, I think if you check the site that you reported for hidden text, you’ll find that it’s in the penalty box. Let me take this chance to talk a bit about the spam report system. From a webmaster point of view, it doesn’t take much time to file a report if you feel something unfair is going on. I would definitely read through our guidelines first to make sure it’s something that we would agree is spam: [google.com...] "
"Reporting a site that you feel is spamming certainly won’t hurt. Now let’s talk about what sort of actions Google might take. There are some blatant things that we may take immediate action on. For example, if off-topic porn shows up for a search on someone’s name, that’s often worth doing something on a short time-frame. I noticed that you also did some other reports of things like duplicate content and sites that may be mirrors. Those are the sort of things that we probably wouldn’t take manual action on; we would instead look at using that data to write better algorithms. Our ultimate goal is to improve quality using only automated algorithms. Those algorithms may take longer to get right, but the nice thing is that when they’re done, they can often shut down an entire type of spam. So it doesn’t hurt to do a spam report, it gives us feedback about how to improve our search, and many spam reports end up as data that we use when testing our new algorithms."
| 12:47 am on Jun 26, 2003 (gmt 0)|
My experience: one of my sites index page been fluctuating wildly (page 2 to 6 to 1 to 6) over last few updates for its most optimised keyword - widgets, but has been stable (page 3) for secondary keyword - widget hire.
So I wondered if a filter of most optimised keyword might be the problem. But looking at the serps for widgets showed numbers 1,2 and 3 have 15%, 30% and 18% keyword density for widgets. I've got 10%. Why would they filter me and not them if there was a filter for most optimised keyword?
On the other hand, they are well-established sites in the index. Whereas most of my links are new - last few months. All of which seems to fit with Napoleons original theory.
But then why is my secondary keyword not suffering?
It is hard to wait patiently when all this makes such a difference to business.
| 1:21 am on Jun 26, 2003 (gmt 0)|
Nope, the filtering makes no sense.
Perhaps the next GG question time we're allowed we can just ask the one question 'What the f*** is going on with your algo?'
Still no index file here. Not popping in and out, just out full stop. Other, older sites are fine, very little change, if anything doing slightly better. My only theory which is sticking is that sites that have only started to rank well in the last 6 months are losing their index pages.
This indicates a google problem and not a filter.
oh well. At least my own business site is #1 for primary keywords!
| 1:26 am on Jun 26, 2003 (gmt 0)|
|Perhaps the next GG question time we're allowed we can just ask the one question 'What the f*** is going on with your algo? |
We're the advanced web professionals... we don't have to ask, we figure it out on our own. That's the point of WW.
ADDED: I meant "we" in the collective sense rather than myself :-)
| 1:51 am on Jun 26, 2003 (gmt 0)|
>>But google loves to brag about how freshbot adds new content, so I don't see it going away. <<
Google's fresh results are not the freshest!
I have a web site that just started getting some inbound links a week ago. Altavista already has 208 pages of this site indexed, and has also found two of the pages that link to it. Google only has the index page.
Is that why Google is now listing Altavista as the # 1 search engine? :)