Forum Moderators: Robert Charlton & goodroi
Seems to me that Matt's recent message confirms my theory. We're either all a bunch of moaning idiots with low quality sites with a few innapropriate, spammy links scattered here and there...or...
The more I think about it the more convinced I am that the missing pages problem is being caused by a Backlink/PR issue (see Msg #15).
Tying together all of the evidence from my own experience, and that of others gleaned from the forums, erroneous or out-of-date backlinks would explain all of the missing pages.The erroneous, or simply out-of-date, backlink information (which we cannot see) leads to insufficient PR (which we cannot see) and hence deep pages are not indexed.
We all know that a "link:www.mysite.com" does not show you the complete picture. But, since Big Daddy, it now shows just a tiny proportion of backlinks. Way less, than it used to show before Big Daddy. Why? Because either the backlink index hasn't been updated (and now dates back to mid 2005), or else because it has been updated, but the update process is buggy. Only a small handful of Google employees know which of these two possibilities is the case.
We know that the missing pages problem cannot be due to any kind of duplicate content filter, as some people are suggesting. If this were the case, then effected sites would see a proportion of their pages disapear. Some would lose 10%, some would lose 40%, and some would lose 95%. But that's not what we see. We see sites losing the vast majority of their pages or else losing no pages at all. The reason effected sites lose such high percentages of their pages is because of the hierarchical nature of a site. The number of pages increases with depth, and the artificially low PRs (based on innacurate and/or out-of-date backlink data) prevents the deeper content from being indexed.
The fact that Big Daddy was kick-started from an index dating back to the middle of last year, not only explains why the backlink data might be stale, but it also explains why ancient pages keep popping up on various data centres.
As further evidence: try a "link:www.mysite.com" and compare it to a search for "www.mysite.com". In my case, the "link:" search shows just 6 results, only one of which is external to my site. The one external backlink probably pre-dates when Big Daddy's index was seeded. The "www.mysite.com" search, on the other hand, finds hundreds of results representing hundreds of internal and external backlinks. Why aren't these showing up in the "link:" search? Is it because "link:" searches are well known for not showing you the complete picture? Or, has that well-known fact simply been obscuring the true cause of all of the problems? Namely, that the backlinks are simply missing from Google's backlink index.
[edited by: tedster at 8:25 pm (utc) on May 17, 2006]
I started with Oracle. First page I checked to see if it was indexed, nops not there :-)
Home>TECHNOLOGY PRODUCTS>Technology Overview>Technology Home>Grid Computing (Not Indexed, PR8)
I didn't continue the test further. Checked out robots.txt to make sure that page isn't blocked for spiders. No Robots, Nofollow on page header. I don't think Oracle could have copied the content. There is no reason why should Google deindex that page.
Guys this is serious matter. If the word goes out to the mainstream press, I shudder to think of Google's stock prospects. For whatever reason it was, Google please rein in BigDaddy. Your old infrastructure was just good enough.
I saw that several times both yesterday and today. For a search that has just one result, the "cached" and "similar pages" links were missing from the result.
I hit "reload" and they reappeared. I have seen it just three for four times now, and could not reproduce it immediately afterwards. It happened again many hours later. I did not indentify if it is one particular DC or not.
[edited by: g1smd at 7:09 pm (utc) on May 20, 2006]
hehehehe. Are you serious? You think average investor will care if a some pages aren't indexed?
I suggest you take some good site that has deep navigation structure and check out few deeper pages. I am sure you will find few pages not indexed fairly easily.
So far, a common thread has been duplicate titles and meta descriptions, as well as low PR on internal pages, and very few inbound links to other than the main index page.
I also see a 160 page site with unique titles and meta descriptions, unique content per page, and a few incoming deep links to internal pages, sit there steady as a rock.
All the Best
Col :-)
Google's mission is to organize the world's information and make it universally accessible and useful.
Quantum Leap? Where's this quantum leap?
As I mentioned earlier, I was recently looking for "natural cures for specific cancer." Did I find it on G? nope. Y! and MSN? Yep.
Why would the public wanted "edited" results?
The public doesn't care if that page is PR1 on Bob's 5 page site with 3 incoming links.
They just want the answers.
Let me know when they index ALL the pages AND keep the spam on page 20, then we can talk....
Edited to add. The sad thing is, according to MC, I shouldn't link to that very useful site because none of MY sites are about "natural cures to cancer"... How ironic...
This suits Google just fine while they work this out, for they can't afford to let this go on as is.
Over the last few years many search engines came with really good concepts and USP. But they couldn't compete effectively with Google, and one the main reasons for that was Google's indexing power and keeping the index fresh, which is resource heavy. Now, not doing the basic job of indexing the pages (the ones that have unique content) Google, I am afraid, is throwing away its USP.
I think microsoft has come a long ways in keeping their content fresh. All of my pages indexed in microsoft have less than a week old cache date.
Google has that whole supplemental index that is very old and outdated. Why have an index that is very old and outdated is a good question.
Today I found some of my sites pages cached in the supplemental index that have not been around for a year......
It must have been going on a long time!
Typical Google missing-the-point. Stop doing the important thing because something completely trivial is happening. It's interesting that they would admit doing such a blatantly stupid thing. Fix your index, don't worry (as much) about how pages display for a freaking site: search. Sheesh.
Instead having the site: search working properly with or without a trailing slash should be our main concern...
There are a lot of newly created Supplemental Results with dates from 2005 July to 2006 March to be found now. These are for pages that have gone 404, or domain expired, since 2005 July and they represent the last known version that was online.
They are also for pages that have been edited since 2005 July, and the Supplemental Result shows the previous version of that pages content in the snippet, whereas a search for current content returns that same page as a normal result. In both cases you see a very recent cache of the page itself.
The Supplemental Results are there to let you see content that has recently been edited or removed from the web. They might also play a role in penalising certain types of domain-hopping spam sites that put the same content up in multiple places, and open new sites as soon as the old ones are penalised.
I'm still having the same problem everyone else is though with other sites. Pages dropping or not being listed at all. That's the main one I hope they fix asap. I've posted before, and i'll repeat, it's absolutely ridiculous that my page is a full detailed review of a product, and I can't get indexed at ALL, yet 6 out of the top 10 sites listed for the keyword are absolute garbage and unrelated with just the keyword as one of the randomly generated words on the page.
That is not improving the google search results, and the more situations like that where it happens, the more frustrated surfers will be, and the quicker they'll try other engines. It's why I went to google in the first place, and it's why i'll use another engine soon.
It would be one thing if it's just me and i'm not optimized, but it's clear many are having the same issue, and if it's like Matt says where links to other sites play a big role, the downward spiral of google listings will continue. The more sites drop, there will be fewer quality pages linking to other quality pages, so more get dropped, and so on.
I really didn't think i'd see the day when MSN took over google for SE traffic to my site, but it's happened this week, and it's not really because MSN improved. It's because google dropped so many pages, i'm only getting a handful of hits a day now. Please fix this soon!
Google had a new spider update called BigDaddy that went around and collected info from about Dec ’05 to March ’06. Around March 28th, they switched over completely to their BigDaddy data. On that day, all hell broke loose as thousands of good sites disappeared from Google. BUT… <our competitor's Spam-rich> sites seemed to thrive.
Google has admitted that they had problems with storing all the data from the BD update. Which at first seems weird because so many sites were dropped.
You can do a search on most of the search engines to show all the pages of <a site> that they have in their indexes. On Google, like many others, you just type “site:http://www.yourdomain.com/” (without the quotes) into the search box. <Our site> was down to 3 pages from over 100, but have been adding a couple a day for the past couple of days.
You can also do a “link:http://www.yourdomain.com/” search to find out who they have listed in their index that have links to our site. This troubled me because it would always say something like 1-8 out of about 24 matches. But there was no link down the bottom to either go to the second page or repeat the search with the omitted results included. It’s no secret that Google doesn’t let you know ALL the links they have toward you, but it was like they didn’t want us to see all the links for some reason. I tried this with other sites and found the same thing.
What some people don’t know is that Google has a special command (probably for their employees across the country/world) that does show ALL of the links to your site. It is “@:yourdomain.com” (without the quotes). I did this search a week ago. This search showed that we had 9,000 incoming links. That is just plain crazy. <Competitor's spammy> site had a normal amount: some directory listings, internal links, and his cross-linking between his different sites.
I looked through LOTS of the 9,000+ to look for patterns. This is important because if you have a sudden increase (nobody is able to define exactly how many links over what period of time constitutes a “sudden increase”), then Google can count that as a type of spamming their indexes.
Aside from the links I would expect to be there, there were thousands of pages that were nothing more than pages with PPC ads on them! To oversimplify a little, they broke down into roughly three categories:
1. Sites with Overture ads – more likely than not, these had a Google PR (pagerank) of ZERO. Of course, Overture is their competition.
2. Sites that were hosted/created by some company named Enom.com. A lot, but not enough to pursue a theory on those.
3. A HUGE number of sites hosted by GoDaddy.com that say “Coming Soon”. This is complete b.s.
Here is an example of one of the GoDaddy sites: [visibilitysystems.com...] <I replaced our keywords with some keywords unrelated to our business> Complete crap. Why would there even be such a site? Because as one fraudster put it “I can create thousands of websites automatically in just a few minutes” (Which really scares me because if they can create all these sites, I’m sure that they have the ability to make clicks once per hour in each topic and have them look like they come from different IP addresses.) This type of site is something that people have started to be able to automate the creation of.
Because 15% of web traffic is estimated to be generated by people actually typing urls into their address bar, there was $ to be made by people making sites like BestBy.com – basically you make a site close to a common well-known site with a typo. People come to your site by mistake, you choose keywords that will show the real BestBuy in the results through a PPC program – the customer gets where they want, but at the expense of BestBuy having to pay Google and the owner of the bogus site gets a commission. So, people have started to throw up thousands of sites at a time, if they make no money, they cancel them after three months. The ones that make money make so much money that it’s worth it to let them sit there.
So there are thousands of bogus sites like visibilitysystems listed above that point to our site. Surely Google doesn’t index these sites, right? My theory is that Google knows about them and may even be in on it. They made $6 billion dollars last year. HALF of it was through “syndicated” PPC sites. So the real url I gave you above is even ranked by Google – as a PR 3. Almost all of these GoDaddy sites are PR 3-6 (most were 4’s). And if you change the part of the url that says "cell%20phones” to say, some other high-paying term like “online%20gambling”, the url still works, so this will show as a backlink for the top 10 companies in all of the richest keywords in the Google PPC program.
Notice they even put a little directory at the bottom of every one of these pages in case you land there and see something else you were looking for. Google also acts as a registrar now (but not a hosting company) which also makes me suspicious that they may be in bed with GoDaddy. see [theregister.co.uk...]
So what are the options of what’s going on? Either Google did not know or they did. If they did, whether or not they were in on it, I believe that they did not realize that when they switched over to BigDaddy data at the end of March that it would affect all of their top customers in all the keywords that pay well. But it has certainly worked out for them either way, as sites with high-ranking keywords have had to rely on spending even more on Google’s PPC than before because so many of their (our) pages dropped out of the indexes.
You can either ratchet up the conspiracy theory or bring it down a few notches, but let me leave you with some of the kickers:
<our comptetitor's spammy> sites, which have not been dropped from the indexes, did not list these fake pages with urls to their sites. That would explain why it looks like <they're> Google’s darling right now.
I also found that there were 80+ GOOGLE blogs that are all identical that are similar to the generated GoDaddy pages above that point to our site.
If Google were really evil, this could be their way of trying to get us to stop using Overture (the PR-zero sites that point to BGF) but I doubt that’s the case; I think they just don’t count those at all – or else they’d have a PR.
As of today, a new group of <our> urls are in the Google index and within 2 days the # of urls linking to us has fallen from 9,000+ to under 900 (this has turned out to be a good thing for us)
I submitted info to Google 3 different ways, so I think they’re re-spidering the sites that have complained first - we're starting to see major results - finally.
The first was on the Google message boards; you need to list your complaint very politely under the post that says "We're listening".
The second was to do a "reinclusion request" on Google once we were down to only 3 urls listed.
The third was to send a detailed email to the email address provided in (Google's) Matt's blog.
9,000+ sites pointing to us? No wonder Google admitted to running out of room on their servers. This probably did catch them off guard.
I think they tried to do a quick fix by pulling some old data (the “supplemental results”) and putting it in temporarily so site owners would see something when they typed in the site: command. That’s exactly what happened to us.
The order that things happened:
1. Most of our pages were dropped
2.Supplemental results added
3. Site re-spidered after contacting Google
4. Supplemental results dropped
5. Real results being added
6. New additions to be assigned page ranks in near future.
7. Google decided to respider their supplemental results, so they added supplemental urls back in
8. The number of urls in our site jumped from 3 on Wed to 7 on Fri to over 90 today (Sat)
Basically, I think they know what they did, they need months to fix it, and they don’t want anyone to figure out what went wrong because any combination of the above scenarios implicates them in something that is probably worse than what they actually intended. I’ve tested other sites that said they’ve had problems and it all seems to fit.
There are also other pieces that make it ven more perfect fit, but this is the short version.
Be interested to find out if this scenario fits anyone else.....
This is what I don't get.
This is what you hear from Google employees: "we are working to reduce the spam links/spam sites getting indexed".
The problem is, the spam sites are doing just fine. Many of the rest of us who don't engage in their tactics aren't. I'm looking at keywords, I'm not even talking about the bug that Google acknowled yesterday, ( "site:www.yourdomain.com" being broken).
In my niche, there are about 10 sites, 6 or 7 of which are spammers (one is borderline), and the other three are legit - we write our own content, we get links from large sites outside of niche that find an occasional story of our's interesting, etc., etc.
Even though we compete, we exchange emails, occasionally link to stories on each others sites, etc. The three of us had an IM discussion today, our first ever, because all three of us are having problems while the spammers in our niche aren't - we are dropping in key words fast.
In our conversation, one of the people suggested something that none of us would have even brought up months ago - all three of our sites are registered in our names or our company names - she suggested creating a site registered through a private proxy, no advertising, with some simple blog software and accounts for each of us, and we update it whenever we publish a story on our site.
It's very unsavory and uncomfortable and reeks of what the spammers are doing, but we actually considered it for a few minutes.
It's ridiculous that legitimate site owners might have to resort to such means, but apparently, according to Mr. Cutts, having a "fine" site is not enough, you have to have some kind of threshold of IBLs before you can be considered worthy enough to index. Guess Google will have to drop their claims about indexing the whole web, and instead just claim to index part of the web.