| This 213 message thread spans 8 pages: < < 213 ( 1 2 3 4 5 6  8 ) > > || |
|Pages Dropping Out of Big Daddy Index|
< continued from [webmasterworld.com...] >
Seems to me that Matt's recent message confirms my theory. We're either all a bunch of moaning idiots with low quality sites with a few innapropriate, spammy links scattered here and there...or...
|The more I think about it the more convinced I am that the missing pages problem is being caused by a Backlink/PR issue (see Msg #15). |
Tying together all of the evidence from my own experience, and that of others gleaned from the forums, erroneous or out-of-date backlinks would explain all of the missing pages.
The erroneous, or simply out-of-date, backlink information (which we cannot see) leads to insufficient PR (which we cannot see) and hence deep pages are not indexed.
We all know that a "link:www.mysite.com" does not show you the complete picture. But, since Big Daddy, it now shows just a tiny proportion of backlinks. Way less, than it used to show before Big Daddy. Why? Because either the backlink index hasn't been updated (and now dates back to mid 2005), or else because it has been updated, but the update process is buggy. Only a small handful of Google employees know which of these two possibilities is the case.
We know that the missing pages problem cannot be due to any kind of duplicate content filter, as some people are suggesting. If this were the case, then effected sites would see a proportion of their pages disapear. Some would lose 10%, some would lose 40%, and some would lose 95%. But that's not what we see. We see sites losing the vast majority of their pages or else losing no pages at all. The reason effected sites lose such high percentages of their pages is because of the hierarchical nature of a site. The number of pages increases with depth, and the artificially low PRs (based on innacurate and/or out-of-date backlink data) prevents the deeper content from being indexed.
The fact that Big Daddy was kick-started from an index dating back to the middle of last year, not only explains why the backlink data might be stale, but it also explains why ancient pages keep popping up on various data centres.
As further evidence: try a "link:www.mysite.com" and compare it to a search for "www.mysite.com". In my case, the "link:" search shows just 6 results, only one of which is external to my site. The one external backlink probably pre-dates when Big Daddy's index was seeded. The "www.mysite.com" search, on the other hand, finds hundreds of results representing hundreds of internal and external backlinks. Why aren't these showing up in the "link:" search? Is it because "link:" searches are well known for not showing you the complete picture? Or, has that well-known fact simply been obscuring the true cause of all of the problems? Namely, that the backlinks are simply missing from Google's backlink index.
[edited by: tedster at 8:25 pm (utc) on May 17, 2006]
it does make you wonder about the timing of this break considering the state of things right now
>> bsaric wrote : My numbers going up again for older sites.
My pages are also going up. My site collapsed from 30000 pages to 10000, but since yesterday it's going up from 10000 to 12000 on nearly all the datacenters. Hope it won't fall again ...
|it does make you wonder about the timing of this break considering the state of things right now |
With so many older pages missing there must be a lot less for him to do and monitor.
No new sites or pages causing problems, a nice quiet time for MC to take a break
One thing we all can do, is citing examples of prominent sites that have lost pages.
I started with Oracle. First page I checked to see if it was indexed, nops not there :-)
Home>TECHNOLOGY PRODUCTS>Technology Overview>Technology Home>Grid Computing (Not Indexed, PR8)
I didn't continue the test further. Checked out robots.txt to make sure that page isn't blocked for spiders. No Robots, Nofollow on page header. I don't think Oracle could have copied the content. There is no reason why should Google deindex that page.
Guys this is serious matter. If the word goes out to the mainstream press, I shudder to think of Google's stock prospects. For whatever reason it was, Google please rein in BigDaddy. Your old infrastructure was just good enough.
>> Did anyone notice link to the cache is missing on the SERP's? Is it just me? <<
I saw that several times both yesterday and today. For a search that has just one result, the "cached" and "similar pages" links were missing from the result.
I hit "reload" and they reappeared. I have seen it just three for four times now, and could not reproduce it immediately afterwards. It happened again many hours later. I did not indentify if it is one particular DC or not.
[edited by: g1smd at 7:09 pm (utc) on May 20, 2006]
>> Guys this is serious matter. If the word goes out to the mainstream press, I shudder to think of Google's stock prospects.
hehehehe. Are you serious? You think average investor will care if a some pages aren't indexed?
walkman, I am sure you know we aren't talking just a handful pages not indexed, but a big, very big portion of webpages not indexed. Average investor won't mind if pages from sites of walkman and Mcmohan miss. But if pages from Oracle are lost, for no apparent reason, that's a serious matter. Other day I did a couple of random searches on Dmoz and found every deep-linked page isn't indexed, yet have a toolbar PR. Today, I started off with Oracle, and the first instance I hit a PR8 page not indexed. If this sampling any evidence, then we are staring at the face of a major crisis at Google.
I suggest you take some good site that has deep navigation structure and check out few deeper pages. I am sure you will find few pages not indexed fairly easily.
> I started with Oracle. First page I checked to see if it was indexed, nops not there :-)
Good find McMohan!
I followed your example and did a search for the link and sure enough got a
"Sorry, no information is available for the URL www.oracle.com/technologies/grid/index.html"
Had 20 pages indexed yesterday, now I am down to 7..... Pages are still falling out.
|hehehehe. Are you serious? You think average investor will care if a some pages aren't indexed? |
"Google. No longer indexing all the web's information."
Umm, how could this NOT be earth-shattering. Isn't this their mission statement?
I see lots of small sites dropping multiple pages in the last few days. From 35 to 6. From 40 to 10. From 20 to 8.
So far, a common thread has been duplicate titles and meta descriptions, as well as low PR on internal pages, and very few inbound links to other than the main index page.
I also see a 160 page site with unique titles and meta descriptions, unique content per page, and a few incoming deep links to internal pages, sit there steady as a rock.
Is it possible that Google are working on a system which no longer lists backpages, instead opting to list index pages with the added atributes and keyphrases of their backpages. This may help do away with the canonical problems and also save some storage space on their servers.
All the Best
There are pages that are no longer cached that have page rank.... That is the baffeling part... Googlebot has been crawling the crap out of the site this week.
[edited by: trinorthlighting at 7:53 pm (utc) on May 20, 2006]
"Umm, how could this NOT be earth-shattering. Isn't this their mission statement? "
No, Google's mission is to organize information, not just index every page that exists.
Big Daddy looks like a quantum leap in accomplishing this mission.
I also noticed today that I lost over 80 back links to pages. So the pages are dropping and so are the links.
First statement on G's About Us page
|Google's mission is to organize the world's information and make it universally accessible and useful. |
Quantum Leap? Where's this quantum leap?
As I mentioned earlier, I was recently looking for "natural cures for specific cancer." Did I find it on G? nope. Y! and MSN? Yep.
Why would the public wanted "edited" results?
The public doesn't care if that page is PR1 on Bob's 5 page site with 3 incoming links.
They just want the answers.
Let me know when they index ALL the pages AND keep the spam on page 20, then we can talk....
Edited to add. The sad thing is, according to MC, I shouldn't link to that very useful site because none of MY sites are about "natural cures to cancer"... How ironic...
Many of us are doing a great favor to Google by not taking time to see if pages of prominent sites are indexed or not, and then reporting if they indeed were missing. Chances are that there are many instances of pages missing from many good sites.
This suits Google just fine while they work this out, for they can't afford to let this go on as is.
Over the last few years many search engines came with really good concepts and USP. But they couldn't compete effectively with Google, and one the main reasons for that was Google's indexing power and keeping the index fresh, which is resource heavy. Now, not doing the basic job of indexing the pages (the ones that have unique content) Google, I am afraid, is throwing away its USP.
I think microsoft has come a long ways in keeping their content fresh. All of my pages indexed in microsoft have less than a week old cache date.
Google has that whole supplemental index that is very old and outdated. Why have an index that is very old and outdated is a good question.
Today I found some of my sites pages cached in the supplemental index that have not been around for a year......
I always thought the flaw with adsense was they let webmasters add it to any sites they wanted after initially being approved so that fed the scraper sites that brought down the whole deck of cards.
Vaneesa Fox of Google Engineering has posted this declaration about a bug affecting page counts with site operator tool
It must have been going on a long time!
"We're freezing all refreshes of the supplemental results until these issues are fixed, and things should be back to normal in a few days."
Typical Google missing-the-point. Stop doing the important thing because something completely trivial is happening. It's interesting that they would admit doing such a blatantly stupid thing. Fix your index, don't worry (as much) about how pages display for a freaking site: search. Sheesh.
True, who cares about the site: search. The problem is that the pages not showing up in the site:search have gone alltogether. So is the problem of the site: search and of the dropped pages related / the same / independend?
Hopefully it is just one of those "bugs" mentioned by Vanessa Fox.
Yes, since we are to blame for all that's happening cos we are all spammers then it stands to reason that our concerns for no end of pages going supplemental / being dropped are unimportant.
Instead having the site: search working properly with or without a trailing slash should be our main concern...
I cannot find any Supplemental Results dated earlier then 2005 June now. It looks like those have all gone. For pages then went 404, or domain expired, before 2005 June they have simply been deleted. Previously there was stuff going back as far as 2004 January all over the place. There is none of that now.
There are a lot of newly created Supplemental Results with dates from 2005 July to 2006 March to be found now. These are for pages that have gone 404, or domain expired, since 2005 July and they represent the last known version that was online.
They are also for pages that have been edited since 2005 July, and the Supplemental Result shows the previous version of that pages content in the snippet, whereas a search for current content returns that same page as a normal result. In both cases you see a very recent cache of the page itself.
The Supplemental Results are there to let you see content that has recently been edited or removed from the web. They might also play a role in penalising certain types of domain-hopping spam sites that put the same content up in multiple places, and open new sites as soon as the old ones are penalised.
Thanks for the link Whitey
Re the site tool bug, it would be good to get some broader acknowledgement in a timely manner from Google of what they're up against.
It would show that the webmasters and Google are on common ground and genuine about being "open".
But at least these are bigger steps than before on communication.
Well, i'm glad to read about the bug. I was worried about my larger site that I checked yesterday and it was down to just the domain listed. Checking w/o the '/' at the end, and it showed much more pages listed. Still far fewer than it had months ago, but much better than just the 1 listing.
I'm still having the same problem everyone else is though with other sites. Pages dropping or not being listed at all. That's the main one I hope they fix asap. I've posted before, and i'll repeat, it's absolutely ridiculous that my page is a full detailed review of a product, and I can't get indexed at ALL, yet 6 out of the top 10 sites listed for the keyword are absolute garbage and unrelated with just the keyword as one of the randomly generated words on the page.
That is not improving the google search results, and the more situations like that where it happens, the more frustrated surfers will be, and the quicker they'll try other engines. It's why I went to google in the first place, and it's why i'll use another engine soon.
It would be one thing if it's just me and i'm not optimized, but it's clear many are having the same issue, and if it's like Matt says where links to other sites play a big role, the downward spiral of google listings will continue. The more sites drop, there will be fewer quality pages linking to other quality pages, so more get dropped, and so on.
I really didn't think i'd see the day when MSN took over google for SE traffic to my site, but it's happened this week, and it's not really because MSN improved. It's because google dropped so many pages, i'm only getting a handful of hits a day now. Please fix this soon!
Based on actual data, here is my theory on Google's BigDaddy Conspiracy. I believe that they will slowly disseminate different "fixes" for the most noticed issues, but will not come clean about the full extent of the problems because I believe that they were unintentionally caused by Google not knowing its own search algorithm well enough. Now for a rundown - data in < > brackets takes the place of actual data for the purpose of privacy for the real site owners.
Google had a new spider update called BigDaddy that went around and collected info from about Dec ’05 to March ’06. Around March 28th, they switched over completely to their BigDaddy data. On that day, all hell broke loose as thousands of good sites disappeared from Google. BUT… <our competitor's Spam-rich> sites seemed to thrive.
Google has admitted that they had problems with storing all the data from the BD update. Which at first seems weird because so many sites were dropped.
You can do a search on most of the search engines to show all the pages of <a site> that they have in their indexes. On Google, like many others, you just type “site:http://www.yourdomain.com/” (without the quotes) into the search box. <Our site> was down to 3 pages from over 100, but have been adding a couple a day for the past couple of days.
You can also do a “link:http://www.yourdomain.com/” search to find out who they have listed in their index that have links to our site. This troubled me because it would always say something like 1-8 out of about 24 matches. But there was no link down the bottom to either go to the second page or repeat the search with the omitted results included. It’s no secret that Google doesn’t let you know ALL the links they have toward you, but it was like they didn’t want us to see all the links for some reason. I tried this with other sites and found the same thing.
What some people don’t know is that Google has a special command (probably for their employees across the country/world) that does show ALL of the links to your site. It is “@:yourdomain.com” (without the quotes). I did this search a week ago. This search showed that we had 9,000 incoming links. That is just plain crazy. <Competitor's spammy> site had a normal amount: some directory listings, internal links, and his cross-linking between his different sites.
I looked through LOTS of the 9,000+ to look for patterns. This is important because if you have a sudden increase (nobody is able to define exactly how many links over what period of time constitutes a “sudden increase”), then Google can count that as a type of spamming their indexes.
Aside from the links I would expect to be there, there were thousands of pages that were nothing more than pages with PPC ads on them! To oversimplify a little, they broke down into roughly three categories:
1. Sites with Overture ads – more likely than not, these had a Google PR (pagerank) of ZERO. Of course, Overture is their competition.
2. Sites that were hosted/created by some company named Enom.com. A lot, but not enough to pursue a theory on those.
3. A HUGE number of sites hosted by GoDaddy.com that say “Coming Soon”. This is complete b.s.
Here is an example of one of the GoDaddy sites: [visibilitysystems.com...] <I replaced our keywords with some keywords unrelated to our business> Complete crap. Why would there even be such a site? Because as one fraudster put it “I can create thousands of websites automatically in just a few minutes” (Which really scares me because if they can create all these sites, I’m sure that they have the ability to make clicks once per hour in each topic and have them look like they come from different IP addresses.) This type of site is something that people have started to be able to automate the creation of.
Because 15% of web traffic is estimated to be generated by people actually typing urls into their address bar, there was $ to be made by people making sites like BestBy.com – basically you make a site close to a common well-known site with a typo. People come to your site by mistake, you choose keywords that will show the real BestBuy in the results through a PPC program – the customer gets where they want, but at the expense of BestBuy having to pay Google and the owner of the bogus site gets a commission. So, people have started to throw up thousands of sites at a time, if they make no money, they cancel them after three months. The ones that make money make so much money that it’s worth it to let them sit there.
So there are thousands of bogus sites like visibilitysystems listed above that point to our site. Surely Google doesn’t index these sites, right? My theory is that Google knows about them and may even be in on it. They made $6 billion dollars last year. HALF of it was through “syndicated” PPC sites. So the real url I gave you above is even ranked by Google – as a PR 3. Almost all of these GoDaddy sites are PR 3-6 (most were 4’s). And if you change the part of the url that says "cell%20phones” to say, some other high-paying term like “online%20gambling”, the url still works, so this will show as a backlink for the top 10 companies in all of the richest keywords in the Google PPC program.
Notice they even put a little directory at the bottom of every one of these pages in case you land there and see something else you were looking for. Google also acts as a registrar now (but not a hosting company) which also makes me suspicious that they may be in bed with GoDaddy. see [theregister.co.uk...]
So what are the options of what’s going on? Either Google did not know or they did. If they did, whether or not they were in on it, I believe that they did not realize that when they switched over to BigDaddy data at the end of March that it would affect all of their top customers in all the keywords that pay well. But it has certainly worked out for them either way, as sites with high-ranking keywords have had to rely on spending even more on Google’s PPC than before because so many of their (our) pages dropped out of the indexes.
You can either ratchet up the conspiracy theory or bring it down a few notches, but let me leave you with some of the kickers:
<our comptetitor's spammy> sites, which have not been dropped from the indexes, did not list these fake pages with urls to their sites. That would explain why it looks like <they're> Google’s darling right now.
I also found that there were 80+ GOOGLE blogs that are all identical that are similar to the generated GoDaddy pages above that point to our site.
If Google were really evil, this could be their way of trying to get us to stop using Overture (the PR-zero sites that point to BGF) but I doubt that’s the case; I think they just don’t count those at all – or else they’d have a PR.
As of today, a new group of <our> urls are in the Google index and within 2 days the # of urls linking to us has fallen from 9,000+ to under 900 (this has turned out to be a good thing for us)
I submitted info to Google 3 different ways, so I think they’re re-spidering the sites that have complained first - we're starting to see major results - finally.
The first was on the Google message boards; you need to list your complaint very politely under the post that says "We're listening".
The second was to do a "reinclusion request" on Google once we were down to only 3 urls listed.
The third was to send a detailed email to the email address provided in (Google's) Matt's blog.
9,000+ sites pointing to us? No wonder Google admitted to running out of room on their servers. This probably did catch them off guard.
I think they tried to do a quick fix by pulling some old data (the “supplemental results”) and putting it in temporarily so site owners would see something when they typed in the site: command. That’s exactly what happened to us.
The order that things happened:
1. Most of our pages were dropped
2.Supplemental results added
3. Site re-spidered after contacting Google
4. Supplemental results dropped
5. Real results being added
6. New additions to be assigned page ranks in near future.
7. Google decided to respider their supplemental results, so they added supplemental urls back in
8. The number of urls in our site jumped from 3 on Wed to 7 on Fri to over 90 today (Sat)
Basically, I think they know what they did, they need months to fix it, and they don’t want anyone to figure out what went wrong because any combination of the above scenarios implicates them in something that is probably worse than what they actually intended. I’ve tested other sites that said they’ve had problems and it all seems to fit.
There are also other pieces that make it ven more perfect fit, but this is the short version.
Be interested to find out if this scenario fits anyone else.....
>>all hell broke loose as thousands of good sites disappeared from Google. BUT… <our competitor's Spam-rich> sites seemed to thrive.
This is what I don't get.
This is what you hear from Google employees: "we are working to reduce the spam links/spam sites getting indexed".
The problem is, the spam sites are doing just fine. Many of the rest of us who don't engage in their tactics aren't. I'm looking at keywords, I'm not even talking about the bug that Google acknowled yesterday, ( "site:www.yourdomain.com" being broken).
In my niche, there are about 10 sites, 6 or 7 of which are spammers (one is borderline), and the other three are legit - we write our own content, we get links from large sites outside of niche that find an occasional story of our's interesting, etc., etc.
Even though we compete, we exchange emails, occasionally link to stories on each others sites, etc. The three of us had an IM discussion today, our first ever, because all three of us are having problems while the spammers in our niche aren't - we are dropping in key words fast.
In our conversation, one of the people suggested something that none of us would have even brought up months ago - all three of our sites are registered in our names or our company names - she suggested creating a site registered through a private proxy, no advertising, with some simple blog software and accounts for each of us, and we update it whenever we publish a story on our site.
It's very unsavory and uncomfortable and reeks of what the spammers are doing, but we actually considered it for a few minutes.
It's ridiculous that legitimate site owners might have to resort to such means, but apparently, according to Mr. Cutts, having a "fine" site is not enough, you have to have some kind of threshold of IBLs before you can be considered worthy enough to index. Guess Google will have to drop their claims about indexing the whole web, and instead just claim to index part of the web.
"it's not really because MSN improved."
That's right nether Yahoo .Google is and will stay the main streem SE for many years ahead ,the only thing is just be patience and cross your fingers ,after all I believe Google will fix all problems if any ,and the communication that Google has with webmasters is unique in the SE world.Do you really expect that MSN or Yahoo will put a Google Guy or Matt or Vanesa and lately Adam Lasnik to communicate with the webmaster community?
Q:When did you see any postings or contacts lately from Yahoo Tim or MSNdude?
I do agree about how open and upfront G has been but on the other hand I still think a lot of these problems could have been stopped with more testing. After all we all know the respect some webmasters offer towards GG & MC so asking people to do testing on some DC's wouldn't really have been hard to do
| This 213 message thread spans 8 pages: < < 213 ( 1 2 3 4 5 6  8 ) > > |