Forum Moderators: Robert Charlton & goodroi
I had 20,300 pages showing for a site:www.example.com search yesterday and for the past month. Today it dropped to 509 but my traffic is still pretty constant. I normally get around 4,500 - 5,000 to that site per day and today I've already got 4,000.So, either Google doesn't account for even a small percentage of my traffic (which I doubt) or the way Google stores information about my site has changed. i.e. the 20,300 pages are still there, Google will only tell me about 509 of them. As far as I can tell, I think the other pages have been supplemented.
That resonated with something that I was talking about with the crawl/index team. internetheaven, was that post about the site in your profile, or a different site? Your post aligns exactly with one thing I've seen in a couple ways. It would align even more if you were talking about a different site than the one in your profile. :) If you were talking about a different site, would mind sending the site name to bostonpubcon2006 [at] gmail.com with the subject line of "crawlpages" and the name of your site, plus the handle "internetheaven"? I'd like to check the theory.
Just to give folks an update, we've been going through the feedback and noticed one thing. We've been refreshing some (but not all) of the supplemental results. One part of the supplemental indexing system didn't return any results for [site:domain.com] (that is, a site: search with no additional terms). So that would match with fewer results being reported for site: queries but traffic not changing much. The pages are available for queries matching the supplemental results, but just adding a term or stopword to site: wouldn't automatically access those supplemental results.
I'm checking with the crawl/index folks if this might factor into what people are seeing, and I should hear back later today or tomorrow. In the mean time, interested folks might want to check if their search traffic has gone up/down by a major amount, and see if there are fewer/more supplemental results for a site: search for their domain. Since folks outside Google couldn't force the supplemental results to return site: results, it needed a crawl/index person to notice that fact based on the feedback that we've gotten.
Anyone that wants to send more info along those lines to bostonpubcon2006 [at] gmail.com with the subject line "crawlpages" is welcome to. So you might send something like "I originally wrote about domain.com. I looked at my logs and haven't seen a major decrease in traffic; my traffic is about the same. I used to have about X% supplemental results, and now I hardly see any supplemental results with a site:domain.com query."
I've still got someone reading the bostonpubcon email alias, and I've worked with the Sitemaps team to exclude that as a factor. The crawl/index folks are reading portions of the feedback too; if there's more that I notice, I'll stop by to let you know.
[edited by: Brett_Tabke at 8:07 pm (utc) on May 8, 2006]
Not only are some sites having less pages appear in the index (these are the "experimental" and "cleanup" datacentres as far as I can tell)
Just to be crystal clear: The missing pages problems are accross all datacentres. That is, sites that are effected by the bug, see 95%+ of their pages dropped from Google's index on all datacentres (obviously there are the usual slight variations from DC to DC).
Again, I ask....what has G. proposed in order to help these webmasters? I know they have announced an email addresss for webmasters to provide examples, but what else? Anything? Hello....is anyone there?
Again, I ask....what has G. proposed in order to help these webmasters? I know they have announced an email addresss for webmasters to provide examples, but what else? Anything? Hello....is anyone there?
The simple answer is: Nothing, beyond the email address.
Google run a very tight ship when it comes to disseminating information. While this policy has many obvious advantages, it has some serious downsides as well. When a serious bug is introduced, the lack of communication, both within Google and with the outside world, can seriously hamper their ability to identify and fix the problem. Maintaining the high level of secrecy that they do, requires a great deal of "need-to-know" segmentation. I'm certain that only a very small handful of Google employees have the full picture of exactly what is going on. How many of Google's employees have a birdseye view of all of the changes encompassed by "Big Daddy"? I don't know the answer, but I'd guess it is a tiny, tiny, number. What chance then, of identifying and sorting out the current problems?
One thing I did notice, I had a bunch of old 404 pages from last august dump in the supplemental index and suddenly my good pages disappeared.
May be I am getting hit with a duplicate pentalty due to these old and outdated 404 pages caches that all the sudden showed in the index.
Well I'm shocked I got a reply to the email I sent in, basically told me I didn't have a canonical problem and suggested I used the G site maps! Although I'm not using G site maps on this site I do have my own SM which in the past have always done there job well and I really think to tell a webmaster just to use their SM is a little lame considering that prior to them dropping pages which for me started about 3 weeks ago and lack of being able to crawl sites I never had problems getting content crawled
I do agree the email didn't really offer any answers, other than use Gmaps! but I've replied back to it so it will be interesting to see if I get a more detailed reply back
Well do keep us updated if they do.
As Arubicus said, any info on whether it's something we can "fix" or it's something on their end would be heaven sent.
In your original email to Google did you describe your problem as a "crawling" problem or an "indexing" problem? From their suggestion to try using a sitemap, it seems they are asuming that your missing pages are as a result of not being crawled.
This is not the usual symptom of the missing pages problem. The missing pages are crawled regularly, they just don't make it into the index.
PS: A Google sitemap won't help.
Hmmmz - they appear to have changed something on your site though to correct the possible Canonical issue - so I am not sure why they have said you didn't have the problem.
Eg. I am sure that internal pages with the non-www within the site had PR0 - and now they have PR the same as the www...... (Not 100% sure as I cant remember)
the title was as GG said "crawlpages" and within my email I explaned how pages used to get crawled and ranked that were no longer showning any cached info - so to tell me to use the SM is a little annoying
The crawl/index team checked into several reports and each time came up with other reasons why the site wouldn’t be crawled as much (e.g. the ‘next page’ url on one site wasn’t short; it was a total hairball with like 200 chars of params), and some supplemental results folks have been through the raw emails, which is how one of the site: changes was noticed. So far, about half of the feedback to the email isn’t about pages dropped. Of the other half, one factor is that several sites have spam penalties. Of the remaining feedback, the two site: changes were the only two that we noticed. We’re going to keep digging in, but people need to bear in mind that Bigdaddy does have different crawl priorities, so a site that had more pages indexed by the earlier Googlebot won’t necessarily have as many pages indexed in the future. But don’t get me wrong; we’re still going through the feedback to see if there’s anything else to be identified and improved.
If the old supplemental results are gone then I'm happy. The supps for my site were of an early version of the site and I've been wanting them to be recrawled or go away for months. All of the pages pointed to have been 301 to new pages for over a year and in many cases the new pages had much more content than the old versions.
Hopefully the new crawl doesn't trip too many duplicate content triggers on my good content. The site is database generated with URL rewriting to create static URL's. The data involves geo-location so a lot of names will appear on multiple pages but in different orders on those pages. This probably plays hell with Google's dup content discovery process. Several hundred pages have additional content such as photos and commentary which make those pages highly unique, but several thousand just have the geo-data. I'm sure this is a large part of why a lot of my content ended up in the supps before. I'm noth bothered by that I just wish Google would figure out when those pages gain detail and bring them back to the active index. Sitemaps doesn't seem to be helping there as much as I would hope.
Anyway, enough rambling on. I'm not fretting too much but it looks like this update is going to take some time to recover from. Good thing this isn't my day job or I'd be eating beans and rice for a long time to come.
I'm sure there is no justifiable reason, why these pages have been removed. Every page validates, has no spam, no links to bad neighbourhoods etc, etc.
So basically, what Google are saying, is that if our pages aren't in the index, it's not their fault, - it's ours!
If that is the case, then they should make some mechanism for informing us of the reason why they have dropped the pages. After all, it was a 'decision' that was taken, either by a human, or a machine, which ever, the machine or the human knows why, so is it too much to ask, for them to add the status into sitemaps?
This is so disappointing. I guess if they are not going to fix it, because they don't believe it is broken, we'll just have to forget about Google and rely on the other search engines. I for one will not continue to promote Google while I'm not in their index. I've already closed my Adwords account and Adsense is next as soon as I have setup an alternative. It does not make sense to promote a search engine in which you can't get your own websites listed!
I don't see any point in maintaining the sitemap either and besides.. why should Google need a sitemap, when Yahoo and MSN manage to find my deepest pages, without one? Google know that the pages are there, if they refuse to include them, a sitemap is not going to help. In fact, it was only 3 weeks after I set up my sitemap that my site began to be dropped from the index!
One site is PR 7 and one site is PR 6, both are about 6 years old with thousands of IBL's from authority sites linking to our 100% custom content for each sites given verticals that took 5 plus years for our experts to write up. Some things I have come to the conclusion on.
- PR isn't important in regards to the issue, no difference in the problem from our pr7 or pr6 site.
- No 301 issues on either site, they are on the same server with 20 other sites of ours and no others have been effected, despite interlinking of the network.
- Authority systems seem to be irrelevant. We have scrapers by the tens of thousands weekly crawling and spamming our content out and now, these pr 1 or 0 sites with little more then 1 ibl are crushing our 2 sites in the serps for almost all positions and our company name. This despite having PR7, thousands of IBL's, clean seo, no changes to the properties in years.
I have given up monitoring the subject, we have emailed them and been told we do not have any penalties put against the properties, we have had our coders pour hundreds of hours looking into the properties to make sure this isn't a mistake on our part and they have come to the conclusion it is not. This is some form of crazy manipulation on Googles part that makes no sense whatsoever.
This hasn't shattered my confidence in Google but is most definitely a eye opener. I don't blame Google like others here, things happen and after all it is there search engine and they can do what they want but it is fairly upsetting to us to see properties we poured our hearts into and "played by there rules" on get destroyed in less then 30 days.
We’re going to keep digging in, but people need to bear in mind that Bigdaddy does have different crawl priorities, so a site that had more pages indexed by the earlier Googlebot won’t necessarily have as many pages indexed in the future. But don’t get me wrong; we’re still going through the feedback to see if there’s anything else to be identified and improved.
[mattcutts.com...]
After re-reading and re-reading MC response to Donna, it seems pretty obvious that they simply don't know what the problem is. And while the sudden appearance of GG is the largest indicator that G acknowledges there MIGHT BE a problem, it sounds like they no clue where to begin.
This is so disappointing. I guess if they are not going to fix it, because they don't believe it is broken, we'll just have to forget about Google and rely on the other search engines. I for one will not continue to promote Google while I'm not in their index. I've already closed my Adwords account and Adsense is next as soon as I have setup an alternative. It does not make sense to promote a search engine in which you can't get your own websites listed!
From the looks of things, the new Big Daddy "algo" (for lack of better word) percieves most every site as having a spam penalty. If we go by MC/GG explanation, all of us are incurring some type of spam penalty which is resulting in our pages being dropped.
Does that make any sense?
Meanwhile Yahoo, MSN and Ask all list all of my pages. As has Google in the past. If giving a Google a sitemap is going to help, then they need to explain why.
Also filing a reinclusion request is often quoted as well but to do that you have to admit to being a reformed spammer and then beg for repentance.