Forum Moderators: Robert Charlton & goodroi
We have a website that has the following scenario:
Googlesitemap.xml = 10,000 Indexed URLs
site: = 4,000 pages found.
The indexed URLs in our sitemaps have been rapidly rising over the last week or so.
Now doing my research, I have noticed many webmasters experiencing the problem that they have MORE site: pages than indexed urls found in sitemaps, but never the scenario where sitemaps has far more URLs indexed than in site:
Has anyone else had this scenario and can anyone shed any light and where the 6,000 URLs might be? Are they waiting to go into the google index? Are they already in Google and the site: is not showing them?
Any help would be appreciated.
Cheers,
Wesiwyg
The site: operator has been slowly developing more and more "issues". I find very low and very high "about" numbers all the time. I also see site:example.com giving one set of results, but site:example.com/directory/ can return urls that are not in the results with the /directory/ excluded.
One of the issues here seems to be canonicalization choices on Google's side - how they choose to combine near duplicates, or not. Another issue may be the "supplemental" partitions of the entire data-set, and whether those urls get counted in the "about" number esimates.
I just did one site search today that began at "about 2200" and clicked through to "197". That's over a 90% error in the original estimate - or is it really an error? This particular site has some nasty canonicalization issues, including session IDs in the url, 302 redirects that should be 301s, and mixed case urls. It's ver possible that the 197 is much closer to the number of unique pages, and the 2200 is more like the total of all urls that are indexed.
Thanks for your comments - I have just done about 30 site: checks on recently "indexed sitemaps urls" and none of them are showing in the google index, so I was hoping that someone could perhaps know if there is a delay from being an indexed url in sitemaps to being added to the google index.
So now I clicked the link to to include the similar pages that had been filtered out and the total WENT DOWN to 650. Similar borkedness is a regular diet from the site: operator. I'm used to it. It's a very imperfect analysis tool and we just can't take it as anything else.
So now you talk about the Sitemap in Webmaster Tools. It can tell you that a url is in the index when it isn't. So it's borked too.
The total number of urls in the index is a minor indicator for SEO these days, because it can't be anything but a ballpark guess. What's important is traffic - visitors that Google actually sends, on what searches and to what pages. If a page you intend as a good search target isn't getting any search traffic, then debug as best you can.
I wish I had some magic to make these Google reporting tools less borked. My best advice is as I mentioned above - run the site: operator across each of the top level directories, and even down to the second or third sub-directory level for a major site if the url structure allows.
Every once in a while the dust clears and for a while these Google reporting tools make sense. But because of the current transition to Caffeine, I don't expect to see another such moment of clarity any time soon.
If you want to see the indexed pages that Google considers most valuable, run the site: operator on AOL - their partner. Google only exports a subset of the entire index to AOL Search, and which urls they do export is interesting to note.
WMT:
URLs in sitemap = 45,000,000
Indexed = 200,000
site:example.com = 2,160,000
site:www.example.com = 1,720,000
site:example.com -site:www.example.com = 8,660
Break up into states:
site:www.example.com/abc = 265,000
site:example.com/abc = 365,000
site:www.example.com/def = 230,000
site:example.com/def = 300,000
site:www.example.com/abc "unique string A" = 18,000
site:example.com/abc "unique string A" = 18,000
site:www.example.com/def "unique string A" = 11,000
site:example.com/def "unique string A" = 8,000
Inconsistent and unreliable.