Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Indexed URLs - big difference between sitemap.xml report and site: operator

         

speedshopping

1:39 pm on Oct 5, 2009 (gmt 0)

10+ Year Member



Hi,

We have a website that has the following scenario:

Googlesitemap.xml = 10,000 Indexed URLs
site: = 4,000 pages found.

The indexed URLs in our sitemaps have been rapidly rising over the last week or so.

Now doing my research, I have noticed many webmasters experiencing the problem that they have MORE site: pages than indexed urls found in sitemaps, but never the scenario where sitemaps has far more URLs indexed than in site:

Has anyone else had this scenario and can anyone shed any light and where the 6,000 URLs might be? Are they waiting to go into the google index? Are they already in Google and the site: is not showing them?

Any help would be appreciated.

Cheers,
Wesiwyg

tedster

3:23 pm on Oct 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



One thing you can do is sample some of the sitemap urls that WMT says are indexed but don't show in the site: operator results. Just paste some of those urls directly into the Google search box and see if they show up as a result. If they do show up, then they are in the index - and that seems yo be the best way to get a final verdict.

The site: operator has been slowly developing more and more "issues". I find very low and very high "about" numbers all the time. I also see site:example.com giving one set of results, but site:example.com/directory/ can return urls that are not in the results with the /directory/ excluded.

One of the issues here seems to be canonicalization choices on Google's side - how they choose to combine near duplicates, or not. Another issue may be the "supplemental" partitions of the entire data-set, and whether those urls get counted in the "about" number esimates.

I just did one site search today that began at "about 2200" and clicked through to "197". That's over a 90% error in the original estimate - or is it really an error? This particular site has some nasty canonicalization issues, including session IDs in the url, 302 redirects that should be 301s, and mixed case urls. It's ver possible that the 197 is much closer to the number of unique pages, and the 2200 is more like the total of all urls that are indexed.

speedshopping

5:30 pm on Oct 5, 2009 (gmt 0)

10+ Year Member



Hi Tedster,

Thanks for your comments - I have just done about 30 site: checks on recently "indexed sitemaps urls" and none of them are showing in the google index, so I was hoping that someone could perhaps know if there is a delay from being an indexed url in sitemaps to being added to the google index.

tedster

3:32 am on Oct 6, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There can be a delay of a couple days - but the problem is the site: operator technology itself. It's just plain borked. Tonight I did a site: operator search that resulted in 778 urls exactly - clicking to the last page.

So now I clicked the link to to include the similar pages that had been filtered out and the total WENT DOWN to 650. Similar borkedness is a regular diet from the site: operator. I'm used to it. It's a very imperfect analysis tool and we just can't take it as anything else.

So now you talk about the Sitemap in Webmaster Tools. It can tell you that a url is in the index when it isn't. So it's borked too.

The total number of urls in the index is a minor indicator for SEO these days, because it can't be anything but a ballpark guess. What's important is traffic - visitors that Google actually sends, on what searches and to what pages. If a page you intend as a good search target isn't getting any search traffic, then debug as best you can.

I wish I had some magic to make these Google reporting tools less borked. My best advice is as I mentioned above - run the site: operator across each of the top level directories, and even down to the second or third sub-directory level for a major site if the url structure allows.

Every once in a while the dust clears and for a while these Google reporting tools make sense. But because of the current transition to Caffeine, I don't expect to see another such moment of clarity any time soon.

If you want to see the indexed pages that Google considers most valuable, run the site: operator on AOL - their partner. Google only exports a subset of the entire index to AOL Search, and which urls they do export is interesting to note.

anallawalla

3:56 am on Oct 6, 2009 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I second this observation. None of the numbers makes sense - you have to break down the search into very tiny chunks to prove it to yourself. Example (not our actual numbers):

WMT:
URLs in sitemap = 45,000,000
Indexed = 200,000

site:example.com = 2,160,000
site:www.example.com = 1,720,000
site:example.com -site:www.example.com = 8,660

Break up into states:
site:www.example.com/abc = 265,000
site:example.com/abc = 365,000

site:www.example.com/def = 230,000
site:example.com/def = 300,000

site:www.example.com/abc "unique string A" = 18,000
site:example.com/abc "unique string A" = 18,000

site:www.example.com/def "unique string A" = 11,000
site:example.com/def "unique string A" = 8,000

Inconsistent and unreliable.