Can't explain number of indexed pages in WMT - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Can't explain number of indexed pages in WMT

seoholic

3:10 am on Sep 7, 2013 (gmt 0)

10+ Year Member

The number of indexed in WMT is significantly higher than expected for years.
One year ago we deleted some spam and created a sitemap using a crawler. This sitemap is still in place.
After that we deleted some of our own pages. At this time 99% of our pages in the sitemap were indexed.

WMT stats today
Indexed pages: 130k (max 155k in february)
Sitemap: 135k
Indexed pages in sitemap: 115k
Crawling is more than healthy. Google crawls 10k pages every day.

Thats leaves me with 15k pages which are not in our sitemap and therefore couldn't be found by our crawler 1 year ago.

New pages are part of this difference, but in the last year we created less than 5k pages, so at least there are 10k indexed pages I can't explain. It's a problem lasting years anyway.

More Details:
One year ago 10k spam pages were injected into our site and we removed them.
We also had a problem on Googles side with the URL parameter handling. It took them 1,5 years to discover and remove 500k(sic) URLs we used to track our ads, although the correct policy was always in place.
URL parameters account for only 2,5k pages according to WMT and are included in our sitemap anyway.
I can't find any traces of unwanted tracking URLs in Google search.

One year ago we tried to wipe out all traces of old URLs and spam. We used 410 and also applied the URL removal tool to all relevant folders, even removing good content to get rid of all remaining unwanted pages.
I can't find any traces of the old and spam pages we removed in Google search.
I can't find any new spam using site:example.com viagra etc.

Widgets on other websites account for only a handfull of indexed pages in SERPs (I should work on that). Even theoretically it can't be more than a few hundred.

Is this normal? What am I missing?

aakk9999

12:59 pm on Sep 8, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I have seen this situation on many sites and most of them did not have any ranking problem. Firstly, WMT data could be inaccurate. For example, I often see number of pages indexed reported in WMT being quite different to the result returned in site: command. You cannot even say which one is correct or whether the right number is somewhere in between.

Sometimes new URLs are created because of scrapers. They scrape Google results which often has URLs truncated with ... shown and unless your server perfectly handle these and returns 404 or redirects to a know URL, then you may get additional URLs indexed. It is very difficult to find these cases (or better, all of these cases).

If your site's traffic and ranking has not been affected and if this discrepancy has been steady for a while, and you have done the sanity checks with URLs, then I would not be worried.

I have a case where a site indexed URLs in WMT has increased by 40%, stayed like this for a month and then dropped back to where it was a month before without any changes to the site. No adverse effects to traffic/ranking. I am treating this as Google bug. In the period when there was 40% rise, we checked but found nothing that could cause problems.

londrum

2:01 pm on Sep 8, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

i was thinking about this problem a couple of months ago.
i keep track of the number of indexed pages from week to week and my normal amount was between a million and a million-and-a-half. but 2 months ago it dropped down to about 150,000 -- a huge fall.
so i looked into it, and found that i only had about 350,000 pages on the entire site. the 200,000 that weren't indexed were all database generated pages which don't have any internal links on the site, which was easily fixed. but that still left a million "ghost" pages that i couldn't account for.

my actual traffic hasn't dropped at all, and google were still crawling about 30,000 pages a day -- that hasn't changed

i always check all my query strings for unrecognised ones, and i 404 everything that isnt legit. i even put all of the different query strings into a recognised order, and redirect the page if they are out of order. so i havent got a clue where all these extra pages have come from.

but my traffic is fine, so i have decided not to worry about it

rvkumarweb

7:03 am on Sep 12, 2013 (gmt 0)

10+ Year Member

Hi,

From this I want to know one thing that if I submitted a sitemap.xml file in GWT, its show me the submitted URL and indexed URL after crawl by robots. Is it possible to see what the actual pages or (URL) that Google are crawled in GWT?

lucy24

8:30 am on Sep 12, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You mean, minute by minute? No, it's all aggregated. But you can easily get this information from your logs. The UA is always Googlebot-- sometimes with an extra like Googlebot-Image or Googlebot-Mobile (three of 'em) --and there's a narrow range of IPs.

seoholic

5:35 am on Sep 15, 2013 (gmt 0)

10+ Year Member

I guess rvkumarweb meant indexed pages. You can get this information as granular as you want it to be. You just have to create sitemaps for every category/folder/page. You can only include 400 sitemaps in a sitemap index in WMT, but you can have more than one sitemap index. I don't know if there is a limit for them too. Of course you should still be able to derive conclusions from the provided information.
I will have to create a second sitemap index soon and think that this is one of the very few ways to get accurate information out of Google.

seoholic

6:18 am on Sep 15, 2013 (gmt 0)

10+ Year Member

Sometimes new URLs are created because of scrapers. They scrape Google results which often has URLs truncated with ... shown and unless your server perfectly handle these and returns 404 or redirects to a know URL, then you may get additional URLs indexed. It is very difficult to find these cases (or better, all of these cases).

We invested a lot of effort to get rid of infinite URL spaces and I am confident, that we always deliver the correct HTTP status. At the moment redirects (permanent, of course) are the only cause for the number of indexed pages I can imagine. We also have a lot of natural redirects in place that get created, when an entity changes its name:
oldentityname/example1.html-->newentityname/example1.html
oldentityname/example2.html-->newentityname/example2.html
...

While we don't want to avoid being listed in Google, you can take some preemptive measures to minimise the number of scraper links:
One preemptive measure is using relative URLs in your code.
Avoiding/deleting links from Wikipedia, DMOZ and similar sites will also reduce the number of links from scrapers.
I went after all unnatural links long before Panda. I call them, threaten to sue them, call their investors, whatever helps to contain this menace before its starts spreading. I guess sites with tons of pages and outgoing links are more interesting to scrapers than mom&pop sites. After all you have to systematically modify the data to get it ranked and this process is more cost-efficient with huge sites.