New Index Status feature added in Webmaster Tools

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

New Index Status feature added in Webmaster Tools

sunnyujjawal

9:21 am on Jul 25, 2012 (gmt 0)

A GWT blog post says: [googlewebmastercentral.blogspot.in ]

Since Googlebot was born, webmasters around the world have been asking one question: Google, oh, Google, are my pages in the index? Now is the time to answer that question using the new Index Status feature in Webmaster Tools. Whether one or one million, Index Status will show you how many pages from your site have been included in Google’s index.

For this Google added Index Status is under the Health menu.

It shows how many pages are currently indexed. The legend shows the latest count and the graph shows up to one year of data.

Additional advance tab is also there:

The advanced section will show not only totals of indexed pages, but also the cumulative number of pages crawled, the number of pages that we know about which are not crawled because they are blocked by robots.txt, and also the number of pages that were not selected for inclusion in our results.

tedster

1:55 pm on Jul 25, 2012 (gmt 0)

Note this related post:

Webmaster Tools - unrealistically high "Ever crawled" number [webmasterworld.com]

I like this report - and hope that over time the numbers make sense. I just looked at one site that shows 0 indexed and 0 crawled, but the site: operator returns over 1000 URLs and they've been getting Google traffic since the year 2000!

netmeg

2:07 pm on Jul 25, 2012 (gmt 0)

Heh yea, I don't really get some of the numbers... one of my legacy sites says it went from 7600 URLs indexed to 96,000 during one week in October, 2011. That musta been quite a week! (There's about 2500 URLs in the sitemap, and a whole ton of stuff that's blocked - and it's been that way for at least six years)

g1smd

2:25 pm on Jul 25, 2012 (gmt 0)

Site with about 3000 real pages and infinite Duplicate Content issues (unchecked wildcard parts of URLs) that were fixed at the beginning of the year shows 100 000 URLs "ever crawled" (sounds reasonable) and the "indexed" count is just over 8000.

However, for the last 18 months, the site: search only ever returns between 700 and 750 URLs. Even during and after the mass redirecting of the old multiple-parameter URLs and old .html (rewritten) URLs to the new single extensionless URL per page, the numbers in the site: search did not go up.

In WMT internal linking reports, many of the listed pages were showing upwards of 50 000 internal links at the start of the year, and that figure is now down around the 8000 mark. This figure is over-reported and the simple reason for that is Google hasn't yet crawled all of the redirects from old to new URLs so believe the old URLs still exist and are still counting the links that used to exist on those pages.

So it seems that the WMT "indexed" figure relates to "URLs that link to each other within the site".

However, with 8000 URLs reported as "indexed" in WMT there's still only 730-ish URLs showing for the site: search.

At site relaunch, the old 3000 page site had about 500 new pages added (extensionless URLs), and a few hundred old pages (parameterised or .html URLs) went 404. All old content pages (both .html and parameterised) that still exist have redirects from those old URLs to the new extensionless URL for the page. Requesting one of the pages that are now retired returns 404 at the requested URL, and if you ask again you get 410 Gone every time after.

With crawling at a very minimal number of pages per day (under 450), it looks like the site is under some sort of crawl-budget penalty. Crawl rate was over 3000 URLs per day for a while after the relaunch, but with flat-topping on the graph indicating some sort of enforced limit? Maybe 80 000 redirects from the old duplicate content riddled URL structure to the new extensionless structure was a bit much for the old Googlebot to handle? The site was offline for a couple of hours about a week apart at one point, and immediately after the second event crawl rate dropped to under 500 URLs per day and has stayed there.

[edited by: g1smd at 2:45 pm (utc) on Jul 25, 2012]

indyank

2:25 pm on Jul 25, 2012 (gmt 0)

funny numbers! If they can't get these numbers correctly, how will they calculate the overall site quality scores for deciding whether to inflict an algorithmic site wide action through their animals?

aristotle

2:29 pm on Jul 25, 2012 (gmt 0)

I just checked one of my sites, and found a discrepancy between this new Index Status report and the Sitemaps report.

The Sitemaps report says
URLs submitted = 38
URLs indexed = 37

But the Index Status advanced report says:
Total indexed = 41
Ever crawled = 59
Not selected = 9
Blocked by robots.txt = 1

I believe that the Sitemaps report is correct. The information in the Index Status report can't be right. The only pages that aren't in the submitted sitemap either have noindex tags or are blocked by robots.txt. The "Not selected = 9" is puzzling, since the total number of pages of all types is only 43, and I've never deleted or re-directed any pages.

tedster

3:37 pm on Jul 25, 2012 (gmt 0)

If they can't get these numbers correctly, how will they calculate the overall site quality scores for deciding whether to inflict an algorithmic site wide action

Webmaster Tools numbers have always seemed to be second class citizens for Google. Because they are just reports, the data import doesn't seem to get the same level of quality control they use in the actual ranking algorithm. So we end up with transparency, but through a foggy window.

scooterdude

4:24 pm on Jul 25, 2012 (gmt 0)

I am curious, why would one assume that the data used in the ranking algorithim is any more accurate?

Given the sheer mass of data G handles, I am inclined to think its the same data, otherwise the logistics of providing a free service like gwt might become,, expensive

besides, in so far as the sites they love are on those loved up exception lists, what does it matter how pinpoint accurate such data is from G's point of view

tedster

5:08 pm on Jul 25, 2012 (gmt 0)

I don't just assume this, I'm sure it's proven. For example, Webmaster Tools tells me a site has zero URLs in the index, but the site actually gets Google Search traffic to over 1,000 URLs. that's a Q.E.D. in my book.

The issue here is one of scale. There are many copies of data across Google's many server farms. Proposed new algorithms are run on private copies, for example. And load sharing software sends users to various copies when they do a search. To tap the ranking data directly (even one copy) could not be practical because it would generate too much demand.

Sometimes the data used for actual rankings can get a buggy or partial import to a live server farm that is used directly in rankings for search results. Google has admitted this in the past - but it also fixes such bugs pretty fast.

To get some understanding of the scale that Google operates, check out this discussion which is already four years old!
The Google Search Query - a technical look [webmasterworld.com]

This stuff is WAY beyond the kind of database we work with ;)

lucy24

8:11 pm on Jul 25, 2012 (gmt 0)

Whether one or one million, Index Status will show you how many pages from your site have been included in Google’s index.

Great. Now all they have to do is swipe an idea from Yandex and tell you which specific pages have been indexed-- and why the non-indexed ones aren't.