aakk9999 - 12:03 pm on Jan 9, 2011 (gmt 0)
I have also noticed that when using site: command, of recently Google is reporting much more pages as being indexed. On one of domains we are following the number of indexed pages approximately doubled in the last few weeks.
Since the site is not very big (4000 pages aprox.) I was able to inspect more closely the results of site: command and noticed that the following URLs also appeared in the results:
- many pages that are blocked by robots.txt and have been blocked by robots.txt for a very long time now appear in the results. These pages were not there before and most of them are not linked from anywhere else than from within the site
- a number of pages that have had 301 redirect in place for over a year are now also listed in the site: results. Testing the redirect, the request still returns the correct HTTP 301 code. The page title reported in SERPs for these pages is link anchor followed by hyphen, followed by home page title
To me it almost looks like that Google has "folded in" some historic information on site URLs it knew about from way before.
In our case there is no impact on ranking or traffic but there is always a concern when you see number of "indexed pages" showing as double to what you know they should be.
With regards to GWT crawl errors, whilst giving a good guidance, I have noticed these not always being correct. For instance I have some page reported as "blocked by robots.txt" but when I copy this URL to "Crawler access" section of GWT it shows the page as "Allowed". Another instance I noticed is that Google is on occasions reporting some pages as 400 error. If you cut and paste URL shown in GWT, the page renders correctly. If you hover over URL google is showing in GWT, the correct URL is shown in status bar of the browser. But if you CLICK on the URL directly from GWT HTTP errors section, the in the URL that shows in the address bar and which is requested has all ? & and = replaced by %3F %3D and %26 and this throws server error. Now, why this is being reported for only handful of URLs and not across the board for all dynamic URLs is a mystery to me.