Sgt_Kickaxe - 2:16 am on Sep 12, 2012 (gmt 0) [edited by: Robert_Charlton at 3:26 am (utc) on Sep 12, 2012]
< moved from another location >
Mods note: original post title:
Phantom pages as a result of Google ignoring robots.txt
Perplexed as to why one of my 500 page mini-sites suddenly began listing 30,000 pages indexed when performing a /site:example.com I did some digging, here's what I found. Hope it helps others, especially if you run wordpress.
- Though Google reports 30,000 pages indexed you cannot see them all in google. If you click to the last visible page of the results there are all of a sudden only a handful of pages worth of content indexed, not the original 30,000 google reported. Bug?
- By playing around with the site command, and adding some parameters I managed to get Google to reveal that the 29,500 EXTRA pages indexed are in fact comment edit pages which are supposed to be blocked by robots.txt
The entry for all 29,500 of these is as follows...
A description for this result is not available because of this site's robots.txt – learn more
notice how Google is indexing content that says is restricted by robots.txt right on their results page?!?
- The webmaster tools new "index status" feature says that my site now has 30,000 known pages of which 2,000+ are indexed, 8,000 are not chosen and 28,000 are blocked by robots. None of that is accurate, the site has 500 articles. The site command was accurate 3 months ago and I've changed nothing since.
Questions: Should I remove the /wp-admin/ entry from robots.txt since Google is ignoring it completely? How can I remove these phantom pages from serps since Google is ignoring my directives? Should I be in touch with a lawyer since they are crawling where explicitly banned from? Other ideas?
[edited by: Robert_Charlton at 3:26 am (utc) on Sep 12, 2012]