|The Submitted/Index ratio is still the same which to me says google is not obeying the robots.txt directive |
The rest of your post seems to say the opposite: that g### IS obeying robots.txt and therefore not crawling the blocked pages.
crawling and indexing are different things
This is an interesting post because others have suggested that the sitemap overrides robots.txt, such that anything in a sitemap will be crawled even if roboted-out. Your experience seems to say otherwise.
Anyway the solution is straightforward. If you want a page to be neither crawled nor indexed, don't include it in a sitemap. If you want it to be crawled but not indexed, give each page a meta noindex header. If google knows that a resource exists, it will index it unless it has been explicitly told not to.
"it will index it unless it has been explicitly told not to."
...adding, robots.txt does NOT work as a method of saying "do not index". OP seems to be confused about this.
If I link to your robots.txt-excluded pages, Google will probably add your URLs to the index and rank them based on what it can know about them without crawling them directly. So not even robots.txt plus leaving them out of the sitemap will do what you want - you have to use a NOINDEX directive somewhere.
Amusingly, in the above example, a NOINDEX wouldn't work because Google has been instructed (by you) not to crawl the page, so it can't know what the META ROBOTS directives are. IE, robots.txt actually works against you here.
There are a lot of pages to noindex! Is there a way to this with a .htaccess file?
Look into an X-Robots tag.
Sounds good...will be curious to see if robots.txt + x-robots can defeat sitemap.xml!
If it doesn't, I guess I'll just give up and delete those entries out of sitemap.xml.
(For what it's worth, I would never put them IN the sitemap to begin with. Ever.)
|will be curious to see if robots.txt + x-robots can defeat sitemap.xml |
remove the robots.txt block
If the googlebot can't crawl your pages, it will never see the "noindex" header-- whether it's an X-Robots response header or a meta within the individual page.
|Sounds good...will be curious to see if robots.txt + x-robots can defeat sitemap.xml! |
What Lucy24 said reiterated: You *cannot* block a page from being crawled in robots.txt *and* noindex the page to have it removed, because if Google finds links to the page it usually *will* be indexed based on the information in and surrounding the links to it even though they cannot crawl the page itself.
If you want a page to be removed from the index you *must* allow GoogleBot to crawl it and either have noindex on the page, noindex in an X-Robots-Tag header for the page *or* serve GoogleBot an error code such as 403 when the page is requested.
Also, I'm really not sure I understand why you're explicitly telling GoogleBot how to find pages you don't want indexed. Seems a bit like "conflicting signals" you're sending to me, unless you're just trying to get them crawled so the noindex or 403 error is seen and then they are going to be removed from the XML Sitemap.
The "indexing system" is just that, a system [not a person], so I think it's always best to send the clearest message you can.