Forum Moderators: open
There could be a time delay in following a new robots.txt instruction due to caching of the old one. Have you checked the header on your file to see that the expiration is not set to way out into the future and that the syntax you're using is correct?
The prevailing wisdom on this forum was to use both robots.txt and NOINDEX to block out a page from Google's index. TheDave's "Catch-22" scenario was never brought up before, to my knowledge. It does make perfect sense though.
Actually, Google's cache of our site was a real mess. The number of pages in the cache was always changing day by day. There were hundreds of pages that were excluded by metatags and robots.txt, yet they still persisted on Google's cache.
I recently discovered Google's URL removal login page (now there's a secret gold mine!) and applied the robots.txt again. It seems to be working this morning. Our cache is definitely stranger than usual. I'll be keeping tabs on its progress.
On an aside, we also have a customized error page. It is possible that Google won't recognize the customized error page as an error 404? Could that prevent non-existing pages from being immediately removed from the index?
It is possible that Google won't recognize the customized error page as an error 404? Could that prevent non-existing pages from being immediately removed from the index?
It would if the 404 page is not returning the proper status in the header. I've seen lots of 404 pages returning 200 status which is where the problems begin. Make sure your 404 is returning a 404 status in the header...
Server Header Checker [searchengineworld.com]