Google spider ignoring robots.txt and NOINDEX?

Forum Moderators: open

Message Too Old, No Replies

Google spider ignoring robots.txt and NOINDEX?

Umbra

7:09 pm on Sep 28, 2003 (gmt 0)

I found a page in Google's cache of our site that should not be there. This page contains NOINDEX, NOFOLLOW metatags and our robots.txt excludes the the entire directory containing that page. How could that happen?

TheDave

3:37 am on Sep 29, 2003 (gmt 0)

In the search results, is it just a link, or does the search result also have a snippet from the page? Google will list links to pages even if they are excluded from crawling by robots.txt. And if the page is excluded by robots.txt, it has no way of knowing that it is in fact NOINDEX.

BlueSky

4:40 am on Sep 29, 2003 (gmt 0)

I have had problems with Googlebot spidering pages with NOINDEX, NOFOLLOW metatags. He has never strayed in going into disallowed areas listed in robots.txt.

There could be a time delay in following a new robots.txt instruction due to caching of the old one. Have you checked the header on your file to see that the expiration is not set to way out into the future and that the syntax you're using is correct?

GoogleGuy

4:43 am on Sep 29, 2003 (gmt 0)

My two guesses would be what TheDave suggested (see [searchengineshowdown.com...] for more detail)
or maybe that the robots.txt page was changed recently and we haven't found the changes yet? Lemme know if neither of those two apply..

Umbra

12:56 pm on Sep 29, 2003 (gmt 0)

TheDave, in the search results it was just a link, with no snippet. BlueSky, our server does not modify the headers with expiration dates. GoogleGuy, the robots.txt is pretty old.

The prevailing wisdom on this forum was to use both robots.txt and NOINDEX to block out a page from Google's index. TheDave's "Catch-22" scenario was never brought up before, to my knowledge. It does make perfect sense though.

Actually, Google's cache of our site was a real mess. The number of pages in the cache was always changing day by day. There were hundreds of pages that were excluded by metatags and robots.txt, yet they still persisted on Google's cache.

I recently discovered Google's URL removal login page (now there's a secret gold mine!) and applied the robots.txt again. It seems to be working this morning. Our cache is definitely stranger than usual. I'll be keeping tabs on its progress.

On an aside, we also have a customized error page. It is possible that Google won't recognize the customized error page as an error 404? Could that prevent non-existing pages from being immediately removed from the index?

pageoneresults

1:38 pm on Sep 29, 2003 (gmt 0)

It is possible that Google won't recognize the customized error page as an error 404? Could that prevent non-existing pages from being immediately removed from the index?

It would if the 404 page is not returning the proper status in the header. I've seen lots of 404 pages returning 200 status which is where the problems begin. Make sure your 404 is returning a 404 status in the header...

Server Header Checker [searchengineworld.com]