Forum Moderators: goodroi
I have a very simple robots.txt file that is supposed to be disallowing everything from all robots, but doesn't seem to, as I have been able to find my site indexed in several search engines.
My robots.txt file IS located in my public root directory i.e. [myserver.com...] and contains this:
User-agent: *
Disallow: /
Is there something I'm missing?
TIA.
That is, are they full listings with a title and a description (or snippet) from your page, or are they listed by URL only or URL with link-text only?
Google, a few others, and most recently, Yahoo, have been listing any page they find a link to, even if they are not allowed to fetch the page by robots.txt. For emphasis, they consider robots.txt to mean, "Do not fetch this page" rather than, "Do not list this page." Other search engines treat a robots.txt Disallow as saying not to list it, but the Standard for Robots Exclusion favors the "Do not Fetch" interpretation.
The solution -- if this is the case for your site -- is to allow the 'bots to fetch your publically-linked page, and mark each one with a robots meta-tag including "noindex,nofollow". Given that scenario, Google and Yahoo will not list the page. A problem arises here for non-html pages, such as .pdf and media-type files that cannot be marked with a robots meta-tag -- there is no way to keep them out of the SERPs except to make sure there are no spiderable links to them.
Jim