Forum Moderators: goodroi
I am guessing that if I add Disallow: /robots.txt to the robots.txt file, then that will stop the file content from appearing in the SERPs in the future.
I further surmise that Google will still fetch the file to see what is in it, as far as the rule-processing as to what it can and cannot spider on the site is concerned. I mean to say:
Disallow: /
does not stop Google from accessing the robots.txt file to see what is in it, so I don't see why Disallow: /robots.txt should stop it doing so either.
What do you think?
Additionally robots.txt disallows _crawling_ of pages on a given site - urls from the same site can still be present in search engine by the virtue of anchor text used on some other sites that did not disallow crawling of their own data that referenced your site.
SERPs might contain URLs from your site that were referenced by some other crawled sites - you should not expect to have control over this using your own sites robots.txt.
Sure, Google and Yahoo might still show it as a URL-only entry once their indexer knows that the file exists, and Yahoo might even craft a title for their entry in the SERPs based on any anchor text they might find that points to the URL, but that in itself would be better than the current situation, where the robots.txt file ranks for important keywords all on its own, and shows a snippet, due to incoming links that point at it.
The content of that file cannot appear in the SERPs if it has been disallowed.
If content was crawled BEFORE you disallowed it then I am not sure anything in robots.txt standard mandates that this content will be removed.
I think your problem is that of ranking and to fix it you probably need to get other pages rank better in site: command - this should be fairly easy as in this case pages should compete among themsevles.
[edited by: Lord_Majestic at 7:15 pm (utc) on Jan. 13, 2008]
If so, then Disallow: /robots.txt should work to keep the contents of robots.txt out of the index. But as stated above, the major SEs now include "naked" URL-only listings for just about any link they find, whether or not they are allowed to fetch the page, and the option of including a <meta name="robots" content="noindex"> tag is not open to you, since robots.txt is not an HTML page.
However, you might try returning the new X-Robots header [webmasterworld.com] recently defined as a Google/Yahoo-supported HTTP server response header (and NOT Disallowing robots.txt in robots.txt, so that it can be fetched along with its X-Robots header).
On Apache, you could use the following in .htaccess:
<FilesMatch "^robots\.txt$">
Header set X-Robots-Tag "noindex"
</FilesMatch>
Jim
Hopefully, most robots separate the "Fetch robots.txt automatically to see what I can crawl" function from the "fetch a link if allowed by robots.txt and process that page for inclusion in my index (even when that link happens to be to a robots.txt file)" functions.
Wishful thinking. In most complex systems crawling is separated from actual index update. What you want is going to be one of the last problems that search engine builder will concern himself with - one has to be robots.txt compliant, but going much further is not as easy as you think (certainly for big indices with tens of billions of pages - there is nothing easy there at all).
It has to be said that, I think, Google acts like you suggest and in that they set the gold standard of behavior.
To be clear, if /robots.txt is Disallowed in robots.txt, I would expect crawlers to fetch it and parse it in order to build the fetch/no-fetch map for the domain. But I would also expect that robots.txt --when found as a link-- not to be fetched, because it is included in the no-fetch map of the domain.
Do the crawlers you know about ( ;) ) include such extra case-handlers in the robots.txt-parsing and check-link-against-robots.txt functions?
I'd be inclined to try it and see what happens...
Jim
Most search engines would fetch robots.txt to then clean up their url lists, however this is prone to issues arising from the time gap between robots.txt fetch and actual crawling of cleaned urls. A better approach is to check robots.txt prior to actual crawl of a fairly large batch of urls.
The main thing here is that most people would assume that crawling/parsing/indexing of urls is somehow synced in one nice database so you can easily do things. Yes, this is what naive implementations do, but they never scale to hundreds of billions of urls. When you deal with that many nothing is straightforward :(