Robots.txt disallowed.

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt disallowed.

g1smd

12:32 am on Jan 13, 2008 (gmt 0)

I'm looking at a site where the robots.txt file for that site is listed in the SERPs in a site:example.com search and appears in the top ten results. That's primarily because some other site links to the robots.txt file and Google has indexed it as if it were any other normal text file.

I am guessing that if I add Disallow: /robots.txt to the robots.txt file, then that will stop the file content from appearing in the SERPs in the future.

I further surmise that Google will still fetch the file to see what is in it, as far as the rule-processing as to what it can and cannot spider on the site is concerned. I mean to say:

Disallow: /

does not stop Google from accessing the robots.txt file to see what is in it, so I don't see why Disallow: /robots.txt should stop it doing so either.

What do you think?

Marcia

12:42 am on Jan 13, 2008 (gmt 0)

MSN Live just indexed the robots.txt from a couple of my sites.

Added: Oh yeah, they're disallowed from the sites in robots.txt - it hasn't helped yet.

[edited by: Marcia at 12:43 am (utc) on Jan. 13, 2008]

Lord Majestic

1:40 am on Jan 13, 2008 (gmt 0)

robots.txt is an exception - you can't disallow it using robots.txt.

Additionally robots.txt disallows _crawling_ of pages on a given site - urls from the same site can still be present in search engine by the virtue of anchor text used on some other sites that did not disallow crawling of their own data that referenced your site.

g1smd

5:38 pm on Jan 13, 2008 (gmt 0)

I don't want to disallow the file from being fetched, as I want them to follow all of the directives within it without fail.

I do want to disallow it from appearing in the SERPs, so I am guessing that adding that directive will allow that to happen.

Lord Majestic

5:41 pm on Jan 13, 2008 (gmt 0)

robots.txt does not have control over what appears in SERPs - even though some search engines might take robots.txt directives as a hint to remove content from their databases and never show it in SERPs, but this is totally beyond the intention of robots.txt itself.

SERPs might contain URLs from your site that were referenced by some other crawled sites - you should not expect to have control over this using your own sites robots.txt.

g1smd

7:09 pm on Jan 13, 2008 (gmt 0)

The content of that file cannot appear in the SERPs if it has been disallowed.

Sure, Google and Yahoo might still show it as a URL-only entry once their indexer knows that the file exists, and Yahoo might even craft a title for their entry in the SERPs based on any anchor text they might find that points to the URL, but that in itself would be better than the current situation, where the robots.txt file ranks for important keywords all on its own, and shows a snippet, due to incoming links that point at it.

Lord Majestic

7:13 pm on Jan 13, 2008 (gmt 0)

The content of that file cannot appear in the SERPs if it has been disallowed.

If content was crawled BEFORE you disallowed it then I am not sure anything in robots.txt standard mandates that this content will be removed.

I think your problem is that of ranking and to fix it you probably need to get other pages rank better in site: command - this should be fairly easy as in this case pages should compete among themsevles.

[edited by: Lord_Majestic at 7:15 pm (utc) on Jan. 13, 2008]

jdMorgan

1:56 am on Jan 14, 2008 (gmt 0)

Hopefully, most robots separate the "Fetch robots.txt automatically to see what I can crawl" function from the "fetch a link if allowed by robots.txt and process that page for inclusion in my index (even when that link happens to be to a robots.txt file)" functions.

If so, then Disallow: /robots.txt should work to keep the contents of robots.txt out of the index. But as stated above, the major SEs now include "naked" URL-only listings for just about any link they find, whether or not they are allowed to fetch the page, and the option of including a <meta name="robots" content="noindex"> tag is not open to you, since robots.txt is not an HTML page.

However, you might try returning the new X-Robots header [webmasterworld.com] recently defined as a Google/Yahoo-supported HTTP server response header (and NOT Disallowing robots.txt in robots.txt, so that it can be fetched along with its X-Robots header).

On Apache, you could use the following in .htaccess:


<FilesMatch "^robots\.txt$">
Header set X-Robots-Tag "noindex"
</FilesMatch>

It has yet to be determined if any other SEs but Google and Yahoo will support this HTTP protocol extension. Microsoft seems to be ignoring both robots.txt and on-page meta-tags these days, so I wouldn't count on them handling it properly.

Jim

Lord Majestic

2:03 am on Jan 14, 2008 (gmt 0)

Hopefully, most robots separate the "Fetch robots.txt automatically to see what I can crawl" function from the "fetch a link if allowed by robots.txt and process that page for inclusion in my index (even when that link happens to be to a robots.txt file)" functions.

Wishful thinking. In most complex systems crawling is separated from actual index update. What you want is going to be one of the last problems that search engine builder will concern himself with - one has to be robots.txt compliant, but going much further is not as easy as you think (certainly for big indices with tens of billions of pages - there is nothing easy there at all).

It has to be said that, I think, Google acts like you suggest and in that they set the gold standard of behavior.

jdMorgan

2:40 am on Jan 14, 2008 (gmt 0)

On the other hand, it seems to me that when a link is found to /robots.txt, treating it differently than any other URL would require extra coding. The same would be true for the /robots.txt URL-prefix when found in a Disallow directive; Treating it differently than any other URL-path-prefix would take extra work.

To be clear, if /robots.txt is Disallowed in robots.txt, I would expect crawlers to fetch it and parse it in order to build the fetch/no-fetch map for the domain. But I would also expect that robots.txt --when found as a link-- not to be fetched, because it is included in the no-fetch map of the domain.

Do the crawlers you know about ( ;) ) include such extra case-handlers in the robots.txt-parsing and check-link-against-robots.txt functions?

I'd be inclined to try it and see what happens...

Jim

g1smd

12:01 am on Jan 15, 2008 (gmt 0)

I'll let you know what happens.

I have added the disallow on one site, and the other site might also get it added later.

Lord Majestic

12:18 am on Jan 15, 2008 (gmt 0)

robots.txt is a special case that can't be disallowed.

Most search engines would fetch robots.txt to then clean up their url lists, however this is prone to issues arising from the time gap between robots.txt fetch and actual crawling of cleaned urls. A better approach is to check robots.txt prior to actual crawl of a fairly large batch of urls.

The main thing here is that most people would assume that crawling/parsing/indexing of urls is somehow synced in one nice database so you can easily do things. Yes, this is what naive implementations do, but they never scale to hundreds of billions of urls. When you deal with that many nothing is straightforward :(