Robots.txt Protocol Directive Order

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt Protocol Directive Order

WebOpz

8:46 pm on Jun 14, 2021 (gmt 0)

Is there a required order for the robots.txt protocol file? Do I have to set Disallow before Allow directives?

A robot author is claiming that his bot can ignore exclusion if the first line is allow and then the disallows are listed. Am I missing something?

lucy24

9:37 pm on Jun 14, 2021 (gmt 0)

As currently written, the robots.txt standard--which is very, very old--includes one and only one directive:
Disallow:
Everything else is an optional extra that a given robot may or may not understand.

Some robots will take this as an excuse to interpret rules in ways that nobody could possibly have intended. (At this point I was going to rattle off a list of examples, but decided not to, because why put ideas into robots' heads.)

That being said: If you quote the relevant part of your robots.txt, we might be able to take a guess at whether it's a reasonable misunderstanding, or the robot is being willfully obtuse. And if it's the latter, you'll know what to do. Heh, heh.

phranque

11:38 pm on Jun 14, 2021 (gmt 0)

it depends...

according to the way google determines the "Order of precedence for group-member lines" in their Robots.txt Specifications:

At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry trumps the less specific (shorter) rule. In case of conflicting rules, including those with wildcards, the least restrictive rule is used.

source: https://developers.google.com/search/docs/advanced/robots/robots_txt#order-of-precedence-for-group-member-lines

more specifically, according to the latest (June 05, 2021) Internet Draft of the specification (which is still largely based on Martijn Koster's 1996 protocol):

To evaluate if access to a URI is allowed, a robot MUST match the paths in allow and disallow rules against the URI. The matching SHOULD be case sensitive. The most specific match found MUST be used. The most specific match is the match that has the most octets. If an allow and disallow rule is equivalent, the allow SHOULD be used. If no match is found amongst the rules in a group for a matching user-agent, or there are no rules in the group, the URI is allowed. The /robots.txt URI is implicitly allowed.

Octets in the URI and robots.txt paths outside the range of the US-ASCII coded character set, and those in the reserved range defined by RFC3986, MUST be percent-encoded as defined by RFC3986 prior to comparison.

If a percent-encoded US-ASCII octet is encountered in the URI, it MUST be unencoded prior to comparison, unless it is a reserved character in the URI as defined by RFC3986 or the character is outside the unreserved character range. The match evaluates positively if and only if the end of the path from the rule is reached before a difference in octets is encountered.

source: https://datatracker.ietf.org/doc/html/draft-koster-rep#section-2.2.2

if the robot author is really old school, he may want to refer to Koster's original Internet-Draft document A Method for Web Robots Control [robotstxt.org]

To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.

The /robots.txt URL is always allowed, and must not appear in the Allow/Disallow rules.

The matching process compares every octet in the path portion of the URL and the path from the record. If a %xx encoded octet is encountered it is unencoded prior to comparison, unless it is the "/" character, which has special meaning in a path. The match evaluates positively if and only if the end of the path from the record is reached before a difference in octets is encountered.

[edited by: phranque at 2:35 am (utc) on Jun 15, 2021]

not2easy

2:14 am on Jun 15, 2021 (gmt 0)

As far as I know, Googlebot is the only one that reliably uses the Allow syntax within a Disallow, and then the Allow must follow the Disallow. This is useful if you want to disallow the contents of an entire folder/ directory except that Google complains they can't parse the page because a .js file they would need is in that disallowed directory.

On a WordPress site for example you might have lines like:

Disallow: /wp-includes/
Allow: /wp-includes/js/
Allow: /wp-includes/css/
Allow: /*.css
Allow: /*.js

But that does not mean that all robots might comply with your wishes. Google's practices are outlined here:
https://developers.google.com/search/docs/advanced/robots/robots_txt

phranque

2:34 am on Jun 15, 2021 (gmt 0)

Googlebot is the only one that reliably uses the Allow syntax within a Disallow, and then the Allow must follow the Disallow

not really.
as i quoted above (from the same document you referenced):

At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry trumps the less specific (shorter) rule. In case of conflicting rules, including those with wildcards, the least restrictive rule is used.