Updates to Google's Work on Advanced Robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Updates to Google's Work on Advanced Robots.txt

engine

4:08 pm on Apr 28, 2021 (gmt 0)

Google is working on updates to advanced robots.txt and, according to a recent tweet, "The latest additions include explicit details on the files for IDNs, IP-addresses, and port numbered hostnames. "
[twitter.com...]

[developers.google.com...]

lammert

4:38 pm on Apr 28, 2021 (gmt 0)

That developer document is an interesting read.

Google follows at least five redirect hops as defined by RFC 1945 for HTTP/1.0 and then stops and treats it as a 404.
All 4xx errors are treated the same way and it's assumed that no valid robots.txt file exists.
A 503 (Service Unavailable) error results in fairly frequent retrying. If the robots.txt is unreachable for more than 30 days, the last cached copy of the robots.txt is used.

Are these rules only for robots.txt, or does Google apply them to crawling in general? I have seen that 30 days mentioned in a Google patent some 10 years ago but can't remember the exact context.

phranque

2:03 am on May 1, 2021 (gmt 0)

some additional interesting nuggets in there...

generally 4XX errors are treated as "full allow" (i.e. essentially assuming no robots.txt is available)

generally 5XX errors are treated as "full disallow" (i.e. essentially assuming robots.txt may be temporarily unavailable)

Handling of logical redirects for the robots.txt file based on HTML content that returns 2xx (frames, JavaScript, or meta refresh-type redirects) is discouraged and the content of the first page is used for finding applicable rules.

i think this means that the robots exclusion protocol is only applied to the originally requested url (and the urls of any subsequent 3XX redirects) but not any urls that are redirected to by browser-based techniques.

To temporarily suspend crawling, it is recommended to serve a 503 HTTP result code.

(i.e., "Service Unavailable")
sometimes we all need to be reminded of the obvious.

The cached response may be shared by different crawlers

good to know in case you don't see one of the other google bots in your web server access logs.

If there's more than one group declared for a specific user agent, all the rules from the groups applicable to the specific user agent are combined into a single group.

not sure i knew that was the case before.

also good to know that the Google Robots.txt Parser and Matcher Library [github.com] is available on github.

JorgeV

3:28 pm on May 2, 2021 (gmt 0)

Hello,

Do we need an "Advanced Robots.txt " ? To me it looks like adding complexity where there is no need.

phranque

8:28 pm on May 2, 2021 (gmt 0)

the current protocol is less complex than what was being supported 5 years ago while providing more clarity and transparency.