some additional interesting nuggets in there...
generally 4XX errors are treated as "full allow" (i.e. essentially assuming no robots.txt is available)
generally 5XX errors are treated as "full disallow" (i.e. essentially assuming robots.txt may be temporarily unavailable)
Handling of logical redirects for the robots.txt file based on HTML content that returns 2xx (frames, JavaScript, or meta refresh-type redirects) is discouraged and the content of the first page is used for finding applicable rules.
i think this means that the robots exclusion protocol is only applied to the originally requested url (and the urls of any subsequent 3XX redirects) but not any urls that are redirected to by browser-based techniques.
To temporarily suspend crawling, it is recommended to serve a 503 HTTP result code.
(i.e., "Service Unavailable")
sometimes we all need to be reminded of the obvious.
The cached response may be shared by different crawlers
good to know in case you don't see one of the other google bots in your web server access logs.
If there's more than one group declared for a specific user agent, all the rules from the groups applicable to the specific user agent are combined into a single group.
not sure i knew that was the case before.
also good to know that the
Google Robots.txt Parser and Matcher Library [github.com] is available on github.