Disallowing a specific Bot using Robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Disallowing a specific Bot using Robots.txt

lucy24

6:54 pm on May 15, 2018 (gmt 0)

System: The following 4 messages were cut out of thread at: https://www.webmasterworld.com/search_engine_spiders/4896765.htm [webmasterworld.com] by keyplyr - 7:01 pm on May 15, 2018 (UTC -7)

:: bump ::

Since the only thing better than a blocked request is a request that is not made at all, I have been testing.

This does not work:

User-Agent: this means you
User-Agent: and you
User-Agent: Knowledge
Disallow: /

and neither does this (in a block by itself):

User-Agent: Knowledge
Disallow: /

but this does:

User-Agent: The Knowledge AI
Disallow: /

... because, apparently, there are so many robots with “Knowledge” in their names, they couldn’t possibly have guessed that I meant them. (If they had had a proper UA string with contact/www information and so on, would they have demanded that I match the whole thing to the letter?) There have been no page requests since about a week ago, when I made this final change.

keyplyr

8:17 pm on May 15, 2018 (gmt 0)

but this does:

Of course you need to use the exact name. Always been that way... too many bots share commonalities.

[fix typo]

[edited by: keyplyr at 1:59 am (utc) on May 16, 2018]

lucy24

1:11 am on May 16, 2018 (gmt 0)

Of course you need to use the exact name

Googlebot. Bingbot. Yandex. et cetera.

TorontoBoy

1:25 am on May 16, 2018 (gmt 0)

User-agent: Knowledge AI
Disallow: /

This does not work. It just keeps coming. I'll add "The"

phranque

5:00 am on May 16, 2018 (gmt 0)

Of course you need to use the exact name. Always been that way...

this is how it was from the beginning - "A Standard for Robot Exclusion" (1994):

User-agent
The value of this field is the name of the robot the record is describing access policy for.
...
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

source: [robotstxt.org...]

the ietf internet draft "A Method for Web Robots Control" (1997) refers to a "name token" which is a subset of the user agent string:

3.2.1 The User-agent line

Name tokens are used to allow robots to identify themselves via a simple product token. Name tokens should be short and to the point. The name token a robot chooses for itself should be sent as part of the HTTP User-agent header, and must be well documented.
These name tokens are used in User-agent lines in /robots.txt to identify to which specific robots the record applies. The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring.

source: [robotstxt.org...]

according to RFC7231 "Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content":

5.5.3. User-Agent
...
User-Agent = product *( RWS ( product / comment ) )

The User-Agent field-value consists of one or more product identifiers, each followed by zero or more comments (Section 3.2 of [RFC7230]), which together identify the user agent software and its significant subproducts. By convention, the product identifiers are listed in decreasing order of their significance for identifying the user agent software. Each product identifier consists of a name and optional version.

product = token ["/" product-version]
product-version = token
...

i take this all to mean the following should be specified as the name token in the User-agent line:
- everything up to the first forward slash or open paren in the UA string
- the entire UA string otherwise

phranque

5:06 am on May 16, 2018 (gmt 0)

note that google clearly documents the name tokens for their various crawlers and uses dash-separated words:
Google crawlers [support.google.com]

this would be considered an extension of the early specifications:

If you want to block or allow all of Google's crawlers from accessing some of your content, you can do this by specifying Googlebot as the user-agent.

keyplyr

7:08 am on May 16, 2018 (gmt 0)

Sadly the proposed "standard" of 1994 never did achieve the support intended. A lot of interpretation has come and gone over the last 24 years. Only several of the major Search Engines support robots.txt and they do it very differently.

The Rise & Fall of Robots.txt [webmasterworld.com]

phranque

7:44 am on May 16, 2018 (gmt 0)

it was almost 10 years ago when the "big 3" search engines (at the time) had an agreement to support a common set of directives:
Yahoo!, Google, Microsoft Clarify Robots.txt Support [searchengineland.com]

lucy24

6:54 pm on May 16, 2018 (gmt 0)

everything up to the first forward slash

By this rule, ¾ of the world’s robots are now called Mozilla.

keyplyr

7:21 pm on May 16, 2018 (gmt 0)

it was almost 10 years ago when the "big 3" search engines (at the time) had an agreement to support a common set of directives

Then, as stated above and at the link I posted, they went different ways. Besides a few common supported directives, those that support robots.txt do it differently. I work with this every day.

Differences in Top 3 Search Engines regarding Robots.txt
• Google supports wildcard (*) in URLs, Bing does not.
• Yandex supports cross domain, Google & Bing do not.
• Google supports different crawl delay parameters than Bing, and Yanddex doesn't support it at all.
• Yandex will fault the robots.txt if mirror preference is not included, even if there is no mirror.
• Disallowing an image file will prompt Google & Yandex to remove it from Image Search, Bing will ignore it completely.

There are a few more discrepancies in how these SEs interpret robots.txt.