robots.txt: prevent new bots en masse

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt: prevent new bots en masse

configuring robots.txt to black all bots but the ones you want

Andova Begarin

7:32 pm on Nov 26, 2017 (gmt 0)

I have a new site and am using a robots.txt file of:


User-agent: Googlebot
Allow: /

User-agent: bingbot
Allow: /

User-agent: Twitterbot
Allow: /

User-agent: YandexBot
Allow: /

User-agent: *
Disallow: /

I did this because I was tired of adding to the dozens of specific bot disallows I had. (My list of disallows was growing too large to manage.)

It seems to work, i.e. allowed bots see their entry and then do not process the end disallow.

As I monitor the logs, and see new bots, I go to their webpage and see if I'd be OK with what they do. If so, I add them to be allowed.

This cuts down on a lot of work, but I'd am interested in other people's view on this technique.

Thank You

keyplyr

8:44 pm on Nov 26, 2017 (gmt 0)

This is a basic form of "whitelisting" and works very well for *simple* configurations. Just know that only bots that support robots.txt will follow it.

However, as you develop more explicit needs for controlling who/what has access to your files, you may well find a more comprehensive approach is needed.

Blocking Methods [webmasterworld.com]

Andova Begarin

9:18 pm on Nov 26, 2017 (gmt 0)

Yes, of course. That is just about whitelisting specific bots one wants.

As most here should be aware, many bots do not adhere to robots.txt, bur for those that do, my point was - if people perhaps agree - there may be no need to have dozens of specific bot disallows entries in one's robot.txt file.

If most people are aware of this, then I apologize for a dumb post.

Thank you for your reply.

lucy24

9:25 pm on Nov 26, 2017 (gmt 0)

Be aware that not all robots recognize the "Allow" directive, although Google does. To play it safe, replace
Allow: /
with
Disallow:
(i.e. nothing)

Edit: It's very common to use various kinds of whitelisting in actual access controls--the rules that result in a 403. But I don't remember seeing it in robots.txt, and personally I think it's kind of ingenious.

keyplyr

5:10 am on Nov 27, 2017 (gmt 0)

there may be no need to have dozens of specific bot disallows entries in one's robot.txt file

There's also something else to consider... if you block the other bots by default, you may not get their actual bot w/ the complete UA to research.

Some companies just send a GET for robots w/o the complete UA for text files (robots.txt) so if blocked, the actual bot may not visit the site. There aren't too many of these, but there are a dozen or so and depending on your business interests, you may want to allow them.

lucy24

6:46 am on Nov 27, 2017 (gmt 0)

if you block the other bots by default, you may not get their actual bot w/ the complete UA to research

Oh, that's a point. Another is that requests for robots.txt may not contain exactly the same headers as requests for �real� files, so you end up poking the wrong holes, or more holes than is strictly necessary. It probably isn't specific to robots.txt; I suspect you'd see the same variation in requests for any .txt file. As with the UA*, this is rare but it does happen.

* I know one otherwise well-behaved robot that uses its real name for page requests but a bare �robots� for robots.txt. Fortunately it's got a reliable IP.