Welcome to WebmasterWorld Guest from 54.159.94.253

Forum Moderators: goodroi

robots.txt: prevent new bots en masse

configuring robots.txt to black all bots but the ones you want

     
7:32 pm on Nov 26, 2017 (gmt 0)

New User from RU 

joined:Feb 17, 2016
posts:14
votes: 0


I have a new site and am using a robots.txt file of:


User-agent: Googlebot
Allow: /

User-agent: bingbot
Allow: /

User-agent: Twitterbot
Allow: /

User-agent: YandexBot
Allow: /

User-agent: *
Disallow: /


I did this because I was tired of adding to the dozens of specific bot disallows I had. (My list of disallows was growing too large to manage.)

It seems to work, i.e. allowed bots see their entry and then do not process the end disallow.

As I monitor the logs, and see new bots, I go to their webpage and see if I'd be OK with what they do. If so, I add them to be allowed.

This cuts down on a lot of work, but I'd am interested in other people's view on this technique.

Thank You
8:44 pm on Nov 26, 2017 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:11474
votes: 692


This is a basic form of "whitelisting" and works very well for *simple* configurations. Just know that only bots that support robots.txt will follow it.

However, as you develop more explicit needs for controlling who/what has access to your files, you may well find a more comprehensive approach is needed.

Blocking Methods [webmasterworld.com]
9:18 pm on Nov 26, 2017 (gmt 0)

New User from RU 

joined:Feb 17, 2016
posts:14
votes: 0


Yes, of course. That is just about whitelisting specific bots one wants.

As most here should be aware, many bots do not adhere to robots.txt, bur for those that do, my point was - if people perhaps agree - there may be no need to have dozens of specific bot disallows entries in one's robot.txt file.

If most people are aware of this, then I apologize for a dumb post.

Thank you for your reply.
9:25 pm on Nov 26, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14709
votes: 613


Be aware that not all robots recognize the "Allow" directive, although Google does. To play it safe, replace
Allow: /
with
Disallow:
(i.e. nothing)

Edit: It's very common to use various kinds of whitelisting in actual access controls--the rules that result in a 403. But I don't remember seeing it in robots.txt, and personally I think it's kind of ingenious.
5:10 am on Nov 27, 2017 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:11474
votes: 692


there may be no need to have dozens of specific bot disallows entries in one's robot.txt file
There's also something else to consider... if you block the other bots by default, you may not get their actual bot w/ the complete UA to research.

Some companies just send a GET for robots w/o the complete UA for text files (robots.txt) so if blocked, the actual bot may not visit the site. There aren't too many of these, but there are a dozen or so and depending on your business interests, you may want to allow them.
6:46 am on Nov 27, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14709
votes: 613


if you block the other bots by default, you may not get their actual bot w/ the complete UA to research
Oh, that's a point. Another is that requests for robots.txt may not contain exactly the same headers as requests for “real” files, so you end up poking the wrong holes, or more holes than is strictly necessary. It probably isn't specific to robots.txt; I suspect you'd see the same variation in requests for any .txt file. As with the UA*, this is rare but it does happen.


* I know one otherwise well-behaved robot that uses its real name for page requests but a bare “robots” for robots.txt. Fortunately it's got a reliable IP.