Forum Moderators: open

Evolving and Cloaking Your Robots.txt into an Active Bot Gatekeeper

         

Brett_Tabke

11:58 am on Nov 5, 2025 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Spend a very long time on this technical article. It grew out of a webmasterworld thread from about 20 years ago. Feedback welcome!
[searchengineworld.com...]

lucy24

4:46 pm on Nov 5, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Heh. Some years back, I replaced my robots.txt with a concealed robots.php. The original reasons were so I could (a) append my logheaders function (letting me know which holes to poke if I subsequently decided to admit a given robot) and (b) include a single list of Disallows for multiple sites.

Only later did I proceed to (c) serve different versions of robots.txt depending on the original request. But so far it’s only two: the complete robots.txt, which names names, and a minimalist one that disallows everyone. (If you, yourself, are not allowed in, you do not need to know who is allowed in, or what assumed name to try on.)

SumGuy

3:01 am on Nov 8, 2025 (gmt 0)

5+ Year Member Top Contributors Of The Month



My robots file is wide open.

Anything I want to block, I don't bother with robots.txt, I just IP-block and be done with it.

tangor

4:36 am on Nov 8, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I use robots.txt as whitelist filter. Anything that ignores it gets nuked. Meanwhile, EVERYONE gets to see robots.txt (in my case there can be only one).

lucy24

4:42 pm on Nov 8, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Anything I want to block, I don't bother with robots.txt
I feel like I’ve said this before--many times--but the only thing better than a blocked request is a request that isn’t made in the first place. Although rare, there do exist robots.txt compliant robots that some specific site might nevertheless not want. And even legitimate search engines may be disallowed from certain directories.

SumGuy

2:18 am on Nov 9, 2025 (gmt 0)

5+ Year Member Top Contributors Of The Month



A question: Can you give an example of a search engine or bot that does obey robots.txt that you block?

Brett_Tabke

3:05 am on Nov 9, 2025 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



@SumGuy perplexity, biadu, alibaba, archive.org

But that's not what this is about for me. I only want to allow Google thing and open AI. Everyone else should be banned. Obviously this is the limited solution cuz the really bad ones don't obey it

SumGuy

1:43 pm on Nov 9, 2025 (gmt 0)

5+ Year Member Top Contributors Of The Month



Yes, I was wondering about the China bots like Baudi and alibaba and others that might come from tencent. Since I IP-block large chunks of them anyways for various reasons, I end up capturing their bot IP's. I allow archive.org, I really don't see it that often. Perplexity I think I've seen in the logs but either it's very rare or I've IP-blocked it by accident.

Since I block large chunks of AWS I think (or I know) that I've also blocked the amazon-bot, which I don't care.

> I only want to allow Google thing and open AI. Everyone else should be banned.

Including bing? What about yandex? I allow yandex (I don't IP-block yandex).

I also IP-block fecebook and that includes their bot (again not through robots but by IP). I know there is somehow some user interaction between FB and FB users and external sites, but by blocking FB IP's I know I disrupt that interaction but I don't care.

lucy24

5:37 pm on Nov 9, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



an example of a search engine or bot that does obey robots.txt that you block
One that comes to mind is PetalBot. I've never bothered to find out what exactly it is, other than it's Chinese. I also currently Disallow Awario (both variants) simply because I don't see why it has to request the same file dozens of times every day. There are others.

And, again, this isn't about blocking. It's a robots.txt Disallow; compliant robots don't need to be blocked (although they would be) because they don't request anything but robots.txt. Facebook is both blocked and disallowed, since it seems to change its mind from one week to the next about whether it's going to be compliant.