Forum Moderators: phranque

Message Too Old, No Replies

Fixing my htaccess file

Understanding regex and keeping bots at bay

         

Martin Potter

8:43 pm on Mar 8, 2024 (gmt 0)

5+ Year Member Top Contributors Of The Month



I noticed today that another bot has slipped through my htaccess file and am wondering what is wrong with my script. Must admit that I have never really understood all the rules and syntax of regex. Maybe someone can recognize my probably-common error.

Trying to keep claudebot/ClaudeBot out of my site and have the following script in .htaccess :

<IfModule mod_rewrite.c>
RewriteCond %[HTTP_USER_AGENT] ClaudeBot [NC,OR]
... (others) ...
RewriteRule .* [R=404,L]
</IfModule>


This seems not be working, although I am sure that it did at one time. Maybe there is a simple typo or maybe I should be using a completely different format and rule. Can someone tell me what I have done wrong and how to fix it?

Many thanks!

not2easy

9:43 pm on Mar 8, 2024 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I skip the header line,"<IfModule mod_rewrite.c>" because it makes some people repeat that, "if it isn't on, that doesn't turn it on" or something similar.

My first line is formatted like
RewriteCond %{HTTP_USER_AGENT} (A6|Abadbot|AnotherBot|appid|Blog|nextone) [NC,OR]
and the next line continues like that with additional bot snippets until I have added them all. The last line does not get
[NC,OR]
just
[NC]

The last line is like
RewriteRule .* - [F]
to kick them out with a Forbidden (403), instead of a 404 (not found) that keeps them trying.

A custom 403 page can be added to deal with any possibility a 403 recipient complains, they can contact you about it.

Keep in mind that it helps for others to know the version of Apache they are offering suggestions for.

lucy24

11:08 pm on Mar 8, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteRule .* [R=404,L]
Is that really what the rule says, or did you miss something in cut-and-paste? It should be
RewriteRule . - [R=404]

With a 400-class (also 500-class) response, you don't need to use the [L] flag. It does no harm, but isn't necessary.

I don’t recommend using [NC] at all except in special circumstances, because it’s extra work for the server, flattening both the UA and the string-to-match (in this case, the RewriteCond) on every request. Instead, say either
ClaudeBot|claudebot
OR
[Cc]laude[Bb]ot

The <IfModule> envelope is also not necessary. If you didn't have mod_rewrite, you'd know.

It is entirely a matter of personal choice whether you want to return a 403 (Forbidden, slam the door in their face) or a manual 404 (no such file, except that the server doesn't even have to go look for it). Sometimes a 404 is satisfying because then you haven't given the robot any information.

It should be noted that mod_rewrite is not really the most efficient way to do basic access control; it's a fairly server-intensive module. But if you're comfortable with it, you may not want to change.

Finally: I do hope you have an exemption at the very beginning of your RewriteRules saying something like
RewriteRule robots\.txt - [L]
because everyone should be allowed to see robots.txt. Especially when, like ClaudeBot/claudebot, it appears to be compliant. Sure, it asks for robots.txt dozens of times a day--but nothing else.

Martin Potter

8:41 pm on Mar 9, 2024 (gmt 0)

5+ Year Member Top Contributors Of The Month



Thank you, @not2easy and @lucy24, for all those ideas and suggestions. Have to admit that Regex has always been a bit of a mystery to me, and now at my age it is hard to remember from one time to the next what I have learned about it.

Sorry, yes, the hosting company's server uses Apache v2.4.58. I suppose that is the latest version, or close to it.

I think both of you put your fingers on my critical error -- the missing space and "-" in my RewriteRule statement. I will fix that! Thank you for that! I imagine that it was there originally and just don't know how it got lost.

not2easy, thanks for the suggestion about leaving out the IfModule turn-on command. That will save a little space. And the all-important lack of an OR at the end of the last RewriteCondition statement. And, yes, I intentionally return a "404" error to those visitors. In fact, my custom "403" error message page actually delivers a "404 - File not found" text message, so that they don't know that I am "on to them", if you know what I mean. Sort of in line with Lucy's thought.

Lucy, thanks for the suggestion about leaving out "NC". I always try to capture the capitalization, or lack of it, used in the visitor's UA, so leaving out NC should not change anything. More importantly, I had completely forgotten about the impact on robots.txt, so will be sure to include your suggested exemption.

Thank you, again, to both of you. I'm sure the problem will be solved after I make these changes to htaccess. Just hope I don't introduce any new errors while making these repairs. ;-)