Forum Moderators: phranque

Message Too Old, No Replies

Allow Robots.txt and Deny all Others?

         

Brett_Tabke

1:55 pm on Jul 22, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I'd like to allow a ip to grab robots.txt, but still deny it to all others. (for example, I want to ban badbot.org by IP, but still allow it to read robots.txt so that it knows it is fully blocked)

This does not appear to work:

RewriteCond %{REQUEST_FILENAME} ^robots\.txt$
RewriteRule ^(.*)$ - [END]
deny from 118.193.41.43


The deny works, but the allow it to have robots.txt does not.

Brett_Tabke

2:07 pm on Jul 22, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I guess I was trying to over complicate it?

This seems to work:

RewriteRule ^robots.txt - [L]

not2easy

3:35 pm on Jul 22, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



That rule in the first post, the allow needs an order so it knows whether to evaluate allow or deny first.

Something like
order deny,allow

Brett_Tabke

4:03 pm on Jul 22, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Not used order deny since apache 1.1.
The deny is not related really to the RewriteRule - I just put it in there to show precedences of. the [END] statement - that the deny statement comes after the rewrite rule.

Using the [END] flag terminates not only the current round of rewrite processing (like [L]) but also prevents any subsequent rewrite processing from occurring in per-directory (htaccess) context.


A simple deny doesn't require order directive.

lucy24

4:42 pm on Jul 22, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Any module that can issue a 403 needs to poke a hole for two things: robots.txt, as in the present question, and any error documents. Separate exemptions for each mod, since each module is an island.

For mod_authz_core * this is best done with a Files envelope, like
<Files robots.txt>
Require all granted
</Files>
For mod_rewrite, put a rule at the very beginning of your rewrite section that says

RewriteRule robots.txt - [L]
(The anchored form ^robots.txt can and should be used if the rule is in a <Directory> section or htaccess.)

The latter will also exempt robots.txt from canonicalization redirects, which is desirable because some robots seem to get confused if a robots.txt request is redirected. And you don't want to give them any excuse whatsoever for not getting robots.txt.


* I have only just looked up the name. By tomorrow I will have forgotten it again.

tangor

7:07 pm on Jul 22, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I do it this way ... should I change it?

SetEnvIfNoCase Request_URI "(robots\.txt)$" pass

lucy24

7:24 pm on Jul 22, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Neither the quotation marks nor the parentheses are necessary. (Quotation marks in mod_setenvif are mainly useful when the pattern contains a literal space; parentheses are only needed if there is a capture involved.) Nor, for that matter, is the closing anchor, unless the site mysteriously contains files called robots.txt/blahblah.

Since the environmental variable "pass" is positive, i.e. DO let people in if it has been set, the whole thing depends on the structure of the Require section: for example, it could be one of several options in a <RequireAny> section.

tangor

8:20 am on Jul 23, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks! I got that line from either incrediBILL or JDMorgan years ago and just went with it. This thread suggests I might do some more study regarding .htaccess as I have no RequireAny sections at all!

So far, everyone gets robots.txt, even if denied later in .htaccess. :)

dstiles

8:48 am on Jul 23, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I see no real benefit to allowing any old bot to access robots.txt UNLESS you have a denial for every one of the thousands of bots that hit (and then ignore it). I worry about letting the acceptable bots have access and do not worry about all the others apart from a very few genuine but unwanted bots. Even the acceptable bots don't entirely obey directives in robots.txt (eg bingbot, googlebot...)

After all, robots.txt is a rubbish non-standard anyway, with no actual control over bots that don't care about it. It should long ago have been turned into a gateway device such as htaccess.

Well, that's my view, anyway. :(

lucy24

3:08 pm on Jul 23, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I rewrite robots.txt to a robots.php that, among other things, selects which version to display. For certain types of robots that I already know will be categorically denied, all they get is a minimalist
User-Agent: *
Disallow: /

:: detour to check ::

Currently, this goes to: no agent; bad agent (from a list); bad range; humanoid UA* ... and any robots.txt request that includes a referer **.

If a robot can't get to robots.txt, it has no way of knowing you don't want it there.

* This means that nosy humans who snoop into my robots.txt will see a comprehensive Disallow, but it can't be helped.
** There is, I think, one and only one point on one site where I actually do link to my own robots.txt--but this link would only be followed by humans, who have already been told they won't see the real thing.

Brett_Tabke

3:52 pm on Jul 23, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



>UNLESS you have a denial for every one of the thousands of bots that hit

If they do a valid agent name - most are covered. There are some that are very pesky (the evil Archive.org comes to mind)

Then there are bots like Apples, that ignore robots.txt for 90days. The only way to stop it is to ban the agent, and then hope it reads robots.txt

This gets a mass percentage of them:


RewriteCond %{HTTP_USER_AGENT} ^.*(DataForSeoBot|AhrefsBot|wp_is_mobile|AppleBot|meerkatseo|LWP|AppleNewsBot|yacy|infotiger|amazonbot|YisouSpider|search.marginalia.nu|netvibes|Slurp|omgili|archive|AspiegelBot|AwarioSmartBot|axios|babbar|Baiduspider|basalsa|bbot|BcahiBot|BLEXBot|BluechipBacklinks|bsalsa|Bytespider|CCBot|Cliqzbot|curl|DaniBot|DataForSeoBot|daum.net|DotBot|EcoSearch|Exabot|eyemon|Flamingo_SearchEngine|GarlikCrawler|Gwene|Hatena|HTTrack|Jersey|Linespider|linkfluence|linkpad|magpie|mail.ru|MauiBot|MBCrawler|MJ12bot|naver|OpenNet.ru|PaperLiBot|PetalBot|picoFeed|Python|Qwantify|RyteBot|Seekport|SemrushBot|SentiBot|SerendeputyBot|serpstatbot|SeznamBot|Sogou|sogouspider|Studio|TkBot|trendiction|Wget|Yandex|zhanzhang|zoominfobot).*$ [NC]
RewriteRule .* - [F,L]

not2easy

5:31 pm on Jul 23, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



It should work without the {^.*} before the ( and the {.*$} after the closing ) and I see at leat one duplicate UA. You can also use "contains" strings in place of the full name of the UA so long as the string text is not found in other UAs that you don't want to block. In other words, concentrate to several UAs with things common to multiple unwanted UAs like (fetch|filter|flip|geni|gimme|hub|link|libww).

For example, meerk or atseo will block meerkatseo. So "|Sogou|sogouspider|" is redundant. |link| could replace |linkfluence|linkpad|BluechipBacklinks| and |eobot| or |data| blocks |DataForSeoBot| (which is duplicated).
|OpenNet.ru| should have an escape \ before .ru as in |OpenNet\.ru| - same for |daum.net|, |search.marginalia.nu| and |mail.ru|.

With the NC flag, no need to use exact match casing. Yes, I should have used {code} tags but it might seem too long.

lucy24

6:42 pm on Jul 23, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It should work without the {^.*} before the ( and the {.*$} after the closing )
The one at the front is especially undesirable, because it means the server goes through the entire UA and only then stops and says “Oh, whoops, I guess I was supposed to pick up that string along the way.”

The [L] flag is never needed with [F] (or any other 400-class response). Does no harm, but costs two bytes on every request, assuming htaccess. And if this is in htaccess, that's a horrendous lot of work for the server.

Some of the robots on that long list are in fact compliant, so it's worth Disallowing them in robots.txt, because the only thing better than a blocked request is one that isn't made in the first place.