Forum Moderators: open
I wrote a little piece of code that pulls all 403s and writes them to a text file to examine.
So is there a simple way for me to let this particular bot through while still blocking other bots that include "nutch" in their UA strings.
BrowserMatch Nutch bad_nutch
BrowserMatch {some-exact-UA-containing-nutch} !bad_nutch
...
Deny from env=bad_nutch
while if it's just one site, you might do it as RewriteCond %{HTTP_USER_AGENT} Nutch
RewriteCond %{HTTP_USER_AGENT} !{some-exact-UA-containing-nutch}
RewriteRule (^|/|html)$ - [F]
(ymmv, but robots asking for non-page files are so rare that it isn't worth making the server check on every single request) while if you're in Apache 2.4 you might do the same thing using assorted <If> envelopes something that gives an internal server error when you try to use it
it is no longer prudent to block server ranges without diligent watch
"GET /robots.txt HTTP/1.0" 200 1420 "-" "bigfind/Nutch-1.7"This is an example of "who the hell is this?" And why we can't just allow nutch open access. Bot runners need to learn to include a simple info page to let webmasters know who they are and what they do. After all... it is our property they are asking for.
"GET /page.html HTTP/1.0" 403 946 "-" "bigfind/Nutch-1.7"
• Using start anchors (^) when applicable may save server resources.
RewriteCond %{HTTP_USER_AGENT} (generic|agent|attributes|including|nutch|spider|crawl|etc) [NC]
RewriteCond %{HTTP_USER_AGENT} !^(UAs|identified|by|specific|attribute|at|start|of|string)
RewriteCond %{HTTP_USER_AGENT} !|UAs|identified|with|specific|attributes| that|occur|other|than|at|start)
RewriteRule !^(forbidden\.html|robots\.txt)$ - [F]
RewriteRule !^(forbidden\.html|robots\.txt)$ - [F]
RewriteRule forbidden\.html - [L]
listing any files that would otherwise match your access-control rules. (For example I don't need to say anything about robots.txt, because my rules are all written for html files.) This rule goes at the very beginning of all RewriteRules, not near the end with the ordinary [L] rewrites. # BLOCK USER AGENTS:
SetEnvIfNoCase User-Agent (a6corp|NerdyBot|nutch|spbot) ban
SetEnvIfNoCase User-Agent (aboundex|PHPCrawl|Dotbot) ban
SetEnvIfNoCase User-Agent (BLEXBot|genieo|Gigabot) ban
etc
.
.
.
SetEnvIf User-Agent "Windows 95" ban
SetEnvIf User-Agent "Windows 98" ban
SetEnvIf User-Agent "Mozilla/4.6" ban
etc
.
.
.
SetEnvIfNoCase User-Agent "SafeDNS" ! ban
Order Allow,Deny
Allow from all
Deny from env=ban
SetEnvIfNoCase User-Agent
BrowserMatchNoCase
! ban
SetEnvIf User-Agent "Windows 95" ban
SetEnvIf User-Agent "Windows 98" ban
can easily collapse to SetEnvIf User-Agent "Windows 9[58]" ban
Host: Digital Ocean
178.62.0.0 - 178.62.255.255
178.62.0.0/16
Both of these checked robots.txt first.