Forum Moderators: open

Message Too Old, No Replies

SafeDNS

         

keyplyr

11:24 am on Oct 17, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: SafeDNS search bot/Nutch-1.9 (https://www.safedns.com/searchbot; support [at] safedns [dot] com)
Protocol: HTTP/1.0
Robots.txt: Yes
Host: Digital Ocean
178.62.0.0 - 178.62.255.255
178.62.0.0/16

Safe site service for parents, schools, ISPs, etc. Subscribers can "check this website."

aristotle

1:37 pm on Oct 18, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I"ve started seeing this too, but it gets a 403 response because my .htaccess files block anything that includes "nutch" as part of the UA.

So should I allow it, and if so, is there a way to let it get through without removing "nutch" from all of the htaccess files for all of my sites. Or should I just remove "nutch", since I can't remember why I included it originally. What is the best approach?

Also, in other threads people sometimes say that they block HTTP/1.0 requests, if I remember correctly.

keyplyr

2:20 pm on Oct 18, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



IMO it is dangerous to blindly allow nutch (or any publically available tool like this) because you never know who is using it to get your files and what they will do with your data.

I filter nutch. I use mod_rewrite to list a dozen generic bot types by UA, then allow the beneficial ones through by either specific name or IP range.

I don't block 1.0. I tried it but realized it blocked too many users and beneficial agents. But everyone's site is different. If you do decide to block (anything) always examine your daily logs thoroughly.

I wrote a little piece of code that pulls all 403s and writes them to a text file to examine. But you can do it manually with a text editor by searching for each occurence of 403.

IMO it's unwise to block anything unless you are willing to spend the effort & time to find out what/who is actually getting blocked.

aristotle

2:42 pm on Oct 18, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



keyplyr -- Thanks for your reply.

So is there a simple way for me to let this particular bot through while still blocking other bots that include "nutch" in their UA strings. I'm not very good at working out .htaccess code on my own, and have never had much luck finding what I want on the web in this regard. It's easy to spend two or three frustrating hours searching around and then you end up with something that gives an internal server error when you try to use it.

lucy24

7:50 pm on Oct 18, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I wrote a little piece of code that pulls all 403s and writes them to a text file to examine.

Aha, that answers that question. My default log-wrangling routine starts by ignoring all 403s, so they're out of sight out of mind except every year or two when it's "At Home with the Robots" time. Which, in turn, means that I don't really know who's using those blocked IP ranges, except in the rare case where they ask for robots.txt without immediately afterward asking for a page. (Another pattern I ignore.)

So is there a simple way for me to let this particular bot through while still blocking other bots that include "nutch" in their UA strings.

Depends on your exact physical configuration, and on your personal preferences, and on a bunch of other stuff. For example, if you've got several sites sharing the same htaccess you might say
BrowserMatch Nutch bad_nutch
BrowserMatch {some-exact-UA-containing-nutch} !bad_nutch
...
Deny from env=bad_nutch
while if it's just one site, you might do it as
RewriteCond %{HTTP_USER_AGENT} Nutch
RewriteCond %{HTTP_USER_AGENT} !{some-exact-UA-containing-nutch}
RewriteRule (^|/|html)$ - [F]
(ymmv, but robots asking for non-page files are so rare that it isn't worth making the server check on every single request) while if you're in Apache 2.4 you might do the same thing using assorted <If> envelopes

... et cetera.

something that gives an internal server error when you try to use it

My current procedure goes like this:
#1 make changes to local copy of htaccess file (called "htaccess-shared" or "htaccess-sitename" or whatever it may be) and keep the file open
#2 upload this htaccess file to the appropriate location
#3 test-retrieve any random document such as robots.txt from the relevant site
#4a if successful, close local document
#4b if server error, ask text editor to highlight all changes, and fine-tooth-comb until I find the missing parenthesis or trailing space or similar embarrassing error

aristotle

9:14 pm on Oct 18, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks Lucy -- Your suggestions for the User Agent filtering look like exactly what I need. I should have time to work on that and get it implemented sometime in the next day or two.

As for the internal server errors that I mentioned, they've occurred a couple of times when I found a piece of code on the web that was supposed to do what I wanted, but when I inserted it into an existing .htaccess file, it caused the error. So I had to take it out and start looking for something else to try. That's why I don't like to look for solutions by searching on the web, because a lot of the information isn't trustworthy, at least on servers that I've used.

keyplyr

10:39 pm on Oct 18, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The rewrite Lucy exemplified is basically what I mentioned. I'm traveling with only a phablet for connectivity so could not be more explicit.

As you have discovered, it is a very bad idea to cut'n paste code from unknown sources. Probably not a perfect solution to get code from WW either... everyone makes mistakes now and then, which is why it's essential to always back up the files you are editing.

@Lucy - IMO it is no longer prudent to block server ranges without diligent watch to see exactly who/what is actually being affected.

Many/most server farms now host cloud users. These can be humans or beneficial agents. Social media and mobile devices changed the way we now need to think about server ranges. Almost daily I find humans (desktop or mobile) coming from server ranges I have blocked. I am constantly poking holes to allow valid traffic through.

Ironic - For 13 years I spent most of my time blocking access. Now for the last 3 years that same time is spent allowing it. I also allow approx 30 bots I had previously blocked. Either they have been re-purposed of I have just changed my strategy with what they do.

lucy24

11:23 pm on Oct 18, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



it is no longer prudent to block server ranges without diligent watch

Just the other day I unblocked a clutch of Russian/Ukrainian ranges as well as one specific referer pattern. (It was never a full-out block, but the redirect was enough to make everyone lose interest.) I almost immediately discovered that one form of referer that seemed to scream out "robot!" is exactly what a human Yandex user brings. I mean an actual human, not an infected browser.

And I still keep track of any humanoid behavior from blocked requests-- in particular, piwik requests coming from the 403 page. (I put analytics code on the page for this very reason.) At least 99% of the time, the referer is a semalt-type thing, which absolutely deserved to be blocked, so that's reassuring.

As mentioned elsewhere, I put an IPv6 address on my test site. I wish they had called it something else. Took me ages to grasp that the address is not six but eight segments. One corollary is that incoming IPv6 requests-- which used to get converted by some mysterious means into IPv4-- now show up as-is in logs. That includes a fair number of humans-- myself among them.

keyplyr

12:29 am on Oct 19, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@ aristotle - coincidentally in this hour's log:
"GET /robots.txt HTTP/1.0" 200 1420 "-" "bigfind/Nutch-1.7"
"GET /page.html HTTP/1.0" 403 946 "-" "bigfind/Nutch-1.7"
This is an example of "who the hell is this?" And why we can't just allow nutch open access. Bot runners need to learn to include a simple info page to let webmasters know who they are and what they do. After all... it is our property they are asking for.

keyplyr

3:19 am on Oct 19, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



(back at a desktop machine)
My version of the rewirte Lucy exemplified earlier:

RewriteCond %{HTTP_USER_AGENT} (generic|agent|attributes|including|nutch|spider|crawl|etc) [NC]
RewriteCond %{HTTP_USER_AGENT} !^(UAs|identified|by|specific|attribute|at|start|of|string)
RewriteCond %{HTTP_USER_AGENT} !|UAs|identified|with|specific|attributes| that|occur|other|than|at|start)
RewriteRule !^(forbidden\.html|robots\.txt)$ - [F]
• Using start anchors (^) when applicable may save server resources.
• I use a custom 403 page so I must allow that to those I block. I also believe all agents should have access to robots.txt. By not identifying what pages are blocked, the resolving rule assumes all pages are blocked with the exception of these two.

lucy24

6:15 am on Oct 19, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteRule !^(forbidden\.html|robots\.txt)$ - [F]

The alternative is to make a preliminary rule that says
RewriteRule forbidden\.html - [L]
listing any files that would otherwise match your access-control rules. (For example I don't need to say anything about robots.txt, because my rules are all written for html files.) This rule goes at the very beginning of all RewriteRules, not near the end with the ordinary [L] rewrites.

If your lockouts are issued via "Deny from..." lines, make a <Files> envelope for robots.txt that says Allow from all.

If you are on shared hosting and they've given you default names for error documents (it is no coincidence that keyplyr and I both use "forbidden.html" ;)) you do not need a <Files> envelope for those names, because there's already one in the config file. You do need a RewriteRule exemption, since those aren't inherited.

keyplyr

8:55 am on Oct 19, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for that, but I don't use envelopes any longer. I use as few mods as possible. I know they all load regardless on shared hosting, but for my end I like to keep it simple so aside from the Header & Handler controls, I use mod_rewrite. Anything cross-site I use server-side scripting.

aristotle

6:17 pm on Oct 19, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What about the following code, which I devised by taking what I already have (with "nutch" in the first line), then adding a new line near the bottom for allowing UAs that contain the term "SafeDNS":
# BLOCK USER AGENTS:
SetEnvIfNoCase User-Agent (a6corp|NerdyBot|nutch|spbot) ban
SetEnvIfNoCase User-Agent (aboundex|PHPCrawl|Dotbot) ban
SetEnvIfNoCase User-Agent (BLEXBot|genieo|Gigabot) ban
etc
.
.
.
SetEnvIf User-Agent "Windows 95" ban
SetEnvIf User-Agent "Windows 98" ban
SetEnvIf User-Agent "Mozilla/4.6" ban
etc
.
.
.
SetEnvIfNoCase User-Agent "SafeDNS" ! ban
Order Allow,Deny
Allow from all
Deny from env=ban

Could this work?

lucy24

8:37 pm on Oct 19, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



SetEnvIfNoCase User-Agent

You may have forgotten the shorthand
BrowserMatchNoCase

which exists for exactly this situation.

I'd avoid NoCase matches (in any module) whenever possible, since it's equivalent to, for example, [Nn][Uu][Tt][Cc][Hh] etcetera, which will rarely occur so it's just more work for the server.

! ban

No space! I don't actually know the consequence of the space-- whether a server meltdown or just failure of the current rule-- but don't particularly want to find out.

mod_setenvif uses Regular Expressions, so groups like
SetEnvIf User-Agent "Windows 95" ban
SetEnvIf User-Agent "Windows 98" ban
can easily collapse to
SetEnvIf User-Agent "Windows 9[58]" ban

Note for future reference that in mod_setenvif, unlike mod_rewrite, quotation marks do obviate the need to escape spaces. (That is, you did it right, but I want to be sure you did it right on purpose, not by accident.) But they don't convert the stuff inside quotation marks to a literal string; it's still a Regular Expression. Yes, this may be confusing and counterintuitive.

aristotle

9:30 pm on Oct 19, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks Lucy -- Looks like I made some dumb errors. I'll get back to work on this tomorow and try to fix them.

Oh one quick question-- If i use quotation marks around a phrase, then it wouldn't make sense to use NoCase too, would it?

lucy24

10:38 pm on Oct 19, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There's no relationship between quotation marks and NoCase, since the quotation marks in mod_setenvif don't mean literal text.

In general, save NoCase for situations where something really can occur in all possible casings. In a text editor I might search for, say, "finding" with case sensitivity disabled if I wanted to find (haha) both medial "finding" and sentence-initial "Finding". But when nanoseconds matter, you really mean "[Ff]inding", case-sensitive.

"NoCase" or [NC] -- or a "case sensitivity" clickbox* -- is one of those things that's a convenient shorthand for us, the human users, but it can create a lot more work behind the scenes.


* One of the two text editors I use regularly has "Ignore Case" as an optional extra, meaning that things are case sensitive by default. The other has a "Case Sensitive" checkbox, meaning it's the exact opposite. This can be exasperating.

dstiles

6:28 pm on Oct 20, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Remember there have also been nutch UAs with hyphens and underlines! My puny Windows regex accepts...

n[-_]?u[-_]?t[-_]?c[-_]?h

(it is a case insensitive test)

lucy24

9:54 pm on Oct 20, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ugh, how nasty. Seems like it should be possible to construct a fairly generic RegEx to cover the case of "hyphen or lowline in the wrong place".

:: detour to play with random specimen of logs that's sitting on my desktop from last time ::

Googlebot-Something
OS X 10_\d_\d
nn-nn (that is, language+region)
Linux x86_64

... uhm, OK, maybe not. Darn those mobiles anyway.

aristotle

1:06 pm on Oct 22, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Host: Digital Ocean
178.62.0.0 - 178.62.255.255
178.62.0.0/16

keyplyr -- Are you sure about that IP range? The two visits I've noticed came from 188.166.29.35 and 188.166.81.11

Anyway, I'm still trying to test my new code. I made the changes that Lucy suggested and implemented it on my least important site. I then went to the SafeDNS website and entered the URL for them to check. But that was two days ago and their bot still hasn't come to that site yet.

I don't want to implement it on my other sites until I see what happens on this less important site. So right now I'm just waiting.

keyplyr

2:11 pm on Oct 22, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes I am sure. Bots do often come from more than one range ya know. Your IPs are also Digital Ocean. With clouds the ranges are dynamic... they change often.

aristotle

6:33 pm on Oct 28, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well I still haven't seen this SafeDNS bot show up on my test site. I haven't noticed it on my other sites recently either, and have only seen it twice altogether.

But I think it's important to make sure that it isn't blocked, so on all my other sites I've removed "nutch" from the user agent block list, and also tried to unblock all the Digital Ocean IP ranges that I can identify. This is a temporary measure until I see what happens on my test site.

In the meantime just today I noticed two other nutch-type user agents showing up:

VeriCiteCrawler/Nutch-1.9 (from IP 198.30.168.58)

tbot-nutch/Nutch-1.10 (from IP 217.73.208.154)

Both of these checked robots.txt first.

lucy24

8:43 pm on Oct 28, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Both of these checked robots.txt first.

I think keyplyr said in a different thread that the default "nutch" script includes a robots.txt request, so they'll always ask. Of course obeying robots.txt is a whole nother matter. If your front page happens to include links to roboted-out directories (mine does) --or, of course, if the entire site is roboted-out-- it will be easy to tell if you've got a compliant robot.

And then there are the ones that ask for robots.txt ... only after requesting the front page. (Are they verifying that the site exists at all before going to the arduous effort of studying a robots.txt file that could be as big as, oh, several hundred bytes?) "Um, sorry, robot, I don't think you quite understand how this thing works."

keyplyr

9:54 pm on Oct 28, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It was always my opinion that any customized version of nutch should support explicit "nutch" directives in robots.txt and if they didn't, they should change the entire UA and not just customize the prefix attribute. This has not been the case which has irked me (strange word but I try to use it at least once a month for vernacular diversity.)

So in retaliation, I blocked "nutch" in htaccess for years. There I have the power and using just the attribute "nutch" blocked them all - ha ha! However nowadays my webmaster techniques have switched from blocking everything to allowing as many beneficial agents as possible... so long and thanks for all the fish.

tangor

10:28 pm on Oct 28, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This too, is my norm in nuking "nutch" of any kind. I have yet to find any beneficial "nutch", though I do check from time to time.