Welcome to WebmasterWorld Guest from 54.159.246.164

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

Banning by user agent

   
3:34 am on Nov 11, 2012 (gmt 0)



I have a problematic host that is consistently trying to scrape my site.

Which is the most effective method of blocking, in relation to bandwidth/server consumption if the pest is persistent?

mod_security?
SecRule REQUEST_HEADERS:REMOTE_HOST "host-name-here" deny,status:403
- this doesn't seem to work.

htaccess?
 SetEnvIfNoCase Remote_Host "host-name-here" bad_bot
3:41 am on Nov 11, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



This old thread is enough to get you started.

Close to Perfect htaccess [webmasterworld.com]
3:45 am on Nov 11, 2012 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I'm not sure of which is the most efficient but considering mod_security is yet another layer of add-ons, I'd assume it would be slightly less efficient just having it there in the first place.

Overall REMOTE_HOST is grossly inefficient as it forces a reverse DNS lookup which can burn a lot of time opposed to simply blocking by IP range.

I don't mind reverse DNS look ups if they're cached, which I do in my code but I can't voich for REMOTE_HOST, so it doesn't repetitively do it every time the server encounters the same IP in a short time period.

This old thread is enough to get you started.


Yeah, but that's blacklisting and the number of user agents to block is in the thousands now and the amount of linear processing added to every Apache process is ridiculously inefficient.

Do the same basic code but whitelisting and it's short, efficient and permanently effective, you're good to go.
4:34 am on Nov 11, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I have a problematic host


Looks singular to me ;)
5:00 am on Nov 11, 2012 (gmt 0)



Overall REMOTE_HOST is grossly inefficient as it forces a reverse DNS lookup which can burn a lot of time opposed to simply blocking by IP range.

Good to know!

I've already got a whitelist reverse DNS checker up for the primary search engines, so if I rewrote that to boot the pest(s) I should be OK.

Looks singular to me ;)

1 really bad one, that's almost un-bannable at this point; every method I've tried thus far seems futile. As soon as I block a range, it comes back with another range... always has the same hostname, though.
5:17 am on Nov 11, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



1 really bad one, that's almost un-bannable at this point; every method I've tried thus far seems futile. As soon as I block a range, it comes back with another range... always has the same hostname, though.


Than your NOT defining the range far enough.
5:42 am on Nov 11, 2012 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



it comes back with another range... always has the same hostname


I have a few I had to block by host name for the same reason.

Sometimes there just is no choice.
9:02 am on Nov 11, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



mod_security?
SecRule REQUEST_HEADERS:REMOTE_HOST "host-name-here" deny,status:403 - this doesn't seem to work.

htaccess?
SetEnvIfNoCase Remote_Host "host-name-here" bad_bot

By yawn-provoking coincidence I have just this minute come from an unrelated forum where someone had exactly the same kind of "Is it hotter in New York than in the summer?" question.*

You mean mod_security or mod_setenvif. Why bother with the host name at all? Somewhere behind the name is an IP address-- and it's less likely to be faked than anything else you could block. If you're in doubt about the full range, just make it bigger. If for example it claims to be
aa.bb.cc.0/19
but your raw logs don't turn up anything from the rest of aa.bb., just block the whole /16. Or /15 or /14 if you haven't met any humans from there either.

I don't know whether anyone has done a rigorous speed comparison on mod_rewrite using RegExes vs mod_auth-thingie using CIDR ranges. (This is assuming you're not lucky enough to be in Apache 2.4 yet.) Gut feeling says that anything in CIDR form will be faster. But unless you've got an absolutely enormous site-- which you've said you don't-- the difference isn't likely to be significant.


* Answer: Yes.
10:59 am on Nov 11, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Why bother with the host name at all? Somewhere behind the name is an IP address-- and it's less likely to be faked than anything else you could block. If you're in doubt about the full range, just make it bigger. If for example it claims to be
aa.bb.cc.0/19
but your raw logs don't turn up anything from the rest of aa.bb., just block the whole /16. Or /15 or /14 if you haven't met any humans from there either.


I agree, and furthermore, if your unable to focus on the IP range?
Simply start denying Class A's (/8) temporarily, and then expanding them in follow up.
10:14 pm on Nov 14, 2012 (gmt 0)



Why bother with the host name at all?


Wouldn't this be beneficial if the host adds IP ranges to their existing ones? That way, you would be covered for future events.
10:32 pm on Nov 14, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



But only at the expense of speed loss everywhere. Especially if you're using an exclusion method that looks at all requests all the time. Since each request is an island, your htaccess doesn't know that those eighteen consecutive image requests come from a page that the visitor has already got permission to visit, so there's rarely any point to checking permissions* all over again. (I'm talking here about generic allows and denies, not the separate issue of hotlinks.)

Hm. Would it save any time overall if the WHOLE list of Deny from... directives were placed inside a Files envelope that constrained it all to .html requests? Or would the envelope itself add even more time to request processing? No, I have no idea why this question never occurred to me before.


* "Permission" in the casual sense, not the 401 sense.