Forum Moderators: phranque

Message Too Old, No Replies

Filtering Bad Bots

most efficient way

         

jake66

8:20 am on Jun 5, 2007 (gmt 0)

10+ Year Member



I noticed a post:
..Scapers spoof as Google to rip-off the naive people depending on shoddy .htaccess files blocking bad user agents..
from another topic (unrelated to this one).

Sadly, I am one of those naive people! Until I came across this post, I thought htaccess was the best way. So what IS the best way to defeat the scrapers / bad bots?

(I'm sure this has been asked and answered many times here, but I cannot figure a reasonable search query that would come up with the most relevant results.. as all of the main keywords I can think of are used frequently here for many subjects)

Frank_Rizzo

8:25 am on Jun 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



IMO you can't beat modsecurity. The amount of guff this thing blocks is amazing. The best thing about it is - if you have a site which is specifically targetted then you can create a custom rule.

jake66

8:29 am on Jun 5, 2007 (gmt 0)

10+ Year Member



I have mod_security on my cpanel VPS, what's the best way for a newbie to learn how to use it?

I've took a stab at tutorials a few weeks ago because of people raving about it on other forums, but the tech talk is over my head. :)

bcc1234

8:54 am on Jun 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A good way to combat scrapers is to create a hidden link to a special page. Put that page into robots txt. And then have a script that analyzes all requests to that file and updates the db of ip addresses to ban in real time. If something hits that file, the script does a reverse dns lookup to check the name, and a forward dns on that name. If it's not one of the major spiders you want to allow, then add it to the list.

And the whole site is behind a filter with that db of ip addresses.

That way, you don't have to do a reverse dns lookup for each hit to your site, which really slows things down.

If you want to get really creative, then set up a filter script that would not simply deny requests to anyone with a blocked ip, but return some garbage (random words, etc) for any request to a page on your site. That way, they won't even notice it right away while scraping.

jdMorgan

1:40 pm on Jun 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The quoted phrase should have been written as "...people depending on shoddy .htaccess files blocking only bad user agents, without checking the reverse-DNS of user-agents that appear to be "good" ones."

The basic foundation for the bad-bot script [webmasterworld.com] described above is available here at WebmasterWorld. Originally written by key_master, it has been modified and enhanced by several members.

Jim

[edited by: jdMorgan at 1:41 pm (utc) on June 5, 2007]

jake66

7:53 pm on Jun 5, 2007 (gmt 0)

10+ Year Member



I make use of the bad bots htaccess script you linked :)

But I've read around the web htaccess use can slowdown the overall load of your website, so surely there must be a better method?

(I am currently experimenting with my mod_security setup)

jdMorgan

9:43 pm on Jun 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, poorly-coded and badly-thought-out .htaccess files *can* slow down a server, sure. So can using that stuff called PHP and PERL scripts and those database thingies -- It's all relative.

Here's one of the worst offenders, inefficient regex patterns:


RewriteRule ^(.*)/(.*)/(.*)$ /some-path

That can be sped up by a factor of ten simply by using a more-efficient pattern. And while we're at it, let's add an [L] flag, so that mod_rewrite processing stops if the rule is invoked:

RewriteRule ^([^/])/([^/])/([^/])/?$ /some-path [L]

Or how about this "real winner," present on almost every WordPress site:

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule .* /index.php

What's wrong with that? Well, it calls the filesystem twice for every single request to the server to see if the requested URL-path exists as a file or as a directory. A bit of thought might lead to a vast improvement, simply by excluding things that WordPress doesn't handle, or things that we know will always exist, such as:

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^/]+/)*[^.]*$ /index.php

Here, we prevent the RewriteConds from being processed simply by looking for a period in the final URL-path-part. If the Rule pattern does not match, then the RewriteConds won't be processed at all, and the filesystem calls won't be invoked (RewriteConds are processed only *after* their RewriteRule's pattern matches -- it's in the book).

Maybe we don't use extensionless files for our blog, so we might not be able to do the "no filetype" RewriteRule pattern trick, but we can at least stop those filesystem calls from being made most of the time:


RewriteCond $1 !\.(gif¦jpe?g¦png¦css¦js¦pdf¦¦mp3¦mpe?g¦avi¦txt¦xml¦rdf)$
RewriteCond $1 !^index\.php$
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule (.*) /index.php

Here, we skip the filesystem checks if the request is for a media, css, JavaScript, text, xml, or rdf file, or for the index.php file itself.

I've used that trick right there to 'save' several hopelessly-slow sites, and leave them, well, downright zippy.

Another thing to avoid is doing unconditional rDNS lookups; Like filesystem checks, they are very slow, and conditions should be added if possible. rDNS lookups are invoked by checking the {REMOTE_HOST} variable, BTW.

Beyond programming details, there's also the issue of overall file structure. Put user-agent and IP address access-controls at the top -- there's no use processing a bunch of URL-path- and hostname- based redirects and rewrites if the requestor isn't welcome on the site. And look for other opportunities to quit mod_rewrite processing early, such as skipping internal page rewrites if the request is for an image -- If your site is like most, then you'll have quite a few image requests per page. And how about Expires and Cache-control headers? -- Are those configured reasonably?

Using the .htaccess file is not the most efficient way to do many things -- Using pre-compiled code in httpd is much better than the same code interpreted on-the-fly per-HTTP-request in .htaccess. But that doesn't mean it can't be made powerful and much more efficient with a bit of study and work -- Like most things in life...

Only each individual Webmaster can decide how much 'armor' to throw on their Web site. It depends on how much trouble you get from competitors, scrapers, and other malicious entities. For some sites, adding every bit of armor you can find is appropriate, while for others only a little is required. But like the real armor worn by European Knights, too much armor is too heavy and defeats its own purpose -- to improve the survivability of its wearer.

Well, you didn't say you wanted a short opinion, did you? :)

Jim

[edited by: jdMorgan at 9:46 pm (utc) on June 5, 2007]

jake66

7:31 am on Jun 6, 2007 (gmt 0)

10+ Year Member



Thanks for the fantastic tutorial, it reminded me of the htaccess problem I'd been having a few weeks ago in another topic.

And while we're at it, let's add an [L] flag, so that mod_rewrite processing stops if the rule is invoked:

Your reference to this looks similar to an entry in my own htaccess, only it ends in [NC] instead of [L].

Initially, I did not write the rules that contain this, so I'm not entirely sure of their function.. but is there any performance difference that the naked eye could see by switching those?

And while we're on the subject, is the ip deny via htaccess:

deny from 38.98.x.x
efficient enough when you have potential bad bots hammering your site? If these guys try to DDos, is there anything else I can do to stop / slow them down?

jdMorgan

4:51 pm on Jun 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



[NC] Force case-insensitive pattern match
[L] Last Rule: Stop the rewrite engine if this rule matches and is invoked. (Use it every time, unless you have a good reason not to.)

Don't use any code you don't understand -- Doing so leads to this kind of situation where your site gets broken and you've no idea where to start looking. So study up enough to understand each rule that you're using, or hire someone to maintain your server configuration... Those are the two viable choices. For more information, see the documents cited in our forum charter [webmasterworld.com] and the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com].

I'm not sure how to answer your "Deny from" efficiency question -- I've got more than a hundred of those directives on each of many sites.

For a real DDOS, you need to get your host to block the IP addresses and/or IP address ranges at their firewall. The good news is that on a shared server they have even more incentive to do so, so don't think that they won't cooperate just because you're on a relatively inexpensive hosting service -- They don't want *all* the sites on that server to be affected, and they don't want their internal networks overloaded, so they'll generally act if they can.

Jim