Forum Moderators: phranque
..Scapers spoof as Google to rip-off the naive people depending on shoddy .htaccess files blocking bad user agents..from another topic (unrelated to this one).
Sadly, I am one of those naive people! Until I came across this post, I thought htaccess was the best way. So what IS the best way to defeat the scrapers / bad bots?
(I'm sure this has been asked and answered many times here, but I cannot figure a reasonable search query that would come up with the most relevant results.. as all of the main keywords I can think of are used frequently here for many subjects)
And the whole site is behind a filter with that db of ip addresses.
That way, you don't have to do a reverse dns lookup for each hit to your site, which really slows things down.
If you want to get really creative, then set up a filter script that would not simply deny requests to anyone with a blocked ip, but return some garbage (random words, etc) for any request to a page on your site. That way, they won't even notice it right away while scraping.
The basic foundation for the bad-bot script [webmasterworld.com] described above is available here at WebmasterWorld. Originally written by key_master, it has been modified and enhanced by several members.
Jim
[edited by: jdMorgan at 1:41 pm (utc) on June 5, 2007]
Here's one of the worst offenders, inefficient regex patterns:
RewriteRule ^(.*)/(.*)/(.*)$ /some-path
RewriteRule ^([^/])/([^/])/([^/])/?$ /some-path [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule .* /index.php
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^/]+/)*[^.]*$ /index.php
Maybe we don't use extensionless files for our blog, so we might not be able to do the "no filetype" RewriteRule pattern trick, but we can at least stop those filesystem calls from being made most of the time:
RewriteCond $1 !\.(gif¦jpe?g¦png¦css¦js¦pdf¦¦mp3¦mpe?g¦avi¦txt¦xml¦rdf)$
RewriteCond $1 !^index\.php$
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule (.*) /index.php
I've used that trick right there to 'save' several hopelessly-slow sites, and leave them, well, downright zippy.
Another thing to avoid is doing unconditional rDNS lookups; Like filesystem checks, they are very slow, and conditions should be added if possible. rDNS lookups are invoked by checking the {REMOTE_HOST} variable, BTW.
Beyond programming details, there's also the issue of overall file structure. Put user-agent and IP address access-controls at the top -- there's no use processing a bunch of URL-path- and hostname- based redirects and rewrites if the requestor isn't welcome on the site. And look for other opportunities to quit mod_rewrite processing early, such as skipping internal page rewrites if the request is for an image -- If your site is like most, then you'll have quite a few image requests per page. And how about Expires and Cache-control headers? -- Are those configured reasonably?
Using the .htaccess file is not the most efficient way to do many things -- Using pre-compiled code in httpd is much better than the same code interpreted on-the-fly per-HTTP-request in .htaccess. But that doesn't mean it can't be made powerful and much more efficient with a bit of study and work -- Like most things in life...
Only each individual Webmaster can decide how much 'armor' to throw on their Web site. It depends on how much trouble you get from competitors, scrapers, and other malicious entities. For some sites, adding every bit of armor you can find is appropriate, while for others only a little is required. But like the real armor worn by European Knights, too much armor is too heavy and defeats its own purpose -- to improve the survivability of its wearer.
Well, you didn't say you wanted a short opinion, did you? :)
Jim
[edited by: jdMorgan at 9:46 pm (utc) on June 5, 2007]
And while we're at it, let's add an [L] flag, so that mod_rewrite processing stops if the rule is invoked:
Initially, I did not write the rules that contain this, so I'm not entirely sure of their function.. but is there any performance difference that the naked eye could see by switching those?
And while we're on the subject, is the ip deny via htaccess:
deny from 38.98.x.x efficient enough when you have potential bad bots hammering your site? If these guys try to DDos, is there anything else I can do to stop / slow them down?
Don't use any code you don't understand -- Doing so leads to this kind of situation where your site gets broken and you've no idea where to start looking. So study up enough to understand each rule that you're using, or hire someone to maintain your server configuration... Those are the two viable choices. For more information, see the documents cited in our forum charter [webmasterworld.com] and the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com].
I'm not sure how to answer your "Deny from" efficiency question -- I've got more than a hundred of those directives on each of many sites.
For a real DDOS, you need to get your host to block the IP addresses and/or IP address ranges at their firewall. The good news is that on a shared server they have even more incentive to do so, so don't think that they won't cooperate just because you're on a relatively inexpensive hosting service -- They don't want *all* the sites on that server to be affected, and they don't want their internal networks overloaded, so they'll generally act if they can.
Jim