Forum Moderators: phranque

Message Too Old, No Replies

Defeating scrapers, framers and hot-linkers with .htaccess

         

numnum

7:26 am on Aug 5, 2014 (gmt 0)

10+ Year Member



Aside from implementing ssi's, conditionals, and the like, I don't know what I'm doing when it comes to things Apache, so I wanted to double-check the following with the experts here. To prevent framing:

Header append X-FRAME-OPTIONS "DENY"

And to prevent hotlinking various and sundry file types:

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)myownsite.com/.*$ [NC]
RewriteRule \.(gif|jpg|jpeg|bmp|zip|rar|mp3|flv|swf|xml|php|png|css|pdf)$ - [F]


What I'm even less certain about is the simplest and most effective way to defeat bot scrapers. One suggestion from an old WW thread (I've lost track of it) was to add to the most likely entry page a small and invisible link ( display:none ? ) to a dummy page such that when a bot follows the link its IP address is stored and site access is denied. I don't know how to implement this method, and I assume I'd need to make exceptions for legitimate spiders (esp. search engines). Does this sound right? Isn't there another method that shuts down access to any visitor moving too quickly through too many pages at one's site? I'm totally out of my wheelhouse here.

Lastly, what about denying access to any request for a URL that appends a string to a page's actual URL. In checking my logs, I've noticed essentially the following:

actual url:
www.myownsite.com/page.html

URL request (file or folder obviously doesn't exist but the request returns the above URL):
www.myownsite.com/page.html/string

I'm not sure if what's happening there is framing or something else, but in any event there must be a simple way to prevent this. I can't figure it out from the documentation at the Apache site -- more accurately, I don't fully understand and can't apply what's there.

penders

11:09 am on Aug 5, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond %{HTTP_REFERER} !^http://(www\.)myownsite.com/.*$ [NC]


I assume the reason for the parentheses on "www" are to make the subdomain optional? In which case you need a "?" suffix, ie. (www\.)?

You could probably lose the NC flag as well? In most legitimate cases this is always going to be lowercase. (?)

...add to the most likely entry page a small and invisible link ( display:none ? ) to a dummy page such that when a bot follows the link its IP address is stored and site access is denied.


A "honeypot". The important thing here is that the target of the invisible link is blocked by robots.txt, so bots that obey robots.txt do not follow the link and do not get blocked. This does, however, require a certain amount of server-side scripting in your language of choice.

However, even this method can result in false positives.

URL request (file or folder obviously doesn't exist but the request returns the above URL):
www.myownsite.com/page.html/string


Note that some systems do rely on this behaviour for a "cheap" way of implementing user friendly URLs for instance (if you don't have access to rewrite the URLs properly). This is controlled by the AcceptPathInfo directive [httpd.apache.org], for example:

AcceptPathInfo Off

wilderness

12:23 pm on Aug 5, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteRule \.(gif|jpg|jpeg|bmp|zip|rar|mp3|flv|swf|xml|php|png|css|pdf)$ - [F]


It's a good practice to only include file types that your site (s) contain. No since in wasting server time/CPU to check for something that isn't there.

not2easy

3:10 pm on Aug 5, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I installed an old fashioned honeypot years ago and it serves to let me know I need to take a look at access logs between scheduled reviews. On non- Wordpress sites I call the target file "wp-login.php" to kill two birds with one click - it catches scrapers ignoring robots.txt and hackers trying to log in to a non-existent wp install.

I think 2 or 3 times it has caught a residential ISP IP but on examination, they were not human activities so I left them individually blocked for a few months.

I do not consider this a method to block scrapers, only an alert to new chunks of the net to check out. To block them, you need to start keeping records of CIDR ranges of known scraper hosts and block entire CIDRs that can never send you good traffic. It is time consuming to set up, but then a minimal nuisance to maintain.

lucy24

6:51 pm on Aug 5, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I assume the reason for the parentheses on "www" are to make the subdomain optional? In which case you need a "?" suffix, ie. (www\.)?

In fact the www isn't optional. Assuming you've got your domain-name-canonicalization redirect in place, the referer will be either with or without www, depending on your site. In fact you can slap down a global block on

RewriteCond %{HTTP_REFERER} ^http://example\.com


(no closing anchor) giving the "wrong" form of your sitename.

www.myownsite.com/page.html/string


If you don't want to take the AcceptPathInfo route, there are alternatives. Depending on your particular circumstances:

Conditionless lockout:
RewriteRule \.html. - [F]


Forcible redirect:
RewriteCond %{REQUEST_URI} (.+\.html)
RewriteRule \.html. http://www.example.com/%1 [R=301,L]

The sole purpose of the Condition is to save your server the work of capturing when the rule won't apply (i.e. at least 99% of all requests). Use this form if you're plagued with legitimate visitors-- up to and including google-- appending spurious guff to the end of your html URLs. If your paths don't contain literal periods (legal but not all that common unless your name is apache dot org), replace .+ in the Condition with [^.]+

numnum

8:43 pm on Aug 5, 2014 (gmt 0)

10+ Year Member



You could probably lose the NC flag as well? In most legitimate cases this is always going to be lowercase. (?)

But I'm trying to block strings that could be either upper or lower case. Forgive me; there's something here I'm not understanding.

It's a good practice to only include file types that your site (s) contain. No since in wasting server time/CPU to check for something that isn't there.

Point taken.

On non- Wordpress sites I call the target file "wp-login.php" to kill two birds with one click - it catches scrapers ignoring robots.txt and hackers trying to log in to a non-existent wp install.

Sweet!

Thanks, all, for the input. I'm working on it.

lucy24

9:56 pm on Aug 5, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm trying to block strings that could be either upper or lower case.

In this case, they couldn't. There is only one canonical form of your sitename, and hence only one form that can occur in legitimate referers. If your site is example.com and the referer says EXAMPLE.COM, it's fake.

penders

10:18 pm on Aug 5, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



(EDIT: I didn't see lucy24's post before I posted...)

But I'm trying to block strings that could be either upper or lower case. Forgive me; there's something here I'm not understanding.


But you're not blocking, you're allowing "strings that could be either upper or lower case". The regex is negated (! prefix). So it is being blocked when it is not equal to ...

With the NC flag, the directive is allowing "yoursite.com", "yOuRsItE.cOm" and "YOURSITE.COM".

Without the NC flag only "yoursite.com" will be able to access the files (and result in a little less processing). "yOuRsItE.cOm", etc. will be blocked (since it is not "yoursite.com").