Welcome to WebmasterWorld Guest from 34.229.113.106

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

Defeating scrapers, framers and hot-linkers with .htaccess

     
7:26 am on Aug 5, 2014 (gmt 0)

Junior Member

5+ Year Member

joined:May 17, 2011
posts:170
votes: 0


Aside from implementing ssi's, conditionals, and the like, I don't know what I'm doing when it comes to things Apache, so I wanted to double-check the following with the experts here. To prevent framing:

Header append X-FRAME-OPTIONS "DENY"

And to prevent hotlinking various and sundry file types:

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)myownsite.com/.*$ [NC]
RewriteRule \.(gif|jpg|jpeg|bmp|zip|rar|mp3|flv|swf|xml|php|png|css|pdf)$ - [F]


What I'm even less certain about is the simplest and most effective way to defeat bot scrapers. One suggestion from an old WW thread (I've lost track of it) was to add to the most likely entry page a small and invisible link ( display:none ? ) to a dummy page such that when a bot follows the link its IP address is stored and site access is denied. I don't know how to implement this method, and I assume I'd need to make exceptions for legitimate spiders (esp. search engines). Does this sound right? Isn't there another method that shuts down access to any visitor moving too quickly through too many pages at one's site? I'm totally out of my wheelhouse here.

Lastly, what about denying access to any request for a URL that appends a string to a page's actual URL. In checking my logs, I've noticed essentially the following:

actual url:
www.myownsite.com/page.html

URL request (file or folder obviously doesn't exist but the request returns the above URL):
www.myownsite.com/page.html/string

I'm not sure if what's happening there is framing or something else, but in any event there must be a simple way to prevent this. I can't figure it out from the documentation at the Apache site -- more accurately, I don't fully understand and can't apply what's there.
11:09 am on Aug 5, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2006
posts: 3127
votes: 1


RewriteCond %{HTTP_REFERER} !^http://(www\.)myownsite.com/.*$ [NC]


I assume the reason for the parentheses on "www" are to make the subdomain optional? In which case you need a "?" suffix, ie. (www\.)?

You could probably lose the NC flag as well? In most legitimate cases this is always going to be lowercase. (?)

...add to the most likely entry page a small and invisible link ( display:none ? ) to a dummy page such that when a bot follows the link its IP address is stored and site access is denied.


A "honeypot". The important thing here is that the target of the invisible link is blocked by robots.txt, so bots that obey robots.txt do not follow the link and do not get blocked. This does, however, require a certain amount of server-side scripting in your language of choice.

However, even this method can result in false positives.

URL request (file or folder obviously doesn't exist but the request returns the above URL):
www.myownsite.com/page.html/string


Note that some systems do rely on this behaviour for a "cheap" way of implementing user friendly URLs for instance (if you don't have access to rewrite the URLs properly). This is controlled by the AcceptPathInfo directive [httpd.apache.org], for example:

AcceptPathInfo Off
12:23 pm on Aug 5, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5505
votes: 5


RewriteRule \.(gif|jpg|jpeg|bmp|zip|rar|mp3|flv|swf|xml|php|png|css|pdf)$ - [F]


It's a good practice to only include file types that your site (s) contain. No since in wasting server time/CPU to check for something that isn't there.
3:10 pm on Aug 5, 2014 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4343
votes: 292


I installed an old fashioned honeypot years ago and it serves to let me know I need to take a look at access logs between scheduled reviews. On non- Wordpress sites I call the target file "wp-login.php" to kill two birds with one click - it catches scrapers ignoring robots.txt and hackers trying to log in to a non-existent wp install.

I think 2 or 3 times it has caught a residential ISP IP but on examination, they were not human activities so I left them individually blocked for a few months.

I do not consider this a method to block scrapers, only an alert to new chunks of the net to check out. To block them, you need to start keeping records of CIDR ranges of known scraper hosts and block entire CIDRs that can never send you good traffic. It is time consuming to set up, but then a minimal nuisance to maintain.
6:51 pm on Aug 5, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15698
votes: 810


I assume the reason for the parentheses on "www" are to make the subdomain optional? In which case you need a "?" suffix, ie. (www\.)?

In fact the www isn't optional. Assuming you've got your domain-name-canonicalization redirect in place, the referer will be either with or without www, depending on your site. In fact you can slap down a global block on

RewriteCond %{HTTP_REFERER} ^http://example\.com


(no closing anchor) giving the "wrong" form of your sitename.

www.myownsite.com/page.html/string


If you don't want to take the AcceptPathInfo route, there are alternatives. Depending on your particular circumstances:

Conditionless lockout:
RewriteRule \.html. - [F]


Forcible redirect:
RewriteCond %{REQUEST_URI} (.+\.html)
RewriteRule \.html. http://www.example.com/%1 [R=301,L]

The sole purpose of the Condition is to save your server the work of capturing when the rule won't apply (i.e. at least 99% of all requests). Use this form if you're plagued with legitimate visitors-- up to and including google-- appending spurious guff to the end of your html URLs. If your paths don't contain literal periods (legal but not all that common unless your name is apache dot org), replace .+ in the Condition with [^.]+
8:43 pm on Aug 5, 2014 (gmt 0)

Junior Member

5+ Year Member

joined:May 17, 2011
posts:170
votes: 0


You could probably lose the NC flag as well? In most legitimate cases this is always going to be lowercase. (?)

But I'm trying to block strings that could be either upper or lower case. Forgive me; there's something here I'm not understanding.

It's a good practice to only include file types that your site (s) contain. No since in wasting server time/CPU to check for something that isn't there.

Point taken.

On non- Wordpress sites I call the target file "wp-login.php" to kill two birds with one click - it catches scrapers ignoring robots.txt and hackers trying to log in to a non-existent wp install.

Sweet!

Thanks, all, for the input. I'm working on it.
9:56 pm on Aug 5, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15698
votes: 810


I'm trying to block strings that could be either upper or lower case.

In this case, they couldn't. There is only one canonical form of your sitename, and hence only one form that can occur in legitimate referers. If your site is example.com and the referer says EXAMPLE.COM, it's fake.
10:18 pm on Aug 5, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2006
posts: 3127
votes: 1


(EDIT: I didn't see lucy24's post before I posted...)

But I'm trying to block strings that could be either upper or lower case. Forgive me; there's something here I'm not understanding.


But you're not blocking, you're allowing "strings that could be either upper or lower case". The regex is negated (! prefix). So it is being blocked when it is not equal to ...

With the NC flag, the directive is allowing "yoursite.com", "yOuRsItE.cOm" and "YOURSITE.COM".

Without the NC flag only "yoursite.com" will be able to access the files (and result in a little less processing). "yOuRsItE.cOm", etc. will be blocked (since it is not "yoursite.com").