Forum Moderators: phranque

Message Too Old, No Replies

Using .htaccess to prevent external scripts from accessing site?

         

Linda_A

5:07 pm on Oct 5, 2005 (gmt 0)

10+ Year Member



Lately I have noticed hits in my referral log identified only with 'libwww-perl/5.803'. From a brief googling of this, it seems some external script may be accessing my site?

If so, can I stop this via .htaccess _without_ also stopping my own, perl-based search engine from indexing my site?

jdMorgan

8:03 pm on Oct 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Using a directive to deny access to that particular user-agent, and adding an exception for your own IP address or machine name seems to be the way to go.

Look at your raw access logs (stats are pretty useless for this kind of project) and get the exact user-agent name that is scraping your site, the IP address(es) it's coming from, and any other info that may be useful in blocking it in a precise manner.

blocking by IP and user-agent name is covered extensively (some say "excessively") in the long-running four-part A Close to perfect .htaccess ban list [webmasterworld.com] threads.

Either mod_rewrite or a combination of mod_access and mod_setenvif can be used to solve this problem.

Jim

Linda_A

8:21 pm on Oct 5, 2005 (gmt 0)

10+ Year Member



The exact user-agent seems to be libwww-perl/5.803 judging by my raw access logs.

I think I can see how it block the specific agent, but not how to make an exception for the IP of the site iself.

jdMorgan

10:23 pm on Oct 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That depends on how you coded the user-agent block -- whether you're using mod_rewrite or mod_access.

Please post your best-effort code, and we can discuss it. A review of our charter will explain this request.

Jim

Linda_A

6:00 pm on Oct 7, 2005 (gmt 0)

10+ Year Member



This is what I've got for the basic rewrite:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} libwww-perl/5\.803
RewriteRule ^ - [F]

Not sure precisely what each part does, though, or how to fit in the exception for my own site's IP.

Btw, will this conflict with my other rewrites, or does each rewrite rule finish up each section?

jdMorgan

9:19 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Put your server's IP address into this added RewriteCond, with the literal periods escaped as shown:

RewriteEngine on
RewriteCond %{REMOTE_ADDR}!^192\.168\.0\.1$
RewriteCond %{HTTP_USER_AGENT} libwww-perl/5\.803
RewriteRule .* - [F]

The order of execution determines whether this rule will apply to all accesses. When using the [F], [G], [L], or [N] flags, the rewrite engine will stop processing for this pass through the mod_rewrite code if the rule matches and is invoked. However, if the code is in .htaccess, it may be re-processed after any URL is rewritten. This is because, after a URL is rewritten, it must be tested for newly-applicable access restrictions, so mod_rewrite is re-invoked. Therefore, it is best to explicitly prevent rewrite loops by making the rule specific or by adding exceptions to the rule.

Jim