homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

block blank referrer on specific page and by ip
stopping suspiciuos spidering

 3:27 pm on Feb 7, 2008 (gmt 0)

heres the issue. I have someone constantly spidering my site. They take a few hundred pages then hit the spider trap and are blocked. Within an hour they are back with a new ip and the same thing happens. The only way i can spot them is that they always come in off the same page and with a blank referrer. Therefore im looking at using htaccess to block anyone hitting this page with a blank referrer and within a certain ip range. I have no idea of the syntax for this and any pointers would be great.



 3:46 pm on Feb 7, 2008 (gmt 0)

The page URL-path goes into the RewriteRule pattern
The IP address range goes into a RewriteCond %{REMOTE_ADDR} pattern
The blank referrer would be RewriteCond %{HTTP_REFERER} ^$

That should get you started.



 4:23 pm on Feb 7, 2008 (gmt 0)

im struggling a bit with this. I thought the url would be a condition?

For example if the url was the index page i was thinking something like

RewriteCond %{THE_REQUEST} !^http://www.domain.com/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^-?$ [OR]
RewriteCond %{REMOTE_ADDR} ^x.x.x.x/x
RewriteRule .* - [F]

but i really struggle with this stuff.. :-(


 4:48 pm on Feb 7, 2008 (gmt 0)

No, you can put it in the RewriteRule pattern, and should do so for efficiency's sake. You said they always access the same initial page, so using ".*" is inappropriate. I'd suggest:

# If user-agent is blank or "-"
RewriteCond %{HTTP_USER_AGENT} ^-?$
# and if remote address is through
RewriteCond %{REMOTE_ADDR} ^123\.45\.67\.([1-9]?[0-9]1[01][0-9]12[0-7])$
RewriteRule ^always_the_same_page\.html$ - [F]

I don't know why you added the RewriteCond testing "THE_REQUEST"
You want to AND the conditions, so do not use the [OR] flag.
You cannot use CIDR notation in a RewriteCond; The RewriteCond examines a text string pattern, not a numerical value. I give an example of a fairly ugly pattern to detect in the code snippet above.

Replace the broken pipe "" characters in that pattern with solid pipes before use; Posting on this forum modifies the pipe characters.

Put a pattern matching the URL-path for the initially-requested page in the RewriteRule as shown.
You could also change the Rule so that it directly calls your bad-bot script instead of issuing a single 403-Forbidden response, for example:
RewriteRule ^always_the_same_page\.html$ /path_to_bad-bot.pl [L]

If the bad-bot script isn't being triggered soon enough, then consider whether your embedded 'trap' URLs are numerous enough, well-placed high in the source code and throughout, vary enough in URL appearance, and "look interesting enough" to a harvester. There are quite a few ways to embed invisible trap links, and you should use several, with all Disallowed in robots.txt, and all rewritten to your bad-bot script.



 4:58 pm on Feb 7, 2008 (gmt 0)

thanks for that. What seems to happen is they start on the index page then call a bunch of urls already spidered until one happens to be the bot script, they dont actually follow the link to it from an embedded page.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved