Welcome to WebmasterWorld Guest from 54.166.87.123

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

block blank referrer on specific page and by ip

stopping suspiciuos spidering

   
3:27 pm on Feb 7, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



heres the issue. I have someone constantly spidering my site. They take a few hundred pages then hit the spider trap and are blocked. Within an hour they are back with a new ip and the same thing happens. The only way i can spot them is that they always come in off the same page and with a blank referrer. Therefore im looking at using htaccess to block anyone hitting this page with a blank referrer and within a certain ip range. I have no idea of the syntax for this and any pointers would be great.
3:46 pm on Feb 7, 2008 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



The page URL-path goes into the RewriteRule pattern
The IP address range goes into a RewriteCond %{REMOTE_ADDR} pattern
The blank referrer would be RewriteCond %{HTTP_REFERER} ^$

That should get you started.

Jim

4:23 pm on Feb 7, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



im struggling a bit with this. I thought the url would be a condition?

For example if the url was the index page i was thinking something like

RewriteCond %{THE_REQUEST} !^http://www.domain.com/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^-?$ [OR]
RewriteCond %{REMOTE_ADDR} ^x.x.x.x/x
RewriteRule .* - [F]

but i really struggle with this stuff.. :-(

4:48 pm on Feb 7, 2008 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



No, you can put it in the RewriteRule pattern, and should do so for efficiency's sake. You said they always access the same initial page, so using ".*" is inappropriate. I'd suggest:

# If user-agent is blank or "-"
RewriteCond %{HTTP_USER_AGENT} ^-?$
# and if remote address is 123.45.67.0 through 123.45.67.127
RewriteCond %{REMOTE_ADDR} ^123\.45\.67\.([1-9]?[0-9]1[01][0-9]12[0-7])$
RewriteRule ^always_the_same_page\.html$ - [F]

I don't know why you added the RewriteCond testing "THE_REQUEST"
You want to AND the conditions, so do not use the [OR] flag.
You cannot use CIDR notation in a RewriteCond; The RewriteCond examines a text string pattern, not a numerical value. I give an example of a fairly ugly pattern to detect 123.45.67.0-127 in the code snippet above.

Replace the broken pipe "" characters in that pattern with solid pipes before use; Posting on this forum modifies the pipe characters.

Put a pattern matching the URL-path for the initially-requested page in the RewriteRule as shown.
You could also change the Rule so that it directly calls your bad-bot script instead of issuing a single 403-Forbidden response, for example:

 RewriteRule ^always_the_same_page\.html$ /path_to_bad-bot.pl [L] 

If the bad-bot script isn't being triggered soon enough, then consider whether your embedded 'trap' URLs are numerous enough, well-placed high in the source code and throughout, vary enough in URL appearance, and "look interesting enough" to a harvester. There are quite a few ways to embed invisible trap links, and you should use several, with all Disallowed in robots.txt, and all rewritten to your bad-bot script.

Jim

4:58 pm on Feb 7, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



thanks for that. What seems to happen is they start on the index page then call a bunch of urls already spidered until one happens to be the bot script, they dont actually follow the link to it from an embedded page.