Welcome to WebmasterWorld Guest from 54.145.44.134

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

block blank referrer on specific page and by ip

stopping suspiciuos spidering

     

soapystar

3:27 pm on Feb 7, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



heres the issue. I have someone constantly spidering my site. They take a few hundred pages then hit the spider trap and are blocked. Within an hour they are back with a new ip and the same thing happens. The only way i can spot them is that they always come in off the same page and with a blank referrer. Therefore im looking at using htaccess to block anyone hitting this page with a blank referrer and within a certain ip range. I have no idea of the syntax for this and any pointers would be great.

jdMorgan

3:46 pm on Feb 7, 2008 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



The page URL-path goes into the RewriteRule pattern
The IP address range goes into a RewriteCond %{REMOTE_ADDR} pattern
The blank referrer would be RewriteCond %{HTTP_REFERER} ^$

That should get you started.

Jim

soapystar

4:23 pm on Feb 7, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



im struggling a bit with this. I thought the url would be a condition?

For example if the url was the index page i was thinking something like

RewriteCond %{THE_REQUEST} !^http://www.domain.com/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^-?$ [OR]
RewriteCond %{REMOTE_ADDR} ^x.x.x.x/x
RewriteRule .* - [F]

but i really struggle with this stuff.. :-(

jdMorgan

4:48 pm on Feb 7, 2008 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



No, you can put it in the RewriteRule pattern, and should do so for efficiency's sake. You said they always access the same initial page, so using ".*" is inappropriate. I'd suggest:

# If user-agent is blank or "-"
RewriteCond %{HTTP_USER_AGENT} ^-?$
# and if remote address is 123.45.67.0 through 123.45.67.127
RewriteCond %{REMOTE_ADDR} ^123\.45\.67\.([1-9]?[0-9]1[01][0-9]12[0-7])$
RewriteRule ^always_the_same_page\.html$ - [F]

I don't know why you added the RewriteCond testing "THE_REQUEST"
You want to AND the conditions, so do not use the [OR] flag.
You cannot use CIDR notation in a RewriteCond; The RewriteCond examines a text string pattern, not a numerical value. I give an example of a fairly ugly pattern to detect 123.45.67.0-127 in the code snippet above.

Replace the broken pipe "" characters in that pattern with solid pipes before use; Posting on this forum modifies the pipe characters.

Put a pattern matching the URL-path for the initially-requested page in the RewriteRule as shown.
You could also change the Rule so that it directly calls your bad-bot script instead of issuing a single 403-Forbidden response, for example:

 RewriteRule ^always_the_same_page\.html$ /path_to_bad-bot.pl [L] 

If the bad-bot script isn't being triggered soon enough, then consider whether your embedded 'trap' URLs are numerous enough, well-placed high in the source code and throughout, vary enough in URL appearance, and "look interesting enough" to a harvester. There are quite a few ways to embed invisible trap links, and you should use several, with all Disallowed in robots.txt, and all rewritten to your bad-bot script.

Jim

soapystar

4:58 pm on Feb 7, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



thanks for that. What seems to happen is they start on the index page then call a bunch of urls already spidered until one happens to be the bot script, they dont actually follow the link to it from an embedded page.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month