Forum Moderators: phranque

Message Too Old, No Replies

blocking specific URIs from specific searchbots

I can't seem to get the logical OR working;~/

         

icpix

9:26 am on May 23, 2004 (gmt 0)

10+ Year Member



I am a photographer and run an image gallery (specifically with the Menalto site's 'Gallery' PHP software) on my own Apache server (static IP). My server is not particularly fancy and so I am obliged to attempt to maximise its CPU cycle efficiency.

Sometimes there are whole squads of bots concurrently munching. This is not a problem per se, it's more to do with the specific types of links that they munch. For instance I've opened up slideshow functionality. This isn't much good to a bot and, to be fair, the bots don't actually use the slideshow URIs for the purpose intended. However slideshow links shouldn't be catalogued.

I hope it's reasonably obvious what it is that I am trying to code but, briefly, the code needs to specifically and accurately identify the family of bots and block a number of inappropriate generic PHP calls...

# FORBID all Googlebots slideshow
RewriteCond %{HTTP_USER_AGENT} "Googlebot/2.1"
RewriteCond %{REMOTE_ADDR} ^64\.68\.(6[4-9]¦[7-8][0-9]¦9[0-5])\.
RewriteCond %{REQUEST_URI} (/slideshow.php)
RewriteRule .* - [F]

# Googlebots stuff
RewriteCond %{HTTP_USER_AGENT} "Googlebot/2.1"
RewriteCond %{REMOTE_ADDR} ^64\.68\.(6[4-9]¦[7-8][0-9]¦9[0-5])\.
######RewriteCond %{REQUEST_URI} (&?PHPSESSID=) [NC]
RewriteCond %{REQUEST_URI} (/search.php¦/view_photo.php¦/view_photo.php?id=¦/view_photo_properties.php?set_albumName=¦/do_command.php?set_fullOnly=¦/&)
#RewriteCond %{REQUEST_URI} (/view_photo.php?id=) [OR]
#RewriteCond %{REQUEST_URI} (/view_photo_properties.php?set_albumName=) [OR]
#RewriteCond %{REQUEST_URI} (/do_command.php?set_fullOnly=) [OR]
#RewriteCond %{REQUEST_URI} (/&)
RewriteRule .* - [G]

# FORBID all MSN bot slideshows
RewriteCond %{HTTP_USER_AGENT} msnbot
RewriteCond %{REMOTE_ADDR} ^65\.5[2-5]\.
RewriteCond %{REQUEST_URI} (/slideshow.php)
RewriteRule .* - [F]

# MSN bot stuff
RewriteCond %{HTTP_USER_AGENT} msnbot
RewriteCond %{REMOTE_ADDR} ^65\.5[2-5]\.
RewriteCond %{REQUEST_URI} (/search.php¦/view_photo.php¦/view_photo.php?id=¦/view_photo_properties.php?set_albumName=¦/do_command.php?set_fullOnly=¦/&)
#RewriteCond %{REQUEST_URI} (/view_photo.php?id=) [OR]
#RewriteCond %{REQUEST_URI} (/view_photo_properties.php?set_albumName=) [OR]
#RewriteCond %{REQUEST_URI} (/do_command.php?set_fullOnly=) [OR]
#RewriteCond %{REQUEST_URI} (/&)
RewriteRule .* - [G]

The individual lines of the [OR] don't seem to work so I had a go at trying for logical OR functionality in a single line using piping...(?)

The separate clause forbidding slideshows does work and it's the only thing I have been able to verify as being properly operational;~/

Would it be possible for you to assist me to implement an ergonomic and reliable solution please?

best wishes, Robert

gergoe

11:03 am on May 23, 2004 (gmt 0)

10+ Year Member



You've made one mistake consequently through your rules; all the regex control characters must be escaped, like the dot and the question mark. Additionally the whole thing can be simplified, and concentrated, here's another one you can try:

#
# FORBID all slideshow for google and msm
#
RewriteCond %{HTTP_USER_AGENT}<>%{REMOTE_ADDR} googlebot.*<>64\.68\.(6[4-9]¦[7-8][0-9]¦9[0-5])\. [NC,OR]
RewriteCond %{HTTP_USER_AGENT}<>%{REMOTE_ADDR} msnbot.*<>65\.5[2-5]\. [NC]
RewriteRule ^/slideshow.php - [NC,F]
#
# Send back 401 gone for google and msn for some pages
#
RewriteCond %{HTTP_USER_AGENT}<>%{REMOTE_ADDR} googlebot.*<>64\.68\.(6[4-9]¦[7-8][0-9]¦9[0-5])\. [NC,OR]
RewriteCond %{HTTP_USER_AGENT}<>%{REMOTE_ADDR} msnbot.*<>65\.5[2-5]\. [NC]
RewriteRule ^/(search\.php¦view_photo\.php¦view_photo_properties\.php\?set_albumName=¦do_command\.php\?set_fullOnly=) - [NC,G]

This should do the same as your rewriting, but from a bit different approach. Don't forget to change the broken pipes to vertical pipes before trying it out. Additionally if you are about to place this into a htaccess file, then it needs some adjustments (because the leading slash is stripped of from the url)

jdMorgan

7:56 pm on May 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



... A couple of comments:

First, you might consider using robots.txt to *ask* Googlebot and msnbot to stay out of those subdirectories, instead of forbidding them. This is equivalent to posting a "Keep Out" sign on the door, instead of punching anyone who comes through the door without warning.

Secondly, I don't think you'll be able to test query strings using RewriteCond %{REQUEST_URI} or RewriteRule directly. Instead, use "RewriteCond %{QUERY_STRING} set_albumName=" to test for that query string.

Jim

icpix

2:30 pm on May 24, 2004 (gmt 0)

10+ Year Member



Gergoe, looks promising, my thanks. Your code now being tested live.

Jim, I hear you:-) I'm not forbidding those bots, just using the GONE flag when the bots vector inappropriate links. The slideshow vector uses an awful lot of CPU cycles and, in any case, the output must be somewhat unusable to the bots. [later] I've disabled the slideshow vector all together at the front entrance level - too much trouble for too little benefit. Slideshows remain available on lower subdirectories.

Unfortunately the gallery's software topology makes it impossible to ask the nice bots to do what you suggest (ie subdirectories et al)

• First, my robots.txt files have been roundly ineffective at holding bad bots at bay... Nowadays I just monitor the robots.txt file for new activity and monitor the antics of any new names to decide whether to give them total freedom of access or site-wide blocking. A small PHP utility also keeps a beady eye on the 'bandwidth' use of (any) visitors... If it's a bot that triggers the bandwidth monitor... well, it doesn't do it again;~)

• Secondly, I -am- trying out your string code but still having difficulties... sometimes all the filtration works and sometimes, seemingly, it doesn't... baffling... I'm trying to diagnose some more in order to come back with more data;~/

best wishes, Robert