Forum Moderators: phranque
Sometimes there are whole squads of bots concurrently munching. This is not a problem per se, it's more to do with the specific types of links that they munch. For instance I've opened up slideshow functionality. This isn't much good to a bot and, to be fair, the bots don't actually use the slideshow URIs for the purpose intended. However slideshow links shouldn't be catalogued.
I hope it's reasonably obvious what it is that I am trying to code but, briefly, the code needs to specifically and accurately identify the family of bots and block a number of inappropriate generic PHP calls...
# FORBID all Googlebots slideshow
RewriteCond %{HTTP_USER_AGENT} "Googlebot/2.1"
RewriteCond %{REMOTE_ADDR} ^64\.68\.(6[4-9]¦[7-8][0-9]¦9[0-5])\.
RewriteCond %{REQUEST_URI} (/slideshow.php)
RewriteRule .* - [F]
# Googlebots stuff
RewriteCond %{HTTP_USER_AGENT} "Googlebot/2.1"
RewriteCond %{REMOTE_ADDR} ^64\.68\.(6[4-9]¦[7-8][0-9]¦9[0-5])\.
######RewriteCond %{REQUEST_URI} (&?PHPSESSID=) [NC]
RewriteCond %{REQUEST_URI} (/search.php¦/view_photo.php¦/view_photo.php?id=¦/view_photo_properties.php?set_albumName=¦/do_command.php?set_fullOnly=¦/&)
#RewriteCond %{REQUEST_URI} (/view_photo.php?id=) [OR]
#RewriteCond %{REQUEST_URI} (/view_photo_properties.php?set_albumName=) [OR]
#RewriteCond %{REQUEST_URI} (/do_command.php?set_fullOnly=) [OR]
#RewriteCond %{REQUEST_URI} (/&)
RewriteRule .* - [G]
# FORBID all MSN bot slideshows
RewriteCond %{HTTP_USER_AGENT} msnbot
RewriteCond %{REMOTE_ADDR} ^65\.5[2-5]\.
RewriteCond %{REQUEST_URI} (/slideshow.php)
RewriteRule .* - [F]
# MSN bot stuff
RewriteCond %{HTTP_USER_AGENT} msnbot
RewriteCond %{REMOTE_ADDR} ^65\.5[2-5]\.
RewriteCond %{REQUEST_URI} (/search.php¦/view_photo.php¦/view_photo.php?id=¦/view_photo_properties.php?set_albumName=¦/do_command.php?set_fullOnly=¦/&)
#RewriteCond %{REQUEST_URI} (/view_photo.php?id=) [OR]
#RewriteCond %{REQUEST_URI} (/view_photo_properties.php?set_albumName=) [OR]
#RewriteCond %{REQUEST_URI} (/do_command.php?set_fullOnly=) [OR]
#RewriteCond %{REQUEST_URI} (/&)
RewriteRule .* - [G]
The individual lines of the [OR] don't seem to work so I had a go at trying for logical OR functionality in a single line using piping...(?)
The separate clause forbidding slideshows does work and it's the only thing I have been able to verify as being properly operational;~/
Would it be possible for you to assist me to implement an ergonomic and reliable solution please?
best wishes, Robert
#
# FORBID all slideshow for google and msm
#
RewriteCond %{HTTP_USER_AGENT}<>%{REMOTE_ADDR} googlebot.*<>64\.68\.(6[4-9]¦[7-8][0-9]¦9[0-5])\. [NC,OR]
RewriteCond %{HTTP_USER_AGENT}<>%{REMOTE_ADDR} msnbot.*<>65\.5[2-5]\. [NC]
RewriteRule ^/slideshow.php - [NC,F]
#
# Send back 401 gone for google and msn for some pages
#
RewriteCond %{HTTP_USER_AGENT}<>%{REMOTE_ADDR} googlebot.*<>64\.68\.(6[4-9]¦[7-8][0-9]¦9[0-5])\. [NC,OR]
RewriteCond %{HTTP_USER_AGENT}<>%{REMOTE_ADDR} msnbot.*<>65\.5[2-5]\. [NC]
RewriteRule ^/(search\.php¦view_photo\.php¦view_photo_properties\.php\?set_albumName=¦do_command\.php\?set_fullOnly=) - [NC,G]
First, you might consider using robots.txt to *ask* Googlebot and msnbot to stay out of those subdirectories, instead of forbidding them. This is equivalent to posting a "Keep Out" sign on the door, instead of punching anyone who comes through the door without warning.
Secondly, I don't think you'll be able to test query strings using RewriteCond %{REQUEST_URI} or RewriteRule directly. Instead, use "RewriteCond %{QUERY_STRING} set_albumName=" to test for that query string.
Jim
Jim, I hear you:-) I'm not forbidding those bots, just using the GONE flag when the bots vector inappropriate links. The slideshow vector uses an awful lot of CPU cycles and, in any case, the output must be somewhat unusable to the bots. [later] I've disabled the slideshow vector all together at the front entrance level - too much trouble for too little benefit. Slideshows remain available on lower subdirectories.
Unfortunately the gallery's software topology makes it impossible to ask the nice bots to do what you suggest (ie subdirectories et al)
• First, my robots.txt files have been roundly ineffective at holding bad bots at bay... Nowadays I just monitor the robots.txt file for new activity and monitor the antics of any new names to decide whether to give them total freedom of access or site-wide blocking. A small PHP utility also keeps a beady eye on the 'bandwidth' use of (any) visitors... If it's a bot that triggers the bandwidth monitor... well, it doesn't do it again;~)
• Secondly, I -am- trying out your string code but still having difficulties... sometimes all the filtration works and sometimes, seemingly, it doesn't... baffling... I'm trying to diagnose some more in order to come back with more data;~/
best wishes, Robert