Welcome to WebmasterWorld Guest from 107.20.122.81

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

Coding for generating cleanly escaped .htaccess files

     

incrediBILL

10:54 pm on Sep 4, 2013 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Don't know if any of you write code to generate your .htaccess files, but I do and it saves a ton of time converting massive user agent lists and such using PHP and preg_quote() to automatically escape the strings.

The only gotcha I found so far was it doesn't automatically escape spaces so I added a " " to the list of escaped characters. The special characters escaped are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -

If that's not a complete list for Apache, including the space, let me know!

This sample code:
$s="Bat Bot 1.0";
echo "RewriteCond %{HTTP_USER_AGENT} " . preg_quote($s," ") . " [NC,OR]";


Outputs this .htaccess line:
RewriteCond %{HTTP_USER_AGENT} Bad\ Bot\ 1\.0 [NC,OR]


Easy to make a quick routine to process an array, posted form, or file full of user agents and the escaping is flawless so no more 500 errors.

Sample code to process an array of user agents:


$arr = array("bad bot 1.0","googlebot","bingbot");

$ht_output = "RewriteEngine on\n";
$flags="";
foreach($arr as $key=>$ua)
{
$ua=trim($ua);
if (!empty($ua))
{
$ht_output .= "$flags RewriteCond %{HTTP_USER_AGENT} " . preg_quote($ua," ");
$flags=" [NC,OR]\n";
}
}
$ht_output .=" [NC]\n";
echo $ht_output;


The output should be
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} bad\ bot\ 1\.0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} googlebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} bingbot [NC]

Hope that kicks starts some automation for the more novice coders and generates a lot more clean .htaccess files :)

lucy24

11:28 pm on Sep 4, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



If that's not a complete list for Apache

It depends on the module. I'm not sure the colon : needs to be escaped at all; I can only think of one place it's got syntactic meaning, and that's in a rewrite flag. In vanilla Regular Expressions it isn't escaped. Conversely there are a handful of mods that require / escaping. You said .htaccess but did you really mean specifically mod_rewrite?

I think you may be too generous with [NC]. A BadBot is a badbot no matter how it's cased, but there's only one Googlebot. If it calls itself "googlebot" or "GoogleBot" it's fake.

incrediBILL

11:45 pm on Sep 4, 2013 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I think you may be too generous with [NC]. A BadBot is a badbot no matter how it's cased, but there's only one Googlebot. If it calls itself "googlebot" or "GoogleBot" it's fake.


I didn't say it was one size fits all :)

That's true in that any variation of Googlebot other than "Googlebot" is fake but remember I give my known bots a pass up front so the real Googlebot would already be allowed. Any other variation would require the [NC] to catch all fake variations.

Problem with not using [NC] is someone comes along as "bad bot 1.0" on Monday and by Tuesday it's "Bad bot 1.1" and Wednesday it's "Bad Bot 1.2" which is why I would typically just put in "bad bot [NC]" and catch them all if I were doing user agent blocking the old fashioned way.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month