Welcome to WebmasterWorld Guest from 54.145.118.24

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

Coding for generating cleanly escaped .htaccess files

     
10:54 pm on Sep 4, 2013 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14663
votes: 99


Don't know if any of you write code to generate your .htaccess files, but I do and it saves a ton of time converting massive user agent lists and such using PHP and preg_quote() to automatically escape the strings.

The only gotcha I found so far was it doesn't automatically escape spaces so I added a " " to the list of escaped characters. The special characters escaped are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -

If that's not a complete list for Apache, including the space, let me know!

This sample code:
$s="Bat Bot 1.0";
echo "RewriteCond %{HTTP_USER_AGENT} " . preg_quote($s," ") . " [NC,OR]";


Outputs this .htaccess line:
RewriteCond %{HTTP_USER_AGENT} Bad\ Bot\ 1\.0 [NC,OR]


Easy to make a quick routine to process an array, posted form, or file full of user agents and the escaping is flawless so no more 500 errors.

Sample code to process an array of user agents:


$arr = array("bad bot 1.0","googlebot","bingbot");

$ht_output = "RewriteEngine on\n";
$flags="";
foreach($arr as $key=>$ua)
{
$ua=trim($ua);
if (!empty($ua))
{
$ht_output .= "$flags RewriteCond %{HTTP_USER_AGENT} " . preg_quote($ua," ");
$flags=" [NC,OR]\n";
}
}
$ht_output .=" [NC]\n";
echo $ht_output;


The output should be
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} bad\ bot\ 1\.0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} googlebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} bingbot [NC]

Hope that kicks starts some automation for the more novice coders and generates a lot more clean .htaccess files :)
11:28 pm on Sept 4, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13674
votes: 439


If that's not a complete list for Apache

It depends on the module. I'm not sure the colon : needs to be escaped at all; I can only think of one place it's got syntactic meaning, and that's in a rewrite flag. In vanilla Regular Expressions it isn't escaped. Conversely there are a handful of mods that require / escaping. You said .htaccess but did you really mean specifically mod_rewrite?

I think you may be too generous with [NC]. A BadBot is a badbot no matter how it's cased, but there's only one Googlebot. If it calls itself "googlebot" or "GoogleBot" it's fake.
11:45 pm on Sept 4, 2013 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14663
votes: 99


I think you may be too generous with [NC]. A BadBot is a badbot no matter how it's cased, but there's only one Googlebot. If it calls itself "googlebot" or "GoogleBot" it's fake.


I didn't say it was one size fits all :)

That's true in that any variation of Googlebot other than "Googlebot" is fake but remember I give my known bots a pass up front so the real Googlebot would already be allowed. Any other variation would require the [NC] to catch all fake variations.

Problem with not using [NC] is someone comes along as "bad bot 1.0" on Monday and by Tuesday it's "Bad bot 1.1" and Wednesday it's "Bad Bot 1.2" which is why I would typically just put in "bad bot [NC]" and catch them all if I were doing user agent blocking the old fashioned way.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members