homepage Welcome to WebmasterWorld Guest from 54.211.230.186
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Blocking Bots via htaccess Question
davescottus



 
Msg#: 4556577 posted 9:02 pm on Mar 19, 2013 (gmt 0)

Hi, I have two questions (please don't laugh if they seem very basic). It's about the alphabetical order in the .htaccess file and the difficences between ^ and without the ^ when it comes to blocking bots/user agents?

What is the differrence between: RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR] and just RewriteCond %{HTTP_USER_AGENT} Zeus?


Will the second one be more thorough with blocking a bot by name? What does this ^ and the [OR] mean and are they necessary?



Question 2: I often read about many stating to post these blocks in alphabetical order with in the .htaccess file. Is this necessary (will it cause problems if it's not in order).

For example, these are some of the bots I'm blocking (the last 4 aren't in alpabetical order, will this cause a problem for the site if they're not listed in precise alphabetical order):


RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xenu [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xara [OR]
RewriteCond %{HTTP_USER_AGENT} ^Y!TunnelPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^YahooYSMcm [OR]
RewriteCond %{HTTP_USER_AGENT} ^YandexBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zade [OR]
RewriteCond %{HTTP_USER_AGENT} ^ZBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]
RewriteCond %{HTTP_USER_AGENT} ^zerxbot
RewriteCond %{HTTP_USER_AGENT} MJ12bot
RewriteCond %{HTTP_USER_AGENT} Linguee
RewriteCond %{HTTP_USER_AGENT} SolomonoBot
RewriteCond %{HTTP_USER_AGENT} Lightspeedsystems


Note last 4 aren't in order. Thanks everyone that can help, but if at all possible, please answer here and not redirect me to another site with several pages I have to scrub though to find the answer instead. I hoping one of the gurus here already know the answer to these?

 

topr8

WebmasterWorld Senior Member topr8 us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4556577 posted 10:39 pm on Mar 19, 2013 (gmt 0)

Question 2

there is no relevance to alphabetical order ... i imagine whoever says that is required/necessary has got confused when copying someone else who just said it was easier to maintain if you kept it in alphabetical order.

fyi MJ12bot and yandex respects robots.txt so just block it there if you want to, don't waste processing power with apache on it.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4556577 posted 10:40 pm on Mar 19, 2013 (gmt 0)

What is the differrence between: RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR] and just RewriteCond %{HTTP_USER_AGENT} Zeus?

You have misunderstood what the ^ means. Unlike some things in mod_rewrite and in .htaccess-in-general, the ^ applies to all Regular Expressions everywhere. It is an opening anchor.

So in your case
"Zeus" = the user-agent string contains the four consecutive letters "Zeus"
while
"^Zeus" = the user-agent string begins with the four consecutive letters "Zeus"

The counterpart to ^ is $ for closing anchor.

I often read about many stating to post these blocks in alphabetical order with in the .htaccess file. Is this necessary (will it cause problems if it's not in order).

This has nothing to do with Regular Expressions or with htaccess or even mod_rewrite. It is about personal organization. When the list gets long, you need to be able to find things quickly.

Similarly, if you have a list of IP blocks like
Deny from 11.22.33.44
keep them in numerical order.

Finally: mod_rewrite is probably not the most efficient way to do user-agent blocks. I recommend mod_setenvif combined with mod_auth-something so you can run out a string of

BrowserMatch badbot keep_away

mod_setenvif may run before or after mod_rewrite, but even on shared hosting you can be confident that the mod_auth-thingie package runs immediately before the core, after everything else.

btw, what have you got against Yandex? It currently behaves quite well. And their wmt has a feature I wish the Big Boys would steal: they list reasons for not indexing (in my case generally Unsupported Language ;)) with affected pages by name.

davescottus



 
Msg#: 4556577 posted 10:57 pm on Mar 19, 2013 (gmt 0)

May I ask what happens when you use just RewriteCond %{HTTP_USER_AGENT} zerxbot instead of RewriteCond %{HTTP_USER_AGENT} ^zerxbot?


I read somewhere that zerxbot only may be more effective without an anchor, in regards to blocking bots? I'm not sure why or if it's true however.

topr8

WebmasterWorld Senior Member topr8 us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4556577 posted 11:12 pm on Mar 19, 2013 (gmt 0)

... ok here's a tip, get yourself the firefox extension: user agent switcher

and you can try out different variations and see how they work for yourself.
set the user agent to what you like and then test your .htaccess file and see what happens

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4556577 posted 1:41 am on Mar 20, 2013 (gmt 0)

May I ask what happens when you use just RewriteCond %{HTTP_USER_AGENT} zerxbot instead of RewriteCond %{HTTP_USER_AGENT} ^zerxbot?

Answer too long? OK, I'll repeat the relevant part, lightly edited.

... the ^ applies to all Regular Expressions everywhere. It is an opening anchor.

So in your case
"zerxbot" = the user-agent string contains the seven consecutive letters "zerxbot"
while
"^zerxbot" = the user-agent string begins with the seven consecutive letters "zerxbot"


This is assuming for the sake of discussion that you know what "user-agent string" means. It occurs to me belatedly that maybe you don't, and that this is the source of the problem.

davescottus



 
Msg#: 4556577 posted 1:52 am on Mar 20, 2013 (gmt 0)

Thanks Lucy, I think I'm with you know (sorry, if I was a little slow to catch on..lol). So from what I'm understanding ^ means the user agent must begin with this to find a match
$ the user agent name must end with this name to match

With neither the ^ or $ added, it means it can be either (it can start or end with this name to find a match).

So by omitting the ^ and the $ it'll block more agressively as it doesn't matter if the name begins or ends with that name, it'll still block it?

jlnaman



 
Msg#: 4556577 posted 6:09 am on Mar 20, 2013 (gmt 0)

> So by omitting the ^ and the $ it'll block more agressively
... and take more resources. You are Paying for searching the UA string.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4556577 posted 8:02 am on Mar 20, 2013 (gmt 0)

If you know that a particular element always occurs at the beginning of the UA string, then by all means anchor it. Then, if it isn't the very first thing, the server can stop looking. Closing anchors don't save any work-- the server can only read in one direction-- so use them only if the element must come last.

But most robots are not thoughtful enough to make it easy on your server.

it can be either (it can start or end with this name to find a match)

Or neither: the part you want can be sitting somewhere in the middle. And it doesn't have to be an exact full word, unless you explicitly say so.

davescottus



 
Msg#: 4556577 posted 4:50 pm on Mar 20, 2013 (gmt 0)

I haven't noticed any server resources by not adding the anchor yet, but I can see how doing this a lot could add up.

The [OR] just means it can be either an uppercase or lowercase match, correct? So adding an [OR] is normally better, or does that eat up more resources as well?

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4556577 posted 6:19 pm on Mar 20, 2013 (gmt 0)

It is [NC] that allows aNyCase.

The [OR] means "or". If you omit it, ALL of the conditions must be true - so the rule will NEVER run. There is no user-agent string that could possibly satisfy all of the conditions at once. Multiple conditions are separated by an implicit "and" unless you state otherwise.

The anchors are important where a pattern could be ambiguous. They allow you to match a shorter string while rejecting a longer string.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved