Forum Moderators: phranque

Message Too Old, No Replies

Need clarification on http user agent and mod rewrite

         

Boston444

7:25 pm on Jan 19, 2012 (gmt 0)

10+ Year Member



Hi Everyone,
I have the following mod_rewrite setup to block a few bots I don't want taking up bandwidth on my site. The first two entries work fine and I added a new entry for Baiduspider. Ithen did a graceful restart but I am still seeing requests coming through from Baiduspider. I see in this line in the apache log entry the userAgent is equaling to Mozilla/5.0 , is that my problem and if so is there a way to adjust the syntax to grab Baiduspider/2.0 ? Basically can I grab what is inside () or does HTTP_USER_AGENT only care about what is right after the = sign?

# Apache Config Entry
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^Googlebot/Nutch-1.0 [OR]
RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider/2.0 [OR]
RewriteRule ^.* - [F,L]

# Apache log file example.
userAgent=Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

Thank You!

g1smd

8:22 pm on Jan 19, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do not use [OR] on the last condition.

Boston444

8:37 pm on Jan 19, 2012 (gmt 0)

10+ Year Member



Sorry please ignore the [OR], I brought it over in a copy/paste by accident. I am wondering if I need to do a hard stop/start instead of a graceful to kill any current connections.

lucy24

8:45 pm on Jan 19, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I am still seeing requests coming through from Baiduspider.

The requests will never stop. Do you mean that you're still seeing successful requests?

On my system, putting [OR] after the last condition would lead to a 500 error, so consider yourself lucky ;)

If you start blocking user-agents you will be playing whack-a-mole until the cows come home. Sure, I write my own content. Want to make something out of it? It's generally easier to block by IP. Unless it's something like MJ12 that's distributed so there's no IP to block.

Boston444

8:59 pm on Jan 19, 2012 (gmt 0)

10+ Year Member



Hi,
I am looking into my web application tomcat logs. So the request should die off at the httpd service and not make it to tomcat. I understand I will still see the requests in the httpd logs but they should not make it through to the tomcat logs.


httpd --> tomcat --> mysql

Boston444

9:31 pm on Jan 19, 2012 (gmt 0)

10+ Year Member



This is my full rewrite list, I don't believe there any mistakes and all other rules work. I am seeing a bunch of posts around the web with the same problem of not being able to block this bot even with these rules in place.

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^Googlebot/Nutch-1.0 [OR]
RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [OR]
RewriteCond %{HTTP_USER_AGENT} ^YandexBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Ezooms [OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sogou [OR]
RewriteCond %{HTTP_USER_AGENT} ^Sosospider+
RewriteRule ^.* - [F,L]

wilderness

9:56 pm on Jan 19, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



for the ka-zillionth time, these threads belong in the SSID forum. They've NEVER been discussed in the Apache Forum (not when Jim was or wasn't here).

You may simplify all the your lines into the following:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} (crawler|Ezooms|MJ12|Nutch|Sogou|spider|Yandex) [NC]
RewriteRule .* - [F]

The end result is that all you previously wished will be denied as well as many common rogues (i. e., spider, crawler and Nutch)

Please note that I've also omitted the use of the begins with anchor and used "contains" instead.

A basic understanding of the use of anchors:
Begins with
ends with
contains

are minimum requirements.

For future reference:
When using a RewriteCond with a period located within the name, (below)
Googlebot/Nutch-1.0

than that period requires escaping
Googlebot/Nutch-1\.0

The last explanation is all for not, because you don't need the other phrases in the solution I provided.

Boston444

10:20 pm on Jan 19, 2012 (gmt 0)

10+ Year Member



Hi,
Thank you that seems to be working. Just curious, could you explain a bit more why my syntax didn't work?

Thank You

wilderness

10:41 pm on Jan 19, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Have you looked at your error logs to see why they were failing?

All of your RewriteCond's used the begins with anchor (^) when those UA's may not begin with those terms.

Your Google/Nutch line would fail because you did not have the period escaped.

You severely restrict your comprehension of black-listing (that's what these denial methods are named) by not understanding some very basic skills (such as anchors) and the versatility of RegEx to not require complex formulas for wild card expressions.
Unfortunately strings utilized by PHP result in complex RegEx.

Basic User Agents and IP's are simple and require little knowledge of Apache or RegEx.

There are thousands of examples of these methods in the SSID archives prior to mid-2003. (the User Agents may be outdated, however the methods still work).

wilderness

10:46 pm on Jan 19, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



FWIW, the Google/Nutch line is used by wiki and that "wiki" term is also used in their UA. I'd consider adding that term as well.

lucy24

11:41 pm on Jan 19, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hey, wilderness, good to see you back.

Your Google/Nutch line would fail because you did not have the period escaped.

Unescaped periods rarely make rules fail. The danger is that they can yield false positives.

Nutch-1.0

will work on "Nutch-1.0" --but it will also work on

Nutch-100
Nutch-1s0
Nutch-1 0
Nutch-1/0

and so on.

wilderness

11:59 pm on Jan 19, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Many thanks Lucy.

With your help for Boston444, I'll make my exit.

penders

11:21 am on Jan 20, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



for the ka-zillionth time, these threads belong in the SSID forum.


SSID forum?

g1smd

8:42 pm on Jan 20, 2012 (gmt 0)

lucy24

9:33 pm on Jan 20, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why? He's asking how to block, not whom to block.

Boston444

10:26 pm on Jan 20, 2012 (gmt 0)

10+ Year Member



Yes, sorry I was not asking about certain bots or how to block them per say. I was just asking if my mod_write syntax was correct and it turns out it was not. Thank you for clearing up my mistake, much appreciated.