homepage Welcome to WebmasterWorld Guest from 54.226.235.222
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
.htaccess block and redirect
Blekfis




msg:4615282
 8:45 am on Oct 8, 2013 (gmt 0)

If .htaccess looks like this:

SetEnvIfNoCase User-Agent .*rogerbot.* bad_bot

<Limit GET POST HEAD>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

Redirect 301 / http://domain.com


Will rogerbot read/see the whole file or stop at </Limit>?

 

wilderness




msg:4615316
 12:15 pm on Oct 8, 2013 (gmt 0)

The very basic understanding of SetEnvIf requires the use of anchors

Begins with (^)
Ends with ($)
Contains ( )
Exactly as ("")

Using wildcards will produce less than desired results and may not function as you intended at all.

You'll also need to add Error docs, or else a loop will be created.

Blekfis




msg:4615325
 12:55 pm on Oct 8, 2013 (gmt 0)

Just copied a bit of code I found = Apache-noob ;)

Would this better do what I'm looking for?


RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*rogerbot.*$ [NC,OR]
RewriteRule ^.*.* http://www.googlehammer.com/ [L]

Redirect 301 / http://domain.com



I'd like the bot(s) not to see the 301, is it even possible with just the htaccess..?

[edited by: phranque at 8:37 am (utc) on Oct 10, 2013]
[edit reason] unlinked urls [/edit]

penders




msg:4615340
 2:07 pm on Oct 8, 2013 (gmt 0)

What exactly are you trying to do? Are you wanting to redirect this bot or simply block it?

Regarding your regex, you have a lot of superfluous...
RewriteCond %{HTTP_USER_AGENT} rogerbot [NC]
RewriteRule .* http://example.com/ [R=301,L]


Assuming example.com is another domain then that will be an external redirect, as opposed to an internal rewrite (as suggested by your code).

wilderness




msg:4615367
 3:43 pm on Oct 8, 2013 (gmt 0)

RewriteCond %{HTTP_USER_AGENT} ^.*rogerbot.*$ [NC,OR]


NO. eliminat all .*

Begins with:
RewriteCond %{HTTP_USER_AGENT} ^rogerbot [NC,OR]

Ends with:
RewriteCond %{HTTP_USER_AGENT} rogerbot$ [NC,OR]

Contains:
RewriteCond %{HTTP_USER_AGENT} rogerbot [NC,OR]

Exactly as:
RewriteCond %{HTTP_USER_AGENT} "rogerbot" [NC,OR]

Fundamental anchors.

Same anchors used with SetEnvIf.

penders




msg:4615420
 6:49 pm on Oct 8, 2013 (gmt 0)

Exactly as:
RewriteCond %{HTTP_USER_AGENT} "rogerbot" [NC,OR]


Surrounding the CondPattern in double quotes does not result in an exact match, in this example the double quotes are superfluous and it will search for rogerbot anywhere in the string.

For an exact match, prefix the CondPattern with = (equals)
RewriteCond %{HTTP_USER_AGENT} =rogerbot [NC]

Or use start and end anchors (although the CondPattern is still a regex):
RewriteCond %{HTTP_USER_AGENT} ^rogerbot$ [NC]
Blekfis




msg:4615442
 8:18 pm on Oct 8, 2013 (gmt 0)

I want to block certain bots so they don't see the 301 redirect.

Is it possible to do this with just htaccess or do I need to do a combo of htaccess to block bots and index.php to redirect visitors and allowed bots?

lucy24




msg:4615465
 9:45 pm on Oct 8, 2013 (gmt 0)

Will rogerbot read/see the whole file or stop at </Limit>?

Anything inside an envelope-- whether it's <Limit>, <Files(Match)> or (in config files) <Directory> --supersedes anything outside the envelope.

An htaccess file isn't read sequentially from top to bottom. Each module reads its own sections, followed by the core, and within those categories, anything inside an envelope is evaluated after anything lying around loose.

A common example:
<Files robots.txt>
Order allow,deny
Allow from all
</Files>

It doesn't matter whether you put this section before, after, or smack in the middle of other authorization directives. It will always override them.

In mod_setenvif, you can use quotation marks to "protect" literal spaces in a user-agent string. They're not useful or necessary for anything else I can think of.

When more than one directive could apply to a request-- for example a redirect issued by mod_rewrite followed by a flat-out denial issued by mod_authz-whatever --no response is sent out until all modules have had their chance. A 403 issued by one mod will override a 301 issued by another mod, regardless of which one is evaluated first.

g1smd




msg:4615478
 10:35 pm on Oct 8, 2013 (gmt 0)

In blocking the bots they will see a 403 status code if you use the "deny from" syntax.

If you don't want them to "see" the redirect, what do you want them to see instead?

wilderness




msg:4615520
 2:14 am on Oct 9, 2013 (gmt 0)

Exactly as:
RewriteCond %{HTTP_USER_AGENT} "rogerbot" [NC,OR]


Surrounding the CondPattern in double quotes does not result in an exact match,


nonsense.

in this example the double quotes are superfluous and it will search for rogerbot anywhere in the string.


agreed

wilderness




msg:4615521
 2:18 am on Oct 9, 2013 (gmt 0)

In mod_setenvif, you can use quotation marks to "protect" literal spaces in a user-agent string. They're not useful or necessary for anything else I can think of.


"rogerbot 1.6.2"

or any other longer string

lucy24




msg:4615526
 3:41 am on Oct 9, 2013 (gmt 0)

Are you translating or disagreeing? ;)

Blekfis




msg:4615558
 6:40 am on Oct 9, 2013 (gmt 0)

If you don't want them to "see" the redirect, what do you want them to see instead?


...doesn't matter, I just don't want them to see the redirect...

This is done as a SEO-test where I need to have some bots not see the redirect, just not sure what the best way would be to do so...

penders




msg:4615574
 8:29 am on Oct 9, 2013 (gmt 0)

Surrounding the CondPattern in double quotes does not result in an exact match,


nonsense.


@wilderness: Why "nonsense"? You agreed to the second part, "it will search for rogerbot anywhere in the string" - which isn't an exact match. Agreeing to one and not the other would seem to be a contradiction?

wilderness




msg:4615578
 8:44 am on Oct 9, 2013 (gmt 0)

Agreeing to one and not the other would seem to be a contradiction?


Life is a contradiction, while Apache and regex are beyond multiple life's.
I gave some valid examples in the preliminary use of anchors and you chose to pick my examples apart, rather that assist the OP.

Are you translating or disagreeing?


Hey Lucy,
Disagreeing. There are applications of exactly as far beyond blank spaces. Their just not common.

lucy24




msg:4615585
 9:26 am on Oct 9, 2013 (gmt 0)

But, but, but

:: splutter ::

This
"rogerbot 1.6.2"

or any other longer string

seems to be illustrating exactly what I meant. The UA string contains literal spaces. If you don't want to escape them you have to put the whole thing into quotation marks to prevent the space from taking on semantic meaning. You should also, ahem, escape the literal periods. Quotation marks don't turn off Regular Expressions.

There are applications of exactly as far beyond blank spaces.

I think your cat stepped on the keyboard.

...doesn't matter, I just don't want them to see the redirect...

They have to see something when they request the page. If you don't like or trust them, why don't you simply block them?

Edit:
Back to OP:
SetEnvIfNoCase User-Agent .*rogerbot.* bad_bot

If you're neither capturing nor anchoring, the formulation .* is never necessary. It simply means "there may or may not be more stuff here".

^.*rogerbot.*$ = ^.*rogerbot = rogerbot.*$ = rogerbot

Would this better do what I'm looking for?

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*rogerbot.*$ [NC,OR]
RewriteRule ^.*.* http://www.example.com/ [L]

Redirect 301 / http://example.com

NOOOO. If it weren't 2:30 AM, I would go into detail. Count your blessings.

Oh yes and: Read the Forums rules about using "example.com". Look at your own post and you'll understand why it's doubly important in the Apache subforum.

wilderness




msg:4615597
 9:46 am on Oct 9, 2013 (gmt 0)

seems to be illustrating exactly what I meant. The UA string contains literal spaces. If you don't want to escape them you have to put the whole thing into quotation marks to prevent the space from taking on semantic meaning. You should also, ahem, escape the literal periods. Quotation marks don't turn off Regular Expressions.


lucy,
Jim and I had this disagreement many times (i. e., the use of quotes and exactly as) in the last years of his participation here.
This goes all the way back to the earliest versions of Apache (I've still lines in place from those earlier days) and they remain functional in the most current versions.

Jim kept quoting the Apache Docs and I kept telling him that in this specific instance the Apache Docs were full of beans (of which there are a few other examples).

penders




msg:4615599
 10:00 am on Oct 9, 2013 (gmt 0)

Life is a contradiction...


What?! I was simply correcting a wholly incorrect statement you made in your example... which benefits the OP, you, and everyone else who happens to read this thread.

wilderness




msg:4615602
 10:07 am on Oct 9, 2013 (gmt 0)



I gave some valid examples in the preliminary use of anchors


I was simply correcting a wholly incorrect statement you made in your example...

lucy24




msg:4615739
 10:19 pm on Oct 9, 2013 (gmt 0)

Jim and I had this disagreement many times (i. e., the use of quotes and exactly as) in the last years of his participation here.
This goes all the way back to the earliest versions of Apache (I've still lines in place from those earlier days) and they remain functional in the most current versions.

Jim kept quoting the Apache Docs and I kept telling him that in this specific instance the Apache Docs were full of beans (of which there are a few other examples).

I'm sorry, Don, but I don't understand what you are saying. Specifically I don't understand what you're disagreeing with. And I don't see where "exactly as" enters into it at all, since I didn't say anything about that.

What _I_ said was: If the user-agent string-- or whatever other string you're testing in mod_setenvif-- contains literal spaces, one way to preserve those spaces is to put the test string in quotation marks. If you don't use quotation marks, the spaces acquire their usual semantic meaning.

BrowserMatch "rogerbot 1.6.2" keep_out
= If the UA string contains the element "rogerbot 1x6x2" then set the variable "keep_out" to its default value (1, or "true", or whatever it is)

BrowserMatch rogerbot 1.6.2 keep_out
= If the UA string contains the element "rogerbot" then set two variables, "1.6.2" and "keep_out"

Quotation marks don't cancel regular expressions and they don't create anchors.

BrowserMatch "Camino/2.1.2 (like"
= 500 error due to mismatched parenthesis

BrowserMatch "Camino/2.1.2 \(like"
= I am blocked

BrowserMatch "Camino/2...2"
= I am blocked

wilderness




msg:4616138
 11:59 am on Oct 11, 2013 (gmt 0)

lucy,
Despite your prolific use of BrowserMatch, it's a lame tool and overall provides less than desired results.
Adding that example and/or application to this thread merely confuses matters more.

User-Agent
or
%{HTTP_USER_AGENT}

are much more effective and focused.

Once again, there are effective uses for exactly as ("")

In any event, I've violated my commitment to discontinue posting in the Apache Forum and fear that your just egging me on to jerk my chain ;)

Don

Blekfis




msg:4618056
 6:23 am on Oct 21, 2013 (gmt 0)

Bumping this since I doesn't feel I really got the answer.

In short I want to 301 a page/site and block certain bots from seeing this 301. Can this be done with just htacess or do I need to make it a combo with index.php

lucy24




msg:4618074
 8:14 am on Oct 21, 2013 (gmt 0)

do I need to make it a combo with index.php

Say what now?

A visitor-- whether human or robot-- doesn't see any headers until all Apache mods have done their stuff. If any of those mods issues a lockout, the 403 is all the robot will ever see.

THIS >> "Sorry, you're not wanted here"

NOT THIS >> "Sorry, you're not wanted, but if I had let you in I would have sent you to otherpage.html instead".

Throughout this thread your wording has been a little bit odd. So the lack of an unambiguous answer is because it's not 100% clear that you are, in fact, blocking the robot. If you're blocking it by User-Agent or IP or simply because you don't like its face, your job is done. A blocked request will not see any redirects arising from the same request.

Blekfis




msg:4618093
 9:57 am on Oct 21, 2013 (gmt 0)

Ok, then "mission accomplished" ;)

As I mentioned earlier, this is just a small SEO-test so no need for these bots to see a 403

Thanks!

g1smd




msg:4618484
 9:57 pm on Oct 22, 2013 (gmt 0)

They'll see "something".

What do you want that something to be?

Blekfis




msg:4618540
 4:05 am on Oct 23, 2013 (gmt 0)

Nothing ;)

I want to stop certain bots from crawling the page and see outbound links

lucy24




msg:4618557
 7:40 am on Oct 23, 2013 (gmt 0)

Ah, we're quibbling over the definition of "something". A blocked robot will see the 403 response. It may-- if it so chooses-- see the content of the 403 page that your server obligingly sent out. That's assuming the 403 was issued by the server in the first place.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved