homepage Welcome to WebmasterWorld Guest from 54.204.231.253
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

This 38 message thread spans 2 pages: 38 ( [1] 2 > >     
Is there anything wrong with this htaccess?
From a htaccess n00b
oddsod




msg:4390375
 5:55 pm on Nov 23, 2011 (gmt 0)

I know there's tons of really useful htaccess threads and if I spend a few hours I can learn what works and what doesn't work, but I was hoping someone could have a quick look at this htaccess file from a site I bought recently and tell me if there's anything I need to change.

It's exactly as below (with just the domain name removed)

==============
Options +Includes

Redirect 301 /oldfolder1/ [mysite.com...]
Redirect 301 /oldfolder1 [mysite.com...]
Redirect 301 /oldfolder2/ [mysite.com...]
Redirect 301 /oldfolder2 [mysite.com...]
Redirect 301 /search [mysite.com...]

RewriteEngine On
RewriteCond %{HTTP_HOST} ^mysite.com
RewriteRule (.*) [mysite.com...] [R=301,L]

RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^BaiduSpider [NC, OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^YandexBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]
==========================

I need the htaccess to
1. redirect some pages/folders to new pages and folders (I think I've got this bit right, it's all working)
2. redirect all non-www requests to www (this is also working)
3. Block bots especially Baidu and Yandex (Baidu and Yandex don't seem to be getting blocked)

Thanks in advance for any help.

Any help much appreciated.

 

wilderness




msg:4390439
 8:18 pm on Nov 23, 2011 (gmt 0)

this list of UA's is more than ten years old.
many are not even active anymore.

thirteen lines may be replaced with a single line:

# UA "begins" with web (no case)
RewriteCond %{HTTP_USER_AGENT} ^Web [NC,OR]

some of the other lines need changing from "begins with" to "contains" (simply omit caret).

oddsod




msg:4390454
 8:52 pm on Nov 23, 2011 (gmt 0)

Thanks. I've spent the last hour looking through the library but couldn't find a more up-to-date list. Do you know of a thread that has a recent list?

In the meanwhile I've even added this to my htaccess but it's still not stopping Baidu :(


order allow,deny
deny from 180.76.
allow from all


I can only assume there's something wrong with my htaccess that I don't know enough to spot.

wilderness




msg:4390460
 8:59 pm on Nov 23, 2011 (gmt 0)

Do you know of a thread that has a recent list?


NO.

There are many historical threads of this nature and unfortunately, participants are unable to comprehend simple examples and abide forum charters (i. e., example.com and complete htaccess files), which results in very long threads and abandonment by the regulars whom refuse to read and sift through an entire file, or even the "malformed copied and pasted versions repeated over, and over, and over."

One such example of those abuses is the "Close to Perfect htaccess [webmasterworld.com]"

oddsod




msg:4390463
 9:02 pm on Nov 23, 2011 (gmt 0)

It looks like it should have had asterisks in the IP like so:
deny from 180.76.*.*

I've tried that ... to no avail :(

lucy24




msg:4390465
 9:09 pm on Nov 23, 2011 (gmt 0)

I know there's tons of really useful htaccess threads and if I spend a few hours I can learn what works and what doesn't work, but I was hoping someone could

... do it for you instead? :-P C'mon now, you've been around long enough to know better. And you've been around long enough to know that you gotta say example.com so autolinking doesn't kick in. We need to see what you typed, not go to the site.

have a quick look at this htaccess file


Change everything using mod_alias (Redirect by that name) to mod_rewrite (flag [R=301,L]).

All those rows and rows of [OR] can be expressed as a single pipe-delimited list:

(aaa|bbb|ccc)

instead of

aaa [OR]
bbb [OR]
ccc

The with-or-without www redirect goes at the very end, after any and all specific redirects.

Deny from 180.76

without any trailing punctuation. Or if you want to be super-safe

Deny from 180.76.0.0/16

oddsod




msg:4390466
 9:11 pm on Nov 23, 2011 (gmt 0)

I'm going to read that thread now wilderness, but it is a 2001 thread and even the sequels - #2 and #3 date back to 2003. So I'm thinking I'll get an idea of your "example of abuse", but I don't know if it'll solve my problem i.e. stopping Baidu!

oddsod




msg:4390471
 9:19 pm on Nov 23, 2011 (gmt 0)

lucy24, please don't go by my reg date here, as the title admits, I'm a complete n00b with htaccess.

... do it for you instead? :-P C'mon now, you've been around long enough to know better. And you've been around long enough to know that you gotta say example.com so autolinking doesn't kick in.

Obviously, I didn't know then. I do now. Apologies for the inconvenience.

OK, I'll re-attempt using your advice and get back, thanks.

Change everything using mod_alias (Redirect by that name) to mod_rewrite (flag [R=301,L]).

I've no idea what you mean, but it's a pointer that I can go and research so cheers. I'll be back.

oddsod




msg:4390479
 9:36 pm on Nov 23, 2011 (gmt 0)

Sample of the first two redirect rules converted to Rewrite:


RewriteRule ^/books/ /wiki/books [R=301,L]
RewriteRule ^/books /wiki/books [R=301,L]


Am I on the right track?

lucy24




msg:4390499
 11:09 pm on Nov 23, 2011 (gmt 0)

Leave off the leading slash, or the rule will fail except when you get malformed URLs with double //. mod_rewrite chops off the entire domain name, including its following slash, and reattaches it later.

If you are redirecting via mod_rewrite, include the complete protocol and domain name in the "target" half:

http://www.example.com/wiki/books

If you don't include this part, mod_rewrite will reattach whatever hostname it was originally given, so you risk getting Duplicate Content from with-and-without www forms and possibly even extra business involving ports.

You can collapse your two Rules into one:

^books/?

Are those filenames or directories? mod_rewrite, unlike mod_alias, doesn't reattach the rest of the path, so if there's anything after "books" you need to capture it:

RewriteRule ^books(.*) http://www.example.com/books$1 [R=301,L]

If on the other hand you're making old functional URLs into pretty new ones, you'll have something like

RewriteRule ^books/(index\.html)?$ http://www.example.com/books [R=301,L]

with a RewriteCond looking at THE_REQUEST, followed by a rewrite (not redirect) to turn the pretty URL back into something that actually exists. Details omitted because I don't want to alarm you if that isn't what you are trying to do ;)

wilderness




msg:4390578
 3:25 am on Nov 24, 2011 (gmt 0)

but I don't know if it'll solve my problem i.e. stopping Baidu!


Three lines may be replaced with the following and will also resolve your Baidu issue.

RewriteCond %{HTTP_USER_AGENT} spider [NC,OR]

There are also other common terms of abuse used by many bots that may be combined on the same line and in the manner lucy provided.

crawler, download, wget, nutch, lwp, larbin, php, python,
reaper, xenu, java, MJ12bot, Proxy, capture, win32, WinHttp, wordpress, and more (many of which are Synonyms or variations of these same words).

FWIW, I would suggest alphabetizing these words and keeping each line to 6-8 word-UA's

oddsod




msg:4390660
 12:21 pm on Nov 24, 2011 (gmt 0)

Thanks for your input, guys. I must admit it looks very useful but this old man is having trouble understanding much of your tech speak so please bear with me.

I get the point that target needs to have the full URI. Thanks.

If on the other hand you're making old functional URLs into pretty new ones, you'll have something like...

I don't understand that bit, but maybe it doesn't apply to my case. With respect the redirect, I have two requirements:

1. I need to redirect both the "book" file name (the file name doesn't have a .htm or anything, it's just book) and I need to redirect the "book" folder to a new file. Explanation: example.com/book (old file) and all files in example.com/book/ (old folder) need to redirect to example.com/wiki/book (new file). So my understanding now is that this is done by

RewriteRule ^book/? http://www.example.com/wiki/book$1 [R=301,L]

2. I need to redirect both the "book2" file name and all files in the "book2" folder to the default index page of a new folder. Explanation: example.com/book2 (old file) and all files in example.com/book2/ (old folder) need to redirect to example.com/xyz/book2/ (home page of new folder). So, is this it? (I'm guessing on the $1 bit):

RewriteRule ^book2/? http://www.example.com/wiki/book2/$1 [R=301,L]

Three lines may be replaced with the following and will also resolve your Baidu issue.

RewriteCond %{HTTP_USER_AGENT} spider [NC,OR]

Which three lines are you suggesting I replace? Is it that the above RewriteCond line replaces all useragents in my list that have "spider" in their names ... or does this command block ALL spiders (including Googlebot)?

wilderness




msg:4390673
 1:43 pm on Nov 24, 2011 (gmt 0)

all useragents in my list that have "spider" in their names

g1smd




msg:4390693
 2:31 pm on Nov 24, 2011 (gmt 0)

You'll find your "book2" redirect doesn't work, because the previous "book" redirect pattern matched those requests and it also redirected the "book2" requests to the new "book1" URL. Swapping the rule order will fix it. More-specific redirects always go first.

You'll also find that $1 is always blank, because you didn't capture anything from the pattern to re-use in the target. You'd use /book/(.*) or similar for that.

oddsod




msg:4390712
 3:17 pm on Nov 24, 2011 (gmt 0)

Thanks, g1smd.

the previous "book" redirect pattern matched those requests...

Ah, yes, I see what you're saying.

So is this how my htaccess should look now?

Options +Includes
RewriteEngine On
RewriteRule ^book2/? http://www.example.com/wiki/book2/ [R=301,L]
RewriteRule ^book/? http://www.example.com/wiki/book [R=301,L]
#(If what I understand about $1 is correct)
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow | ^BaiduSpider | ^Download\ Demon | ^eCatch | ^EirGrabber | ^EmailSiphon | ^EmailWolf |^Express\ WebPictures | ^ExtractorPro | ^EyeNetIE | ^FlashGet
RewriteRule ^.* - [F]
RewriteCond %{HTTP_HOST} ^example.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]


Sometimes I think everybody is trying to talk cryptically just to confuse me - or to make me demonstrate I'm willing to do some legwork! - but it may be rather that all of you are so used to the terminology that you deliver suggestions in the language you would use to other experts ;) I'm still trying to work out what "alphabetizing these words" means (from an earlier suggestion above)!

wilderness




msg:4390716
 3:20 pm on Nov 24, 2011 (gmt 0)

Deny from 180.76

without any trailing punctuation. Or if you want to be super-safe

Deny from 180.76.0.0/16


lucy,
FWIW, and in most instances, the following lines will function either way (however assurance is entirely dependent upon individual server and should be verified (as should all htaccess changes after each session modification)).

180.76
180.76.

(Note; it's a bad idea to mix these methods, and for the sake of consistency, one or the other should be utilized.)

g1smd




msg:4390718
 3:33 pm on Nov 24, 2011 (gmt 0)

Bad user agents are redirected to your new site, and then the other site has to block them. Why make both sites have to do some work for bad requests?

Swap the rule order slightly.

Put rules which block access first, specific redirects next and more general (like non-www/www) redirects last.

For your long-term sanity add a blank line after every RewriteRule and prefix each block of code with a
# plain english comment describing what it does.
oddsod




msg:4390720
 3:36 pm on Nov 24, 2011 (gmt 0)

Aha! Yes. That makes sense. Other than that the syntax is ok?

Options +Includes
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow | ^BaiduSpider | ^Download\ Demon | ^eCatch | ^EirGrabber | ^EmailSiphon | ^EmailWolf |^Express\ WebPictures | ^ExtractorPro | ^EyeNetIE | ^FlashGet
RewriteRule ^.* - [F]
RewriteRule ^book2/? http://www.example.com/wiki/book2/ [R=301,L]
RewriteRule ^book/? http://www.example.com/wiki/book [R=301,L]
RewriteCond %{HTTP_HOST} ^example.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

wilderness




msg:4390723
 3:49 pm on Nov 24, 2011 (gmt 0)

This is WRONG.

1) They must be enclosed and parentheses and I'd suggest simplictity as well as omitting the caret (begins with) anchor.
2) NO trailing spaces either

Revise to (Please note; line breaks inserted to prevent forum long lines):
RewriteCond %{HTTP_USER_AGENT} (BlackWidow|Spider|Download
|Catch |Grabber|Email|Express|Pictures|Extractor|EyeNetIE|
Flash|Get|crawler|Nutch|capture|wget|other-terms-previously provided) [NC,OR]

KISS (Keep it simple and stupid).

A basic understanding of anchors is fundamental for any htaccess.

begins with
ends with
contains.

There is so much versatility in KISS, and just using these anchors effectively.

wilderness




msg:4390725
 3:55 pm on Nov 24, 2011 (gmt 0)

BTW, I also suggested previously alphabetizing these UA's and condensing the line (s) to 6-8 UA's and into multiple lines.

oddsod




msg:4390735
 4:55 pm on Nov 24, 2011 (gmt 0)

Thanks for everyone's help so far. Anyway, this is what I've got now and it's completely not working :(

I've tried deleting section by section, but even the basic htaccess below makes all my pages go internal server error :(


Options +Includes
RewriteEngine On
#Block specific spiders and bots
RewriteCond %{HTTP_USER_AGENT} (BlackWidow|Baiduspider|ExtractorPro|EyeNetIE|FlashGet|Hatena|JikeSpider|VoilaBot|YodaoBot) [NC,OR]
RewriteRule ^.* - [F]


Help! What am I doing wrong?

[edited by: oddsod at 5:02 pm (utc) on Nov 24, 2011]

wilderness




msg:4390738
 5:01 pm on Nov 24, 2011 (gmt 0)

Try removing the Options line.

AND, if you DO NOT have any other lines of UA's in these conditions?
the OR will generate a 500

The OR flag is only used when there are multiple lines of conditions, and the NOT on the last line.

[edited by: wilderness at 5:08 pm (utc) on Nov 24, 2011]

oddsod




msg:4390747
 5:08 pm on Nov 24, 2011 (gmt 0)

Sorted! :)

spider blocking working
www/http redirect working

Just the files and folders redirection now. Hey, progress! Thanks again guys. If you ever need help in my areas of expertise (site:webmasterworld.com oddsod) ... drop me a sticky.

wilderness




msg:4390749
 5:10 pm on Nov 24, 2011 (gmt 0)

For the benefit of others (newbies)?

Please provide and explanation of the syntax error that was causing the 500?

Many thanks Don.

oddsod




msg:4390751
 5:19 pm on Nov 24, 2011 (gmt 0)

Internal Server Error

The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator, Postmaster@example.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.

More information about this error may be available in the server error log.

wilderness




msg:4390752
 5:21 pm on Nov 24, 2011 (gmt 0)

Not actual 500 explanation, rather the syntax error in your htaccess, which caused the error and the correction you made.

Many thanks.

oddsod




msg:4390769
 6:19 pm on Nov 24, 2011 (gmt 0)

Removing the Options line is what made the difference. What does that line do anyway?

wilderness




msg:4390783
 7:28 pm on Nov 24, 2011 (gmt 0)

This is the Apache explanation:
This directive enables operating system specific optimizations for a listening socket by the Protocol type.
end of quote

Which I don't really understand.

What I do know is that some hosts have it on by default and adding the line when it exists by default causes the server error.

Other hosts that do NOT have it on by default require the line addition.

oddsod




msg:4390799
 8:18 pm on Nov 24, 2011 (gmt 0)

OK, one more question. I have a vBulletin forum on this site and it's in a /forum/ folder. There is a htaccess file in this forum folder inserted by vBulletin. Do I need to put any specific commands in there? I didn't think so. But I still see those banned spiders getting files from the forum.

If blocked spiders are accessing this folder does it mean that my spider blocking isn't working?

lucy24




msg:4390802
 8:34 pm on Nov 24, 2011 (gmt 0)

Oh, that's odd. I assumed it was the misplaced OR leading to the server errors. 80,000 guesses how I know this.

You have probably figured out by now that "alphabetize" unlike many other words used in this thread means exactly what it means in real life ;)

The pipe character | is handled first, unless there are parentheses. (Just to confuse you, RegEx calls it "bottom priority".) One common mistake is

^blahblah1|blahblah2|blahblah3/$

This does not mean "the exact directory blahblah1 or blahblah2 or blahblah3". It means "begins with blahblah1, OR contains blahblah2, OR ends with directory whose name ends in blahblah3".

With parentheses

^(blahblah1|blahblah2|blahblah3)/$

it comes out meaning what you wanted it to mean.

And if those were literal names-- which I really really don't recommend because you will very quickly forget what the numbers were supposed to mean--

^blahblah[1-3]/$

Once you have got a grip on Regular Expressions you have probably got 90% of Apache.

This 38 message thread spans 2 pages: 38 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved