Forum Moderators: phranque

Message Too Old, No Replies

Mod Rewrite: Stop Writing & Start Reading. Please!

What you don't know can kill your site!

         

TheMadScientist

2:08 am on Nov 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've read a few posts recently that prompted this post...
In one of them the poster basically said within the post:

The rules posted work, but are so confusing I don't understand them, so I'm going to use mine instead.

The preceding is the first and most glaring warning sign you should stop writing Mod_Rewrite and start reading about how to use it until you understand exactly what you are doing.

Most may not know this, because it's not reflected in my post count or join date, but I used to post almost exclusively here in the Apache Forum, then I stopped for a couple of years. (I opted for a user-name change in the interim.) Why did I stop posting? Because the redundancy of the questions asked in the Apache Forum is mind-boggling and, to me, very frustrating.

One of the biggest differences I've noticed is the change in the 'tone' of replies from the 'regulars' here, which have gone from 'generally friendly and explanatory' to 'shorter' and much closer to 'RTBM' (Read The Bleeping Manual!) and I think the reason is the same as why I just plain quit posting and honestly am still much happier to post in the PHP forum, or elsewhere. The Reason: People keep asking the same bleeping questions making it obvious they have not done their homework, and honestly, it gets old.

Mod_Rewrite is used either in the httpd.conf file, or an .htaccess file.
There's a reason they're hidden files.
There's a reason not everyone has access to the httpd.conf file on their server.
There's a reason Mod_Rewrite is not loaded on every server.

The Reason Is:
Mod_Rewrite is neither a 'toy', nor is it a 'scripting language for the masses' like PHP or JavaScript, and many others which are very forgiving, happen to be. (Scripting languages for the masses, that is.)

Mod_Rewrite is a specific, highly powerful, regular expression based, URL manipulation tool, and it must reside in a file that is not only important, but critical to your website operating correctly.

The httpd.conf file, which is the most efficient location for Mod_Rewrite to be used, is run for every request for every page (location, URL) on an entire website, even if none of the Mod_Rewrite rules affect the requested location (URL, Page).

It's processed before PHP is parsed.
It's processed before JavaScript is delivered.
It's processed before the CSS and HTML are sent to the browser for every request made by a browser for every URL on the website, including graphics, movies, etc. Every browser requested URL means: Every Browser Requested URL... All of them. Period.

You can use JS, PHP, and a number of other scripting languages without regular expressions...
That's all Mod_Rewrite is.

There are regular expression tutorials here:
Regular Expression Basics [webmasterworld.com] Apache Forum Library.
Regular Expression Basics [webmasterworld.com] PHP Forum Library.
Regular Expression Tutorial [etext.lib.virginia.edu] University of Virginia, Electronic Text Center.

* If you do not understand the tutorials, then read them again, try the ideas and concepts out on a testing server, and if you still 'just don't get it' you might need to realize Mod_Rewrite is a bit too advanced for you to use on your own. (I don't mean the preceding as a 'slam' or 'dig' on people who don't get it, because I think anyone can learn if they take the time, but those who don't risk breaking their website, wearing their server out early, and losing their search engine rankings, all of which seem like 'less than good' choices.)

Continuing On...
What processed before in the preceding context means is:
If you simply have an HTML file, which uses an external style sheet, 2 JavaScript files, and only 5 images there are 9 URLs requested and they are compared to every Mod_Rewrite rule in the file for every visitor...

IOW: 1000 visitors per day x 20 rules x 9 URLs = 180,000 comparisons for a single, simple, basic page to be displayed to 1000 visitors.

If you have an external redirect, which is executed 1/2 way through your Mod_Rewrite rules, you make 10 comparisons before the redirect is executed, then start the process over.

If you have canonicalization and it's at the end of the file to prevent 'chained' or 'stacked' redirects, then you make 20 comparisons for 9 URLs, redirect to the correct version of the domain (if necessary), and make a minimum of 20 comparisons again.

If you do not eliminate CSS, GIF, JPG, JS, ICO, TXT, SWF, WMV, etc. and have to canonicalize them also, you make 20 comparisons x 8 extra URLs, redirect to the correct version of the domain, then make 20 comparisons x 8 extra URLs again, which is 320 extra comparisons for a very simple page to load for one visitor to the page.

The following single line at the top of the Mod_Rewrite code stops the file types listed in the preceding paragraph from being compared to any following rules, so in the example, there would be one comparison to all 20 rules, rather than 9: (Broken bars must be replaced with regular bar characters prior to use.)

RewriteRule \.(css¦ico¦gif¦jpe?g¦txt¦wmv¦swf)$ - [L]

The root .htaccess file, which is a less efficient setting for Mod_Rewrite than the httpd.conf file, but also happens to be the most used, is again processed for every request for every URL on the entire website, just like the httpd.conf file, which means even if none of the rules affect the requested location (URL, Page), the requested location is still compared to every rule for a possible match before anything else is done.

Recently I looked at a portion of a file posted that contained multiple blocks of comments like this:

############################################
############################################
#################### Comment Here
############################################
############################################

People, the files necessary to edit so you can use Mod_Rewrite are hidden for a reason, and if you ever run something like ySlow [developer.yahoo.com] with FireBug [getfirebug.com] installed on FireFox, you'll see how much file space you can save by removing White Space from a file... Forget about the text for a minute, you can make gains by eliminating white space.

The files you must edit to run Mod_Rewrite are generally processed for every request made for any file on your entire website.

This is not the setting for 'cutsey comments'.
This is not the setting for I'll settle for 'ok', because I don't get it.
This is not the place for those who aren't willing to take the time to learn and understand.

I can understand why the regular posters here in the Apache Forum get (or come across as) frustrated at times, and I think it's because people think Mod_Rewrite is PHP or JavaScript or some other 'scripting language for the masses' and they can 'get by' with 'good enough' and it's not a big deal...

If the preceding were the case there would not have been a recent post about a 500 Internal Server Error being cause by inefficient Mod_Rewrite rules in an .htaccess file.

The files edited to use Mod_Rewrite control your website. They make your website work and can break your website, sometimes so silently or in such a way it can go unnoticed for months, until one day you wake up and wonder why your server crashed, why it keeps serving errors, or why your search engine rankings went away... Then it's too late.

These files communicate with visitors (including search engines) about your website by sending server header codes and information about the requested location, so browsers (and search engines) know what to do, where to go, and what is going on with your site and specific locations.

The files used are hidden, so you have to work to gain access to them, because they're important, fragile and Must Be Perfect to have the desired effect without unwanted and potentially devastating side-effects.

Please, people, I know this sounds blunt and harsh, but Mod_Rewrite is not a toy, so RTBM (Read The Bleeping Manual) and know what everything you put in an .htaccess file does before you try to use this URL Manipulation Tool on a live site, because if you 'get it wrong' or settle for 'good enough and seems to work' you can easily break your site, run your server into the ground, or ruin your search engine rankings, sometimes without even knowing you did it.

In the Charter [webmasterworld.com] of this forum, you will find links to:
The Apache Mod_Rewrite Documentation.
The Apache URL Rewriting Guide.
The University of Virginia Regular Expression Tutorial.
Other Apache Resources.

In the Library of this forum, you will find links to the following posts, and many other useful Apache Server Resources:
An Introduction to Redirecting URLs on an Apache Server [webmasterworld.com]
Beginning Mod_Rewrite [webmasterworld.com]
Mod_Rewrite and Regular Expressions [webmasterworld.com]
What's the difference between an external and an internal redirect? [webmasterworld.com]
Changing Dynamic URLs to Static URLs [webmasterworld.com]
A guide to fixing duplicate content & URL issues on Apache [webmasterworld.com]

Again, please, take the time to learn and know what you are doing before trying to use this Apache Module on a live site, because unlike many languages or modules, things can break 'silently' and have incredibly negative results, which may go unnoticed until it's too late to correct them without a large expenditure of time and/or money...

If you're server breaks and it's yours, you're out some cash, if your rankings tank because of a mistake, you're not only out some revenue, you'll just have to wait until you find the issue, fix it, and search engines decide to rank you again, if they do, which can potentially be much more expensive in the short & long-term than even having to replace a server.

Leosghost

3:18 am on Nov 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Great post..now if it can be stuck at the top of each thread in the apache forum ..

"to be read before one posts"

The time that Jim and all the regulars put in ..in the apache forum is appreciated by many of us ..and is a reason why many many people read it ..even if they never post ..

tangor

4:22 am on Nov 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for illuminating the IMPORTANCE of these tools! I fall into the "get it but know when to avoid" category.

g1smd

8:29 am on Nov 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This forum is largely responsible for Google seemingly now having 88 000 results for the word 'Redirect' when used with my user name. :)

I'm not entirely sure that there's a change of tone. For myself, and I think I can also speak for jd on this one single point, we prefer to use very exact terminology when answering questions. This naturally appears terse. Indeed he posted a few days ago about not using 'it's' in posts but to repeat the article by name to make it clear exactly what was meant.

Terminology is key. The difference between an external redirect (include domain and [R=301,L]) and an internal (exclude domain and use just [L]) is a small coding difference with a very big effect on the outcome.

Planning is key. A lot of people start coding way before they know exactly what it is they want to do. And I mean 'exactly'. Once the exact requirements are known: users see these URL formats (list), redirect these other types of URLs (list) to some other URL formats, rewrite the URL formats that users see to these internal path structures (list), the coding step becomes a lot easier.

One thing that makes this forum a bit different to many of the others here at WebmasterWorld is that it is treated a lot more as a free helpdesk (I guess that also happens in PHP and CSS too), but as the forum charter says 'there's not enough volunteers to offer a free code writing service here' and jd is often at pains to point out 'we can help you write your code, but you must fully understand your code because you must be able to maintain and update your code'.

For a company to be relying on free help in a forum to keep their website online if something breaks is not a good thing; especially if when something breaks there's no-one around to answer questions. There are time zone differences to take into account, we do take 6 to 8 hours sleep daily (despite what our post counts might otherwise suggest) and we do have a lot of other things to be getting on with, to earn a crust.

That said, I think this forum is very successful in getting a lot of problems fixed, and we'll never know how many people find this forum in Google, find the answer they need and use it without ever posting. I always try to make my answers useful to those people, not just the OP.

[edited by: jdMorgan at 10:05 pm (utc) on Nov. 5, 2009]
[edit reason] Edited at member request. [/edit]

TheMadScientist

10:50 am on Nov 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@ Leosghost

Great post..now if it can be stuck at the top of each thread in the apache forum ..

"to be read before one posts"

Thanks, and LOL.

The time that Jim and all the regulars put in ..in the apache forum is appreciated by many of us ..and is a reason why many many people read it ..even if they never post ..

Yeah, it's actually mind numbing to me, because there's days when I've read posts and 'thought I've posted the exact answer to that question (or as close as you'll likely get to the exact answer) 4 times in the last 5 days' and I'm not going to answer it again today... You can't make me. I'm not going to do it! (Yeah, sorry people, but I get tired of typing the same answer over and over again, which prompted my absence here for quite a while. I actually refused to even visit the Apache Forum for quite a while.)

@ tangor

Thanks for illuminating the IMPORTANCE of these tools!

I'm glad you got that out of it. It was one of my main points and reasons for the post, because to me it seems too many people don't realize the possible ramifications of poor mod_rewrite implementation.

(See Below. I have a glaring example.)

@ g1smd

That said, I think this forum is very successful in getting a lot of problems fixed, and we'll never know how many people find this forum in Google, find the answer they need and use it without ever posting. I always try to make my answers useful to those people, not just the OP.

The forum is great for solving problems, helping people out, and you do a great job.

jd is often at pains to point out 'we can help you write your code, but you must fully understand your code because you must be able to maintain and update your code'

Agreed, and you both do a great job of this too.

I was actually just pointing RTBM out and know what your file does more bluntly than most have dared with & for what I believe are good reasons, including... (HTTP/1.1 303 See Other :+:+: Location: Below)

Even if people here help others 'find the answer' or 'solve the problem' we don't always get to see all the code and there have been many times I've read where people just don't listen for some reason, as if a difference in code or efficiency you, or jdMorgan, or (rarely anymore) myself, or some others point out is the same as the difference in these:

../the-path-to/a-file.html
/dir/the-path-to/a-file.html
http://www.example.com/dir/the-path-to/a-file.html

<base href="http://www.example.com/">
dir/the-path-to/a-file.html

It seems to me too many people treat mod_rewrite like it's HTML or CSS or something and if it doesn't 'validate', well whatever... It works, so it must be right.

RewriteRule ^(.*)/(.*)/(.*)/(.*).html /stuffhere.php?1=$1&2=$2&3=$3&4=$4 [L]

The preceding rule really isn't too bad is it? I get the right information to my file and it works and my site's seems fine, so it must be okay... I don't understand what ([^/]+) does, and what I have works, so I'll keep using it, thanks.

##### @ ##### @ ######
Glaring Example of a 'Tiny Error' or Two Below
##### @ ##### @ ######

I posted about this earlier, and it was one of the things that 'triggered' my little RTBM (Read the Bleeping Manual) rant:

One of the sites I used to work on and started back with about 3 months ago had two 'small' errors my successor (and now predecessor) made in the .htaccess file. One was 'tiny' and the other just added processing to the server, which is an issue but doesn't break anything.

The One Small Error happened to rewrite requests to the wrong directory, which happened to serve 'wadgeting-places' rather than 'widget-help', which happened to duplicate about 500 pages in two directories and omit 1000 pages from the site, and subsequently happened to tank the rankings and 3 months ago I fixed the issue, which had been in place for at least 6 months without being noticed, and the site is just now starting to come back in the rankings and get sorted out by the search engines.

There error was in ONE rule. ONE line of code out of 320+ in the entire file was wrong. Just ONE Line. No joke, they made ONE little error in ONE LINE OF CODE, because they didn't pay quite enough attention to the details and maybe forgot to empty their browser cache when they tested, or maybe didn't read the page (the directories look the same 'at a glance', and the text is similar looking, yet unique) and a single, simple, what was probably a copy + paste, then edit, error tanked the site. Honestly, it was an easy mistake to make, and easy to not notice at-a-glance to make sure the page loaded and looked right.

The 'non-critical' error the person made was they added:
(\.php)? to every rule in the file after rule 20 or so (it's a 320 line file) and line number 10 (rule number 3) is (and has been for years):

RewriteCond %{THE_REQUEST} \.php
RewriteRule \.php - [G]

The addition of (\.php)? did not do anything, could not have done anything, and had to be made completely out of ignorance and lack of understanding, because there was absolutely no reason for the addition to be made. A single request of a .php URL on the site would have shown you cannot access a .php URL with a direct request. It's not allowed, so trying to match .php in a requested location on the left side of the rule was absolutely unnecessary and futile.

Again, those of you who take the time to read this, please, take the time to learn what you are doing, know what your rules do, know what mod_rewrite does, and don't use it until you understand what everything in your file does, because the example I just posted is true and did a huge amount of damage to a real, working, monetized website.

I'm actually ranting a bit for the good of the people who come here for help and take the time to read, because this is not a module to take lightly or 'play around' with. If I was making a boat analogy: CSS and HTML would be a row boat and dingy respectively, JavaScript would be a ski boat, PHP a speed boat, and Mod_Rewrite, a hydroplane... Flip it and you're done.