homepage Welcome to WebmasterWorld Guest from 54.197.94.241
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
RewriteRules, non-absolute URLs in HTML, and trailing slashes
NotionCommotion




msg:4650729
 3:04 pm on Mar 3, 2014 (gmt 0)

Hello all!

For me, Apache + regex = confusion, and hoping someone can help out.

I have the following rule:
RewriteRule ^(testit)/?$ test.php?p=$1 [L,QSA]

It is my understanding that it looks for "testit" plus 0 or 1 forward slashes and redirects to "test.php?p=testit". If there is anything else in the query, the QSA flag means it will be appended, and the L flag means don't try doing any more matches. Seem accurate?

So I create a file containing
<!DOCTYPE html><html><body><img src="manual.png"></body></html>, uploaded it as /var/www/html/test.php, and also upload /var/www/html/manual.png.

I then put
http://www.myDomain.com/testit in the browser, the page is rendered, and manual.png is displayed.

So far, so good. But then I put
http://www.myDomain.com/testit/ (not the trailing slash) in the browser, the page is rendered, but manual.png is a broken link.

I have found that I could resolve the problem by making all URLs in my HTML absolute instead of relative, but I really don't want to do that. I believe the problem is based on the client thinking it is at
http://www.myDomain.com/testit, so it requests the image from http://www.myDomain.com/testit/manual.png which obviously doesn't exist. Why this only poses a problem when the main page has a trailing slash, I am uncertain.

Can anyone help? Thank you

 

lucy24




msg:4650739
 4:50 pm on Mar 3, 2014 (gmt 0)

It is my understanding that it looks for "testit" plus 0 or 1 forward slashes and redirects to "test.php?p=testit".

Not "redirects". Only "rewrites". Crucial difference. A redirect means the browser makes a fresh request, as shown in your logs, and their address bar changes. A rewrite means that you quietly serve content from the target URL; neither your logs nor the user can see it happening.

Other than that, your translation is spot-on. Except, ahem, the target should start in / slash. (This is safer than using a RewriteBase, for reasons g1smd will explain.)

I assume the capture is just for illustraion purposes, since there would normally be no point to capturing literal text with no variation.

You interpretation is also spot-on. (Edit: Well, it's actually backward, but I think there was a typo.)

example.com/testit/
= browser thinks it is in /testit/ directory, and asks for any relative links based on that belief.

example.com/testit
= browser thinks it is in root / directory, and et cetera as above.

Fortunately the solution is simple, and it concurrently avoids the Duplicate Content issue you'd be creating with that optional end slash. Decide which URL you want: with or without. Personally I'd go "without", using an extensionless URL rather than a fake directory. Happily this agrees with the physical location of the target file, so all supporting files now display as intended.

You will see arguments in favor of using absolute URLs consistently. Personally I find the opposite is better in many cases. If there's a package, where a page and its supporting files always stay together but the whole package might move, it makes much more sense to use relative URLs. Save the absolute links for things that aren't part of the package, or that are in other directories.

g1smd




msg:4650757
 5:37 pm on Mar 3, 2014 (gmt 0)

I would also recommend adding some preceding rules:
- one that redirects requests for test.php?p=(something) to www extensionless URL
- one that redirects "with slash" to www without slash
- one that redirects all other non-www to www, preserving requested path
and ensure the site links to URLs without trailing slash. URLs with trailing slash denote a folder or index page of a folder.

The first two rules will each need a preceding RewriteCond testing THE_REQUEST and the third will need a preceding RewriteCond testing HTTP_HOST. These are to prevent an infinite redirect loop.

Your original rule (which is a rewrite rather than a redirect) becomes:
RewriteRule ^(testit)$ /test.php?p=$1 [L,QSA]
(three changes).

Make sure that all links to css, js and images always begin with a slash and specify the full path to the file. This is a crucial step once you start using rewrites.

Your PHP script now becomes responsible for returning 404 responses for non-valid requests that happen to be rewritten to be handled by the script. At the beginning test for valid page name and return a 404 HEADER and INCLUDE the 404 page for all such non-valid requests.

Leave a blank line after every rule for clarity. Comment every rule in plain English describing what it does. With 90 000 threads in this forum, there's tons of example code to get ideas from.

NotionCommotion




msg:4650784
 7:05 pm on Mar 3, 2014 (gmt 0)

Thank you Lucy and g1smd!

Not "redirects". Only "rewrites". Crucial difference.
Good point.

Except, ahem, the target should start in / slash. (This is safer than using a RewriteBase, for reasons g1smd will explain.)
Why so? Should I never use RewriteBase?

I assume the capture is just for illustraion purposes, since there would normally be no point to capturing literal text with no variation.
My actual rules are shown below. My "intent" is to have the user see pretty URLs in their browser such as http://www.myDomain.com/html1, http://www.myDomain.com/html2, http://www.myDomain.com/get-started, and http://www.myDomain.com/contact-us. html1 and html2 are just normal HTML pages located in my root directory, and should be rewritten to have the .html extension. Get-started and contact-us are two special cases which need PHP support, and might also need to pass some special data when used like http://www.myDomain.com/get-started/edit/123. Does this make any sense?

Fortunately the solution is simple, and it concurrently avoids the Duplicate Content issue you'd be creating with that optional end slash. Decide which URL you want: with or without. Personally I'd go "without", using an extensionless URL rather than a fake directory. Happily this agrees with the physical location of the target file, so all supporting files now display as intended.
I wish without end slashes.

You will see arguments in favor of using absolute URLs consistently. Personally I find the opposite is better in many cases.
More on this in a bit.

I would also recommend adding some preceding rules:
- one that redirects requests for test.php?p=(something) to www extensionless URL
- one that redirects "with slash" to www without slash
- one that redirects all other non-www to www, preserving requested path
and ensure the site links to URLs without trailing slash. URLs with trailing slash denote a folder or index page of a folder.
Please see my actual rules below. Can you provide some specific pointers?

Make sure that all links to css, js and images always begin with a slash and specify the full path to the file. This is a crucial step once you start using rewrites.
This is contrary to Lucy's position. I would rather not have to do so. Is there a workaround?

Your PHP script now becomes responsible for returning 404 responses for non-valid requests that happen to be rewritten to be handled by the script.
Good point. Will do. Only need to do for get-started and contact-us, right?

Thanks again!

My actual rules:
<IfModule mod_rewrite.c>

RewriteEngine On

RewriteBase /

## If the request is for a valid directory, file, or link, don't do anything
RewriteCond %{REQUEST_FILENAME} -d [OR]
RewriteCond %{REQUEST_FILENAME} -f [OR]
RewriteCond %{REQUEST_FILENAME} -l
RewriteRule ^ - [L]

#remove the trailing slash. I havenít really figured out this part.
RewriteRule (.+)/$ $1

# replace my-page/my-controller/data with corporate.php?p=my-page&c=my-controller&data=data
RewriteRule ^(contact-us|get-started)/([^/]+)/([^/]+)/?$ corporate.php?p=$1&c=$2&d=$3 [L,QSA]
# replace my-page/my-controller with corporate.php?p=my-page&c=my-controller
RewriteRule ^(contact-us|get-started)/([^/]+)/?$ corporate.php?p=$1&c=$2 [L,QSA]
# replace my-page with corporate.php?p=my-page
RewriteRule ^(contact-us|get-started)/?$ corporate.php?p=$1 [L,QSA]

#Replaces file if "." is not in the string (i.e. it will not replace file.html, but will replace file
RewriteRule ^([^.]+)$ $1.html [L]

</IfModule>

lucy24




msg:4650806
 10:24 pm on Mar 3, 2014 (gmt 0)

RewriteRule (.+)/$ $1

If your filepaths contain no literal periods, you can do it without the dreadful necessity of a -d test:
RewriteRule ^([^.]+[^./])/+$ /$1 [L]
If you do have literal periods in directory names (I do not even want to consider the possibility of filenames with extra periods) it becomes
RewriteRule ^(([^/]+/)*[^./]+)/+$ /$1 [L]
NotionCommotion




msg:4650841
 1:14 am on Mar 4, 2014 (gmt 0)

Thanks Lucy, No, the filepath will not include any literal periods. Could you describe what your first rule is doing? THanks

lucy24




msg:4650898
 5:41 am on Mar 4, 2014 (gmt 0)

###

I don't know what I was thinking. This rule only works if you don't actually have any directories. (It then means: "Starting at the beginning of the requested URLpath and continuing to the end, capture all non-periods. Also omit the final slash. If you meet a period along the way, or if the request doesn't end in a slash, the rule fails.")

Time out while I figure out what I meant to say. Maybe it's best if phranque just quietly deletes these last few posts.

###

g1smd




msg:4650942
 8:33 am on Mar 4, 2014 (gmt 0)

#remove the trailing slash. I havenít really figured out this part.
RewriteRule(.+)/$ $1

You'll need protocol and hostname on the rule target and the [R=301,L] flags here to ensure this is a redirect.


The rule target in the final rewrites should begin with a slash.

The three rules beginning
RewriteRule ^(contact-us|get-started)/([^/]+)/([^/]+)/?$ corporate.php?p=$1&c=$2&d=$3 [L,QSA]
simplfies to
RewriteRule ^(contact-us|get-started)(/([^/.]+)(/([^/.]+))?)?$ /corporate.php?p=$1&c=$3&d=$5 [L,QSA]
NotionCommotion




msg:4651089
 6:39 pm on Mar 4, 2014 (gmt 0)

You'll need protocol and hostname on the rule target and the [R=301,L] flags here to ensure this is a redirect.
Please elaborate.


The rule target in the final rewrites should begin with a slash.

The three rules beginning
RewriteRule ^(contact-us|get-started)/([^/]+)/([^/]+)/?$ corporate.php?p=$1&c=$2&d=$3 [L,QSA]
simplfies to
RewriteRule ^(contact-us|get-started)(/([^/.]+)(/([^/.]+))?)?$ /corporate.php?p=$1&c=$3&d=$5 [L,QSA]

Is my interpretation correct?

  • If "contact-us" or "get-started" is found, store it in $1
  • Gobble up any characters until a "/" is found, and store it in $3.
  • Store "/" plus $3, and store it in $2. What if "/" isn't found?
  • Continue gobbling up any characters until a "/" is found, and store it in $5.
  • Store the a "/" plus $5, and store it in $4.
  • Rewrite as you have shown.

g1smd




msg:4651101
 6:59 pm on Mar 4, 2014 (gmt 0)

The logic here is simply that a match has been found and something placed into $1 so why not continue rather than exit this rule and proceed to match the same stuff into $1 again in the very next rule.

The question marks at the end of the pattern cater for cases where $5 will be blank or $3 and $5 will both be blank. $2 and $4 are not used. The pattern could also be modified from ( to (?: in two places to suppress backreference generation, so using $1, $2 and $3 again.

RewriteRule^(contact-us|get-started)(?:/([^/.]+)(?:/([^/.]+))?)?$ /corporate.php?p=$1&c=$2&d=$3 [L,QSA]
lucy24




msg:4651150
 9:29 pm on Mar 4, 2014 (gmt 0)

What if "/" isn't found?

If there is no / immediately after (contact-us|get-started) then parts 2 and, by necessity, 3 are empty.

Since parts 2 and 3 are identical, I don't think you need the additional nesting; you can ? each one separately. Unless mod_rewrite uses a very weird RegEx engine. (I do not exclude this possibility.)

RewriteRule ^(contact-us|get-started)(?:/([^/.]+))?(?:/([^/.]+))?$ /corporate.php?p=$1&c=$2&d=$3 [L,QSA]

Note that ?: "no-capture" doesn't mean "ignore this part". It simply means "don't assign it a separate number". You see it more often on internal groups, like the common
((?:[^/]+/)*)

Edit: Would these URLs even have a query string necessitating a QSA? I thought the whole point was to be "friendly"; if so, there's no need for the flag, though it won't do any harm.

g1smd




msg:4651182
 10:58 pm on Mar 4, 2014 (gmt 0)

I think it better to nest as this ensures corrrect operation. The final part can match only if all preceding parts have matched.

NotionCommotion




msg:4651198
 1:01 am on Mar 5, 2014 (gmt 0)

Thanks, Let me digest this...

Edit: Would these URLs even have a query string necessitating a QSA? I thought the whole point was to be "friendly"; if so, there's no need for the flag, though it won't do any harm.
Only need to be friendly up to a point! I use a similar URL for ajax calls which might need additional information from the client.
g1smd




msg:4651254
 7:10 am on Mar 5, 2014 (gmt 0)

Would these URLs even have a query string necessitating a QSA? I thought the whole point was to be "friendly"; if so, there's no need for the flag, though it won't do any harm.

If I have a page containing a sortable table of data, I might attach a
?sortorder= parameter to the URL. The canonical URL will be the one without this parameter attached.
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved