Forum Moderators: phranque

Message Too Old, No Replies

Rewriterule for subdirectory name to php argument

         

santrix

7:22 pm on Mar 2, 2009 (gmt 0)

10+ Year Member



Hi everyone... I normally find things out through reading, but this problem has me totally fed up - and i don't seem to be able to find a good reference anywhere that doesn't make a million assumptions about my knowledge (which is not great in this area).

I have the following

http://example.com/news/index.php?title=friendly-page-title

and I want to access it via:

http://example.com/news/friendly-page-title
or
http://example.com/news/friendly-page-title/

I have tried putting /news/.htaccess with the following:

Options +FollowSymLinks
RewriteEngine on
RewriteRule /(.*)/$ index.php?title=$1

Can someone put me out of my misery? I've wasted hours on this...

Many thanks for any guidance.

[edited by: jdMorgan at 4:05 pm (utc) on Mar. 3, 2009]
[edit reason] example.com [/edit]

g1smd

8:45 pm on Mar 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This same question, or something very similar, has been asked several times in the last few days.

Check recent threads for a whole chunk of information and code.

For starters, remove the leading / from the pattern, and add [L] to the end.

However, there's a whole lot more you will need to do to finish the whole job.

santrix

8:59 pm on Mar 2, 2009 (gmt 0)

10+ Year Member



OK, I think I'm getting closer I now have in /news/.htaccess

Options +FollowSymlinks
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php?title=$1 [L]

/news/friendly-title

now works, and gets passed as

/news/index.php?title=friendly-title

but two things I can't work out:

1.) if there is a trailing slash, then it gets passed through to the title argument. Is there a way to remove it without php having to look for it?

2.) if I remove the RewriteCond statements then it all fails to work? This I don't understand, as all other examples I have seen don't seem to rely on a RewriteCond preceding the RewriteRule.

g1smd

9:24 pm on Mar 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It is likely that the 'not work' is an infinite loop. There are several ways to solve that.

.

For the trailing slash problem set up some redirects before the rewrite.

1. Redirect any request with trailing slash for both www and non-www to remove the trailing slash and force the www to be added if it is missing. You now have a canonical URL for your content.

2. Redirect non-www to www for all other non-www requests.

3. Place the rewrite after these redirects.

santrix

9:32 pm on Mar 2, 2009 (gmt 0)

10+ Year Member



OK... I'm sorry for sounding a bit green. I can't see how I am causing a loop. I also don't understand your statements about the modifying the hostname, as I thought this is not dealt with my RewriteRule?

I now have

Options +FollowSymlinks
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^./]+)$ /news/$1/ [L]
RewriteRule ^(.*)/$ /news/index.php?url=$1 [L]

And it appears to work fine... Although I have read up on regex, I still don't fully understand ^([^./]+)$

So, what actually happens is, if a trailing slash is missing then the first RewriteRule fires, ceases any further processing and passes the request (now with the trailing slash) back to apache?

g1smd

9:38 pm on Mar 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You absolutely do not want two URLs to resolve to the same content.

If someone requests the wrong URL, you want to hard redirect them to the correct URL with a 301 redirect. That makes the user agent 'see' a new URL for the content.

Once they come back requesting the correct URL, and only then, fire the rewrite to get that content.

g1smd

9:42 pm on Mar 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As for the loop when using this code:
RewriteRule ^(.*)$ index.php?title=$1 [L]

Ask for a URL like this

example.com/12345
and see what happens.

The rewrite will rewrite this to

index.php?title=12345
- so far so good.

The

index.php
still matches
(.*)
so gets rewritten to
index.php?title=index.php&title=12345

The

index.php
still matches
(.*)
so gets rewritten to
index.php?title=index.php&title=index.php&title=12345

The

index.php
still matches
(.*)
so gets rewritten to
index.php?title=index.php&title=index.php&title=index.php&title=12345

... and so on, forever.

That code would also feed requests for

robots.txt
through to
index.php?title=robots.txt
and I am quite sure your script would not deliver something that looked like a
robots.txt
file.

You need the

(.*)
to be more selective in what can actually trigger the rewrite.

santrix

9:52 pm on Mar 2, 2009 (gmt 0)

10+ Year Member



g1smd, thanks for your advice... I think I follow...

I thought the [L] would have stopped that but obviously not. I need to understand how the headers bounce back and forth from the browser to the server.

Once a rewrite occurs, and the [L] is encountered, then the modified URL goes where? direct to apache? And then it all happens again - i.e. the rewrite rules are all parsed again?

Sorry to sound thick... I see this is going to be one of linux's steeper learning curves...

I have thought about what you say, and now have this:

Options +FollowSymlinks
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^./]+)$ /news/$1/ [R=301,L] <--- make them go away and get the right URL?
RewriteRule ^(.*)/$ /news/index.php?url=$1 [L] <--- only the trailing slash version gets served up!

Does this pass the sanity test? It seems to work fine.

Steve

g1smd

9:58 pm on Mar 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The .htaccess file is processed again and again, until nothing in there matching anything that needs doing, so it is very easy to inadvertently cause an infinite loop.

A RewriteCond or group of conditions apply only the single Rule which follows, so that code is not correct. Don't insert a redirect in the middle like that.

You need to place redirects before the pre-exisiting rewrite code that you already had. As I stated above you also need to add the domain name to the redirect to force www at the same time. As you have it now, both www and non-www will directly serve content. You need the redirect to force only one of the two to be the place to directly get the content from.

For extensionless URLs you should be redirecting to "without slash". A URL with a slash is supposed to be a folder.

You'll also need the other redirect that I mentioned. Pay attention to the order they need to appear in. See the 1-2-3 list above.

santrix

8:20 am on Mar 3, 2009 (gmt 0)

10+ Year Member



g1smd, thanks again for the pointers... It's excruciating, trying to bone up on regex as well as understanding the idiosyncrasies of mod rewrite...

I now have the following in the /news/.htaccess

Options +FollowSymlinks
RewriteEngine on

RewriteCond %{HTTP_HOST} !^www\.example\.co\.uk [NC]
RewriteRule ^(.*)$ http://www.example.co.uk/news/$1 [R=301,L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^./]+)/$ /news/$1 [R=301,L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ /news/index.php?url=$1 [L]

It appears to be doing what I want... I actually have the canonical name redirection in the virtual root above the news directory. I tried using inherit, but the redirection always ended up pointing at the root, and putting in the physical path... I gave up and just put it in the /news/.htaccess as well.

Do i pass the noobie exam yet?

[edited by: jdMorgan at 4:20 pm (utc) on Mar. 3, 2009]
[edit reason] example.co.uk [/edit]

santrix

9:46 am on Mar 3, 2009 (gmt 0)

10+ Year Member



Update... It seemed silly to have a .haccess file in the /news/ subdirectory, so I have now put it all together in the /.htaccess file in the root of the site

Options All -Indexes
DirectoryIndex index.php index.htm
RewriteEngine On
RewriteCond %{HTTP_HOST} !^www\.example\.co\.uk [NC]
RewriteRule ^(.*)$ http://www.example.co.uk/$1 [R=301]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^news/([^./]+)/$ /news/$1 [R=301,L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^news/(.*)$ /news/index.php?url=$1 [L]

Hopefully this is the (well, one at least) correct way of doing things - it appears to work fine.

Steve

[edited by: jdMorgan at 4:20 pm (utc) on Mar. 3, 2009]
[edit reason] example.co.uk [/edit]

jdMorgan

4:19 pm on Mar 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It may work fine, but try requesting a virtual URL with a trailing slash from the "wrong" domain. You will see that you get two "chained" redirects, and that is not good, SEO-wise.

A good rule of thumb is to put your rules in order with external redirects first, ordered from most-specific pattern (fewest URLs affected) to least-specific pattern, followed by internal rewrites, again from most-specific to least-specific.


Options All -Indexes
DirectoryIndex index.php index.htm
RewriteEngine on
#
# Externally redirect to remove trailing slashes from virtual /news/ URLs
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^news/([^./]+)/$ http://www.example.co.uk/news/$1 [R=301,L]
#
# Externally redirect to canonical hostname
RewriteCond %{HTTP_HOST} !^www\.example\.co\.uk$
RewriteRule (.*) http://www.example.co.uk/$1 [R=301,L]
#
# Internally rewrite /news/ URLs to index.php script with query string
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^news/([^./]+)?$ /news/index.php?url=$1 [L]

I also suggest that you comment your code, so that you can go back to it next month or next year, and not wonder about its subtleties...

Note minor tweaks involving anchoring, hostname casing, the [L] flag on the domain redirect, and the more-specific internal rewrite pattern, which reflects the pattern used in the external redirect pattern.

It is good to make RewriteRule patterns as specific as possible, especially when using slow, inefficient file- and directory-exists checking (which involve calling the filesystem, and may even require disk accesses). When possible, it is better to use characteristics of the requested URL itself to avoid the necessity of file- and directory-exists checking.

Jim

g1smd

11:22 pm on Mar 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have been out all day, so jd got there first.

It's a good thing to remember - always state the full domain name when doing a 301 redirect and always add [L] to the end of every rule.

santrix

7:07 am on Mar 4, 2009 (gmt 0)

10+ Year Member



g1smd (old class c amateur radio callsign?) and jdMorgan, many thanks for all of your help and time - it's greatly appreciated. Sorry for being such a numbskull! I have spent a few hours reading up and now I think I "get" the script I have arrived at (well, the final version that jd proposed)...

The regex ^news/([^./]+)?$ - let me have a go at explaining this to se if I have it right...

^news/ # match anything starting with news/

( )? # everything in the parenthesis either not at all, or only once. Also, whatever is in the parenthesis is the pass back argument $1, used later in the replacement statement

[^./]+ # a character class that says match any character that isn't a dot or a slash. The following + sign means match it one or more times. It took me a while to realise the dot in the character class is a literal, and not a metacharacter!

Do I now get a gold (noob) star?

Steve (aged 41, and rapidly wondering if he's getting too old for this crap!)
once upon a time g1kad on 144Mhz

jdMorgan

1:54 pm on Mar 4, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



^news/ # match anything starting with news/

( )? # everything in the parenthesis either not at all, or only once. Also, whatever is in the parenthesis is the pass back argument $1, used later in the replacement statement

Correct. Note that "pass back" is more-formally termed "back-reference."

[^./]+ # a character class that says match any character that isn't a dot or a slash. The following + sign means match it one or more times.

Correct: "Match one or more characters not equal to a period or a slash."

It took me a while to realise the dot in the character class is a literal, and not a metacharacter!

Escaping rules vary within and outside [alternate-character groups], with fewer characters needing to be escaped within groups than without.

Jim

santrix

2:18 pm on Mar 4, 2009 (gmt 0)

10+ Year Member



That deserves a pint... :) Thanks again for your help.

g1smd

8:24 pm on Mar 4, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Class B actually; class of '85. Too old at 41? I'm 42, so knock it off already. :)

santrix

8:37 pm on Mar 4, 2009 (gmt 0)

10+ Year Member



Got mine in '83... Never did the morse for that experimental license. I'm amazed. I wonder if I can get my old G1 license back now it lapsed? 42, eh? there's hope for me yet! Cheers again.

g1smd

8:49 pm on Mar 4, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yeah, it's about 30 quid to reactivate, one-off payment for life.