Forum Moderators: phranque
I got something working using mod_rewrite:
Old link on index page
<li><a href="?act=6">Why they went</a></li>
New link
<li><a href="emigration-reasons.html">Why they went</a></li>
With associated rewrite rule in .htaccess in file directory:
RewriteRule ^emigration-reasons.html index.php?act=6 [NC]
Later I will add the step for the spiders and others still following the old ?act=xx links as suggested by jdMorgan [webmasterworld.com...]
There is a specific problem, though. Not all pages – links have been “rewritten” and I am testing it on a home server with Apache 2.2 – php 5.2.5
The rule works for the pages that are rewritten, also going from one new url to the next new url and back. However, once ANY rewritten-url page is visited, trying to follow an old ?act=x link fails. The ?act=x is added behind the last rewritten url but the page stays on the rewritten-url page, like www.example.org/emigration-reasons?act=4. As of that moment, I can only follow links between rewritten urls, nothing else works.That problem should/will probably stop once all is ready but I want to understand how things work (and especially why not)
I tried some RewriteCond, did not work. Now I realize that I somehow need to force the server that with every new request or link clicked, it should start again from the basic www.example.org line and get rid of the “rewritten-nice-words.html” first. No idea how to code for that one or if that is the problem. Any idea?
Relative-link-resolution is done by the browser based on it's current page location, so if the browser is displaying the page /index.php" --that is, "/index.php" appears in the address bar-- then this relative link will work.
But if "/emigration-reasons.html" appears in the address bar, and the relative <a href> link is "?act=5" then the browser will construct the link as "emigration-reasons.html?act=5" and send that to your server. Once it gets to the server, mod_rewrite will match the "emigration-reasons.html" and replace the browser-specified "act=5" query with the one specified in the rule. So, the browser-specified query string essentially gets ignored, and you therefore seem to be "stuck on the same paqe".
You're right in that the problem will go away once all dynamic links are replaced. In the meantime, consider changing the remaining dynamic links to "<a href="index.php?act=5">Who they were</a>" since the browser will then recognize that a new page is being referenced (because "index.php" is not the same as "emigration-reasons.html") and will then build the relative links as "/index.php?act=5".
One note: Although it won't affect the problem you're asking about, be sure to escape literal periods in your patterns to eliminate ambiguity, and to start- and end-anchor your patterns to eliminate ambiguity, prevent duplicate-content problems, and promote pattern-matching efficiency. Do not use [NC] for internal rewrites; if you have capitalization errors, they should be fixed with an external redirect in order to prevent duplicate content problems.
Also, use the [L] flag, unless you know of a specific reason not to use it on a particular rule -- again, this is for efficiency. So, your posted rule becomes:
RewriteRule ^emigration-reasons\.html$ /index.php?act=6 [L]
An example capitalization-fix rule would be:
RewriteCond $1 !^emigration-issues\.html$
RewriteRule ^(emigration-issues\.html)$ http://www.example.com/emmigration-issues.html [NC,R=301,L]
Jim
[edited by: jdMorgan at 4:36 pm (utc) on April 27, 2008]
However, again ran into a problem that drove me crazy earlier this weekend. I tried several things when making the rewrite rules, but now narrowed it down thanks your suggestions for unambiguous strict and simple syntax to one thing: the beginning slash in the second part of the rule is the one that keeps throwing an error.
^word-word.html index.php?act=x [L] ...works
^word-word\.html$ index.php?act=20 [L] ...works
^emigration-passport.html$ /index.php?act=20 [L]does not work:
"Not Found, the requested URL /index.php was not found on this server". Removing the begin slash makes it work again.
I am running Apache 2.2 with php 5.2.5, with Apache in it strictest default settings. I just checked the 2.2 documentation on the apache site and they also do not use the beginning slash (if I understand correctly):
Quote
# now the rewriting rules
RewriteRule ^oldstuff\.html$ newstuff.html
Unquote
This is a difference between Apache 1.3 and 2.2?
Something related I am now also pondering.
1) For optimum SEF and user-friendliness, what is better as new url type, with or without html? For example “emigration-reasons.html” or just “emigration-reasons”? I saw both promoted, but no arguments why one would be better than the other for SE’s or for public. Anybody any remarks on that?
2) This is not really an important question because I can make our site work; just interested. With a product site with clear product numbers and section ID’s, I can imagine a general rule for rewriting hundreds of easy html’s back into real …php?act=x & id= etc. Actually, most topics on mode_rewrite here address how to do just that.
Our site, in contrast, only has new URLs like “easy-words.html” or at most “friendly-easy-words.html” which I have to link to a specific index.php?act=xx (only one variable). I cannot find a general rule how to link each friendly url to its ?act=xx number. As far as I can see, it boils down to making a unique rule for each page. We only have a few dozen pages, so it is doable.
How do big sites with hundreds of pages without clear product or section id’s but only “nice-words.html” solve that; they write a unique rule for each individual page as well?
For larger sites one solution maybe could be to make still reasonably friendly urls by adding the ?act number as folder BEFORE the friendly url.
For example, www.example.org/6/emigration-reasons.html. Visitors would still see the nice url and probably miss the /6/ anyway. For spiders it does not matter at all.
You could then make a general rewrite rule to use the .../x/... as number for the index.php?act=x.
With only around 25 pages for this particular site and about the same for the associated sites, writing unique rules for each page is no problem and nicer. However, for our other sites that are larger this might be the solution to use.
Given a choice, I'd avoid putting the 'act=' number into the URL. It looks appealing, but realize that it leaves your site open to massive duplicate-content problems, since example.org/6/<anything-whatsoever> will return the same page content.
The solution used by large sites depends on having access to httpd.conf, and so usually requires a VPS or a dedicated server.
A RewriteMap is defined in httpd.conf or conf.d that maps requests to a CGI script (e.g. PERL) that does a database lookup using the friendly URL and retrieves the index number (such as your 'act=6'). This value is then used by the RewriteRule to rewrite the request.
This RewriteMap solution is a bit complex, but it is most useful on sites where the back-end scripts cannot be re-coded to accept the friendly URLs directly. Also, the RewriteMap script must be robust; it is started once with the server, and must continue to run 'forever.' As a consequence, it must gracefully handle all possible errors internally, and never, ever, "die." This is not a typical requirement, and most programmers won't code like that unless you make it very clear at the outset that the script must never fail for any reason.
If your back-end script is proprietary, then of course you could just modify it to accept 'friendly' URLs directly. For off-the-shelf scripts, there may be a 'reverse SEF plugin' for use with SEF to allow you to do this.
Jim
Seems two max three words spaced by hyphens is still the most recommended way to do it, correct?
Is it possible-does it make sense to add to each rewrite rule a code to catch both the version without html and the html version if people type that? For example ^nice-words\.html?$ to catch all versions with and without html?
The matching case rule, I probably have to write that also unique for each page, that is
RewriteRule ^emigration-reasons$ index.php?act=20 [L]
RewriteCond $1 !^emigration-reasons$
RewriteRule ^(emigration-reasons)$ http://www.example.org/emmigration-reasons [NC,R=301,L]
etc etc etc?
# if URL-path is not exactly "emigration-reasons", all-lowercase
RewriteCond $1 !^emigration-reasons$
# redirect any variants to the canonical URL (note that the rule pattern now has no end-anchor)
RewriteRule ^(emigration-reasons) http://www.example.org/emmigration-reasons [NC,R=301,L]
#
RewriteRule ^emigration-reasons$ index.php?act=20 [L]
Jim
Jim, thanks, will do that in the coming days.
g1smd,
Normally maybe yes, but this is an existing site still running under its old stuff. I had to resolve dynamic unique page-titles, descriptions, and keywords first and do some other re-coding. On the actual pages, the only real changes were improving the H1-page-header and H2-sub-header.
This is all done on my home-server as testing ground. Doing the recoding, we also figured out what would be the best fitting urls to superimpose on those existing pages. I am now figuring out how to rewrite for the urls. Once that is solved -thanks Jim- we will create a sitemap and then upload the new structure and start pushing.