Forum Moderators: phranque

Message Too Old, No Replies

How to abolish old url after upgrading w/mod rewrite?

Can't get legacy links to us to go to the new format

         

MichaelBluejay

12:44 pm on Jan 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Our urls used to look like this: http:://domain.com/topic.html

Recently I used mod rewrite so now I link to: http:://domain.com/topic

The same page loads, but the address bar doesn't show the ".html" part. So far so good.

But I have two problems:

1. People link to us using the old style urls. Those urls don't get converted to the new style in the address bar. We'd like them to, for various reasons, mostly because we're trying to promote easy ways to link to us and to find our stuff with easy urls.

2. Google already had the old urls, and now it's gonna see the new urls, but they have identical content, so we could get dinged for having duplicate content.

But I can't think of how to turn the legacy urls into the new sleek urls without creating a loop. If old style > new style, and new style > old style, then we go loopy-loop.

But since this seem like it would be a common thing that people need to do, I'm hopeful there could be a solution.

Thanks for your help, -MBJ-

jdMorgan

2:31 pm on Jan 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Typically the URL rewrite from /topic to /topic.html is an internal rewrite, and the "repair" rewrite you want to do -- from /topic.html to /topic -- is an external redirect (because you want the SEs to update to the non-.html URL).

So, yes there is a danger of looping, but it can be avoided. The key is to use the variable %{THE_REQUEST} to control the external redirect. %{THE_REQUEST} is the only variable that contains a copy of the originally-requested URL, and is not updated during an internal rewrite.


# Externally redirect direct browser requests for /topic.html to /topic
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /topic\.html
RewriteRule ^topic\.html$ http://www.example.com/topic [R=301,L]
#
# Internally rewrite requests for /topic to /topic.html
RewriteRule ^topic$ /topic.html [L]

The external redirect will upadate the browser address bar and inform the search engines that they should use the new URL, whereas the internal redirect does neither; it just maps the requested URL to the correct file.

Jim

[edited by: jdMorgan at 3:35 pm (utc) on Jan. 20, 2005]

MichaelBluejay

10:04 am on Jan 20, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for your help. I couldn't get this to work as you wrote it but when I took out the (A-Z){3,6}\ bit it worked fine. I would like to understand what that code is supposed to do, though.

(1) I thought (parentheses) were for storing something to use later in a variable like $1, but I don't see such a variable used. It looks like A-Z could be a character class but I thought those were enclosed with [brackets], not parentheses.

(2) I've never seen the {braces} notation before, and I couldn't find it in any of the tutorials.

(3) The backslash \ looks like you're escaping a space. I don't see why that would be the case.

Also, in the Internal Redirect, as you call it, is the [L] necessary? I thought it was only needed if it was preceded by a RewriteCond.

Incidentally, I simplified my original post, but what I really need is for requests for </topic> to load </DIRECTORY/topic.html> while still showing </topic> in the menubar, and for requests for </DIRECTORY/topic.html> to load but show </topic> in the menu bar. Not sure if this changes things with respect to that code syntax I didn't understand....

Thank you very much, -MBJ-

jdMorgan

3:46 pm on Jan 20, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yeah, I typo'ed the [A-Z] part, now corrected above.

THE_REQUEST looks very much like the entry in a standard-format log file:

GET /topic.html HTTP/1.1

[A-Z]{3,9} matches "GET", "DELETE", and the other HTTP "methods" that may be present in the request. Alternatively, you could list them out as alternates with "(GET¦HEAD¦POST¦DELETE¦CONNECT¦COPY¦LOCK¦MKCOL¦MOVE¦OPTIONS¦PATCH¦PROPFIND¦PROPPATCH¦PUT¦SEARCH¦TRACE¦UNLOCK)", or use [A-Z]+ or anything similar to skip over the method field. (Change the broken pipe "¦" characters above to solid pipe characters before trying to use that snippet).

The braces indicate a minimum and maximum number of the preceding characters to match.

This is followed by a space (yes, escaped), the the local URL-PATH, and finally, the protocol.

You should use [L] any time you don't have a very good reason not to. That is, unless you want the output of the current rule to be processed by subsequent rules, use [L]. It is not required when using [F], [G], or [P], however, because these flags imply an [L] without you having to explicitly include it.

You can simply add the "/DIRECTORY" subdirectory into the substitution path -- this doesn't change the basic syntax. Be aware that changing the directory level may "break" relative links in your /topic.html page. Whenever specifying relative links, I suggest using the root-relative form, rather than the current-location-relative form. That is, use <img src="/images/image.gif"> rather than <img src="../images/image.gif"> -- basically, always use a leading slash on any include that is outside the current directory to make the path relative to DOCUMENT_ROOT, rather than to the current directory.

Sorry for the typo and the resulting confusion. Some days it's hard to keep up, and I go too fast.

Jim

MichaelBluejay

9:56 am on Jan 21, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No problems on the typo. I'm getting such useful, detailed, and personalized help here that I certainly can't complain! I just went and tried to answer some of the easier questions in the other forums to balance things out.

I understand now about all the syntax, thanks. The way I'd gotten it to work was to remove everything at the beginning, including the ^, so I had "%{THE_REQUEST} /dir/filename". I know that that puts me at risk for matching something I don't want to match (e.g., "dir" also matches "subdir"), but I don't think it will be a problem in my case. I made it a little safer by using "%{THE_REQUEST} \ /dir/filename", forcing the space after the METHOD request, since I never use spaces in directory or filenames?

I wasted a good hour trying to get this to work with multiple subdirectories until I discovered that you can't use the $ to mark the end of a %{THE_REQUEST} line. That variable must have other stuff in it, like the user agent and the referrer.

I got everything working in the test case but I have a new wrinkle. I want the external redirect to catch </dir/dir2/index.html> but not </dir/dir2/appendix.html>. My RewriteCond matches on </dir/dir2(/index.html)?>, so that's not gonna work because it's too broad. Surely there has to be a regex which will match on an optional filename and only if that filename is <index.html>?

Ignoring that last issue I just mentioned for a moment, if we assumed I matched only what I wanted to match with RewriteCond, couldn't the "RewriteRule ^topic\.html$...", match simply on (.*)
?
I put my question mark on a new line so you wouldn't think I was including it in the syntax. :)

Thanks very much for your help!

jdMorgan

7:09 pm on Jan 21, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I wasted a good hour trying to get this to work with multiple subdirectories until I discovered that you can't use the $ to mark the end of a %{THE_REQUEST} line. That variable must have other stuff in it, like the user agent and the referrer.

Yes.

I got everything working in the test case but I have a new wrinkle. I want the external redirect to catch </dir/dir2/index.html> but not </dir/dir2/appendix.html>. My RewriteCond matches on </dir/dir2(/index.html)?>, so that's not gonna work because it's too broad. Surely there has to be a regex which will match on an optional filename and only if that filename is <index.html>?

You can list the filenames you want, or exclude the ones you don't want using additional RewriteConds. But if you're just trying to match "index.html" or "", then that's easier.
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /dir/dir2(/index.html¦/)?$

Here, the major change is that the pattern is end-anchored. That prevents it from matching "/dir/dir2<anything or nothing>". Change the broken pipe "¦" character to a solid pipe before use.

Ignoring that last issue I just mentioned for a moment, if we assumed I matched only what I wanted to match with RewriteCond, couldn't the "RewriteRule ^topic\.html$...", match simply on (.*)

Yes, you can do tht. But make the RewriteRule pattern as specific as possible without causing interference with what you want to do.

The reason you want this pattern to be specific is that RewriteConds are not evaluated until the RewriteRule matches (see the Apache docs). If you use ".*" then the Rule will always match, so the RewriteConds will always be evaluated. This is often a waste of CPU time, for example if the requested resource is an image instead of an HTML page. So, it's better to make the rule as specific as possible, as long as the pattern is simple (doesn't contain multiple wildcards, alternate matches, etc.).

This 'optimization' is something you need to balance between efficiency and simplicity. For example, if you only have one RewriteCond, you may not consider it being worth the bother. But if you have multiple RewriteConds, or RewriteConds that test the filesystem (i.e. the -d, -f, -s, -l, -F, and -U special RewriteCond patterns) or especially if they generate a reverse-DNS lookups (i.e. %{REMOTE_HOST}) then this can make a big difference on a busy server.

In those cases, make the Rule pattern as specific as possible, and put the RewriteConds in order from most-efficient and most-likely-to-fail-to-match to least-efficient and least-likely-to-fail-to-match. The idea being that if the whole ruleset is likely to fail (because requests that will match it are rare), then it is best if it fails and exits quickly. Generally, local back-references and server variables are most efficient, user-defined variables are in the middle, and the %{REMOTE_HOST} server variable is the one you want to avoid at all costs (it actually generates an outgoing request from your server to the DNS system to look up the hostname, and the request cannot proceed until it gets a response!).

Also, be aware that %{THE_REQUEST} contains the originally-requested URL, while the URL "seen" by RewriteRule will be changed by any other rewrites that have already taken place. So, make the pattern in the RewriteRule match the current URL, not the origianlly-requested one. For example, if you have already rewritten all requests for .htm to .html, then %{THE_REQUEST} will still contain the requested URL ending in ".htm", while the RewriteRule will see the newly-updated ".html" URL. I should also reiterate that RewriteCond %{REQUEST_URI} will also see the new URL, not the originally-requested one. This is not usually a concern, just a trap to be aware of.

Anyway, it's your call to make, balancing server efficiency versus simplicity as you see fit.

Jim

Peter

10:17 pm on Jan 22, 2005 (gmt 0)

10+ Year Member



Hello,

It's difficult to find clear documentation for THE_REQUEST, so please forgive me if my questions are naive.

J.D. Morgan first explained:
- THE_REQUEST looks very much like the entry in a standard-format log file: GET /topic.html HTTP/1.1
and then later proposed:
- RewriteCond %{THE_REQUEST} ^[A-Z]+\ /dir/dir2(/index.html¦/)?$

Hence two questions:
1. Is it really useful to check ^[A-Z]+\ at the start, and why?
2. What is the effect of the ? before the end anchor; and is "HTTP/1.1" returned in THE_REQUEST?

Thanks for any enlightenment.

Peter.

jdMorgan

1:23 am on Jan 23, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The simplest way to answer the question of what's in the %{THE_REQUEST} variable (or any other) is to try a test:

RewriteRule ^foo\.html$ [example.com...] [R=301,L]

Request "foo.html" with your browser, and then watch the address bar for the redirect. The contents of %{THE_REQUEST} will appear as a query string.

%{THE_REQUEST} is the entire browser request line, including method (e.g. "GET"), local URL-path (e.g. "foo.html"), and protocol (e.g. HTTP/1.1). It is the first line you would type into a terminal emulator if you were sending a manual request to your server, for example using the Windows-bundled HyperTerm program:

GET /test.html HTTP/1.0

I addressed the [A-Z] stuff previously, but I like to use anchored, specific patterns. Personal style issue, basically, but my code usually runs faster than most.

In regular expressions, the "?" makes the preceding character, or parenthesized group of characters, optional. Thus, that subpattern matches requests for "index.html", for "/", or for "". There's a link to a decent regex tutorial in our forum charter for more info.

Jim

Peter

10:32 pm on Jan 23, 2005 (gmt 0)

10+ Year Member



That is a very nice technique for displaying server variables, which I hadn't picked up before if you've already shown it to us. Thank you very much, Jim.

I'm probably being stupid (sorry), but I still can't see how one can use the end anchor here if " HTTP/1.*" will be in the string after "/index.html" or "/" or "".

Peter.

jdMorgan

3:07 am on Jan 24, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nope, you're right. The pattern should not be end-anchored. I seem to have filled this thread with typos. I'm a habitual "anchorer" and just slipped that end-anchor in there, somehow. :o

So, the RewriteCond in message #5 should read:


RewriteCond %{THE_REQUEST} ^[A-Z]+\ /dir/dir2(/index.html¦/)?\ HTTP/

Here, we're trying to match only "/index.html" or "/", so the trailing backslash-space is needed to make sure that no additional URL-path elements follow the slash. And to make sure that the syntax was obvious in this post, I went ahead and appended the "HTTP/" part of the request as well, even though it's not likely to vary (any time soon).

Jim

MichaelBluejay

4:13 pm on Jan 26, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This thread alone was worth my WebmasterWorld subscription. Thanks again jdmorgan!

And here's another typo: You said message #5 when I think you meant message #6. :)

When I wrote %(THE_REQUEST) to the query string it started with GET%2520. I know that %20 is a space but what the heck is %2520?

I went ahead and made my RewriteRule matches specific instead of generic to increase performance. I didn't see any problem with being generic as long as their wasn't a downside, but now that I know there's a downside, specific it is.

I went ahead and decided to match everything in a subdirectory, not just the <index.html> file, so that makes everything a little simpler anyway.

And of course, I have one outstanding issue. I'm mapping requests for </topic> to </directory/topic/index.html>, and having requests for </directory/topic/(.*)> show in the browser as </topic/$1>. So far so good.

But if the $1 is "index.html", I'd like that to be omitted from the address bar. Not sure if this can be done, but you've had the solution to everything else, so I figured it was worth a shot!

jdMorgan

10:44 pm on Jan 26, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> But if the $1 is "index.html", I'd like that to be omitted from the address bar. Not sure if this can be done, but you've had the solution to everything else, so I figured it was worth a shot!

Yes, it's possible (if I understand the question). However, I'm not sure where we are at this point code-wise. So, if you're not up to figuring this out on your own yet, you'd better post what you have.

Jim