Forum Moderators: phranque

Message Too Old, No Replies

Redirecting with?'s and #'s in the URI

using RedirectMatch vs. mod_rewrite (or something else?)

         

zwhalen

8:25 pm on Mar 25, 2006 (gmt 0)

10+ Year Member



Hi.

I'm in the process of moving a site to a new domain, and what I thought would be an easy switchover is turning out to be more complicated than I thought. The problem stems from the fact that the at olddomain.org is a blog running on Blosxom, and the way I've got it set up it generates permanent links to entries that look like [olddomain.org...] .

The new domain is running on entirely different setup (Drupal), so the new entries look like [newdomain.org...] .

I want to set up matching each file specifically to its counterpart since most of our inbound links use those URLs, but the problem comes in the matching. I can identify which 'YYYY/MM/DD#filename' goes with which 'node', but I'm having a hard time configuring .htaccess to match the URI's with? and # characters in them.

So far I've tried something like:

redirectmatch 301 /index\.shtml?YYYY/MM/DD\#filename newdomainlocation

(where newdomainlocation is the actual URL it goes to)

But that doesn't work (doesn't match). Should I instead try something with mod_rewrite? Can that still indicate a 301 redirect?

Thanks in advance.

zach

zwhalen

9:50 pm on Mar 25, 2006 (gmt 0)

10+ Year Member



Well, I've figured out how to deal with the query string and at least get it evaluated, but I still can't figure out how to match the #. Here's what I've got now:

RewriteCond %{QUERY_STRING} ^/2005/7/1[.+]filename65$
RewriteRule ^index\.shtml$ http://www.newdomain.org/node/751 [R=301,L]

Any ideas?

zwhalen

10:21 pm on Mar 25, 2006 (gmt 0)

10+ Year Member



After some more experimenting, apparently anything following a # in the URL is not part of the QUERY_STRING. It also does not appear in REQUEST_URI, which makes sense if it just passes that information to the browser to deal with.

There's gotta be some way to test for this, though. Anything come to mind, or am I SOL?

EDIT: I think I can live with it. Since there aren't usually more than one entry per day, I can just match the dates and redirect to a "best guess" page.

As a followup, though, this approach I'm going to be taking will produce a (what seems to me) rather large .htaccess file. Is this going to be a problem or create a performance issue if I have, say, 2500 rewriterules?

jdMorgan

2:32 am on Mar 26, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



"#" characters aren't valid in URLs, so those characters will appear as hex-encoded entities, that is, as %23.
But RewriteRule, RewriteCond %{REQUEST_URI}, and RewriteCond %{QUERY_STRING} all have different un-encoding behaviours --that is, I can't tell you the specifics-- so you'll have to experiment with detecting %23 in those variables.

You can also get closest to the actual transmitted URL by using RewriteCond %{THE_REQUEST}. This variable contains the entire request line sent by the client (browser) including the method and the protocol, such as:

GET /index.shtml?/YYYY/MM/DD#filename HTTP/1.1

(This is what usually appears in standard server access logs in the Request field)

Anyway, you ought to be able to catch that "#" character by experimenting with the info above. Because yes, 2500 lines is too many on a site getting traffic.

Jim

jt007superman

4:18 am on Mar 26, 2006 (gmt 0)

10+ Year Member



Your working entirely too hard. If there is no similarities in the file number to the node number then all you have to do is this:

RewriteCond %{QUERY_STRING} filename65$
RewriteRule ^index\.shtml$ [newsite.org...] [R=301,L]

zwhalen

4:50 am on Mar 26, 2006 (gmt 0)

10+ Year Member



RewriteCond %{QUERY_STRING} filename65$
RewriteRule ^index\.shtml$ [newsite.org...] [R=301,L]

That would be nice, but no go. I hadn't actually thought to try that until just now, but again, I think the problem is that whatever comes after the # doesn't make into QUERY_STRING at all.

jdMorgan - I couldn't get it to match THE_REQUEST either, but I checked my logs and found that my request fields look like "GET /index.shtml?/2005/10/17/ HTTP/1.1" when I know the actual request included the filename after a # (I know because it's my IP address in the log).

Is that something that might vary by server? I don't have much control over this one (outside of .htaccess, obviously) so I probably wouldn't be able to change much in the logging department.

jt007superman

5:47 am on Mar 26, 2006 (gmt 0)

10+ Year Member



Oops, your right zwhalen, sorry. I tested the rule without using the # sign.

I am afraid to say that after seeing your log notice, and running some tests myself, that anything from the # sign is for temporary purposes.

The only use for the pound sign in a url that I have ever seen is for anchor linking within the same page.
It's probably carried by the browser from one page to the next, but most likely not available to the server.

jt007superman

3:04 am on Mar 27, 2006 (gmt 0)

10+ Year Member



The "#filename65" you are using is called a "fragment identifier". Here is what the (RFC3986) white papers had to say.

Fragment identifiers have a special role in information retrieval systems as the primary
form of client-side indirect referencing, allowing an author to specifically identify aspects of an existing resource that are only indirectly provided by the resource owner.

Sorry, wish I could help you but this one is looking impossible.

zwhalen

9:03 pm on Mar 27, 2006 (gmt 0)

10+ Year Member



Well, here's what I've worked out. It's a bit of a kludge, but it hopefully won't have to kick in that much.

Since the entries on oldsite.org are organized by date, I can use that to get a pretty good idea of where the entry is on the new site. Most of the time, in fact, there's only one post on any given day, so that's easy:

RewriteCond %{QUERY_STRING} ^/2005/05/31$
RewriteRule ^index\.shtml$ http://www.newsite.org/node/393 [R=301,L]

On dates when there's more than one entry, I can determine that the indented file is one of a list, so I redirect to a page on the new site that expects that list as a parameter:

RewriteCond %{QUERY_STRING} ^/2005/11/15$
RewriteRule ^index\.shtml$ http://www.newsite.org/node/disambig?options=/644/642/660/643 [R=301,L]

So that page, 'disambig', says "Sorry, couldn't find what you were looking for, but it might be one of these four: 644, 642, 660, 643. By the way, check out the new site!"

So it's a big workaround, but hopefully it'll only come up on a few dates.

Thanks for all the input.