homepage Welcome to WebmasterWorld Guest from 54.211.181.45
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
301 redirecting list of URLs where one word is changed
minarets

5+ Year Member



 
Msg#: 4582238 posted 1:20 am on Jun 8, 2013 (gmt 0)

Hi, I've got a list of 150 pages out of approx 10,000 for which I need to create a 301 redirect, preferably using regex. All the URLs follow the same general, but not identical, structure.
Here's an example of the site's overall path structure for the directory I'm working in:

https://www.example.com/subdir/1500-COMPANY-Series1-_-Category1-Green-Blue-_-ABC002-01.html
https://www.example.com/subdir/1501-COMPANY-Series1-_-Category1-Green-Blue-_-ABC002-02.html
https://www.example.com/subdir/1502-COMPANY-Series1-_-Category2-Red-Black-_-DEF001-01.html
https://www.example.com/subdir/1503-COMPANY-Series1-_-Category2-Red-Black-_-DEF001-02.html
https://www.example.com/subdir/1504-COMPANY-Series1-_-Category3-White-Orange-_-GHI006-01.html
https://www.example.com/subdir/1505-COMPANY-Series1-_-Category3-White-Orange-_-GHI006-02.html

I need to redirect only the middle 2 URLs in the list above and ignore all others. So, I need to find Category2-Red-Black and redirect to Category2-Red-Purple while leaving the rest of the URL intact. For example,

this page:
https://www.example.com/subdir/1502-COMPANY-Series1-_-Category2-Red-Black-_-DEF001-01.html
becomes:
https://www.example.com/subdir/1502-COMPANY-Series1-_-Category2-Red-Purple-_-DEF001-01.html

and this page:
https://www.example.com/subdir/1503-COMPANY-Series1-_-Category2-Red-Black-_-DEF001-02.html
becomes:
https://www.example.com/subdir/1503-COMPANY-Series1-_-Category2-Red-Purple-_-DEF001-02.html

Any URLs which don't contain Category2-Red-Black should be ignored. I've been reading various threads on this and other forums but am not comfortable enough with regex to know how to approach this without bringing down the site. Can someone help me on this?

-min.

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4582238 posted 1:56 am on Jun 8, 2013 (gmt 0)

-_-
?
Really?
Literally?

Excuse me. I gotta go lie down and put a cold compress on my forehead.

Now then. Ordinarily we would read you the riot act about teaching a man to fish, et cetera, but if you've been reading WebmasterWorld posts since 2005 you know all that already.

When a rule is only supposed to work on 150 out of 10,000 pages, it's important to constrain it as closely as possible so the server can take a quick look and be outta there right away in the other 9,750 cases.

Do all your URLs follow the pattern you showed in your post? That is, not all of them everywhere, but all those that would potentially be affected by the rule? If so, you put that into the pattern. Your inital batch of six all fit into:

^subdir/\d\d\d\d-COMPANY-Series1-_-Category\d-\w+-\w+-_-[A-Z]{3}\d\d\d-\d\d.html

The bit I give as \d\d\d\d is really 150\d but I'm assuming that's coincidental. If any of the four numbers is always the same, give it explicitly. Conversely, if there can be more or less than four numbers, express it as a range: \d+ or \d{3,5} depending on how wide the range is. The same applies to all your other clusters of alphanumerics. Note that \w includes _ (the lowline, which does occur in your URLs). I can't see from your examples whether there will be any ambiguity.

Once you've got your pattern-- as tightly constrained as possible, but be careful not to overconstrain-- you can pin down the constant part:

^(subdir/\d\d\d\d-COMPANY-Series1-_-)Category2-Red-Black(-_-[A-Z]{3}\d\d\d-\d\d\.html)

redirects to
https://www.example.com/$1Category2-Red-Purple$2

Now, does the "Category2-Red-" piece always stay the same? If so, you could include it in your capture, so the target becomes ... $1Purple$2 But there's no particular reason to capture and reuse things if they're always the same. The same applies to the beginning of the pattern: If it's always the same /subdir/ you can choose not to capture it, and instead put the literal text "/subdir/" on the target side.

These are personal decisions. The total number of captures (two) isn't affected, so I doubt it will have a significant effect on your server.

I've deliberately not written out rules in context, because you don't say how you want to achieve your redirects. Since a Regular Expression is required but the redirect affects only the path of the request, you can either use mod_alias in the RedirectMatch form, or mod_rewrite with the [R=301] flag. I'm sure you have come across posts about the danger of mixing modules; if you're already using one or the other, stick with it.

minarets

5+ Year Member



 
Msg#: 4582238 posted 4:10 am on Jun 8, 2013 (gmt 0)

Hi Lucy,

Thanks for replying. Not sure what you were pointing out regarding the -_- format. It's generated by the application in place of hyphens, so there isn't really anything I can do about that. IPTC/XMP metadata (such as image title or image subject) which is embedded in the image files on the filesystem is partly translated into URLs.. So "Image of Mount Holly Reservoir - New York" becomes "0001-Image-Mount-Holly-Reservoir-_-New-York.html" etc etc...

The actual range I need to redirect runs from 1744-1891, so the leading '1' would be explicit.

The /subdir/ is always the same. COMPANY-Series1 is likewise always the same for the series I need to capture.

Yes, the 150 pages that need to be redirected are all structured in exactly the same way. However, while the "Category2-Red-" piece is always the same, the "Black" from the original URL may exist in that spot in other unrelated URLs which are not being redirected. I've looked at it quite closely and I'm pretty sure the entire chain of 3 words "Category2-Red-Black" are the safest and most concise on which to capture.

So, with your suggestions, up to now it would be something like:

^(1\d\d\d-COMPANY-Series1-_-)Category2-Red-Black(-_-[A-Z]{3}\d\d\d-\d\d\.html)

If the alphanumeric at the end is always the same, would it be best to show it explicitly as well? (it's actually 4 letters). And there are two 0s in the same space in all instances...

^(1\d\d\d-COMPANY-Series1-_-)Category2-Red-Black(-_-ABCD\d0\d-0\d\.html)

And, since these 3 words will only exist in that order in that same series, I think the leading group can be expanded:

^(1\d\d\d-COMPANY-Series1-_-Category2-Red-)Black(-_-ABCD\d0\d-0\d\.html)

If I'm looking at this correctly, that would leave only the one word "Black" which is being changed to "Purple" in the new URL.

The existing site, as I inherited it, uses mod_rewrite (RewriteCond and RewriteRule), for example:

RewriteCond %{SERVER_PORT} 80
RewriteRule ^(.*)$ https://www.example.com/$1 [R=301,L]

and there are a small handful of straight "redirect 301" instances for individual pages which I guess they moved at an earlier date.

-min.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4582238 posted 9:43 am on Jun 8, 2013 (gmt 0)

Urk. It's safer to grab those few remaining Redirect 301 rules-- the ones using mod_alias --and translate them to mod_rewrite. And then slot them into the appropriate places among your RewriteRules. Then you know exactly what's happening when.

If the alphanumeric at the end is always the same, would it be best to show it explicitly as well? (it's actually 4 letters). And there are two 0s in the same space in all instances...

At this point we really are getting into picosecond territory ;) On some abstract philosophical plane it's quicker for the server to check for "8", say, instead of \d or [0-9]-- but then you hit the page you'd forgotten all about that has "27" where all the other pages had "28".

^(1\d\d\d-COMPANY-Series1-_-Category2-Red-)Black(-_-ABCD\d0\d-0\d\.html)

If I'm looking at this correctly, that would leave only the one word "Black" which is being changed to "Purple" in the new URL.

That really should do handsomely.


Now, if the server had to take a large bucket and a small brush and had to hand-write every character you asked it to memorize, it would be more efficient to say
^1(\d\d\d)-COMPANY-Series1-_-Category2-Red-Black-_-ABCD(\d)0(\d)-0(\d)\.html
redirecting to
1$1-COMPANY-Series1-_-Category2-Red-Purple-_-ABCD$20$3-0$4\.html

But in real life this would be considerably worse because the server would then have to hold four separate little bits of information in memory.


Anyway, at this point we're getting into deep theoretical speculation, well beyond the realm of fine-tuning a Regular Expression.* What you've got is a conditionless rule looking something like

RewriteRule ^(1\d\d\d-COMPANY-Series1-_-Category2-Red-)Black(-_-ABCD\d0\d-0\d\.html) https://www.example.com/$1purple$2 [R=301,L]

That was fun :)


* But, in my case, a welcome break from figuring out why Footnotes 1 through 272 only contains 271 items, and how did Note 156 vanish off the face of the earth, and why the bleep did Julius Zupitza think it was a bright idea to write his critical apparatus in German instead of sticking to Latin the way God intended? And other queries of a similar nature.

minarets

5+ Year Member



 
Msg#: 4582238 posted 4:23 pm on Jun 8, 2013 (gmt 0)

Hi Lucy,

Thanks again for your help on this. I'm going to give this a try overnight tonight when I can set aside a little time. Will follow up here if there are any issues.

Is there a way to test a redirect like this "prior to" changing the actual files on the server? I suppose I can just create a /test/ subdir with some dummy files in it, then rename those file to correspond with the structure of the redirect and see if it works, but just wondering if there's some kind of built-into-apache method of testing a redirect like this without actually using live pages.

-min.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4582238 posted 7:06 pm on Jun 8, 2013 (gmt 0)

You can't really test it without doing it. In the case of a rewrite, you can test by temporarily setting it to R (explicit redirect) so your browser's address bar shows the new URL; if it didn't work as intended, at least you know where the server thinks it was supposed to go. But there's not much you can do for a redirect short of trying it.

Other than making a test directory, the two basic ways are:

maintain a test site (on shared hosting, should cost no more than domain-name registration)
keep a pseudo-server (MAMP, WAMP, X-thingie) on your HD with, if you like, a complete backup copy of the entire real site

Well, there is one other thing you can do. If you've got a text editor that does Regular Expressions, paste in a whole bunch of random URLs, one per line. Then tell it to do a global replace based on your RegEx pattern. If the pattern is right, things will change as intended. Won't work, of course, if there are Conditions attached, because your text editor can't deal with those!

minarets

5+ Year Member



 
Msg#: 4582238 posted 8:52 pm on Jun 8, 2013 (gmt 0)

I've got a copy of Notepad++ running under Wine and am testing it out, but it doesn't capture properly.. This regex:

^(1\d\d\d-COMPANY-Series1-_-Category2-Red-)Black(-_-ABCD\d0\d-0\d\.html)

doesn't find anything. I think it's because of the ^ which (correct me if I'm wrong) is searching for the '1\d\d\d' at the start of the line, so it's overlooking the entire http://www.example.com/subdir/ part of the URL.
If I remove the ^ from the start of the line, and replace with $1Purple$2 I get the correct new URL in place of the original.

However, in trying to figure this out, I think I found out that I can in fact shorten the capture further. Since the "Black" I need to capture in the original URL is restricted to a specific alphanumeric sequence at the end of the URL, this seems to work:

(.*)Black(-_-ABCD\d0\d-0\d\.html)

This capture now looks for the word "Black" only in those instances where the alphanumeric is ABCD, and since the letters in the alphanumeric *are* a unique "SKU" type of number, they'll only occur for that particular series. So this seems to be a more concise regex:

(.*)Black(-_-ABCD\d0\d-0\d\.html) $1Purple$2 [R=301,L]

Do you see any issue with this new version? And was I correct regarding the leading ^ skipping over the entire domain/subdir/ portion of the URL? It seems to have been what was happening...

-min.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4582238 posted 9:24 pm on Jun 8, 2013 (gmt 0)

I think it's because of the ^ which (correct me if I'm wrong) is searching for the '1\d\d\d' at the start of the line, so it's overlooking the entire http://www.example.com/subdir/ part of the URL.
If I remove the ^ from the start of the line, and replace with $1Purple$2 I get the correct new URL in place of the original.

Oops, yes, your text editor doesn't know from paths :) (It also doesn't know from query strings and hosts, so those too can't be tested exactly as-is.)

You are correct: for text-editor purposes, the opening ^ would have to be replaced with http://www.example.com/ Or, equally simply, DELETE the "http://www.example.com/" element of all URLs, and then your text editor will be left with the same information that mod_rewrite is working on.

I met the same situation earlier today in a different context: "Look, SubEthaEdit, you're supposed to KNOW that when I say ^ I really mean <td>. Sheesh."

Do you see any issue with this new version?

YES. You have replaced a perfectly fine-tuned Regular Expression with a sloppy and inefficient one. The form .* will capture everything all the way to the end. It will then have to backtrack until it finds the literal text it was supposed to exclude. As a human, you can see with your eyeballs that there's a "Black-_-ABCD" et cetera coming up, and you can see where the test string ends, so you know to stop before then. The computer operates in one dimension. It doesn't know it has reached the end of the test string until it gets there-- at which point the "Whoops! I was supposed to leave room for 'Black' et cetera" kicks in.

If you are only matching, not capturing, then you can simply leave off the ^.* piece and only show the part you need to match. But when you ARE capturing, you need to be careful not to capture more than you need.

The second issue, of course, is that the generic .* will capture everything. And until it reaches-- or fails to reach-- the -Black-etcetera part, it won't know that it was never supposed to be here at all. So it will happily capture requests for

www.example.com/some-completely-random-other-stuff-here.jpg

and

www.example.com/other-unrelated-requests.css

and will then have to spit it all out again when it meets the "These aren't the droids you're looking for" punchline. If you constrain the beginning of the RegEx to only those characters that will actually occur in the pattern, the server can get out of there all the sooner. "First character isn't a numeral one? On to the next rule, then."

There's a difference between "it works" and "it works cleanly and efficiently" in the same way that there is a difference-- both in programming and in real life-- between "it is legal" and "it is good".

minarets

5+ Year Member



 
Msg#: 4582238 posted 1:50 am on Jun 9, 2013 (gmt 0)

Ok so testing directly on the server, the existing htaccess rules are apparently causing other issues. The application this site uses has a bunch of preinstalled rewrite rules for SEO purposes and for blocking out common exploits and, maybe as a result of that, this regex isn't doing anything at all. The "old" URL remains in the address bar and the response header is a 200 and not a 301.

Also, if I temporarily test with a straight "redirect 301" it appends the old URL to the new url with a ?arg= inbetween, like so:

https://www.example.com/subdir/4172-COMPANY-Series1-_-Category1-Red-Purple-_-ABCD001-01.html?arg=subdir/4172-COMPANY-Series1-_-Category1-Red-Black-_-ABCD001-01.html

Not quite sure how to approach it at this point but am still trying.

-min.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4582238 posted 2:36 am on Jun 9, 2013 (gmt 0)

Ouch! mod_alias (Redirect by that name) doesn't deal with query strings, so somehow the capture is getting attached to the end of the query. No point in disentangling it, though, because if you have significant mod_rewrite activity you definitely don't want to throw mod_alias into the mix.

If you're adding new RewriteRules to an existing htaccess, it's crucial to get them in the right order. Each module is an island, so you don't need to think about authorization and mod_setenvif and other stuff. Just your rewrites and redirects.

There are two layers of ordering. There can be exceptions, but those are for specific cases. Start with the default.

FIRST group your rules in order of severity. That means:

-- start with any RewriteRule that results in an unequivocal [F]. No point redirecting people if you're going to end up locking them out. You can't do anything about people getting locked out by other means, such as IP in mod_authz-whatsit. But you can be efficient within mods.

-- then any RewriteRule that results in a [G], for pages that used to exist and are now gone, but that don't have a new URL you're redirecting to.

-- then any RewriteRule that results in an external redirect (with [R] flag and/or protocol-plus-domain in the target)

-- finally any RewriteRule that creates only an internal rewrite. If you've got SEO rules made to create shorter or prettier URLs, pay special attention to these. Any external redirects have to happen before internal rewrites; that's why it's risky to have mod_alias (Redirect by that name) running alongside mod_rewrite.

-- and super-finally: any RewriteRule that doesn't directly affect the request at all, such as rules that merely set a cookie or create an environmental variable for some other mod to use. You may not have any of these at all. Unlike most rules, these don't carry an [L] flag.

THEN, within each of these main groups, list rules from most specific to most general. If one thing happens to /directory/subdir/ and something else happens to the rest of /directory/, the rule for /directory/subdir/ obviously has to go first.

The specific-to-general pattern is most obvious in your redirects. Start with individual, named pages. Or groups of pages covered by a single rule, as in the present thread. Then move on to bigger groups. Most htaccess files will have two final redirects: one for requests in "index.html" (or whatever extension you physically use), and a last one for non-standard forms of the domain name.

Once the redirects are out of the way, you start the specific-to-general loop all over again with your rewrites (generally [L] flag alone).


If you have an htaccess file cobbled together from different sources, you may have bits and pieces of RewriteRules scattered throughout. Collect them all in one place. This is for your own sanity; the server doesn't care if all the rules from all the different mods are in a big hodgepodge. But you can't keep them sorted if you can't find them.

If you have an htaccess that contains an element something like

<IfModule mod_rewrite {I-forget-the-exact-wording}>
...
RewriteCond %{REQUEST_URI} !-f
RewriteCond %{REQUEST_URI} !-d
RewriteCond {more-boilerplate-which-I-haven't-memorized}
RewriteRule {blahblah} /index.php?{captured-stuff-here} [L]
</IfModule>

we will need to start over again from the very beginning. Let's hope it does not come to that :)

minarets

5+ Year Member



 
Msg#: 4582238 posted 10:18 pm on Jun 11, 2013 (gmt 0)

Had to put this on hold due to other projects... I took a look at the current htaccess against an untouched and inactive source copy that the software provided, and it actually doesn't appear to have been modified much at all, except for am http to https redirect, some custom error handler declarations, and a maintenance mode redirector block that is currently commented out..

The rest of the htaccess contains a block labeled as redirects to block common exploits, and then several blocks which follow this general structure:

RewriteRule ^(.*[\/])search[\/]?$ index.php?arg=$1search/index.html [NC,L]
RewriteRule ^(.*[\/])search?$ index.php?arg=$1search/index.html [NC,L]
RewriteRule ^search[\/]?$ index.php?arg=search/index.html [NC,L]
RewriteRule ^search?$ index.php?arg=search/index.html [NC,L]

I can't figure out which section is preventing or blocking the 301 redirects though.

-min.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4582238 posted 11:52 pm on Jun 11, 2013 (gmt 0)

(.*[\/])

Excuse me. Did I leave that cold compress lying around somewhere? I'm going to need it again.

Directory slashes never need to be escaped in mod_rewrite; it's doubly unnecessary within grouping brackets. And triply unnecessary when the brackets only contain one thing so you don't even need the brackets!

You say "blocks which follow this general structure" so let's see if we can extrapolate from this particular example.

The quoted group of four rules all seem to be intended to do the same thing: If the request contains the element "/search" then rewrite to index.php?arg=search/index.html

The target should have a leading slash:
/index.php et cetera

Now, I'm not going to tell you how to write your php
:: pause here for massive sighs of relief from assorted quarters ::
but is that whole query string really necessary? I assume the php is doing something based on the premise that the requested page was /search/index.html ... but the requested page never IS /search/index.html because any request ending in "index.html" has already been redirected. It should have been the second-to-last redirect ([R=301,L] flag) within the RewriteRule group. External redirects come before internal rewrites.

And then the pattern:

search?$ means "the request ends in 'search' or in 'searc'". Possibly the person who wrote the RewriteRule thought this is how you say "there might be a query string here". Not so: the body of a rule sees only the path. So that's two of four rules out the window, leaving
RewriteRule ^(.*/)search/?$
and
RewriteRule ^search/?$

Do you have any directories whose name ends in "search" but contains other text? If not, the whole package boils down to
RewriteRule search/?$ /index.php et cetera.

If you do need to exclude longer constructs containing the literal text "search" you can constrain it to
RewriteRule (^|/)search/?$ et cetera

But now there's another basic principle to bring up: When a rule creates a redirect, it is sometimes appropriate or at least acceptable to throw in variations, like an optional "index.html" or the [NC] flag. When you're rewriting alone, the pattern has to have exactly one form, because the URL will not change again. (This is assuming it's a simple rewrite and the php is not going to end up issuing a 301 redirect.) Even after eliminating the "searc" booboos, you've still got both
/search/
and
/other-directories-here/search/

The rule doesn't include any provision for testing whether those outer directories actually exist. The php can't do it, because it hasn't been given the information. (I have to assume the php page doesn't test where you were redirected from, because if so, what's the query string for?) Under what circumstances would a request for /foobar/search/ or /widget/search/ come in? If there are careless internal links, fix them. If there are external links, redirect them as
RewriteCond %{REQUEST_URI} !/search/
RewriteRule /search/ http://www.example.com/search/

To answer the original question: There is nothing in this group of rules-- or any other similarly constructed group-- that would prevent a 301 redirect from happening. UNLESS #1 the rules creating the redirect come after the rules creating internal rewrites AND #2 the pattern for one of these RewriteRules happens to contain the same literal text as the URLs you want to redirect. Or, of course, something in that "common exploits" pattern matches your intended redirects. This does not seem likely :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved