homepage Welcome to WebmasterWorld Guest from 54.166.96.101
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Very tricky rewrite help needed.
Using two sets of regex per url?
416bc



 
Msg#: 4368285 posted 2:34 pm on Sep 28, 2011 (gmt 0)

The old website used some bad trickery in the search engines, and created url's on the fly. I have created two regex snippets, one to target the trickery URL and one to target the normal URL.

Normal:
([a-zA-Z0-9\/*]*)_
safety/story_Child+Proof+Your+<<city>>+Home.html
community/story_New+Year's+Resolutions!.html

Trickey:
([a-zA-Z0-9\/\-*]*),([A-Z*]*)-
safety/local-omaha,NE-Child_Proof_Your_<<city>>_Home.html
community/-orange,CA-New+Year's+Resolutions!.html

There are two samples of each URL there, and the regex matches the part of the URL I don't care about and it can be deleted, but the remaining I'd like to keep and create a rule to forward them to the new site.

What I'm trying to get is this in regex code:
See if this matches anything: ([a-zA-Z0-9\/-]),([A-Z])-

If it does, then run this code:
redirect ([a-zA-Z0-9\/-]),([A-Z])-Child_Proof_Your<<city>>Home.html to http:www.newdomainandsite.com/Child-Proof-Your-Home [301,L]

Otherwise, if that regex isn't found on the page, do this
redirect ([a-zA-Z0-9\/])_Child+Proof+Your+<<city>>+Home.html to http:www.newdomainandsite.com/Child-Proof-Your-Home [301,L]

I know this is broken. Is there an easier way to 301 redirect the old articles to the new site?

 

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4368285 posted 4:25 pm on Sep 28, 2011 (gmt 0)

You don't need to "pre-match" then do the redirect.

Either the full-length RegEx pattern matches or it doesn't.

Where's the split supposed to be? It looks like it should b AFTER the first hyphen after the comma.

Slashes and hyphens do NOT need to be escaped inside character groups.

Do your URLs really have a * in them? If not, why are defining a valid character as * in your character group?

For complex URL manipulations it might be better to detect all URL requests that contain a comma and rewrite (that's rewrite not redirect) those requests to a special PHP script. That PHP script will use preg_match, str_replace, and so on to generate the new URL and then use two header directives to send the 301 response and the new location back to the browser.

[edited by: g1smd at 4:40 pm (utc) on Sep 28, 2011]

416bc



 
Msg#: 4368285 posted 4:39 pm on Sep 28, 2011 (gmt 0)

I really don't know regex that well. The * means to me any number of repeats. If I leave that out, it only matches a single character.

The spit occures in a different spot depending on the url.

In this sample community/story_New+Year's+Resolutions!.html it should split after the first underscore.

In this sample community/-orange,CA-New+Year's+Resolutions!.html it should split after the second dash.

That's why I'm having trouble.

416bc



 
Msg#: 4368285 posted 4:40 pm on Sep 28, 2011 (gmt 0)

I like your idea of the PHP script. I may do that. I'd need to know how to get the two different regex types to forward to the new php file for each type though.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4368285 posted 4:43 pm on Sep 28, 2011 (gmt 0)

Inside [ ] is a character group. It lists valid characters.

[a-z] matches lower case letters.

[ ] means "one of anything inside the box".

[ ]? means "one or zero of anything inside the box".

[ ]+ means "one or more of anything inside the box".

[ ]* means "one or more of anything inside the box OR blank".

[ * ] means "a literal * character" will appear in the URL.

,([A-Z*]*)- Note that the [ ]* here allows a hyphen to directly follow a comma.

[edited by: g1smd at 5:22 pm (utc) on Sep 28, 2011]

416bc



 
Msg#: 4368285 posted 4:52 pm on Sep 28, 2011 (gmt 0)

Wow, you've helped me before, and you're doing it again, thanks! So I'm thinking I can forward any URL with a _ to a PHP page, and from there determine the full URL structure and create arrays of what to match and where to send them.

Does this code look right to you?
Redirect 301 [\_\<\>]* [website.com...]

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4368285 posted 4:59 pm on Sep 28, 2011 (gmt 0)

Not quite.

rewrite (that's rewrite not redirect) those requests to a special PHP script

If you redirect to the script, you'll have an unwanted double redirect for every request (a redirect chain).

You'll need a RewriteRule and the target must NOT contain the domain name or the [R] flag.

[edited by: g1smd at 5:02 pm (utc) on Sep 28, 2011]

416bc



 
Msg#: 4368285 posted 5:01 pm on Sep 28, 2011 (gmt 0)

Options +FollowSymlinks
RewriteEngine on
RewriteRule ^[\_\<\>]*$ [website.com...] [R=301,NC]

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4368285 posted 5:03 pm on Sep 28, 2011 (gmt 0)

RewriteRule can be configured to generate a redirect or to perform an internal rewrite.

You need to configure it as a rewrite.

416bc



 
Msg#: 4368285 posted 5:08 pm on Sep 28, 2011 (gmt 0)

So I take it that wasn't correct?

Options +FollowSymlinks
RewriteEngine on
RewriteRule ^[\_\<\>]*$ http://www.website.com/redirect.php [R=301,NC]


I want it 301'd though so I don't lose the link juice. I'm googling everything you tell me, but it's pieces.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4368285 posted 5:16 pm on Sep 28, 2011 (gmt 0)

You do NOT want to 301 redirect to redirect.php and then have the PHP script issue a second 301 redirect to the new URL.

THAT is creating an unwanted multi-step redirection chain and would be a disaster.

You REWRITE the request to be handled by the PHP script, so that the user doesn't "see" the script filename as a URL. The PHP script then returns the redirect to the new URL.

Remove the domain name. Remove the R flag. Add the L flag.

Use the Live HTTP Headers extension for Firefox to investigate what happens for both ways of doing it. The single redirect method is the right one.

416bc



 
Msg#: 4368285 posted 5:31 pm on Sep 28, 2011 (gmt 0)


Options +FollowSymlinks
RewriteEngine on
RewriteRule ^[\_\<\>]*$ /redirect.php [L,NC]

I get it now, I was confused. I understand we aren't redirecting them with this, rather than pointing them to the file which will redirect them via PHP. Is my above code correct now?

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4368285 posted 5:41 pm on Sep 28, 2011 (gmt 0)

Yes, the silent rewrite to a script which then does all the work can be a useful tool on many occasions.

The main benefit is that for requests that do not need to be processed by the script, there's just one line of code to be skipped in the htaccess file, so normal site operation is not slowed down in any way.

You're now free to experiment with the code inside the PHP file, safe in the knowledge that it cannot interfere with the normal operation of the rest of the site.

Be aware that the PHP HEADER directive returns a 302 response unless you specifically state that the status code should be 301.

You will need the Live HTTP Headers extension for Firefox to be absolutely sure the script returns the right responses.

The code looks OK, but the NC flag isn't needed as there are no characters in the pattern itself.

[edited by: g1smd at 5:52 pm (utc) on Sep 28, 2011]

416bc



 
Msg#: 4368285 posted 5:43 pm on Sep 28, 2011 (gmt 0)

Thanks again g1smd. I'll get to testing this now.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4368285 posted 5:45 pm on Sep 28, 2011 (gmt 0)

When it all works as designed it is a thing of beauty to behold.

Technology as the highest form of art. :)

416bc



 
Msg#: 4368285 posted 5:51 pm on Sep 28, 2011 (gmt 0)

I never thought of technology as art. That's a good way to think of it though.

Anyway I tested it, and it didn't work, here is my code

Options +FollowSymlinks
RewriteEngine on
# Sends all traffic that contains < to home page
RewriteRule ^[\<]$ /fmg [L]


and here is the URL I used to test

http://www.website.com/fmg/title-of-a-node_d<<city>>
g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4368285 posted 5:55 pm on Sep 28, 2011 (gmt 0)

^ and $ are "begins with" and "ends with" anchors.

^[\<]$ will match only one requested URL path:
example.com/<

Remove ^ and $ if you need "contains" rather than "matches exactly".

The < does not need to be escaped in a character group.

What is
/fmg ? If that is the name of your redirecting script file it should have .php on the end.

[edited by: g1smd at 6:06 pm (utc) on Sep 28, 2011]

416bc



 
Msg#: 4368285 posted 6:02 pm on Sep 28, 2011 (gmt 0)

no, that's the sub directory the site is in for now. When I move it to a live site, it won't be in FMG. I'm finding help on google too, so I'll let you know the final code which works for me. Thanks!

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4368285 posted 6:09 pm on Sep 28, 2011 (gmt 0)

With the [L] flag that's a rewrite not a redirect.

The server will be looking for a file called /fmg inside the server to serve the content at the originally requested URL. If it doesn't find that file it might return a 404 error or it might look for a folder called /fmg/ and then look for a file called index.php inside that folder.

416bc



 
Msg#: 4368285 posted 6:38 pm on Sep 28, 2011 (gmt 0)

RewriteRule ^(.*)_(.*)$ /fmg/redirect.php [L]

That's my final code which worked great. I decided to search for the underscore because more pages use that on the site.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4368285 posted 7:42 pm on Sep 28, 2011 (gmt 0)

no, that's the sub directory the site is in for now. When I move it to a live site, it won't be in FMG.

You may be thinking of Redirect, using mod_alias, where the rest of the path gets attached to the end of whatever you specify. In mod_rewrite you have to spell out the whole thing all the way to the end.

btw: Way back in your original post you had some addresses containing + signs. If you ever need to deal with those in a RegEx, note that the + has to be \+ escaped everywhere except in grouping brackets.

RewriteRule ^(.*)_(.*)$ /fmg/redirect.php [L]

It would do exactly the same thing and run faster if you simply said

RewriteRule _ /fmg/redirect.php [L]

Since you're not capturing the non-lowline parts you don't need them; you're just looking for anything that contains a lowline. Besides, g1 is about to read you the riot act about using .* anywhere other than after all matches ;)

416bc



 
Msg#: 4368285 posted 7:48 pm on Sep 28, 2011 (gmt 0)

Thanks Lucy, I have updated my code.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4368285 posted 11:54 pm on Sep 28, 2011 (gmt 0)

The first (.*) pattern says "capture the entire URL". So when used as (.*)_ it says "after the end of 'everything' look for an underscore".

The parser then throws a fit and has to attempt hundreds of "back off and retry" trial matches to find the underscore inside the "everything". Of course, the underscore is not near the end it is in the middle.

NEVER use .* at the beginning or in the middle of a pattern.

Since you're not re-using the captured data as $1 or $2 the parentheses are also redundant.

So, all you need is

RewriteRule _ /fmg/redirect.php [L]


Placement of this code is crucial.

It goes before just about all of your other rules.

It even goes before your non-www to www redirect, because otherwise a non-www request with underscore would pass through a double redirect (one to add www, another to fix the path), and you do not want that. Your PHP script specifies www anyway, so rewrite underscore requests before anything and everything else.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved