Replace + Character in Rewrite Rule

Forum Moderators: phranque

Message Too Old, No Replies

Replace + Character in Rewrite Rule

LionMedia

8:29 pm on Jul 25, 2010 (gmt 0)

Hi,
I've looked everywhere for a solution to this. I have a rewrite rule like this

Rewriterule ^(.*)-county/(.*)-md/(.*)-homes-for-sale-page([0-9]*)\.html index.php?action=searchresults&PageID=searchresultssef&County=$1&city=$2&Type=$3&ForSale=Y&cur_page=$4 [L]

Works fine except some of the variables have one or more + signs that are spaces in the database So the result might look like this:

anne+arundel-county/annapolis-md/garden+1-4+units-homes-for-sale-page.html

I've read the forum documentation and search the internet but still stuck. Any assistance or suggestions on where to find the answer would be great

jdMorgan

8:44 pm on Jul 25, 2010 (gmt 0)

I'll bet that code makes your server really slow... You should specify a much-more-specific pattern for those URLs.

Replacing characters in .htaccess is also very slow because recursion can only be done at the 'entire file' level. That is, if you want to replace more than one character, then you either need multiple rules, or you will have to re-start processing of the entire set of rules all over again.

Be very sure that you cannot simply modify index.php to replace these characters -- It would be many times more efficient to do so. The preg_replace function is quite suitable for this.

Jim

LionMedia

10:33 pm on Jul 25, 2010 (gmt 0)

Thanks. I will research more specific patterns.

So just to clarify...I can find that part of index.php that is responsible for URL structure and add the preg_replace function to replace spaces/characters with a dash?

g1smd

11:22 pm on Jul 25, 2010 (gmt 0)

The more specific pattern will be something like "match to next hyphen" or "match to next slash" or similar.

jdMorgan

11:37 pm on Jul 25, 2010 (gmt 0)

Yes, very likely just before the line where the query to the database is performed.

More-specific pattern:


Rewriterule ^([^-/]+(-[^-/]+)*)-county/([^-/]+(-[^-/]+)*)-md/([^-/]+(-[^-/]+)*)-homes-for-sale-page([0-9]+)\.html$ index.php?action=searchresults&PageID=searchresultssef&County=$1&city=$3&Type=$5&ForSale=Y&cur_page=$7 [L]

The subpatterns are made more complicated because you don't want the trailing hyphens to be included in the query string parameters. but the rule will still be executed much, much faster, because these subpatterns allow the requested URL-path to be parsed from left to right rather than requiring thousands of back-off-and-retry matching attempts -- essentially proceeding one character at a time backwards from the end, and then invoking many repeats of that... Remember that ".*" means, "Match anything, everything, or nothing."

So for example, on the first pattern match attempt, the first ".*" would 'consume' the entire URL-path, try to match, find that no hyphen followed ".html", fail, try again by 'consuming' all characters in the URL-path except for the trailing "l" on /html, find that "l" was not equal to a hyphen, fail, and try again. it would finally work it's way back to the 'sale' in "homes-for-sale-page" and match on that hyphen, but then find that "page" did not equal "county."

It would then continue one character at a time, with several more 'disappointed partial matches' on "for-", "homes-", "custom", "gaithersburg-", etc., until it finally worked it's way back to "Ann+Arundel", for example.

Having done that, it would then commence trying to match your second subpattern using all of the URL-path from the end back up to that point. The net effect is a very slow rule.

The proposed first subpattern now reads, "Match one or more characters not equal to a hyphen or a slash, followed by as many repetitions as you like (including zero) of {a hyphen followed by one or more characters except a hyphen or a slash} followed by a hyphen, followed by "county". So there will be only one back-off-and-retry in this subpattern match -- occurring initially when the matching engine tries to put "county" into the matched second sub-subpattern and finds that it is not followed by a hyphen as required. However, this is certainly not as bad as matching all the way to the end of the entire requested URL-path and then working back from there several times (once per pass for each remaining un-matched subpattern).

Also, because the "not a slash" requirement is included in each sub-subpattern, the matching engine will never proceed past any slash while matching any of these subpatterns. So the subpattern processing occurs within these distinct "boundaries" within the requested URL-path.

If your server gets very busy you will likely notice that its performance is better with this more-specific pattern.

To resolve the contents of each back-reference $1 through $9, count left parentheses.

Jim

LionMedia

12:18 am on Jul 26, 2010 (gmt 0)

Thank you! You're amazing.

SevenCubed

1:21 am on Jul 26, 2010 (gmt 0)

Jim, surely you must think in binary, just reading your explanation makes me dizzy, let along trying to figure out something like that in the first place ;)

g1smd

6:39 am on Jul 26, 2010 (gmt 0)

It all comes from getting the the "requirements" part of the process right, and in the initial design of the URL format. Once it is clear what each element in the URL is, and how they are delimited, the coding bit is trivial.

jdMorgan

3:17 pm on Jul 26, 2010 (gmt 0)

It is far easier to write regular expressions than to read them. This is because, as you construct a regex pattern, you know what your intent is and can proceed one step at a time. Only when you've finished and look back at the result might there be any 'shock and awe' at the creation.

However, lacking really extensive comments in the code, it is often quite difficult to read someone else's regex patterns and infer their intent.

By posting questions here, members get experience solving one or a few problems. However, the members who answer questions get experience solving dozens, hundreds, or even thousands of problems. Both groups benefit. :)

I frequently answer 'glowing reviews' with the observation that I have simply written a lot more bad code than most people posting questions here, and have benefited from that experience... ;)

Think in binary... Yes, I can do that. Or hexadecimal. ASCII translation or CIDR/netmask charts? -- Rarely needed. However, all-in-all, I'd really rather be young again.

Thanks for the kind remarks. :)

Jim

LionMedia

3:42 pm on Jul 26, 2010 (gmt 0)

Given that I'm very limited in my PHP and don't want to make a total mess of things, is leaving the + in the URL that bad for SEO?

SevenCubed

4:00 pm on Jul 26, 2010 (gmt 0)

I'll just add that I am constantly pouring over posts in this Apache forum to get a better understanding of a branch of IT that I previously had no exposure to.

Bit by bit I'm extracting snippits that I can apply to my own server to tweak it for optimal performance and (better) security. I am quickly realizing that I cannot simply copy and paste what is shared here. Many times it just brings the server to it's knees (which is why I am now testing on local dev server using WAMP). I think my major stumbling blocks are that many times the examples here are for htaccess files whereas I've moved away from them to root httpd.conf files for increased performance and disabled htaccess completely in Apache so it doesn't go looking for those files, so, they just don't work at times. But that's ok, nothing is critical for me like some users who post here because they have a major urgent issue.

Once I finally understand this last bit of the grand IT scheme of the Internet I'll feel complete :) It's been a fascinating 10 years of learning for me though! I was a late bloomer into IT -- never even turned on on PC until middle age. Then an opportunity came up to go back to school for retraining and here I am. So, as for wanting to be young again, well keeping the mind sharp by constant analytical processes will keep it from getting rusty!

Now if I could just slow down my mind's single core processor enough at night to rebuke the stormy waters (thoughts) -- I just may be able to turn myself into a 64 bit dual-core processor through meditation and neuron path expansion! Tidit of info, the number 64 is also the core building block of human DNA as well as all that there is.

Oops I did it again, going on and on and on...

Cheers all :)

jdMorgan

7:51 pm on Jul 26, 2010 (gmt 0)

I am quickly realizing that I cannot simply copy and paste what is shared here. Many times it just brings the server to it's knees (which is why I am now testing on local dev server using WAMP).

Very true. One must understand the entire 'meaning' of patterns, rules, and entire collections of rules before trying to use the code.

The critical questions are: "Does this do exactly what I want it to do, with my server, my site, my URLs, my directory-structure, and address the specific problem that I want to solve?" and "What effects will visitors and search engines see, and what will this do to the visitor-experience on my site and to my search listings?"

A rule might seem quite simple, but its effects can be quite far-reaching... Ask anyone who's ever unwittingly exposed internal script filepaths as URLs by redirecting after an internal rewrite, and has had to go clean up the mess that that creates! :o

I think my major stumbling blocks are that many times the examples here are for htaccess files whereas I've moved away from them to root httpd.conf files for increased performance and disabled htaccess completely in Apache so it doesn't go looking for those files, so, they just don't work at times.

The major differences are that the URL-paths matched by patterns in RewriteRules located in .htaccess files will be stripped of the path used to get to that /.htaccess file.

So, for example, a rule located in example.com/.htaccess and intended to match a client request for "example.com/foo" will have a pattern of "^foo$" -- Note that the leading slash will have been stripped from the requested URL-path, because that is [art of the path to this .htaccess file. And a rule located in example.com/dir/.htaccess and intended to match a client request for "example.com/dir/bar" will have a pattern of just "^bar$", because again, the path to this .htaccess file will have been stripped from the requested URL-path.

Note that this does NOT apply to the requested URL-paths matched by RewriteConds examining either %{REQUEST_URI} or %{THE_REQUEST}. Those will always contain the full URL-path, starting with a slash.

The same is true in server config files such as httpd.conf, when the code is located inside a <Directory> container -- The URL-path seen by RewriteRules will not contain that part of the path specified in the <Directory> container.

For mod_rewrite code located outside any <Directory> containers, the full URL-path will always be seen by the RewriteRule.

Another difference is that code located in .htaccess is executed during the fix-up phase of the Apache API. Because of this, the mod_rewrite codes behaves recursively. That is, if any rule is invoked, then processing of all rewriterules is re-started. For this reason, it is necessary to explicitly prevent rewrite loops in .htaccess code -- by excluding the target address of the rewrite from the rule that rewrites it, so that the request won't get rewritten over and over again, leading to s 500-Server Error when Apache detects this looping.

So, the bottom line is that code for use in config files should actually be simpler than code for use in .htaccess files, and all that is usually required is to tweak the RewriteRule patterns slightly -- usually just by adding a leading slash. Trivial example:

example.com/.htaccess:


RewriteCond $1 !\.php
RewriteRule ^(.+)$ /$1.php [L]

example.com's httpd.conf file:


RewriteRule ^/(.+)$ /$1.php [L]

And our three most often repeated bits of advice:

Taking into account all mod_rewrite code in all config and .htaccess files, put all external redirects first, in order from most-specific patterns and conditions (one or only a few requested URLs affected) to least-specific patterns and conditions (more or most URLs affected), followed by all internal rewrites, again in order from most- to least-specific. Access-control rules (if any) should precede the redirects where possible, because there is no use wasting server resources redirecting unwelcome visitors.

-and-

Always end each RewriteRule with an [L] flag, unless you have a very good reason not to. Exceptions are very rare.

-and-

Don't forget to delete your browser cache before testing any new server-side code -- config code, .htaccess code, or scripts. Otherwise, your browser may show you stale, previously-cached pages, objects, and server responses.

Jim

LionMedia

8:19 pm on Jul 26, 2010 (gmt 0)

I don't really want to try to change core files in my scripts so I'm left with figuring out the best way to handle the + in .htaccess.

So looking around, this is what I come up with as a start:
Rewriterule ^([^-/]+(-[^-/]+)*)-county/([^-/]+(-[^-/]+)*)-md/([^-/]+(-[^-/]+)*)-homes-for-sale-page([0-9]+)\.html$ index.php?action=searchresults&PageID=searchresultssef&County=$1&city=$3&Type=$5&ForSale=Y&cur_page=$7 [N]

Rewriterule^([^+]*)\+([^+]*)\+([^+]*)\+([^+]*)\.html$ index.php?action=searchresults&PageID=searchresultssef&County=$1&city=$3&Type=$5&ForSale=Y&cur_page=$7 [L]

jdMorgan

8:56 pm on Jul 26, 2010 (gmt 0)

After getting into this, it's quite likely that you will reconsider.

For one thing, this code will not change the published URL unless you modify the script to put corrected URLs in the links on your pages, and for another thing, you will have to handle the plus-to-space conversion first, keeping the requested URL in a server variable, and then rewrite it in a following rule, using that server variable as the source of the URL.

If the published URLs are not corrected, the effect will be one of two bad things, depending on your .htaccess implementation:

Either all requests for the plus-sign URLs will get redirected, signaling a "low-quality" site to search engines, because they don't like it when a site links inconsistently to itself and requires redirects to work, and also slowing the user experience, because the browser will have to follow the redirect and make a second HTTP request to your server (which also destroys the validity and usefulness of your logs and stats as well).

Or, you can rewrite either space- or plus-padded URLs to you script. But that creates a duplicate-content problem, because the same content will be returned for multiple variants of the 'correct' URL. The multiple URLs will 'compete' with each other for attention and ranking in the search results. This is like paying a competitor to open up shop right next door...

Either way, it is not pretty, and I urge you to fix the problem at the source.

Otherwise, you're also going to encounter a very nasty Apache mod_rewrite bug -- and one that's complicated enough that it still has not been fixed after many years, and the effect of which is to make it almost impossible to use two rules to modify a requested URL (without using the server-variable trick).

However, for more information on this technique, see the Apache forum library post here: [webmasterworld.com...]

Jim