Forum Moderators: phranque

Message Too Old, No Replies

Trying to use RewriteCond to replace certain characters

I cannot figure out how to replace a string of characters

         

ForumKid1

4:51 pm on Aug 20, 2007 (gmt 0)

10+ Year Member



http://www.example.com.com/MyPage.jsp%3Fseq%3D230288

What I want to happen is for apache to convert %3F to? and %3D to =

Somehow urls are coming to the site with these % characters and thus resolving in http 404. My ultimate goal is to convert them to the respective question mark and equal sign. If that cannot be done, is there a way to still read for the %3d and %3f characters in the url and then maybe I can just redirect to the home page?

[edited by: jdMorgan at 4:58 pm (utc) on Aug. 20, 2007]
[edit reason] example.com [/edit]

jdMorgan

4:59 pm on Aug 20, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Welcome to WebmasterWorld, ForumKid1!

Please post the code you've tried so far, as a basis for discussion.

Thanks,
Jim

ForumKid1

5:07 pm on Aug 20, 2007 (gmt 0)

10+ Year Member



I've literally tried a slew of options. But these is my latest attemps.

#This would redirect to the home page
RewriteRule MyPage.jsp%$ http://www.example.com [R=301,L]

#This would replace the character
RewriteRule ^(.*)%3F(.*)$ $1?$2 [N,L]

[edited by: jdMorgan at 5:42 pm (utc) on Aug. 20, 2007]
[edit reason] example.com [/edit]

jdMorgan

5:41 pm on Aug 20, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, well that won't work, because query strings are not included in the URL-path as seen by RewriteRule, and the "?" that normally delimits the query string is encoded in your case, so I doubt Apache can even parse that request.

You normally have to use a RewriteCond testing %{QUERY_STRING} to manipulate normal (unencoded) query strings. But that won't work here, because there's no "?" to tell Apache how to parse the URL-path and query string anyway. So, the solution is to go back to the source -- The original HTTP request header as received from the client.

Try something like this:


RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /MyPage\.jsp\%3[fF]seq\%3[dD]([^\ ]+)\ HTTP/
RewriteRule ^MyPage\.jsp$ http://www.example.com/MyPage.jsp?seq=%1 [R=301,L]

That is a simple problem-specific solution, and assumes that only the value for "seq" varies, and that the URL-path and the "seq" name are constants. It also assumes that the first "%3d" and "%3f" are the only "incorrect" character sequences in the query string.

Jim

ForumKid1

5:44 pm on Aug 20, 2007 (gmt 0)

10+ Year Member



Thanks. I will give this a shot.

But just FYI,

MyPage.jsp%3D results in a 404. So Apache doesnt see it as a query string, it sees it as part of the actual page name. If it saw it as a query string, I would be able to convert the % values to the corresponding value using java.

ForumKid1

6:05 pm on Aug 20, 2007 (gmt 0)

10+ Year Member



OK, actually the code I came up with works, but it only works for example.com. It doesn't work with www.example.com. That was totally throwing me off.

Here is my VH entry. Can you see anything that I did wrong why it would only work for example.com and not www.example.com?

<VirtualHost *>
ServerName www.example.com
DocumentRoot "/home/myapp"
ServerAlias example.com
DirectoryIndex index.jsp

#Turn rewrite engine on
RewriteEngine On


#mydomain.com goes to www.example.com
RewriteCond %{HTTP_HOST} ^example.com$ [NC]
RewriteRule ^(.*)$ http://www.example.com$1 [R=301,L]

#Redirect http://www.example.com to http://www.example.com/myapp/index.jsp
RewriteCond %{REQUEST_URI} ^/$
RewriteRule ^(.*) http://www.example.com/myapp/index.jsp [R=301,L]

#Convert % values to the correct? and = value
RewriteCond %{HTTP_HOST} ^example.com$ [NC]
RewriteRule ^(.*)%3F(.*)$ $1?$2 [N,L]
RewriteRule ^(.*)%3D(.*)$ $1=$2 [N,L]
</VirtualHost>

[edited by: jdMorgan at 6:30 pm (utc) on Aug. 20, 2007]
[edit reason] example.com [/edit]

jdMorgan

6:16 pm on Aug 20, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Your first two redirect rules are in the reverse order that they should be; Put the most-specific redirect first, then the least. That is, the first rule should redirect the home page "/", then the generic domain redirect.

Then follow your external redirects with your internal rewrites.

I also strongly suggest using an external redirect to 'correct' the %3D urls, and if you do that, then that rule should be first. If you don't use an external redirect, then search engines will pick up and index the incorrect URLs, and you'll be dealing with this problem for a long long time.

I don't know if you even tested the code I posted, but I strongly recommend that you use that method for best portability across server versions...

Jim

ForumKid1

6:40 pm on Aug 20, 2007 (gmt 0)

10+ Year Member



Thanks. I wasnt aware of the differences between [N] and [R=301].

Even moving my redirects around and either using my code or yours, still www.mydomain.com doesnt work. It still only works for mydomain.com. I've cleared my cache. Rebooted the server, etc. Some weird reason apache doesnt like something about that code.

ForumKid1

6:56 pm on Aug 20, 2007 (gmt 0)

10+ Year Member



OK, just some more information. This is really bizzare. It seems apache will automatically do those conversions. I took out all the code that does the %3D, etc conversion. Rebooted the server. Cleared my cache. And if I go to mydomain.com, it automatically converts. If I go to www.mydomain.com, it doesnt. Is there a reason why apache wouldnt be accepting those redirect entries (both your code and mine).

jdMorgan

8:59 pm on Aug 20, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You'd have to ask your hosting company about that... It sounds like they have a redirect in place, and that would violate my "most specific rules first" recommendation. Or it may be that index.jsp is "aliased" and that mod_alias directives are being applied before mod_rewrite directives, essentially bypassing them

Try putting a test redirect in place, like:

RewriteRule ^/foo\.html$ [google.com...] [R=301,L]

and test that with both domains.

This is the order of rules that I'd recommend:

<VirtualHost *>
ServerName www.example.com
DocumentRoot "/home/myapp"
ServerAlias example.com
DirectoryIndex index.jsp

#Turn rewrite engine on
RewriteEngine On

# Redirect to remove hex-encoded query string characters
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /MyPage\.jsp\%3[fF]seq\%3[dD]([^\ ]+)\ HTTP/
RewriteRule ^/MyPage\.jsp$ http://www.example.com/MyPage.jsp?seq=%1 [R=301,L]

#Redirect http://www.example.com/ to http://www.example.com/myapp/index.jsp
RewriteRule ^/$ http://www.example.com/myapp/index.jsp [R=301,L]

#mydomain.com goes to www.example.com
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule (.*) http://www.example.com$1 [R=301,L]

</VirtualHost>
[/code]
Note several corrections and optimizations. One of them was to my code (added leading slash on rule pattern), because I forgot this was for httpd.conf and not for .htaccess.

Now about the [N] flag... Do you have case with more than one %3d in the query string?
If so, you may want to use [N], which basically loops to the top of the .htaccess code. But it's slow and inefficient, and also can call a specific Apache mod_rewrite bug into play.

If you do you have more than one %3d in the URL, we can discuss that in detail. It will require that you use [N] while internally rewriting all of the %-encoded characters, and then do an external redirect after all have been fixed. To do that, you'll also need to use the [E=envar] flag to "remember" that you've corrected at least one %-encoded character so as to do an external redirect after the characters have all been fixed (and not before, and only if needed).

Jim

ForumKid1

12:03 am on Aug 21, 2007 (gmt 0)

10+ Year Member



Our ISP only hosts our gateway. We control and own the server, etc. I havent done any aliasing.

There is only one %3D and %3f in the querystring.

If I put the following test code in, it works perfectly for both domains. Just FYI, my root is /myapp, so I had to add that to your test code.

RewriteRule ^/myapp/foo\.html$ [google.com...] [R=301,L]

Now..I guess one more question. My MyPage is located one directory in from the root. So its located at domain.com/myapp/news/MyPage.jsp. I'm not sure if that matters? I have tried your new VH block and still, just one domain forwards. I have changed the Redirect to the following and still just one domain forwards:

#Redirect to remove hex-encoded query string characters
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /myapp/news/MyPage\.jsp\%3[fF]seq\%3[dD]([^\ ]+)\ HTTP/
RewriteRule ^/myapp/news/MyPage\.jsp$ http://www.example.com/myapp/news/MyPage.jsp?seq=%1 [R=301,L]

And, I just want to thank you very much for all the fast replies. I really do sincerely appreciate your help. I will keep banging on this, but if you have any additional ideas, I will greatly appreciate again.

[edited by: jdMorgan at 1:05 am (utc) on Aug. 21, 2007]
[edit reason] example.com [/edit]

jdMorgan

12:50 am on Aug 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Our ISP only hosts our gateway.

What, precisely, do you mean by "gateway?"

You're looking for something that can get control and interfere with your rewrite. If this "gateway" is anything other than a router, switch, or completely-transparent proxy, then it's a candidate for investigation.

To view this problem from a completely-different angle, and perhaps to thereby glean additional information, try testing your redirects using the "Live HTTP Headers" extension for Firefox/Mozilla browsers. Carefully watch for any kind of unexpected redirects or changes in the malformed query string that take place before you expect them to, for example, at this "gateway" or in other server config files.

Also, please define, in detail, what you mean by "still just one domain forwards." Not to be pedantic, but we cannot see over your shoulder here...

For both cases -- "works" and "does not work":

What complete url did you request?
What was the result?
How does that result differ from your expectations?

I suppose you could also define and enable RewriteLog if this continues to be problematic...

And one further supposition... If there is some 'agent' in the way that is interfering, a prime candidate for the kind of interference that would break the rule is that perhaps the interferer is double-encoding the hex-encoded characters. In which case, modifying the rule like this would fix it:


RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /myapp/news/MyPage\.jsp\%(25)*3[fF]seq\%(25)*3[dD]([^\ ]+)\ HTTP/
RewriteRule ^/myapp/news/MyPage\.jsp$ http://www.example.com/myapp/news/MyPage.jsp?[b]seq=%3 [R=301[/b],L]

That should handle any case where the hex-encoded string is re-encoded zero or more times. The premise is that if the URL is rewritten or redirected or otherwise 'handled' by another agent in the network, that handler may re-encode %3f as %253f -- That is, it encodes the "%" to "25" and sticks another "%" on the front to tag it as encoded... if it passes through yet another handler, then that string will get re-encoded as %25253f, ad infinitum...

I guess this explains why the last programmer who tried to put percent-encoded characters in my URLs left with a black eye... :)

Jim

ForumKid1

12:11 pm on Aug 21, 2007 (gmt 0)

10+ Year Member



I will try your new code.

What, precisely, do you mean by "gateway?". I mean physical router. Basically our stuff is in a datacenter. They host the router. 0 rules on it. All they do is move the traffic to our network.

Also, please define, in detail, what you mean by "still just one domain forwards." Not to be pedantic, but we cannot see over your shoulder here...
If I go to http://example.com/myapp/news/MyPage.jsp%3Fseq it correctly adds the www and converts the %3F. If I go to http://www.example.com/myapp/news/MyPage.jsp%3Fseq, I get a 404..hence, its not converting the %3F. What is so damn confusing, is that if I do NOT add the www, it automatically adds the www and converts the % characters. If I add the www myself, it gets error 404.

For both cases -- "works" and "does not work":

[edited by: jdMorgan at 12:24 pm (utc) on Aug. 21, 2007]
[edit reason] example.com [/edit]

ForumKid1

3:37 pm on Aug 21, 2007 (gmt 0)

10+ Year Member



So I tried that new code and still the same results. If I go to [domain.com...] it redirects to [domain.com...]

If I go to [domain.com...] i get 404.

I'm not sure what to do at this point. It makes no sense.

jdMorgan

6:26 pm on Aug 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



See the other suggestions I posted above...

Jim

g1smd

9:16 pm on Aug 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You need the Live HTTP Headers extension to see if there are any intermediate steps, and exactly what they are.

ForumKid1

5:33 pm on Aug 22, 2007 (gmt 0)

10+ Year Member



Thanks. I do appreciate all your help. But for me, it's just not worth the effort to continue trying to fix this.

g1smd

6:49 pm on Aug 22, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can't understand why not.

I strive for 100% on every site.

ForumKid1

3:07 pm on Aug 23, 2007 (gmt 0)

10+ Year Member



I totally understand. But you have to understand that time is money. We have a few urls like this on a daily basis. I want it fixed, but not willing to spend hours and hours to get it done. Google isn't hurting our SEO results for these. The crawler simply states the obvious. Yet the site still remains #4 for the keywords we are interested in. Although we do drop off here and there, but I can't see that being because of these % urls somewhere.

I'm better off spending more time on other things that are going to bring in $$. If I had the staff, then yes, I'd strive for 100%. But right now this company is better off by me building additional models for revenue.