Forum Moderators: phranque

Message Too Old, No Replies

Removing every visible query string

including blank ones

         

Namjies

6:44 am on May 3, 2010 (gmt 0)

10+ Year Member



Hello,
I wanted to remove query string (EVERY) from my website since I use rewrite rules to use clean urls instead

RewriteRule ^(.+)/(.+)$ index.php?p=$2&l=$1 [L]


So I tried a few things like this to cause a notfound response to appear:

RewriteCond %{QUERY_STRING} .
RewriteRule ^$ /oddquery? [R=301,L]


But whatever the rule I create, they won't allow to put index.php into the redirect rule. It returns a loop. It allows other values like ^$ or ^search.php$ tho. I'm not sure why. oddquery shouldn't be using the index.php from the RewriteRule for clean urls. I tried to redirect to > oddquery.php? but also creates a loop whenever index.php is in the rules.

Is there a way to Redirect/404/410 every pages that display a query string?

Thank you.

g1smd

7:06 am on May 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Use a preceding RewriteCond to test THE_REQUEST. This way the redirect only fires for direct client requests, not as the result of a previous internal rewrite.

There are hundreds of threads here showing these types of redirect/rewrite pairings, several in only the last few days.

Check those for example code.

Namjies

8:01 am on May 3, 2010 (gmt 0)

10+ Year Member



I searched for THE_REQUEST and QUERY_STRING and found this suggestion to remove query strings from files:

RewriteCond %{THE_REQUEST} ^GET\ /.*\;.*\ HTTP/
RewriteCond %{QUERY_STRING} !^$
RewriteRule .* [askapache.com%{REQUEST_URI}?...] [R=301,L]

But it doesn't redirect anything.

I tried a condition I'd understand more easily for the syntax and that would match a HTTP request:
RewriteCond %{THE_REQUEST} ^GET\ index\.php(.*)$
with the wild card for the appended query / HTTP or whatever follows. Doesn't work. I tried various variant with full request instead of the wildcard without success.

There must be something I'm not understanding right about this request value.

I also tried my luck with RewriteCond %{THE_REQUEST} ^(.*)\.(.*)$
wildcard, escaped dot, wildcard for any request with a file extension. Redirects to /oddquery but then shows loop error on that page. I really wonder how there can be a loop on /oddquery. It shouldn't trigger any rewrite rule.

Then I tried dot, escaped dot, dot. Worked perfectly. Then it suddenly stopped and won't start again.

RewriteCond %{THE_REQUEST} doesn't seem to work with wildcard?wildcard for redirecting anything with ? in it.

Sorry for my incompetence at understanding what I'm reading something at 4 a.m.

g1smd

8:39 am on May 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Any time you feel like dumping (.*) into a rule, it is probably not the right pattern. It matches "something, anything, or nothing". If you have two (.*) elements in the same pattern, it is definitely incorrect.

This is a trivial single line of code once you break it down. You want to match the
GET /something[b]?[/b]something HTTP/1.x

line coming from the browser request, where specifically there is something present after the question mark.

Break the
RewriteCond
pattern down to match:
- the GET or other request: "
[A-Z]{3,9}
"
- followed by a literal space: "
\ 
"
- followed by a literal slash (the bare minimum path): "
/
"
- followed by an *optional* path part (not a question mark): "
([^\?]+)?
"
- followed by a literal question mark: "
\?
"
- followed by the non-optional query string characters (not a space): "
([^\ ]+)
"
- followed by a literal space: "
\ 
"
- followed by the protocol: "
HTTP/
"
within the
THE_REQUEST
server variable.

[edited by: g1smd at 9:33 am (utc) on May 3, 2010]

Namjies

9:30 am on May 3, 2010 (gmt 0)

10+ Year Member



RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^\?]+)?\?([^\ ]+)\ HTTP/


works fine. I guess I didn't know how to build up these things.

It doesn't work for /?query tho.

for a url like /?query which works as index.php?query, tried changing to
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)\?([^\ ]+)\ HTTP/


Thinking (.*) would also match nothing before the ? but nope. It still works if there's something before the ?

RewriteCond %{THE_REQUEST} ^(.*)\?(.*)

seems to work just fine, but still nothing if it's directly /?query

I thought using many (.*) would attempt to find what is between them like the /? and try to match anything that is not clearly written to those wildcards, like
(.*)1(.*)2(.*)3(.*) > search String > string12string3string and return value, empty, value value

Tried putting an [or] and added a rule (not with RewriteCond %{THE_REQUEST} ^(.*)\?(.*) ) and placed this one first:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /\?([^\ ]+)\ HTTP/ [or]

with directly /\? as start of query, not working. The others with files still do.

I can't seem to find what's the problem. direct path and not beeing considered.

g1smd

9:37 am on May 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If it doesn't match
example.com/?query
, that will be due to the pattern in your
RewriteRule
, not the one in the
RewriteCond
.

All your experiments with (.*) are going backwards. Always avoid (.*) if there is a better pattern.

The first (.*) matches the whole request string and then it has to "back off and retry" hundreds of times to find a match. It then does the same again for the next (.*) and so on. Code with (.*) runs hundreds to thousand of times slower than that using specific patterns like those I showed.

Namjies

10:06 am on May 3, 2010 (gmt 0)

10+ Year Member



ah, good to know.

the only rule I've been using it in before is
RewriteCond %{http_host} ^example\.com [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
which is fine then.

I've had some tutorial on redirects, way better for previous websites using ?query who wants to switch to clean urls. But I get clean urls from the start, so I prefer to hide all queries to potential visitors. Misconfigured queries with one or 2 missing variables can also mess alot.

Now I've used RewriteRule ^(.*)$ for the new rule and it works too, no more queries string possible.

All perfect.

Big thanks to you.

[edited by: Namjies at 10:19 am (utc) on May 3, 2010]

g1smd

10:19 am on May 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Let's see the final code.

One question. Do you want to preserve the path part of the original request, or not?

Namjies

10:26 am on May 3, 2010 (gmt 0)

10+ Year Member



Final code looks so:

Options +FollowSymlinks
RewriteEngine on

RewriteCond %{http_host} ^example\.com [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

RewriteRule ^(.+)/(.+)$ index.php?p=$2&l=$1 [L]

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^\?]+)?\?([^\ ]+)\ HTTP/
RewriteCond %{QUERY_STRING} .
RewriteRule ^(.*)$ /oddquery? [R=301,L]


Preserve the original part for? I only want to block queries urls, wouldn't know what to do with the path. I'd possibly switch for a warning page instead of a notfound 404, but that's it.

P.S.
I suppose you'll hate me for the double (.+) ...

Namjies

10:42 am on May 3, 2010 (gmt 0)

10+ Year Member



Changed (.+) to ([^/]+), took me some time to figure out. That would be the best I suppose? looking for anything not a slash

g1smd

10:44 am on May 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Your non-www to www redirect code fixes non-www requests, but it doesn't handle www requests with an appended port number. I would test for
!^(www\.example\.com)?$
which redirects if HTTP_HOST (note capitalisation) is not *exactly* "www.example.com".

The (.+) patterns in the rewrite are as inefficient as (.*) is. I'd use
^([^/]+)/([^/.]+)$
if those are 'extensionless' URLs (translates to: 'not a slash one or more times, followed by slash, followed by not a period or slash one or more times'). This also avoids using any disk-intensive -f and -d "exists" checks.

Your "query" redirect must be listed before the non-www to www redirect, otherwise non-www requests with a query will be redirected twice. The "query" redirect should fix the domain name in the target of the redirect, but I'm not quite sure what you are trying to achieve here.

Should example.com/somepath?junk be redirected to just example.com/somepath (so that content can be served) or should it return an error message instead?

If you merely want to strip the query string, use a redirect like
RewriteRule (.*) http://www.example.com/$1? [R=301,L]
here. You'll still pair this with the RewriteCond detecting that QUERY_STRING is present.

If you want to simply return a 404 error, use a rewrite (not a redirect) like
RewriteRule (.*) /does-not-exist [L]
for the rule part. Make sure this rule is now the very last in your list as it is now a rewrite, not a redirect (and again, it will be paired with the existing RewriteCond). Make sure you also add the new line:
RewriteCond %{QUERY_STRING} !.
to the main content delivery rewrite (so that it only rewrites when there is no query string present).

Sanity check: the redirects must be listed before the rewrites. The redirects must be listed in order of most specific to least specific. The non-www to www redirect must be the last redirect. The rewrites must be listed in order of most specific to least specific.

Namjies

11:03 am on May 3, 2010 (gmt 0)

10+ Year Member




Options +FollowSymlinks
RewriteEngine on

# Just my way to say "don't try to mess" and to avoid a valid query to return a 200 OK status. I could use redirects to the correct format, but haven't been using the query form since website start, so instead of placing many redirect rules, prefer to return a warning for any queries.
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^\?]+)?\?([^\ ]+)\ HTTP/
RewriteCond %{QUERY_STRING} .
RewriteRule ^(.*)$ http://www.example.com/noqueryplz? [R=301,L]

# I don't care about the port on this one. Only the page serving is important
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

# I added ([^/.]+) with the dot, but my script already returns a 404 error header and load a special notfound page content with a link to homepage if it cannot serve a page correctly with the variables provided.
RewriteRule ^([^/]+)/([^/.]+)$ index.php?p=$2&l=$1 [L]


I think that's fine in this order.

Edit:
Yeah, for the rewrites, I know about ordering them carefully

[edited by: Namjies at 11:14 am (utc) on May 3, 2010]

g1smd

11:11 am on May 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The
RewriteCond %{QUERY_STRING} .
line is redundant. The previous RewriteCond is only true if there is a query string present.

If I link to your site as
www.example.com[b]:80[/b]/somepage
your site returns 200 OK for that URL, and Google would index it as Duplicate Content. You do need to redirect all non-canonical requests.

Namjies

11:20 am on May 3, 2010 (gmt 0)

10+ Year Member



that's right, RewriteCond %{QUERY_STRING} . is gone now.

www.example.com:80/somepage
It's redirecting to www.example.com
My PHP script looks at hostname. If it's not right, it redirects.
My PHP script looks at 1st variables. If it's not a valid category, it returns 404 with custom content.
My PHP script test the 2nd variables. If it can't find associated content, it returns 404 with custom content.

God bless PHP

But thanks a lot. The optimization wasn't necessary, but I appreciate it.

g1smd

11:38 am on May 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Always check there's no way for a double redirect to occur for any requests, or for the alternative URL to directly serve content as a Duplicate.

As you've discovered, rewriting isn't just about getting your rewrites to work with valid requests, it is also about the correct handling of non-valid requests. :)

Personally, I would fix the port thing at the .htaccess level meaning the PHP script would never be invoked for those requests. It's a few milli-seconds more efficient, but no major issue either way.

jdMorgan

1:16 pm on May 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A couple of efficiency and completness tweaks:

Options +FollowSymlinks
RewriteEngine on
#
# Externally redirect requests with any query string appended to "noquery" page
RewriteCond %{THE_REQUEST} [i]^[A-Z]+\ /[^?\ ]*\?[^\ ]*\ HTTP/[/i]
RewriteRule [b]^[/b] http://www.example.com/noqueryplz? [R=301,L]
#
# Externally redirect requests for any non-blank non-canonical hostname to canonical hostname
RewriteCond %{HTTP_HOST} [i]!^(www\.example\.com)?$[/i]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
#
# Internally rewrite extensionless single-directory-level-URL requests to script
RewriteRule ^([^/]+)/([^/.]+)$ index.php?p=$2&l=$1 [L]

Jim

Namjies

3:52 am on May 4, 2010 (gmt 0)

10+ Year Member



Awesome. Looks more than perfect. Could remove some PHP lines.

g1smd

10:44 pm on May 4, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the additional clarifications Jim! :)