homepage Welcome to WebmasterWorld Guest from 54.161.246.212
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Removing querystrings from Wordpress
Using the .htaccess
morpheus83

10+ Year Member



 
Msg#: 4559170 posted 11:59 am on Mar 28, 2013 (gmt 0)

I need to strip all incoming links with the queries ?page= and ?p=
So for eg - www.example.com/page/2?page=13 will be redirected to www.example.com/page/2
or
www.example.com/furniture?p=2 will be redirected to www.example.com/furniture

basically the query should be stripped off and the user be redirected using a 301 to the stripped url.

This is my current .htaccess -
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /

Redirect /atom.xml http://example.com/feed/atom/

RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
RewriteCond %{QUERY_STRING} ^(.*&)?page=
RewriteRule ^(.*)$ $1?%1 [R=301]


</IfModule>

The highlighted code does the redirection but it only works in the domain directory.
www.example.com/?page=2 is redirected to www.example.com/
but www.example.com/page/2?page=2 is not redirected to www.example.com/page/2

I need to make the redirection work in all the directories. Also would it be possible to include a wildcard ahead of "?" which would strip all the queries and redirect them?

[edited by: engine at 12:58 pm (utc) on Mar 28, 2013]
[edit reason] please use example.com [/edit]

 

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4559170 posted 12:11 pm on Mar 28, 2013 (gmt 0)

Every RewriteRule needs the L flag.

RewriteRules invoking a redirect must be listed before RewriteRules that invoke a rewrite.

Redirects must include the protocol and canonical hostname in the redirect target.

You should test THE_REQUEST rather than QUERY_STRING in the RewriteCond otherwise in certain circumstances some requests may lead to an infinite loop.

Never mix Redirect and RewriteRule in the same site. Convert all Redirect directives to use RewriteRule with [R=301,L] flags.

Dump the IfModule container tags. They are not needed.

Never use (.*) at the beginning or in the middle of a RegEx pattern. (.*) means "match the rest of the string to the end", and can therefore only be used at the end of a RegEx pattern. Use a more specific match here.

Is the page= parameter always the ONLY parameter?

If other parameters are requested at the same time as the page= parameter should they stripped or retained?

Append a question mark to the rule target to prevent re-attachment of the originally requested parameters.

Use example.com in this forum to prevent URL auto-linking.
Tick the 'disable smilies in this post' option to make the code readable.

morpheus83

10+ Year Member



 
Msg#: 4559170 posted 12:24 pm on Mar 28, 2013 (gmt 0)

Two queries need to be stripped -
page=parameter and p=parameter.
Other queries can be retained.

Pardon my ignorance but I am not all acquainted with .htaccess.

# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /

RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

</IfModule>
# END WordPress

This is the standard .htaccess that is generated by Wordpress.

According to you what would be the most efficient way of stripping the query? A redirection or rewriting.

How can I pass the rules in the file keeping the current rules intact?

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4559170 posted 1:10 pm on Mar 28, 2013 (gmt 0)

See the list of changes in the post above.


According to you what would be the most efficient way of stripping the query? A redirection or rewriting.

A redirect is needed. This tells a user asking for one URL to make a new request for a different URL. URLs are used "out there" on the web.

Rewriting has no effect on URLs. A rewrite merely alters the internal location used to service a particular request.

morpheus83

10+ Year Member



 
Msg#: 4559170 posted 1:33 pm on Mar 28, 2013 (gmt 0)

You mentioned
Never use (.*) at the beginning or in the middle of a RegEx pattern. (.*) means "match the rest of the string to the end", and can therefore only be used at the end of a RegEx pattern. Use a more specific match here.


But without using a .* how can I convey that it is a query.

Can you help me with suggesting the conditions I need to pass in the .htaccess file?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4559170 posted 10:49 pm on Mar 28, 2013 (gmt 0)

But without using a .* how can I convey that it is a query.

?
There is no relationship between the random string .* and the literal character ? I assume g1's comment was addressed at this specific line:

RewriteCond %{QUERY_STRING} ^(.*&)?page=

I think what you are trying to say here is: there might be other queries before the "page=" or "p=" query that you are trying to get rid of. But there will never be any further queries after it.

What you need is something closely analogous to the format you use when capturing nested directory names. Here it would look something like
^((?:[^&]+&)*)p(?:age)?=

The non-capture elements ?: are not strictly necessary, but it's a good habit when you are using multiple parentheses in a single line.

Do any of your other queries begin with p? If not, you can shave a further bit of time by expressing the inmost package as
[^p][^&]+&
Replace + with * if some of your other queries are only one letter. I don't know whether you need to code for malformed query strings that contain consecutive &&. There is almost no limit to the forms a bad URL can have, but you don't always need to code for all of them.

Will any given query string ever contain both "page" and "p" or are they mutually exclusive?

www.example.com/?page=2 is redirected to www.example.com/

Are you sure? Does the redirect still take place if you explicitly type in an URL containing "/index.php"?

morpheus83

10+ Year Member



 
Msg#: 4559170 posted 12:28 pm on Mar 29, 2013 (gmt 0)

The redirect does take place if the case of a filename as well. But it does not work for any directory outside the root. For eg - it will work on www.example.com/games/sony-psp.php?page=2 but it wont work on www.example.com/games?page=2

The queries would be mixed as well some would include - index.php?p=10&page=142 for which I would need to strip both of them.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4559170 posted 1:25 pm on Mar 29, 2013 (gmt 0)

Whan a request includes both page= and p= do you want to strip both those AND all others, or will you need to retain those others?

morpheus83

10+ Year Member



 
Msg#: 4559170 posted 2:35 pm on Mar 29, 2013 (gmt 0)

I would want to strip both of them. No query should be retained.

This should work in tandem with the rules set by WordPress as well.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4559170 posted 8:49 pm on Mar 29, 2013 (gmt 0)

I would want to strip both of them. No query should be retained.

Now you're saying two different things. Apart from the p(age) element: what about the other queries, if any?

example.com/directory/filename.php?a=123&page=456&b=789

There are potentially three separate captures:

{stuff before first p(age)=\d+}
p(age)=\d+
{stuff in between}
p(age)=\d+
{stuff after p(age)=\d+}

Let us not consider further malformed query strings that have duplicate occurrences of the same thing. Cue Tolstoy paraphrase here.

From first post:
RewriteRule ^index\.php$ - [L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

RewriteCond %{QUERY_STRING} ^(.*&)?page=
RewriteRule ^(.*)$ $1?%1 [R=301]

#1 any request for "index.php" is left unchanged, regardless of whether it was an external request (hence my earlier question about deliberately typing in "index.php") or an internal request resulting from a rewrite

#2 any request for a nonexistent file regardless of format* is rewritten to index.php, and then Apache cycles through all the mods again from the top

#3 any request with "page=" in the query string is redirected to the same request minus the part of the query string beginning with the last occurrence of "page=", unless the requested page is called "index.php" so it would never reach this rule.

Within each module, rules execute in order unless you do fancy footwork involving skips and repeats.


* This is standard CMS behavior and I can't for the life of me understand why it is supposed to be a good idea. But then, I don't speak Apache.

morpheus83

10+ Year Member



 
Msg#: 4559170 posted 8:16 am on Mar 30, 2013 (gmt 0)

Ok, to simplify it how about a rule which strips any and all queries.

Lucy24 is it fine if I share the working url with you?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4559170 posted 7:06 pm on Mar 30, 2013 (gmt 0)

You can share anything you like so long as you express the domain name as example.com. Or .org or .uk or dot anything else if you're talking about multiple domains. Unfortunately it doesn't work with subdomains.

If you want to dump any query strings that contain the parameter
p(age)?
the whole thing definitely gets easier. It may be useful to backtrack a little and say where the queries are coming from. If it's some limited number of outdated links or bookmarks, you may be able to target the rule more narrowly.

morpheus83

10+ Year Member



 
Msg#: 4559170 posted 9:51 am on Apr 1, 2013 (gmt 0)

The queries are coming from cached google pages from our old CMS. Which used to paginate in the fashion of www.example.com/index,php?page=2 and so forth. I have tried a lot to remove the pages from Google's cache from the webmaster tools, use explicit noindex rules in the robots.txt yet they remain. So I think the best way would be to do a 301 using htaccess.

I have pm'd you my website url so you can get a better idea.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved