Forum Moderators: phranque

Message Too Old, No Replies

Duplicate content: rewrite / remove query strings in wrong order

mod_rewrite query string leak duplicate content

         

maxv

8:18 pm on Aug 27, 2014 (gmt 0)

10+ Year Member



Hi everyone,

After a fair amount of hair pulling, I've discovered that there was some bad rewrite code on a site which leaked query string information, then was indexed by the Big G, causing some duplicate content penalties.

The worst of it is that there isn't a consistent order to the query vars, PLUS there are some bogus query strings that need to be removed.

I've looked through the forums here for an out-of-order query rewrite without much luck, hopefully I'm not asking the same question for the N^1000th time. And of course, there are sub-questions /features below, to further muddy things up.

Here goes:

This is a worst case sample URL with a bunch of extraneous query vars, plus a stray ? character where it shouldn't be:


# query variables are showing up in mixed order, this is just one possibility
http://www.example.com/gallery.php?gallery_name=the_gallery_name&image_name=Feature_Image_aeIou121qX?sort=most_viewed&page=120&ipp=9&id=2374803&featuring=Old_MacDonald&loc=500&sort=added_last&floc=3&loc=7


No, I didn't (mostly) write the code that created this monstrosity. It's an amalgam of old indexed content, plus the more recent rewrite leak of query vars. Backlinks generated by other sites have compounded the issue by injecting junk into the links (e.g. featuring=Old_MacDonald), and there's some old pagination stuff like loc, floc which does nothing and needs dropping.

Ultimately, there are only four parameters to keep: gallery_name, image_name (optional), sort, and page.

Ideally, the canonical version would look like either:

http://www.example.com/The_Gallery_Name/feature_image_aeIou121qX?sort=most_viewed&page=120


or if there's no image_name query var,

http://www.example.com/The_Gallery_Name?sort=most_viewed&page=120


This URL would be what Google and users ultimately see in the address bar.

(note the Capitalization change from the_article to The_Gallery_Name- each word should be capitalized, and there's no fixed number of words in the name - I think I may know how to approach this with RewriteMap)

Is this something you can write a series of rules for, deleting each bogus query var with each pass? Or can it be done in a single pass?

Mock(able) code:

# How to grab name/value pairs out of order?
# How to fix stray ? character in middle of query string

RewriteCode ${QUERY_STRING} gallery_name=(\w+)[&]?image_name=(\w+)[&]?page=(\d+)[&]?sort=(\w+) [NC]

RewriteMap uppercase int:toupper
RewriteMap lowercase int:tolower
RewriteRule Transform gallery_name to name case using RewriteMap

RewriteRule ^/gallery.php(.*) /%1/%2(image_name, if present) [QSA}



Honestly, I'm not sure how to approach this. To help with the duplicate content issue, it appears to be appropriate to use the flag [R=301] to make sure that search engines treat the final rewritten result as canonical. After 'fixing' the URL, internally I'll continue to rewrite the canon URL back into its constituent parts (gallery_name=the_gallery_name, image_name=feature_image_aeIou121qX, sort=X, page=Y) for PHP processing (currently using this rule: RewriteRule ^/([^\.\?/]+)/([A-Za-z_0-9\-]+)$ /gallery.php?gallery_name=$1&image_name=$2 [QSA])

But I'm at a loss on first steps with the initial URL cleanup.

Many thanks,

MaxV

lucy24

9:57 pm on Aug 27, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just how many parameters are involved, and how many possible permutations? I've got a nasty feeling you're looking at a very simple rule:

RewriteCond %{QUERY_STRING} (badparam|otherbadparam|thirdbadparam)
RewriteRule \.php /fixup.php [L]


In spite of the L-alone flag, the rule will be located near the beginning of your existing redirects-- the ones with [R=301,L] flag. If the problem is constrained to a handful of specific URLs, include those in the pattern of the rule.

Since it's superficially a rewrite rather than a redirect, the originally requested URL and query string will remain accessible to the fixup.php page ... whose exact content will have to be hammered out next door in the php subforum. (Penders? You out there?) Translation: I could do it for my own site, but it would be grossly irresponsible for me to make suggestions for anyone else.

That's your worst-case scenario, pending more detailed information.

phranque

12:42 am on Aug 28, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Ideally, the canonical version would look like either:

http://www.example.com/The_Gallery_Name/feature_image_aeIou121qX?sort=most_viewed&page=120


or if there's no image_name query var,

http://www.example.com/The_Gallery_Name?sort=most_viewed&page=120

what happens when the sort parameter is excluded?

maxv

6:48 am on Aug 28, 2014 (gmt 0)

10+ Year Member



Thanks for the replies!

@lucy24:
There are 4 parameters we want to keep: gallery_name, image_name (optional), page (optional), and sort

The main 'bad' parameters are anything not in the list above. These represent 88% of the known 'bad' parameters: loc, floc, id, featuring.

The PHP script, gallery.php will ignore bad parameters, so it's not the end of the world if a few things are missed, but there are some very old holdovers (which are no longer used) that are creating a large duplicate content effect with search engines.

You wrote:

RewriteCond %{QUERY_STRING} (badparam|otherbadparam|thirdbadparam)
RewriteRule \.php /fixup.php [L]


Ah, so stripping out bad/unwanted query parameters is not possible with RewriteCond / RewriteRule?

What about the inverse? Some way to store the known good params and loop until they are all caught & stored in %1, %2, %3, %4, then append them in a final RewriteRule? (no idea if this is possible)

@phranque:
The php script will use a default sort order without the inclusion of the sort variable, so it's optional and ok if nonexistent.

Thanks all, really appreciate pointers/feedback.

MaxV

phranque

7:46 am on Aug 28, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



The php script will use a default sort order without the inclusion of the sort variable, so it's optional and ok if nonexistent.


so you're saying these 2 urls will generate the same response?
http://www.example.com/The_Gallery_Name?sort=most_viewed&page=120
http://www.example.com/The_Gallery_Name?page=120

in that case the canonical url should exclude the sort parameter since it has a default value.


The_Gallery_Name
most_viewed

uggh!
if it's not "too late", i would recommend folding that to lower case and using hyphens instead of underscores.

lucy24

8:54 am on Aug 28, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



so stripping out bad/unwanted query parameters is not possible with RewriteCond / RewriteRule?

All things are possible with mod_rewrite ;) but sometimes it's just more trouble than it's worth. That's why I asked how many bad parameters there are, and how many possible sort orders.

The one thing you can NOT do is:

RewriteCond %{QUERY_STRING} (?:bad1|bad2|bad3|bad4)
RewriteCond %{QUERY_STRING} (gallery_name=[^&]*)
RewriteCond %{QUERY_STRING} (sort=[^&]*)
RewriteCond %{QUERY_STRING} (image_name=[^&]*)?
RewriteCond %{QUERY_STRING} (page=[^&]*)?
RewriteRule ^(.+\.php) http://www.example.com/$1?%1&%2&%3&%4 [R=301,L]

Again, you can NOT do this, because mod_rewrite does not allow captures from more than one Condition. (Multiple captures from a single Condition, yes, provided it's the last one.) So you'd have to write a rule for every possible permutation, with a separate ruleset for each. Four parameters, variable order = 4! = 24 rulesets.

But you could easily do this with a few lines of php.

maxv

5:07 pm on Aug 28, 2014 (gmt 0)

10+ Year Member



so you're saying these 2 urls will generate the same response?
http://www.example.com/The_Gallery_Name?sort=most_viewed&page=120
http://www.example.com/The_Gallery_Name?page=120

in that case the canonical url should exclude the sort parameter since it has a default value.


The_Gallery_Name
most_viewed

uggh!
if it's not "too late", i would recommend folding that to lower case and using hyphens instead of underscores.


@phranque:

This would be the canonical default (implied sort order is newest content, descending, no need for the sort query var, so I'll strip it out):
http://www.example.com/The_Gallery_Name?page=120

This would be the variant, showing different content (most viewed, descending order)
http://www.example.com/The_Gallery_Name?sort=most_viewed&page=120

Unfortunately, I inherited the underscore naming convention; it's a decade+ old, so there's no going back without a severe SERP penalty. Agreed, underscore and hyphens would be much more civilized.

Thanks for the feedback, good stuff.

maxv

8:50 pm on Aug 28, 2014 (gmt 0)

10+ Year Member



All things are possible with mod_rewrite ;) but sometimes it's just more trouble than it's worth. That's why I asked how many bad parameters there are, and how many possible sort orders.

The one thing you can NOT do is:

RewriteCond %{QUERY_STRING} (?:bad1|bad2|bad3|bad4)
RewriteCond %{QUERY_STRING} (gallery_name=[^&]*)
RewriteCond %{QUERY_STRING} (sort=[^&]*)
RewriteCond %{QUERY_STRING} (image_name=[^&]*)?
RewriteCond %{QUERY_STRING} (page=[^&]*)?
RewriteRule ^(.+\.php) http://www.example.com/$1?%1&%2&%3&%4 [R=301,L]

Again, you can NOT do this, because mod_rewrite does not allow captures from more than one Condition. (Multiple captures from a single Condition, yes, provided it's the last one.) So you'd have to write a rule for every possible permutation, with a separate ruleset for each. Four parameters, variable order = 4! = 24 rulesets.

But you could easily do this with a few lines of php.


@lucy24:

Sometimes the hardest answer to get to the bottom of is "can I NOT do this?" (or is it infeasible with the current tool?).

Thanks for the time and wisdom shared, most appreciated.

Now to play with some PHP and throw out some header("HTTP/1.1 301 Moved Permanently")'s

MaxV