Duplicate pages due to sorting - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Duplicate pages due to sorting

RealtorAl

10:17 pm on Nov 29, 2013 (gmt 0)

10+ Year Member

I have a widget website. I have pages that contain a list of widgets for sale in a given town. For example:

http://www.example.com/town-st-1234-widget

There's a sort button on these pages that sorts the list but creates a new url when it does it such as:

http://www.example.com/town-st-1234-widget/page:1/sort:price/direction:desc#widget-listings

Any page created with the sort button is a dup. The result is I have 8 or 9 dups for each page of listings.

I have seen code on the net that allows you to get the url of the current page such as:

<?PHP

FUNCTION curPageURL() {

$pageURL = 'http';
IF ($_SERVER["HTTPS"] == "on") {$pageURL .= "s";}

$pageURL .= "://";

IF ($_SERVER["SERVER_PORT"] != "80") {
$pageURL .= $_SERVER["SERVER_NAME"].":".$_SERVER["SERVER_PORT"].$_SERVER["REQUEST_URI"];

} ELSE {

$pageURL .= $_SERVER["SERVER_NAME"].$_SERVER["REQUEST_URI"];
}

RETURN $pageURL;

}

My questions are:

Could I use such code to detect the presence of 'sort:' in the url and, based on that, prevent indexing of the page by web crawlers?

If I could do the above is it worth doing to prevent duplicate titles, descriptions, etc.--in other words bad SEO--?

[edited by: aakk9999 at 12:46 am (utc) on Nov 30, 2013]
[edit reason] Examplified URL [/edit]

netmeg

1:09 am on Nov 30, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You can go into GWT and tell Google what parameters you have and what they do - that's probably your best route.

JD_Toims

1:18 am on Nov 30, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You can RealtorAL, and imo, it's best to do what you're asking about -- Just add a canonical <link> to the head of the sorted pages that points to the "regular" equivalent page.

Unfortunately, if your URL example is correct, netmeg's suggestion won't work in this case, because even though the script is using info from the URL as parameters, there aren't any "technical" parameters [stuff following a ?] in the example you presented, so you can't just go into WMT and say to "ignore the sort parameter" or anything along those lines, because "sort", "page", etc. are "technical" directories, not parameters.

BTW: Welcome to WebmasterWorld!

lucy24

1:58 am on Nov 30, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Normally it's nice to take all your parameters and make them into a "friendly" query-less URL. But here it's the wrong approach. A form like

http://www.example.com/town-st-1234-widget/page:1/sort:price/direction:desc

really looks as if it began life as

http://www.example.com/town-st-1234-widget/?page=1&sort=price&direction=desc

If you leave the parameters as parameters instead of rewriting them into a semi-pretty URL (we won't talk about the literal colons, ugh) it will be a lot easier to tell search engines which parts don't matter.

ZydoSEO

2:15 am on Nov 30, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

This is a case where using the canonical link element is probably the best solution.

JD_Toims

3:18 am on Nov 30, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

If you leave the parameters as parameters instead of rewriting them into a semi-pretty URL (we won't talk about the literal colons, ugh) it will be a lot easier to tell search engines which parts don't matter.

Can't quite tell what you're meaning -- If you're talking about "next time, don't do it this way", then I absolutely agree, but if you're thinking reverting to a query_string would be easier/better, then I'm not sure about that.

If the URLs are reverted, to get any/all bookmarks and inbound links to the right place, everything currently in place would have to be redirected to URLs with parameters; then those URLs would have to be canonicalized anyway, rather than "ignored" in WMT to ensure capturing inbound link weight from "the big 3" search engines rather than only Google; then if all that was taken into account but the fragment-identifier on the end of the URL comes into play it could really get, uh, fugly lol

I think now that the URLs are what they are, it's best, and likely easiest, to just modify the script producing the output to add a <link rel="canonical" href="http://www.example.com/the-path/the-page.ext"> to the <head> pointing to the "regular version", rather than taking any chances with changing to parameters, redirecting the current stuff, and still having to canonicalize the parameterized version of the URL to a "regular version" anyway.

Basically, changing now adds 3 steps where an "oops" could be ugly and with as fickle as Google's been lately, I think I'd stick with 1 step I'd have to put in place either way and make sure I got that one step right.

phranque

7:24 am on Nov 30, 2013 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

welcome to WebmasterWorld, RealtorAl!

i'm not a fan of the link rel canonical and see it as a bandaid of last resort for technical deficiencies.
i'm going to add a 3rd vote here and say that in this case you need that bandaid.

RealtorAl

3:19 pm on Nov 30, 2013 (gmt 0)

10+ Year Member

Thanks All for your excellent replies and I'm happy to join the forum! One thing I should have made more clear: I'm the purchaser of the website and don't have full control of the implementation. The vendor provides a content management system so I can create a lot of what makes up the site and can also control most of the look and feel via CSS. But I don't control the widget list, the way it's created, the 'parameters', etc.

I guess the ideal way to handle this would be not to change the url when sorting the widgets

RealtorAl

4:12 pm on Nov 30, 2013 (gmt 0)

10+ Year Member

One more question: Why not use content=noindex rather than go the canonical route? It seems simpler to detect the presence of 'sort:' in the url and then indicate the page should not be indexed. Can I do this?

JD_Toims

6:35 pm on Nov 30, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You can use noindex, but you'll be forcing search engines to try and guess if http://www.example.com/town-st-1234-widget/page:1/sort:price/direction:desc#widget-listings should be canonicalized to http://www.example.com/town-st-1234-widget for inbound link/ranking-factor consolidation and making them guess at things can lead to very unexpected results as we saw in the 410 Gone Page Indexed [webmasterworld.com] thread.

Personally, I would probably use mod_rewrite to rewrite to a file called /canonicalize.php, which would put that file in the middle of the request and the cms, so I could use that PHP file to manage and set the headers then deliver the contents for the request to the visitor.

phranque

8:46 pm on Nov 30, 2013 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

I guess the ideal way to handle this would be not to change the url when sorting the widgets

this would be ideal, but more difficult to implement without control of the CMS implementation.

Why not use content=noindex rather than go the canonical route?

PR dilution

RealtorAl

4:36 pm on Jan 24, 2014 (gmt 0)

10+ Year Member

I did put in a javascript function to detect the presence of 'sort:' and then add a canonical tag to the correct page. Code is below. It works per Firebug but I can't tell if the javascript is being run by the GoogleBot though I've read that GoogleBot executes javascript.

I think the surest way to go is to add a disallow: line to robots.txt as follows:

disallow: /*sort:*

Does this make sense?

function SetCanon()
{
var theUrl = document.URL;
var iPos = theUrl.indexOf("sort:"); 

if (iPos != -1)
 {
 var theHead = document.getElementsByTagName("head")[0];
 var strTmp = theUrl.slice(0, iPos - 1);
 var eLink = document.createElement("link");
eLink.setAttribute("id", "Canon");
 eLink.setAttribute("rel", "canonical");
 eLink.setAttribute("href", strTmp);
 theHead.appendChild(eLink); 
 }
}