Is it worthwhile to handle URL parameters to avoid duplicate content? - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Is it worthwhile to handle URL parameters to avoid duplicate content?

guarriman3

8:06 am on Jun 11, 2021 (gmt 0)

10+ Year Member

Top Contributors Of The Month

Hi,

For years, I've been seeing thousands of weird URLs on my Apache logs, since some external websites/apps add lots of parameters to the URLs.

https://my_url?viewType=Print&viewClass=Print
https://my_url?iframe=true&width=95%&height=95%
https://my_url?fbclid=xxxx

I was very scared of duplicate content, and I tried to 'catch' all these parameters, to 301-redirect the weird URLs to the original 'http://my_url'. I've been using PHP code and .htaccess conditionals.

However, I'm wondering if all this effort is worth it, because it harms the performance of the server somehow. I'm considering whether simply relying on the canonical metatag with the definitive URL makes Google not consider different URLs as duplicate content. Such canonical metatag is already well defined for all of my URLs.

I've been browsing Google's documentation (https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls), but I'm not find it clear the solution to cope with dynamic parameters.

I would like to know your experiences. Should I just rely just on the canonical metatag to avoid duplicate content? Thank you.

lucy24

4:59 pm on Jun 11, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

because it harms the performance of the server somehow

For a given definition of “somehow”, at least. Pausing to evaluate a condition on all page requests, and then issuing the occasional redirect, shouldn't add that much to your server overhead if you consider how many other things get checked on every page request. I emphasize: page requests. If possible, make sure your RewriteRules (assuming Apache) are narrowly constrained to pages only. It's fairly rare for hanky-panky to involve supporting files.

Personally I block all requests with queries, except fbclid--a legitimate part of Facebook referrals--which gets redirected. (Sometimes I've made other exemptions, depending on where human referrals are coming from.)

Does G itself make a habit of requesting URLs with spurious query strings? If not, you needn't worry about Duplicate Content.

FranticFish

7:45 pm on Jun 12, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I defer to others with way more experience than me in interpreting server logs and blocking bots, but I have found canonical to be very useful. Google say it's a hint and not a directive (as a 301 is) but looking at the examples you've posted about urls that concern you I can't say that they would concern me - at least not for the SME sites I manage.

If a url is linked to then there is a public record of that 'proposed' url. If a 301 then comes into being then that proposed url has been claimed on behalf of the domain. In other words, a 301 means you accept responsibility for that url - you legitimise it. I've seen meds links to non-existent pages show up in Analytics for websites as if they were urls due to 301 nets that were spread too wide and too trustingly. Perhaps this could present a problem if done enough.

Blocking this traffic does indeed protect, but it's a generic response which is basically a 'no'.

Does canonical legitimise the url in the same way that a 301 does? Looking at the examples we have...

1) A print format
Seems legit, maybe it preformats your page according to some need that you are not aware of.

2) Opening within a large <iframe>
This one is perhaps more borderline. A site or app that wants to retain some sort of navigation control over the user with your page as a panel in their system?

3) Facebook ID
Innocent enough: well, as innocent as tracking you all over the web and selling your personal information gets :)

Personally I would lean towards canonical, I have found it handy, but I do not manage large websites and I don't have experience of what it might imply in practice at scale.

NickMNS

8:34 pm on Jun 12, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

except fbclid--a legitimate part of Facebook referrals

It depends on your definition of "legitimate"
It is a spying mechanism used by Facebook to keep tabs on its users as they attempt to navigate away from the walled garden. By adding the parameter FB can, without the use of cookies, track users on external websites. It requires that the external website uses a FB like button or other widget. With the widget's js code it can refer to the external sites url which includes the personally identifying parameter, and relays it back to the mother-ship. (note: I'm speculating about this, but to me it is the only logically explanation for this tracking parameter)

There is also Safari, from within the Google search app, that adds parameters to the url's. In this case it actually provides useful information, specifically "real" keyword data. Real in the sense that Google provides filtered/biased keyword data. Google's data attempts to provide the widest array of keywords as possible, meaning that if you 100 searches for "blue widget", you will likely only see it appear once or twice, but if you then get one search for say "periwinkle blue widget", that term will also appear in the report. This will make it appear as though "periwinkle blue widget" is almost as common as "blue widget". The data from Safari confirms that this not the case, as data from Safari is a near random sampling of keywords, in which the "blue widget" appears often and "periwinkle blue widget" rarely if ever. Anyways, I digress. Very few people are aware of this odd side effect.

All that to say that there may well be valuable information passed in these 3rd party parameters.

RE dupe-content, what Lucy24 says

Does G itself make a habit of requesting URLs with spurious query strings? If not, you needn't worry about Duplicate Content.

An extension to this is, are users linking to your content using these parameters in those links? Probably not.

@FranticFish

In other words, a 301 means you accept responsibility for that url - you legitimise it.

This seems very unlikely. Because Lucy24's comment above.

FranticFish

4:36 am on Jun 13, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

@ NickMNS

Yes of course, Lucy is correct, Analytics is not the same as googlebot so unless the url shows up as having been requested by them during a crawl then what you're seeing in Analytics is a little like referer spam in that it's not the true picture of what is actually being attributed to your domain by Google the search engine.

Whenever I see these 'bad urls via 301' in Analytics I'll get the server side rules changed so that 404s are served instead. Next time I see one I'll have to remember to try to get the server logs to compare.

lucy24

5:55 pm on Jun 13, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

About fbclid: I guess I could have said “used by legitimate humans” ;) I don't have any FB-related widgets, and the people coming in with this query do appear to be actual humans following links from FB, so redirecting rather than blocking seems a decent compromise. I used to globally redirect anything with a query (because I don't use them at all in visible URLs), but on closer inspection it turned out most of them were from malign robots anyway, so why not proceed directly to 403. And then you have to turn around and poke a hole because they don't know they're doing anything wrong.