Forum Moderators: goodroi

Message Too Old, No Replies

Block crawling of parameterized duplicate content

Help with setting up properly block crawl of specific URL Parameters

         

Nosuchthing

2:44 pm on Jan 6, 2021 (gmt 0)

5+ Year Member Top Contributors Of The Month



Hello,

A website we've been working on for a client has an issue with parameterized duplicate content.

They have a large portfolio, which can be narrowed using various filters. For the sake of simplicity, we can take the parameter "capacity" as an example.

This means that a page example.com/product?capacity=1 is indexed when ideally, only the page example.com/product should be.

In order to fix this, I used the URL parameter tool in Search console
>add parameter
>capacity
>Narrows
>No URLs should be crawled

I made this change on the 15th of December, and since then, we have not observed a significant difference. Some pages I meant to exclude actually got more impressions than before since then.

My questions are :
Should I add the ? in front of capacity for this to work (?capacity)
Is there something I missed or did not set up properly?
Do you have any insight on how to solve this issue?

Thanks in advance,

phranque

3:00 pm on Jan 6, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I think the proper technical solution to this would be a link rel canonical element.

lucy24

6:12 pm on Jan 6, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No URLs should be crawled
Doesn't this option mean “Do not crawl URLs that contain this parameter”? (A common real-life example would be "print" versions of pages.) Most of the time you'd select “ignore this parameter” instead.

Nosuchthing

6:02 pm on Jan 7, 2021 (gmt 0)

5+ Year Member Top Contributors Of The Month



Hello,

Thank you for your answers.

Lucy24, it is indeed what it means. Below are the different options I have:

Which URLs with this parameter should Googlebot crawl?
Let Googlebot decide (Default)
Every URL (the page content changes for each value)
Only URLs with value
(may hide content from Googlebot)
No URLs (may hide content from Googlebot; overrides settings for other parameters)

@phranque

This what I thought as well, until I saw the advice in Google documentation :

"If you have many such URL parameters in your site, then you might benefit by using the URL Parameters tool to reduce crawling of duplicate URLs.

Important: If your site serves duplicate content to different URLs without using parameters, you should define a canonical page rather than block crawling, as described in this page."

I took it to mean only if there was no parameters, a canonical page should be defined.

If you think I made no mistake setting up, I guess I'll take that route anyway.

NickMNS

7:09 pm on Jan 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I made this change on the 15th of December, and since then, we have not observed a significant difference. Some pages I meant to exclude actually got more impressions than before since then.

Crawling, Indexing and ranking are different things. The fact that you have asked Google not to crawl a page does not tell Google to remove the page from its index (if already indexed) and not to rank that page. Eventually those page should fall out of the index and their ranking should drop as Google receives fewer signals from not crawling the pages, but in the interim the pages are likely still going to be included.

Adding a canonical tag like phranque suggests will tell Google that you prefer that another page be shown its place. But in this case it simply a notification of your preference, Google can and at times does ignore this preference. Moreover, if you add the tag and at the same time ask Google not to crawl the page, then Google will never see the tag.

Then there is also the fact that a page with a parameter may in fact be significantly different from it's "canonical" page meaning that Google may choose to ignore canonical tag due to the fact that the pages are not similar enough. Remember that a blue t-shirt page (ie?color=blue) is very similar a black t-shirt page (ie?color=black), but will be only marginally similar to the main t-shirt page (the likely canonical of both blue and black pages), as it will likely show all t-shirt variations.

The goal in blocking this content from being crawled, is generally to save on crawl budget. Blocking using the url parameter tool achieves this goal. If a page, after being blocked, continues to appear in search and to bring traffic, well there is no real harm in that. Over time the traffic should change as to what you expect.

Nosuchthing

7:41 pm on Jan 7, 2021 (gmt 0)

5+ Year Member Top Contributors Of The Month



@NickMNS,

Thank you for this very detailed and insightful answer, I truly appreciate it!

Crawling budget was also an issue we were trying to tackle with this move. We just hoped to hit two bird with one stone and I started to doubt myself after a few weeks, seeing little changes.

I'll leave it be for a while and reassess down the line the need for a rel canonical.

not2easy

7:51 pm on Jan 7, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



The use of rel canonical is very important to help Google understand what to index. If you have several different URLs that present different facets of the same item/product/topic it is important to use canonical and not try to index each page separately.

Be sure you aren't submitting all versions in sitemaps as well. If the parameter settings don't help, there is always robots.txt to let Google now you don't want certain strings crawled.

Just be sure that meta index tags and canonical tags and sitemaps and robots.txt and all tell the same story. Not always easy.

Nosuchthing

8:38 pm on Jan 7, 2021 (gmt 0)

5+ Year Member Top Contributors Of The Month



@nor2easy Thanks for this additional comments, I'll make sure to check everything tomorrow.

The website is more or less a comparison website for paint.

The product pages themselves do not change much when selecting a different color or size for example.

They do change quite a bit whenever it comes to the broader landing pages where you can select brands, types, etc...

But let's say you were looking for the brand x of wall paint:

We don't want you to land on example.com/wall-paint?brand=x

Rather, we want the page example.com/wall-paint/x

Both the product page and the selector page present different facets of a similar product in the end. Do you think this is a case where using canonical is necessary?

not2easy

9:15 pm on Jan 7, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Related content isn't the same as duplicate. The product description page sounds like the only difference would be the color. In that case, you would probably prefer people to land on the selector page and prefer to have a product page that isn't competing with itself via a /duplicate-brown/, /duplicate-yellow/, /duplicate-tan/ and /duplicate-blue/ that only differ by their color. The pages that only differ in minor details should have the rel=canonical tag pointing to the page they are linked from. That lets Google 'see' the color pages and understand that they expand the selector page but the selector page is where you want traffic to land.

That answer may differ if the relationship differs significantly from my understanding of it which was based on your question.