|Crawl allocation and duplicate content|
I'm working on a big brand that has hundreds of thousands of URL combinations and a duplicate content problem. I want to noindex or canonical a bunch of the duplicate content, but my concern is that much of the existing duplicated content doesn't appear to have been crawled.
Looks like we're running into a crawl allocation wall.
My question is... if I noindex a bunch of these url combinations vs canonical them, GBOT still has to crawl them, so it's using up the allocation. I can't block them in robots, it won't work with our structure.
BTW, I'm leaning toward noindex rather than canonical as these pages aren't true canonicals (filtered pages).
Our overall rankings are good, but I believe they could be better, and I am hoping that fixing this might improve site quality and give us a few points with Panda.
What do you guys recommend? What has worked for you?
BTW, I have the always lovely "Googlebot found an extremely high number of URLs on your site" message as well.
Are these "URL combinations" variations of the same set of URLs?
Kind of sounds like a hellish architecture.
As these are filtered pages, is it really not possible to block these duplicates in robots.txt by listing some combination of URL parameters that filter uses?
Could you exemplify a few URL samples?
So if you have a set of widgets, you can have a URL for blue widgets, red widgets, red and blue widgets, red and blue and yellow widgets, yellow and blue widgets, red and blue widgets in small, black and red widget in large, large widgets, blue large widgets, blue small and large widgets, etc. x 1,000,000,000,000 :)
I can't block them in robots because the filter combinations create a unique URL with nothing distinct I can grab a hold of.
I'm a bit stuck... afraid a noindex wouldn't help with the crawl allocation, but should help with any negative Panda values?
Is it correct to assume that there is maybe a checkbox selector that the users can check off as many - or as few - types of widgets as they please and then view them on some sort of a results page?
and if so, is that the ONLY way that a user could view the widgets? Or is there any sort of a "hard coded" page that would display, say, all the red widgets, or all the blue widgets?
Really, an example of the URL structure of one of these pages would be a big help.
[edited by: Planet13 at 7:01 pm (utc) on Jul 16, 2013]
|afraid a noindex wouldn't help with the crawl allocation |
I track all googlebot activity on my site. On my noindexed pages, google slows down the crawls over time. So, once it picks up the noindex tag and removes the page from the index it starts to spider that page less and less frequently (once a day, then once a week, then once a month, etc). It will still eat some of the crawl budget but Google seems to be good at reducing how much effort it puts into those pages. It's probably also why once you noindex a page it can be a long time before you can get it reindexed. That's been my experience anyway. Just noindexed 65K pages and am seeing the crawl of those pages reducing quickly once the tags are picked up.
Planet13 - yep, the checkbox is how it's done. Nothing is hard coded, it's all dynamic.
Here's an example of the url structure:
getcooking - good point, I didn't think of that...
Are these URLs likely to be bookmarked by visitors?
Is this problem likely to get worse with the time (more and more URLs) ?
Just a thought.. you could bite the bullet and:
- If possible, change the script to insert a "dummy folder" before you start adding your filters, e.g.
add a predictable query string at the end of the URLs that show filtered results (if easier), e.g.
- block the URLs based on pattern in URL that now repeats in any filter URL
- return 410 gone for all requests that come with the old URL without the new pattern (existing filter URLs)
aakk9999 - thanks. That's the best solution, but the problem is that we won't get the required engineering resources to put this into place. I need to do an "easier" solution....
Any thoughts? Is noindex the best bet?
They won't be bookmarked by visitors.
Noindex with a canonical tag back to the unfiltered page? Why can't you do both?
that's not the intended use of the canonical, it's not a true duplicate... wouldn't a noindex be the better alternative?
I use both on category pages that have filtered results and haven't had any issues, although maybe my setup is different.
It's still about "red widgets" but the filtered page is about "red widgets over 6" tall" and the filtered page contains some of the elements that is on the unfiltered page.
At the very least I would assume noindexing should take care of your issues, especially since you seem to indicate that this isn't a panda problem since rankings are doing ok?
Also, I could have sworn I read about this very issue from someone at Google. Looking for the reference now. It was about how to handle pagination and filtered results.
This way of using canonical is usually used to collapse query string filters into a canonical page, but I think it should do fine here as well.
As with the query strings, though, just make sure you're not using it to collapse a page you actually want to be indexed.
|Noindex with a canonical tag back to the unfiltered page? Why can't you do both? |
Strikes me that noindex might in a way conflict with the canonical tag, particularly, if as describe, noindex causes the noindexed page (the one that would contain the canonical link) to be crawled less frequently over time.
I did some searching, and found that Google's John Mueller covers this question here, as Google's crawling process enters into it, and there is indeed a kind of conflict if you use both...
Canonical conflicts with noindex?
|... Before the rel=canonical link element was announced, using noindex robots meta tags was one way that webmasters were directing us towards canonicals, so this is certainly something we know and understand. However, with the coming of the rel=canonical link element, the optimal way of specifying a canonical is (apart from using a 301 redirect to the preferred URL) is to only use the rel=canonical link element. |
One reason for this is that we sometimes find a non-canonical URL first. If this URL has a noindex robots meta tag, we might decide not to index anything until we crawl and index the canonical URL. Without the noindex robots meta tag (with the rel=canonical link element) we can start by indexing that URL and show it to users in search results. As soon as we crawl the canonical URL, we can change to the canonical URL instead. It's also much safer because you don't have to worry about serving different versions of the content depending on the exact URL :-).
There's a bit more in John's two posts in the discussion worth looking at.
There's also a Matt Cutts video on this page...
At about 7:04 into the video, Matt talks about about different sort orders in ecommerce, but not much about filtering (I haven't watched the whole video in quite some time, so he might).
Much more to be said about keeping it simple... and also making sure that you're not filtering out significant amounts of information with your filters. Make sure that you are distinguishing between attributes (things like colors) and categories (things like brands that you want indexed on their own). So, the solutions we're discussing, IMO, probably wouldn't work for filtering brands... but I'd love to hear otherwise if they will.
That said, there is a video on the canonical and pagination by Maile Ohye, not quite the same thing as what's being discussed here, that I've linked to several times and you can find by site search... and it occurs to me that conceivably a paginated approach, with the canonical, might work as a way to both sort and then filter by brands... if someone wants to work that one out ;). (I haven't thought it through, but it could be a remote possibility.)
Also, there are other ways of filtering faceted searches, but all probably more complex than aakk9999's elegant suggestion of the dummy folder.
Thank you for posting that John Mueller explanation.