Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Duplicate Content - Search vs. Brand

         

sftriman

11:02 pm on Mar 14, 2012 (gmt 0)

10+ Year Member



I have a long-established site with about 28,000 pages in the Google index. Of those, 7,000 are my search results pages. I also have about 1,000 brand pages, and the problem is that some of the search results page queries are essentially identical to the brand page name. What to do?

I always thought of search as a dynamic activity that is user-driven. I never did anything in the way of canonical or noindex or robots.txt, etc. But now, I think Google must see:

search for brand Gizmo (perhaps with another keyword sometimes)
brand page for brand Gizmo

and, though the pages are different in many ways, by and large the list of Gizmo products on the pages is about the same.

So what's the best way to handle it? I don't even know if I'm being Pandalized for this scenario, but my site is sure down 80% from 2 years ago, so something is wrong!

Of the 7,000 search pages, I could selectively add a googlebot noindex meta tag to the 1,000 that overlap with my brands pages. That's my thought right now. But what if my search results pages are seen as "better" and index better in Google? Then should I noindex brand pages? What if it's 500 of one, 500 of the other - do I selectively mix and match which pages to keep and which to noindex?

Is there another solution?

tedster

4:16 am on Mar 15, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The standard recommendation is not to allow site search results to be indexed. You should be able to find a disallow rule you can use in robots.txt to do that.

g1smd

7:58 am on Mar 15, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My next project has this unnerving scenario with a host of other duplicate content issues.

It also has a mix of badly formed friendly URLs and URLs with a ton of parameters with several variations of mix and order. Currently only at the planning stage of migrating to a new and more logical friendly URL format coupled with redirecting all the old stuff. Biggest spreadsheet and headache ever.

sftriman

5:19 pm on Mar 15, 2012 (gmt 0)

10+ Year Member



OK, so I'm going to noindex or 301 my search results. The question now is, how exactly to handle it?

Let's say the pair of pages is search.php?q=gizmo and /brand/gizmo.html for example. I could 301 redirect the search to the flat page, but for some reason that strikes me as odd: someone has typed in a search for "gizmo" in the search box, and they end up on a flat directory type page. Not to mention, the display is totally different: the gizmo brand page has a writeup on gizmos and other elements, while the search results page has drilldown by category, display options, etc. etc. Basically, they are different pages. I'd say the brand page is more SEO friendly, as it should be, while the search results page is clearly aimed at helping the user along to refine his/her search. If I noindex the search page, then it disappears. Net net, I end up with 21,000 pages indexed instead of 28,000 pages. I'd rather somehow pass the search page juice to the brand page.

One thought is to only 301 for Googlebot. I mean, in the end, that's what I want: Google learns that my search page really should be my brand page, while my users get the usual expected search results. Would this be good to handle Googlebot differently than anyone else, in this case?

tedster

6:41 pm on Mar 15, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't think you need to worry about noindex, or preserving link juice and 301 redirects here. I think what you need is a robots.txt disallow rule for the search.php pattern. It's a very direct approach and refocuses googlebot's crawl budget on your important pages.

What matters here is your total search traffic, not the number of URLs in the index. You may find you need to work on the internal linking and structure of your site after removing the site search results from Google's index - and if so, that's all for the best.

I would much rather have my visitors browse the site than use Site Search anyway. That way they can get familiar with more of what the site offers them.

SM_Commerce

9:11 am on Aug 1, 2012 (gmt 0)

10+ Year Member



I feel I should reply on this as I don't completely agree with Tedster

I have the same scenario, of lots of dupe content/titles as bots are indexing search pages

Now setting up the robots.txt block WILL stop bots crawling your search page, but will NOT stop pages from appearing in the index. This is because if an external site is linking to one of your internal SERP s, the bot will pass pagerank and possibly rank the page - what it can't do, and this is the important factor, is pass on any pagerank.

By using the robots method you're effectively going to suck in any pagerank and lose it completely

My preferred method (and I'm not saying its perfect) is to NOINDEX, FOLLOW the page such that the bot can still pass pagerank around

The downside is of course as mentioned you don't preserve the crawl budget but your site isn't that huge with 28k pages in comparison to mid and large scale sites

You can always set crawl priorities in your sitemap and set a lower priority for your search pages but at least you're passing link juice along and can guarantee those pages do not appear in SERPs and flag up a dupe content warning

SM_Commerce

9:13 am on Aug 1, 2012 (gmt 0)

10+ Year Member



also i've found this article from Aaron useful in the past [tools.seobook.com...]

tedster

12:39 pm on Aug 1, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Welcome to the forums, SM_Commerce. We could probably have a lively debate on this topic. I do appreciate your point of view - it was mine until recent times!

There is an underlying assumption here, however - that PageRank distribution still works the way it did in the original academic paper. It seems clear to me that it no longer does. The new "reasonable surfer" model that Google uses alone has introduced a lot of complexity to the PageRank picture. In addition, PageRank is not nearly as big a part of the ranking picture as it used to be.

My point of view is based on experience with sites that stopped Google from crawling search results with robots.txt. Their search traffic increased - and that's the goal. Their regular pages were crawled more frequently, too. And after all, search traffic is the goal, not PageRank.

I also agree that it's easier to remove site search pages that already are indexed. I have used the noindex approach and only later introduced the robots.txt rule after the number of search pages dwindled - and that was successful. But if you do get a couple of site search results stuck in the SERPs, removal is still an option when the robots.txt rule is in place.

The bottom line for me - it depends on the individual site's situation. For most sites, I do think robots.txt is the right approach, especially if it's in place from launch. Other factors to consider: how much search traffic is already coming in, plus how rapidly are new content pages being crawled, indexed, and ranked.

SM_Commerce

1:26 pm on Aug 1, 2012 (gmt 0)

10+ Year Member



thanks for the response and feedback Tedster

As an avid reader of WebmasterWorld I felt it time to finally join in after soaking up lots of good advice!

Back to this topic I agree a good approach would be to NOINDEX followed by robots block - for our own case we don't have any backlinks into our SERP anyway so this would probably work fine for us.

In any case you can remove individual pages via GWT if a few do actually pop up but there is no big harm there

In sftriman's case though I think the NOINDEX is the best option and to monitor his traffic & any key movement on rank (as long as its trended not isolated due to the various factors affecting rank for individuals)

Then based on this and as you say to preserve crawl depth you can introduce the robots.txt solution and further measure traffic/rank over time - at least this way you can measure the impact of the robots.txt solution after some time AND ensure your priority of getting internal SERPs out of the external SERPs is completed

One other thing I'll throw out here - how will Google react to a large number of pages suddenly being de-indexed? Is this a negative signal? In their case, losing 7k pages is fairly significant especially in one single swipe - would you phase this or do you not think this matters? I've heard from SEO specialists siding on both sides of the fence. I've personally not seen evidence that this could be detrimental but something worth considering

netmeg

2:41 pm on Aug 1, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If I am working with a site that already has search results pages in the index, I noindex them.

If I am working with a new site (or migration to a new platform) I use robots.txt to try to keep them out going forward.