legit duplicate content

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

legit duplicate content

how bad this really is?

sprock

12:30 pm on Apr 6, 2007 (gmt 0)

I was just recently aware of google's duplicate content penalty.
The fact is that my website has most of its pages duplicated; One instance of each page is in its corresponding section, like:
scheme://site/section/subsection/item

And then, another instance accessible through a keyword, like:
scheme://site/keyword

The site gets some type-in traffic from the keywords, and I just like the concept of it, no matter what google thinks, so I'm going to keep it anyways, since I don't work for google and I'm doing fine. But since google owns internet traffic nowadays, I'm just wondering how bad is this?

Robert Charlton

9:51 pm on Apr 7, 2007 (gmt 0)

sprock - Welcome to WebmasterWorld.

One instance of each page is in its corresponding section, like:
scheme://site/section/subsection/item
And then, another instance accessible through a keyword, like:
scheme://site/keyword

You've done what I caution all developers I work with not to do. Never have a page referenced under multiple urls.

You might block one of the variants of the page from spidering, using the robots meta tag, but even then you'd likely have problems. You wouldn't be able to control which page received external links, and that could end up costing you inbound linking credits, particularly if you do this frequently on your site, as it sounds like you do.

Q: What's worse than a page than isn't worth linking to?
A: A page that attracts a lot of links and is blocked from spidering. ;)

One additional thought regarding what you might be doing.... Are you manually generating these new urls, or are you somehow generating them via site search and linking to them so they're indexed? If the latter, Matt Cutts has recently indicated that Google frowns upon that particular tactic.

Keniki

10:09 pm on Apr 7, 2007 (gmt 0)

I think its important to remember many cms systems especially those using mod rewrite may serve content several ways and duplicate content penalties may be caused. Also defaults in apache setup may cause canocalisation issues ie www.example.com/subdirectory and www.example.com/subdirectory/ would probably serve the same page on apache unless configured not to.
Also www.example.com/subdirectory// would also show same page as www.example.com/subdirectory/

Could anyone call this geniune duplicate content. Of course not but if links are thrown at both urls can it cause an issue.You bet!

Keniki

10:14 pm on Apr 7, 2007 (gmt 0)

A page that attracts a lot of links and is blocked from spidering. ;)

Interesting so on this basis I could take any website handling hijacking or cannoical issues through robots.txt and create loads of external links to the page prevented from serps to cause this site problems?

If true sounds like spammers paradise to me....

Robert Charlton

10:48 pm on Apr 7, 2007 (gmt 0)

A page that attracts a lot of links and is blocked from spidering.

Interesting so on this basis I could take any website handling hijacking or cannoical issues through robots.txt and create loads of external links to the page prevented from serps to cause this site problems?
If true sounds like spammers paradise to me....

Keniki - I was thinking of this more from the standpoint of "a link is a terrible thing to waste."

Re tons of external links causing serps problems, that's not what I had in mind. I'm thinking that, at worst, under the scenario you suggest, the url only for the blocked page would rank. With the new anti-Googlebombing aspect of the algo, I don't know whether external links could drive up a page whose content wasn't indexed.

But if you don't want the page to be referenced... yes, using the meta robots tag to block a page is a better way to go than using robots.txt. (Note that using both will cause the meta robots tag not to be read.)

jd01

12:29 am on Apr 8, 2007 (gmt 0)

In looking at the URL's it appears the site is dynamic and uses mod_rewrite.

If it is written in php, adding:

// Sections and Keyword Equiv.
$section="section 1,section_1_keyword_1_equiv;section 2,section_2_keyword_2_equiv;section 3,section_3_keyword_3_equiv";

// Take the URI apart
$redirect=explode("/",$_SERVER['REQUEST_URI'])

// Check to see if the URI should be redirected
if(strstr($section,$redirect[0])===TRUE) {

// If the section should be redirected separate the section var into a section, keyword equiv. pair
$section_to_redirect=explode(";",$section);

// Loop through the sections
foreach($section_to_redirect as $find_keyword) {

// Find the right keyword val. pair to redirect
if(strstr($find_keyword,$redirect[0])===TRUE) {

// Take the keyword val. pair apart to redirect
$send_to_page=explode(","$find_keyword);

// Send the header and send to the page
header ("HTTP/1.0 301 Moved Permanently");
header("Location: http://www.example.com/$send_to_page[1]");
exit();

}
}
}

�or something to that effect at the top of the page 'creating the content' might be a solution.

(It might need some 'edits', and has not been tested, and you could probably set an array or something, and if there is a huge set of section/keyword pairs there might be a more efficient solution, but it's the first idea that popped into my head and might give you some ideas on how to resolve the duplication issue and not lose the inbound link weight.)

Hope this helps.
Justin

sprock

9:16 am on Apr 8, 2007 (gmt 0)

The fact is that I dont SEO in any way. I just build a website and add valuable content to it as I please and as I think my visitors like. I've never done anything specific to target google in any way - only visitors. Never done anything specifically to please, displease or trick google - I just ignored google... my understanding was that good search engines had to adapt to rank pages with good content high, and not the other way around. For the time being it has worked great.

But the truth is that these days google owns the traffic on the internet, so I guess my assumptions are not that realistic anymore... Anyway I'm still not too convinced in optimizing for google, but if there are specific penalties, I think I can try to avoid them.

In this case, the pages are real pages generated by a human, they simply can be reached through two urls: in its section and directly by a "keyword". Actually is not a technical problem or something, it's a concept I like and an option I coded on purpose.

I can see this as a problem because of incoming links being split between both (which can be considered a "penalty" in itself), I can live with that. But now I assume by what has been said that google may apply an extra penalty, the so-called duplicate content penalty. But I don't know how bad this is.

Now, If I hide one of the sets of urls with robots.txt I loose the inbound links to that set. To do this I would have to assume that the penalty for duplicate content is really severe, since I'm renouncing to those links.

Another option, as jd01 explains, is to redirect one of the sets to the other, which I think would be a cleaner solution - actually my php now acts more or less in the opposite way, showing the content of another page when a "keyword" in the DB is requested, so no problem reversing it. In this way the incoming links to both would be preserved.

But for this I would also have to assume the penalty is somewhat important, since it breaks the concept I like. Does anyone has a hint on in which way is the page penalized? are both copies removed from the results? is the site banned? do the individual pages rank worse? and if so, a lot worse? Has anyone suffered from this duplicate content penalty in a similar situation? - notice that I'm not refering to junk pages with tons of duplicated content, or things like that...

jd01

5:32 pm on Apr 8, 2007 (gmt 0)

Hi sprock,

I think if you step back and look at the situation from a search results, "Where does a search engine send the person who is seeking information regarding a specific query, so they will return and use the search engine again?", perspective you might see thing a little differently.

Whether two copies of the same information are actually a benefit to your visitors or not does not really matter, because, like it or not, search engines must make a determination based upon patterns.

The questions I would ask are:
How does a search engine determine which copy of the resources on your site is the most 'important' or 'informative' resource?

Should a search engine return both URLs so visitors can pick the URL they like the best, or should it pick one of your URLs by making an algorithmic (or heuristic) determination of which one people are most likely to enjoy/remember/use and 'suggest' it in the results, so people will think they are a 'good' search engine and return the next time they want to find a resource?

How does a search engine algorithmically (or heuristically) determine the difference in intent between someone like yourself, who has two copies of a resource at different locations because you believe them to be a benefit to your visitors, and the other webmaster who uploads a second (or third, or fourth) copy of the same resource at unique locations within a site (or domain, or subdomain(s)) to increase their page count, or the size of their site, or their 'inbound links' to a specific resource?

I'm not trying to sound harsh, just point out some of the complexity in 'pick a page, any page', a search engine must deal with, so please don't be offended. I think you are the exception, rather than the rule�

<added>
Sometimes SEO is about being 'clear' in what you have to offer visitors.
</added>

Justin

Have a great Easter everyone.

Keniki

8:48 pm on Apr 8, 2007 (gmt 0)

Does anyone has a hint on in which way is the page penalized?

I would expect one or both versions of the page to go into the supplemental index.

AlexK

10:32 pm on Apr 8, 2007 (gmt 0)

sprock:

To do this I would have to assume that the penalty for duplicate content is really severe

Google appear to have upped the ante recently. Two personal case histories, both extremely recent:

1 Test site doubling all pages

I use a `test' site to check out proposed site changes before going live. Sometime last Autumn a single post from myself on the site Forums mistakenly used the test-address in a link rather than the live-address. That single link was not noticed until a couple of weeks ago. All through March, Google-referrals were just 22% of the early January figure.

My Host's DNS mis-configuration prevents me removing the test-site from the DNS. In mid-March, `robots.txt' was converted to exclude everyone, Google asked to remove the test-site from it's index and the test-link converted to a live-link. On March 31 2 things occurred simultaneously:

test-accesses dried up
live-accesses returned to January levels

2 Site Transfer errors cause many URLs to show same content

This is how to destroy your site in 2 days.

A site update was made live on April 5. Amongst other presentational-changes, the site location was transferred from .com to .co.uk. Insufficient checking meant that the 301 transfer gave a wrong URI for an entire section of the site. A double-whammy error meant that this wrong URI did not throw a 404, but sent the home-page content instead.

Saturday was 10% of the Fridays visitors and--as far as I can tell--Sunday shows a 97% loss of traffic. The only thing occurring on the site is an eerie silence.

So, a site 9 years in development has, in 2 days, gone from 3,000+ vistors per day to 3 visitors in 4 hours. From the SE point of view, that is entirely due to duplicate content.

The site is ~100,000 pages. Googlebot got 3,063 x 301s on Apr 5/6 on the .com site, and followed up on 1,154 pages on the .co.uk site.

Comment:
I began like yourself. I wanted to make it as easy as possible for vistors to find the content that they wanted. My focus was on them, and I ignored the SEs.

Google has forced me to concentrate my attention on it, rather than my visitors. If I do not get the site right for Google, then nobody will ever see what I produce. That's why, quite some time ago now, I began to detest, then hate, Google. Not a totally sensible emotion, I guess, but thoroughly understandable.

jd01

12:30 am on Apr 9, 2007 (gmt 0)

Can someone please explain to me how having two versions of the same content makes things easier for visitors?

Maybe I am missing something, but I would think if you have a link to two different URLs with exactly the same information a visitor might become more confused than if you only use one well formed URL which follows a 'site-standard' naming convention of some kind.

(Similar to Blog Sites, where I run into the same stinking page 3 times by clicking on different links to different URLs when what I am really looking for is more, different, or expansive information� It really just irritates me.)

Sorry if I am being naive, but I'm not sure I understand exactly how two pages with the same information make things easier for visitors�

Justin

(I do understand the 'test-site' scenario, just not two copies of the same information on a live site(s).)

<added>
In some cases when serving a large directory of some type, I can see listing four vendors for 'example product a' and the same four vendors for 'example product b' if they are separate and distinct products, and they happen to be provided by the same vendors, but if the site is structured correctly, there should be no reason to have both pages included in the index, or by writing full, accurate descriptions of the product, the name/address convention should not be enough to 'duplicate content' and both pages should be included without issue.

Further, if there is not enough information to write separate pages, then I cannot see why both products would not be listed on the same page. I can't personally think of a way to justify the exact same information at two different URLs to cut down on visitor confusion.

I understand eliminating visitor confusion, or adding ease of use was the intent, so again maybe I am missing something?
</added>

Robert Charlton

1:06 am on Apr 9, 2007 (gmt 0)

Can someone please explain to me how having two versions of the same content makes things easier for visitors?

From the original post, I'm thinking that what sprock may be attached to is "breadcrumb" navigation, which might be tied into a folder setup that gives two different pathnames for the same page....

scheme://site/section/subsection/item
And then, another instance accessible through a keyword, like:
scheme://site/keyword

If this interface is what's driving this two url situation, it's actually not necessary that the navigation display and the directory setup be tied together.

I can see both of the following breadcrumbs going to site/products/electronics/widgets.html in the directory structure, but found via different breadcrumbs...

scheme://site/products/electronics/widgets
scheme://site/widgets

I can understand why you'd want to preserve the breadcrumb trails for the visitor. I don't think that visitors generally pay much attention to the page urls.

jd01

2:44 am on Apr 9, 2007 (gmt 0)

Thanks!

I don't think that visitors generally pay much attention to the page urls.

I agree and keep thinking the intent was correct, (it's why I thought I was missing something), but keep coming back to: If you can get them to one, send them back to the same place, because the only difference is what shows in the address bar, and who really cares about the address bar anyway? (Except I like shorter URLs better, always makes me pause when they scroll a mile to the right. =)

I really thought about this thread for a long time this AM before my first post today, because I tried to think of a way you would 'need' two (or more) URLs for the same information, and how a Search Engine should (could) deal with identical information on two different URLs, which (I believe, in this case) were actually intended to help visitors in some way.

I can see the point, but I can't see the application from the Search Engine end, so I wanted to try to understand what detriment there would be in redirecting one set.

I'm not suggesting to change any of the linking convention. I link (breadcrumb) to quite a few pages that would be duplicates for ease of coding, then use a little mod_rewrite, or php to ensure only one URL with the information opens, and the other(s) 301.

About the only other option is to 'noindex' or 'disallow' one set, and if you are going to do that, you have to choose one set of URLs to remove from the index, so why not just redirect that set?

Sorry, I'm staring to sound like a broken record.
I'll bow out of this one. Best of luck.

Justin

sprock

9:29 am on Apr 9, 2007 (gmt 0)

Well, my case is the case explained by Robert. Actually internal links use, except for a few exceptions, the traditional "breadcrumb" structure (on top of the page the breadcrumb path is shown with a link on each step). But since the site can be considered a reference site for certain widgets uniquely identified by a keyword, I also set up the duplicated "keyword" urls for easy type-in traffic, and made users aware of this.

Sounded like a good idea to me. But I can't actually really tell how popular the keywords are as type-in, since I found that some pages are linked by google to the breadcrumb and some others to the keyword address (in an apparently random pick, but maybe related to inbound links). I have an overall high % of typein traffic, though. Anyway, I have to agree that nobody pays attention to URLs.

So after reading of Alexk experiences and the rest of this little and very helpful conversation (thanks guys!), I'm going to follow Justin's suggestions and take google into account as part of the user's experience, helping its crawler a little by 301-redirecting the keyword urls to their respective breadcrum urls.

jd01

1:09 pm on Apr 9, 2007 (gmt 0)

OK, one more:

I also set up the duplicated "keyword" urls for easy type-in traffic, and made users aware of this.

You got me --- I completely understand the above, even more than the breadcrumbs.

If I am reading correctly:
1. You have a script which generates all pages.
2. Certain widget pages can be used as references.
3. It's easier to type in a keyword for the references, so you let visitors know they can�

It sounds like a good idea to me also, and if I did not make it a point to try and pay attention to search engines, I might have done the same thing for the same reason.

Justin

Robert Charlton

7:05 pm on Apr 9, 2007 (gmt 0)

But since the site can be considered a reference site for certain widgets uniquely identified by a keyword, I also set up the duplicated "keyword" urls for easy type-in traffic, and made users aware of this.

Assuming instructions are clear enough to users that they always type in the correct urls, I'd handle this entirely by redirects....

I wouldn't have links within the site to the alternative urls at all. Just set up 301s in the .htaccess that would redirect these short type-ins (within your site) to the actual longer urls. This would also consolidate "link love" that's thus far gone to the dupe pages.

I'd think, btw, that you could also create extra redirects to anticipated typos or variants... Eg, in addition to the "widgets.html" type-in, you might also redirect a "widget.html" and maybe a "widjets.html."

I would remove all the current dupe pages, though, as well as all nav links within the site to these alternative urls. In addition, set your .htaccess to return a 410 Gone [webmasterworld.com] response for the removed pages.

I'm not a programmer or a server expert by any means, but I have it on pretty good authority that it's not good practice to keep live links within a site to redirected urls.

Robert Charlton

7:12 pm on Apr 9, 2007 (gmt 0)

PS...

I'm not suggesting to change any of the linking convention. I link (breadcrumb) to quite a few pages that would be duplicates for ease of coding, then use a little mod_rewrite, or php to ensure only one URL with the information opens, and the other(s) 301.

Based on advice I've received, I'm thinking it would be best to change the linking convention first, then also apply the rewrites.

jd01

7:44 pm on Apr 9, 2007 (gmt 0)

Interesting on the information you have indicating you should not leave a URL running through a redirect.

In addition, set your .htaccess to return a 410 Gone response for the removed pages.

You can't return a 410 for a page which is 301 redirected. It has to be one or the other. By setting the 410 rather than the 301, you defeat the purpose of the redirect to consolidate the links/pages. The redirect makes the 410 unnecessary.

Justin

twicker

8:04 pm on Apr 9, 2007 (gmt 0)

I have it on pretty good authority that it's not good practice to keep live links within a site to redirected urls.

Oh, wow! I begin to be amazed on how pesky these search engines really are, hehehe... Good thing we have this forum to learn!

I like too the idea of redirecting similar type-in urls... thanks, Robert.

BTW, jd01, indeed everything is script-generated, and actually the way I make aware of the existance of the shortcut url for a widget reference is when somebody uses the integrated search box to search exactly for one of the keywords (which amounts for most searches in the site, and approx 10% of visits). I assume google bots do not perform searches (do I assume too much?), so they should not see any of the "keyword" links, at least within the website.

I will keep this hint that suggest the keyword url in search results, since now these are 301 redirected. Or do you think this may affect the quality of possible external links? I mean, is it worth the same for google a link to a 301d page than to a regular page?

jd01

8:16 pm on Apr 9, 2007 (gmt 0)

I assume google bots do not perform searches (do I assume too much?)

No, you are not assuming too much.
They do not complete forms on your site, they leave those to the e-mail spammers to play with. =)

I mean, is it worth the same for google a link to a 301d page than to a regular page?

As far as I know, yes, they count the same through a *single* redirect, but will be lost through 'stacked' (2 or more) redirects.

Justin

<added>
Sorry twicker, forgot:
Welcome to WebmasterWorld!
</added>

Robert Charlton

9:01 pm on Apr 9, 2007 (gmt 0)

You can't return a 410 for a page which is 301 redirected. It has to be one or the other.

Yes, of course. In this case, should be the 301 only.

AlexK

1:46 am on Apr 10, 2007 (gmt 0)

I need to correct a wrong statement in my earlier message [webmasterworld.com]:[quote]This is how to destroy your site in 2 days[/url]
My Host's 3 Name-Servers all suffered 'System Failure' at the start of the Easter break. Even though my own NS is marked as the Authority on my own sites, there is a mis-configuration somewhere, and the DNS did not fail-over to my own server. Hence, my site(s) fell out of the net due to these DNS failures.

I would note the following:

There will still be a penalty for some days/weeks/months due to the error made on my part that was active for 40-odd hours.
It is telling that I was quite prepared to believe that Google would act in that way. Google now enforce practices that are not part of any RFC, with neither redress, nor challenge, nor authority.

AlexK

2:27 am on Apr 10, 2007 (gmt 0)

jd01:

Can someone please explain to me how having two versions of the same content makes things easier for visitors?

1 Some years back I introduced a compression module into my PHP site. Although the algorithm attempted to take known buggy-browsers into account, I wanted to have a backstop available, and put a "nocompress=1" parameter into place which would auto-propagate into all URLs if used (no longer used for obvious reasons).

2 The electronic hardware widgets that my site deals in are often manufactured on an OEM basis (very common, yes?) Thus, the identical widget can often be found under a variety of retail names, and different retailers or manufacturers. My site attempts to locate and store the info for each same-widget in just one place. That place becomes a repository for info and drivers from many places.

For the sake of easy discovery by, and confidence from, searchers, the site structure is first-level by retailer-manufacturer, since that is how the end-user knows it. Thus:

www.my-site.co.uk/retailer/widget.html
If instead it is

www.my-site.co.uk/oem/widget.html...the user says "that's not my widget".

The point here, is that the above is a site designed for users, and not for Google (which is what G righteously says you should do). Doing that, however, causes penalty from Google, so the webmaster ends up designing the site for Google - the exact opposite of what should be the case.

I've attempted to produce a site design which works for both, but the point remains - I spend vast amounts of my time designing the site for the demands of Google, and not for the convenience of my users. Which is the precise opposite of my desire.

jd01

3:36 am on Apr 11, 2007 (gmt 0)

1 Some years back I introduced a compression module into my PHP site. Although the algorithm attempted to take known buggy-browsers into account, I wanted to have a backstop available, and put a "nocompress=1" parameter into place which would auto-propagate into all URLs if used (no longer used for obvious reasons).

Hmmmm... Maybe, but I would think either PHP or Apache could detect the known browsers and respond accordingly, which would make it so there is only 1 set of URLs for each user-agent.

(To me the above would = 1 set of URLs, much the same as a flash site is also often available in html format. I'm sure I wasn't quite clear above. I meant 2 sets of "working", "linked", "accessible to all users" *identical* pages @ unique URLs. IOW *exact* duplicate pages, and obviously compressed !== not-compressed, but could still be (have been) regulated so not all UAs have access.)

For the sake of easy discovery by, and confidence from, searchers, the site structure is first-level by retailer-manufacturer, since that is how the end-user knows it.

I don't understand the second quite as well.

I would think /oem/retailer/widget.html would be just as appropriate, and would possibly even instill more confidence in searchers, who would know, not only does a retailer have their part, but it is Original Equipment Manufacture, which I believe is the correct meaning of the acronym.

Also, why would an OEM set of pages be necessary if visitors only use, need, remember /retailer?
(Rhetorical)

Justin

AlexK

4:05 am on Apr 11, 2007 (gmt 0)

jd01:

Also, why would an OEM set of pages also be necessary if visitors only use, need, remember /retailer?

AlexK:

My site attempts to locate and store the info for each same-widget in just one place. That place becomes a repository for info and drivers from many places.

That is the value for users - one set of info/drivers from one retailer (often) works with the same widget from other retailers. Users come with a retail name, but it is often the OEM info that they need.

Anyway - move away from that detail, because the fundamental point remains the same:

I spend vast amounts of my time designing the site for the demands of Google, and not for the convenience of my users.

Please note, I'm not saying that Google's demands are not often useful. In the same way that I am most interested in your views on good organisation of the info--because that ultimately helps me to give a better service for the users of my site--so am I most interested in Google's views on good organisation of info generally. The enormous difference is that if I ignore you, users of my site may suffer delays in finding what they want. If I ignore Google, they will never find my site, period.

Google's reply is (probably) that 'their demands' and 'the convenience of my users' are one and the same. My response is, "who instituted you as the Internet law-makers, policemen and court?"