How to deal with user-generated *partially* duplicate content?

2:04 am on Jul 26, 2012 (gmt 0)

Here is another instance of "Bing gets it right, why can't Google?"

I run a helpdesk-like system where users can submit their questions related to a certain type of devices and get good answers, both researched and from experience, provided by yours truly and other people. The site overall is moderately well ranked for related terms but it isn't the only way people can get help on these devices on the Net. Most obvious of them being - they can ask around on forums.

Some of those are extremely well-established and old forums that deserve to rank higher than my site. However, what happens is: if a person asks the exact same question (which then becomes the title of the page on all sites where he asked) on a forum and then on my site (I think sequence matters - forum first, my site second) - my site's page never comes up for this particular question, despite the fact that beyond the first post and the title, everything else on the page is different.

I guess it should be open for debate which of them is better but the end result is that my page never sees the light of day even though both pages deserve to be listed because they both have serious and useful (and different!) info.

Needless to say, Bing lists both right next to each other although the old forum's page, despite having fewer responses and less info, is above mine - they are #1, I'm #2. No question about that - the old forum has enormous amount of IBLs and other site quality signals, regardless of that the particular page says.

Google, on the other hand, lists the old forum's page at #1 and *less relevant* page of my site at #2 and only because my less relevant page has a link to the correct page and the anchor text is the title of that question.

Anyhow, guys, sorry for the long winded question, but it is not the first occurrence and I'm afraid this may become a common situation: a person asks a question in a forum. Gets little or non-satisfactory answers. Then (in a couple of days) copies and pastes the title and the actual question into my helpdesk hoping to get a better answer.

He'll get the answer but now everyone else searching for the same question will only see the forum's page, the one with less info on it. My page will never come up for that search because there's a 2-days older page out there with the same title and same first few paragraphs of the content.

This is not a question of indexing: both sites get their new pages indexed in minutes. It is probably more of a question regarding Google's approach to identifying the original source of content, and then sticking to their guns despite the fact that most of that content is different except for the title and first couple paragraphs. As they re-crawl, the pages get more and more different but their definition of original source seems to be set in stone?

Any creative ideas about dealing with this kind of user-generated duplication? Seems detrimental even when duplication is only partial.
4:15 am on Jul 26, 2012 (gmt 0)

User Generated Content is a pain in the ass. I been dealing for that for 3 years now and there is no way to stop it.
7:43 am on Jul 26, 2012 (gmt 0)

Only google has the right to make money from UGC without any responsibility for the user actions. Others had to strictly moderate their content...
1:59 pm on Jul 26, 2012 (gmt 0)

Others had to strictly moderate their content...
Do you mean it as in editing the UGC to conform to a certain (which?) rule or simply not allow a question that has already been asked elsewhere? The latter would seem a really bad way to run a helpdesk. On the other hand, if noone else ever sees results of my work, I might as well have turned the person down and not waste my time...

Anyhow, does anyone actually resort to editing UGC to make it look, well, unique (uniquer? :) )
3:19 pm on Jul 26, 2012 (gmt 0)

Copy pasting text from another site to the forum is definitely one of the things I hate...Most of these forum users answer questions by copy pasting stuff from elsewhere...

But I am assuming that in your case, it is only the questions that are identical but the answers come from you and they are supposedly unique as you claim.

How about other forum users? Are you sure they don't copy paste answers from another site?
8:10 pm on Jul 26, 2012 (gmt 0)

Have you tried adding a link from your page on the topic to the one that ranks highest? Not only might it be useful for the user but also the fact it's ranking first makes it an authority. Can't do you any harm and would potentially also be very useful for your viewers as they get several opinions.

Obviously there is an argument that you are sending traffic to a competitor but firstly, they will probably find it anyway if it's ranking first and secondly, doing this actually means your site potentially offers more options.
8:43 pm on Jul 26, 2012 (gmt 0)

Quotes, snippets, and citations are the downfall of Panda. What Panda should do, is allow for linked citations, and disregard copied text- however, instead, Panda slams your website for collating and commenting on the others content. Apparently the whole academic model of stylized writing was forgotten the moment Google engineers got their degrees.

Anyhow, the answer seems to be parse and replace for non-registered and bots, till Google gets its groove back. (Oops, did I write that out loud?)
1:57 pm on Jul 27, 2012 (gmt 0)

Thank you for your replies, guys. Some interesting ideas here.

As far as uniqueness of the page on my site (except for title and the question) - I am pretty sure about it because in this particular example I was the only one answering and none of the answers were copied from anywhere. When other users are also answering, indyank is right, I cannot be 100% sure the answers are not also copy/pasted just like the question was. However, it stands to reason that even in that case the answers would be copied from different places on the Net and the resulting page on my site would still be different from the one that Google chose to show instead of mine.

@Simsi: I could have linked to the other page, but the funny thing is: the person asking the question has completely switched to using my site and he has posted more information about his issues and even posted some pictures that were not available to the people answering him on the other page. So, other than "Google gaming" there really is no reason to link to that other page: mine has more info both on the asking and on the answering side. That's what irks me here: even the original poster has completely switched to using my page (I don't believe he's even been back to that other forum page) and yet Google still gives the preference to the page that was posted about 48 hours earlier, has the same title and the same first couple paragraphs.

@ascensions: what would you parse/replace? I can think of adding something to the title (I've already added more than enough to the content) although I'm not sure what would be best to add here. However, I can't think of replacing anything without making this look really-really weird. It's a very simple title, too. He just typed "Need help with a broken XX-XX-01". Not much room for change...
3:04 pm on Jul 27, 2012 (gmt 0)

yet Google still gives the preference to the page that was posted about 48 hours earlier,

OK here's a question - I don't work so much with news/ugc but does Google assume content is the original if it's spiders get to it first, or are the document/rdfa dates a deciding factor? And following on, does a document that gets updated assume a more recent stamp?
3:32 pm on Jul 27, 2012 (gmt 0)

does a document that gets updated assume a more recent stamp?
I guess it would be logical to expect it but it does not seem to affect the rankings anyway. The last update on the page that actually ranks is older than the same question was posted on my site and so by definition, each of the 11 or 12 updates my page has had since then is more recent. Yet the other page comes up on search and not mine.

These two pages are not yet a month old. I wonder if Google will eventually re-evaluate them on the basis of their content and not "who posted first". I don't know if it's actually the case but it would make sense if "posted first" would only matter for the first couple weeks or so. After which the news (if it were a news piece) would have percolated down the system and some detailed commentaries would emerge that are worth listing in their own right. Once again, my example is not a news piece but perhaps the same approach also applies to other types of content?
3:36 pm on Jul 27, 2012 (gmt 0)

I'm parsing out quotes as of the last two weeks. Luckily my CMS has a quote function so the tags are there to do so.

I was hoping to have another week before Panda, but I guess I'll wait another month to see how well it does.
3:43 pm on Jul 27, 2012 (gmt 0)

Ah, if only those cut-n-pasters used the quote tags every time! It sounds like a darn good idea but I have a feeling you can only catch a small percentage of quotations in UGC this way. My CMS also has quotes but I don't recall anyone beside myself using those. I guess it depends on the niche you play in...

