Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Has Google made a big change in identifying duplicate content?

         

indyank

1:23 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Has Google made a big a big change in identifying unique content (or duplicate content)?

This was one big question that I always had since Panda was rolled out, as I saw pages on sites go down even if there were a few lines of text from the manufacturer's description of the product.

I stumbled upon two interesting threads on Google forums answered by John Mu and his replies do suggest that things have changed.

[google.com...]

In the above thread, John Mu had picked out a couple of sentences from a page on the questioner's site, compared it with a page on Wikipedia and responded with this message.

For instance, on your page(link removed) you mention:

"Personal Life

Mukesh Ambani married Nita Ambani and they have three children. Akash, Isha and Anant are the names of his children. Mukesh even owes an IPL team with the name Mumbai Indians. Currently he lives in Mumbai in a 27 story building which is named as “Antilia”. The value of his home is about US$1 which is the most expensive homes."

On the same topic, Wikipedia mentions:

"Personal life

He is married to Nita Ambani and has three children, Akash, Anant and Isha. He owns the Indian Premier League team, the Mumbai Indians.They live in a private 27 story building in Mumbai named Antilia. It is estimated to be valued at over US$1 Billion to build. It is claimed to be the most expensive home in history."

The similarity (skipping over the typos) is quite striking. Are you sure that this is the kind of work that should be associated with MBAs?



This recent response does seem to suggest that Google might consider pages as not unique, even if they quote a few sentences from the manufacturer's description of a product or its features.

I am not going into the merits of the site in question, on that thread, but John Mu's response does hint at how Google might be running their quality algo (manually) to determine uniqueness.

Personal life of the nature described in the above example cannot change and that kind of similarity is to be expected.The same will hold true w.r.t a product's description of features. But Google's quality algo seem to have been designed otherwise.

I then stumbled upon this post [seroundtable.com...] pointing to another thread where John mu had responded with a few options then - [google.com...]

Adding a sentence or two is one way to do this, even better would be to make it completely unique.


It does look like Google is now looking for more complete uniqueness than the option that John had suggested earlier (pre-panda).

Any thoughts?

PS: What makes the whole research interesting is that I did not stumble upon the post on seoroundtable, when I did the search on google for "making content unique", but found the same content on another blog that reproduced this while content with a link back to that post on seorountable. Any thoughts on this?

suggy

2:58 pm on Jun 21, 2011 (gmt 0)

10+ Year Member



Hi Indyank

I have my suspicions that this is what pulled us down on Panda; that and Google's inability to attribute us correctly as the originator. Scrapers who take just a sentence of two from every page in a SERPS set have been rife in scrapping our bits of our content. So, our experience would tend to chime with this.

I also think they are better at spotting what I call derivative content. You know where someone just rewrites someone elses piece or rehashes the same old same old.

Maybe they are using n-grams to do both? I think they would tend to shine the spotlight on superficially different content.

kellyman

3:04 pm on Jun 21, 2011 (gmt 0)

10+ Year Member



I personally think there is a lot of testing going on with Google at the moment, and possibly something they have tested,

However i cannot see them being that strict on duplicate content, most eCommerce sites will show some description from the original item somewhere down the line, so would be nearly impossible for a site to stay within the boundary of originality, how many ways can you describe a specific car for instance.

I still tend to think Google has an issue with quality and are trying different methods to resolve this and lots more testing will be done

potentialgeek

3:09 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Unique Content is King?

Last week I read an article on SEORT which was similar. A webmaster had complained about his site being penalized/vandalized ("pandalized"), and the Google response was similar. The Google employee was able to find the original source of the webmaster's content, i.e., from where he'd stolen it.

Site #1: "United States vs Argentina will execute battle since the U.S. returns to New Meadowlands Stadium to engage in recreation a planet power just the once more."

Site #2: "The United States and Argentina will do battle as the U.S. returns to New Meadowlands Stadium to play a world power once more."

I don't know how he found it, because it wasn't an exact copy. It is not difficult to program code to compare exact phrases or snippets of text between sites. Google must go much further than this. It probably compares keywords and pseudonyms to detect plagiarism. There are two kinds of plagiarism: 1) exact copies; and, 2) rewritten text.

Similar keyword proximity is probably used to detect plagiarism, although Google may have simply looked at the code used by the plagiarism software and copied it.

In the above example there are several identical/similar words in the same sentence:

"United States vs Argentina will execute battle since the U.S. returns to New Meadowlands Stadium to engage in recreation a planet power just the once more."

I was recently doing some medical research online. I saw so many sites, one after the other, which had copied text and/or rewritten, plagiarized text. The content was neither unique nor valuable. Plagiarism is everywhere.

Google has had a plagiarism dial for years; now it feels as if they had turned the dial way up. Is there now a UniquenessRanking (UR)?

I don't copy other sites, but what if my content just happens to be similar to the content on other websites? There's no way for Google to know that I didn't copy the other site.

Now I have to know what my competitors have written and change my text to be "unique"?

One of my biggest complaints about the new algo is its failure to respect sites based on age and trust. I wish they had started targeting the new sites and then gradually worked towards the older sites. Then they could iron out all the collateral damage issues by the time they got to the sites that are 10+ years old - and have been around long before Adsense.

I would have implemented Panda 1.0 to 2.2 for post-Adsense era sites.

[edited by: potentialgeek at 3:20 pm (utc) on Jun 21, 2011]

indyank

3:12 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



suggy, you are right.

PS: What makes the whole research interesting is that I did not stumble upon the post on seoroundtable, when I did the search on google for "making content unique", but found the same content on another blog that reproduced this whole content with a link back to that post on seoroundtable. Any thoughts on this?


The fact that google is showing the copied content ahead of the original when I did the above search for "making content unique" does suggest that google's algo is yet to sort out incorrect content attribution.They are even ignoring the link back to the original and most scrapers don't even link back.

But what is the rationale behind the requirement for a page to be completely unique? The above example (the two lines on the personal life of a person) are facts and cannot be changed.At the most, you can re-write it differently but does it serve any purpose?

If this kind of expectation on uniqueness is true and had been coded into this algo, any kind of reproduction despite they being features of a product, will bring you down.And if they do small mistakes in identifying the source, the scrapers will flourish ahead of the original.

londrum

3:24 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



i think the whole "dupliate text should be dumped from the SERPs" argument to be flawed anyway.

the idea that google's users want the person who said it first to be at the top is complete and utter nonsense. There isn't a single walk of life that i can think off that operates in that manner.

i listen to the beatles twist and shout. if someone said they were going to delete that in favour of the Isley Brothers version i'd wonder what is going on. the beatles version has got identical words, music and tune, but so what?

people go and watch movie remakes at the pictures. if they said we're not allowed anymore because we've got to go and watch the old black and white versions we'd have a good laugh at that too.

newspapers: do people search out the papers that break the story first? or do they carry on buying their favourite one. Every paper has the same stories, and same pictures too a lot of the time. but nobody cares about that.

google would do much better if they stuck the best site at the top, regardless of where the information cropped up first. i actually think that's what Panda is about. Giving "trusted" sites a boost up the SERPs to get them past the piddly ones which 99% of users dont want.

Simsi

3:30 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Although John Mu says "Skipping the typos" it's quite clear that the Wiki article is streets ahead in quality just from the way the data is presented.

Major SERPS degradation on the basis of a few lines of dupe content is not logical. As Kellyman says, a lot of verticals will legitimately use duplicate content and after all, how many ways are there to write that some guy has 3 kids and a nice house while presenting the facts?

You could argue that because it's already on the web it doesn't need to be re-iterated over and over but then you could say an article is incomplete without the full facts.

Then you have the scraper effect. I can't see Google would hand scrapers an advantage unless they were looking at other indicators of quality. It doesn't help anyone.

I'm convinced that there are several quality factors at play here and that dupe content is just one of several factors that, combined, deliver a "content quality" score which in turn is then combined with other indicators. Put another way: dupe content on this scale is only a problem if it's surrounded by (in Google's eyes) rubbish.

[edited by: Simsi at 3:34 pm (utc) on Jun 21, 2011]

walkman

3:33 pm on Jun 21, 2011 (gmt 0)



@indyank

Knowing that the whining site was an obvious scrapper, G employee could have very well searched for "US Argentina match Meadowlands Stadium" considering that one is much more likely to steal from top results of soccer news stories. All his 'articles' were spun, badly.

The other site you mentioned has zero value as a site. Asking yourself, why did he create that site, other than MFA, nothing else comes to mind. I really doubt that Google has gone into comparing sentences since it will wreak havoc into many mainstream sites.

I don't copy other sites, but what if my content just happens to be similar to the content on other websites? There's no way for Google to know that I didn't copy the other site.

Get 15 mainstream stories about "Navy Seals kill Bin Laden" and see how unique they are when stripping it into keywords.

That said I have noticed a major crackdown on 'tech news' sites that just rehash a story in 2 cheesy paragraphs.

Major SERPS degradation on the basis of a few lines of dupe content is not logical. As Kellyman says, a lot of verticals will legitimately use duplicate content and after all, how many ways are there to write that some guy has 3 kids and a nice house while presenting the facts?

100% Agree. Imagine Reuters: "We were going to write a brief bio of Bin Laden to enhance our story but Wikipedia and CNN have already done so" :)

[edited by: walkman at 3:39 pm (utc) on Jun 21, 2011]

netmeg

3:33 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't think there's a finite duplicate content threshold that in and of itself causes penalties or filters. I think it's one factor, combined with other factors - but if you're on the tipping point with some of the other factors, then the duplicate content (i.e. similar descriptions) can put you over.

Most of my clients are doing ecommerce, and many of them have a lot of duplicate content - we don't use manufacturer descriptions, but if you have a product that comes in ten colors, and you have to offer them as separate SKUs because that's how your inventory system and your vendors account for them, then you're gonna have duplicate content. But for whatever reason, these sites have not (so far) been hit by any of the Pandas. Those pages with the color changes don't particularly rank well, but they never did. I don't even try to get them to rank; I tart up the category page and figure if I can get traffic to that, the user can make his way from there. Maybe it's because the ratio of this type of content to other, more clearly unique content across the entire site is good. Maybe there are other factors as well (most of the domains are at least ten years old, everything is pretty much written the way we speak, we obviously are doing a lot of stuff unique to the company besides selling some products that other people sell AND selling multiple versions/colors/sizes/etc.)

You obviously want to minimize the duplicate stuff (internally and externally) as much as you can, because you have to sell on what makes you different. But I really don't think it's duplicate content alone that can get you socked, I think it's a duplicate content along with all the other factors.

Planet13

4:06 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Two things come to mind that may help us determine the scope;

1) tedster pointed out in a separate thread the recently increased ability to understand and interpret synonyms

2) Matt Cutts recent public acknowledgment that google is working on something to stop having scrapers outrank the original.

My opinion is that ecommerce sites, for the reasons pointed out above, will of course get a lot more slack when it comes to duplicate content while "news" sites are going to be getting a lot more scrutiny.

BeeDeeDubbleU

4:19 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why does G show almost 600K results for this search for a single string of text?

[google.co.uk...]

Atomic

4:53 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



how many ways are there to write that some guy has 3 kids and a nice house while presenting the facts?

A lot actually. The examples on this page are clearly plagarism. When you decide to write a bio about someone you could include all kinds of details from that person's life. Sure, some are more important than others. But when you are copying another source and do little more than change the sentence structure (a little nit) and whip out your thesaurus and switch out a few coice words (often poorly) you shouldn't be allowed to rank.

There may only be a few ways to describe a single fact but there are a tremendous amount of facts you could choose to include in something like a bio. When you choose to go the easy route and just swipe someone's original work, as so many are doing, then you should be slammed. I'm glad Google's addressing this. I bet a lot of site owners are hastily rewriting content in the hopes they can fool Google but my hunch is some of that will end up crappier than what they started with since many of them really don't know what they're writing about. Their "craft" is just juggling words around trying to fool Google.

mhansen

4:58 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Based on John Mu's response...

The similarity (skipping over the typos) is quite striking.


What would happen if some studious person decided to write a Wikipedia page (or update an existing one) based on information they find on MY website? In other words, what if the others guys content came before the wiki page?

Does MY content suddenly have a 30-60 day life based on the next panda update, when its labeled as non-unique? Appears that way...

indyank

5:10 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Atomic, I don't think anyone will disagree with your idea on plagiarism in general and the examples cited are not for showing the merits in them as obviously the quality is bad with typos and he might even have plagiarized, but that is not the point of this thread.

1)How is Google actually determining duplicates? Is is based on similarities in a few lines?
2) Is there an underlying assumption that what is there on sites like wikipedia are all original and if there are others with similar content, they are duplicates.

From here on I am deviating from the thread.

As an example, say this fact had been added to wikipedia by someone from a book, and this guy happened to add those two lines from the same book later on, is wikipedia the only one who can add them as they added them first?

There may only be a few ways to describe a single fact but there are a tremendous amount of facts you could choose to include in something like a bio.


Are you saying that all the facts that you include in your bio shall be unique and not written elsewhere?

Atomic

5:17 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are you saying that all the facts that you include in your bio shall be unique and not written elsewhere?

Not at all, but when only those same facts are used, with the same structure, then no it should not be written elsewhere because that's plagarism. Many of the examples I see are clearly plagarism. With so many interesting facts, why does a site only use only what Wikipedia has, presented in the same order? I mean really.

indyank

5:28 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Not at all, but when only those same facts are used, with the same structure


No doubt it is plagiarism and I thought google always determined it that way before. But the question is, has that changed and have they now turned this dial way up?

Atomic

6:14 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why does it have to be turned way up? Either way, sites that have been scraping my content for years are disappearing from the SERPs.

Despite what you say, I can't help but think some of what I've read here is a defense of plagarism and a frustration that it doesn't work as well as it used to. Not all, but some. I'm talking about them.

walkman

8:38 pm on Jun 21, 2011 (gmt 0)



Despite what you say, I can't help but think some of what I've read here is a defense of plagarism and a frustration that it doesn't work as well as it used to. Not all, but some. I'm talking about them.

First the site mentioned is the worst example, they started on Wikipedia and changed a thing and there. No sane person should defend them.

The rest is about fear from Google. It may sound paranoid but if you have 70% of your /someone's income go away and not know why your mind starts to wonder. The problem with certain things like biography is that once you break it done in words you see a lot of similarities. What happens, all depends on Google and how far they turn the dial. I understand how someone can say that will never apply to me...but I am just going to laugh at that statement.

Atomic

9:03 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No sane person should defend them.

Perhaps not. But my gut feeling is that they're on this board defending themselves in denial over what they do.

dibbern2

10:27 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Can I add another twist without breaking the thread?

My panda whacked site used actual quotations -2 line blurbs- from web pages, pdfs, and other documents to present a list of resources about certain health care topics. There are 1000's of these over about 120 pages.

My purpose was to use the blurb as a description for the resource listing, NOT to steal content from those pages I was recommending. When I look at it now, I can't blame G (or any se) from seeing lots of borrowed, illegit content.

I had enoyed years of very high serp rankings for prime keywords. Thats gone now. Sad, but not mad. My bad, I think.

austtr

11:10 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you have to use words from another source, what ever happened to the use of quotation marks? I'm sure Google knows that quotation marks mean "what follows is not unique, these words are the work of another"

Also... I'm sure I've read somewhere in a Matt Cutts blog (maybe a video)that a page is not seen as duplicate content until obviously copied/reworked content reaches a certain point. A single 20 word sentence in a 1000 word unique page should not get the whole page slapped.

dibbern2

11:16 pm on Jun 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you have to use words from another source, what ever happened to the use of quotation marks? I'm sure Google knows that quotation marks mean "what follows is not unique, these words are the work of another"


Didn't work for me. Quote marks (hah!), blockquote tags, <q> tags, even <cite> and cite atributes on blockquotes.

I've searched for an acceptable way to display blurbs that would not get classed as lifted content. Will not consider text gifs.

walkman

12:01 am on Jun 22, 2011 (gmt 0)



dibbern2, the health sector was a prime target of Panda. They aimed to push UP major hospitals, CDC and other government sites seen as more reliable for life or death info.By pushing them up, others went down.

Maybe the blurbs weren't it? Just saying, I don't know

indyank

1:13 am on Jun 22, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why does it have to be turned way up? Either way, sites that have been scraping my content for years are disappearing from the SERPs.


Scrapers aren't disappearing and they are ahead of the original for a key like "Make content Unique" (not even an entire sentence).

The site being pushed down has been in existence for sufficient number of years and is quite popular for that niche.

Also... I'm sure I've read somewhere in a Matt Cutts blog (maybe a video)that a page is not seen as duplicate content until obviously copied/reworked content reaches a certain point. A single 20 word sentence in a 1000 word unique page should not get the whole page slapped.


what they say yesterday may not hold good today, whether it be Matt cutts or John mu.Things keep changing at a fast pace and we are behind in identifying changes.

[edited by: tedster at 1:35 am (utc) on Jun 22, 2011]

indyank

1:19 am on Jun 22, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think it's one factor, combined with other factors - but if you're on the tipping point with some of the other factors, then the duplicate content (i.e. similar descriptions) can put you over.


netmeg has a good point.There are probably a set of factors or signals used to classify sites like Wikipedia on one side and the others on the other.The sites that fell on the wrong side and came back immediately after this was rolled out might have been affected by this unique content check as they being popular are scraped all over. But google might have tilted the balance in their favor by working on something else.