"Similar" Sites Vs Duplicate Content

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

"Similar" Sites Vs Duplicate Content

what can cause trouble....

janejanejane

9:07 pm on Jun 24, 2005 (gmt 0)

Duplicate content can cause a problem of course, but what about "similar" looking and structured sites (where content is not duplicated) but where inventory is the same (all operated by the same vendor).

Is that considered spam, or worthy of a (manual) penalty in the new era of Google? Curious if anyone had any thoughts on the technicalities between the "duplicate" and the "similar"...

Clint

9:33 am on Jun 25, 2005 (gmt 0)

[bump] I'd like to know this as well. Since the inventory is the same, if it's the same text for the inventory then I would say yes, it's duplicate.

larryhatch

10:49 am on Jun 25, 2005 (gmt 0)

Agreed, kind of a tough call.

Lets say you and a competitor are selling the same line of goods.
My feeling is that you can sell all the same items, but do NOT copy his pages!
If I were trying this, I might use the competitors' materials as a starting point (at most)
but completely re-write each page, product descriptions, just about everything.
I would even arrange the product pages differently, different color scheme etc.
and make damned sure your domain name isn't to similar to his/hers.

This avoids duplicate penalties for both sites, not to mention some _really_ bad blood
and possible legal difficulties (putting it mildly). -Larry

Clint

12:14 pm on Jun 25, 2005 (gmt 0)

Jane, isn't "inventory" part of "content"? Or is content considered something other than inventory?

janejanejane

5:18 pm on Jun 25, 2005 (gmt 0)

Allow me to clarify - and I bet I am not the only one in this situation...

Essentially, we developed 3 sites one after the other through 2001-2003. Each site was designed to be an improvement on the previous one, in terms of content, design, structure. Now we focus all marketing/content develoment on the latest site.

Here is the big question: We still left the first two sites up and they still account for about 10% of the revenue. Inventory is essentially the same, fundamental structure and feel is similar, but content (while again similar) is not duplicate and the most recent site has much more pages/content.

Just wondering if we can get in trouble having these other sites lingering around...

Widestrides

5:48 pm on Jun 25, 2005 (gmt 0)

Someone once posted, and I'm sure it was just a guess, that there may be "penalty" if say 70% or more of the content is the same. That may just be based on a simple scan of the text by Google and maybe other elements of the page.

Do you think some of your pages might be 70% or more alike? If so, it may be best to change or subtract some content from one or the other sites.

ownerrim

6:16 pm on Jun 25, 2005 (gmt 0)

I think the duplicate content "filter" is pretty loose and this is why: if you have a 10,000 page site (hard to do, of course) on the brown tail squirrel, all of the pages are going to have VERY similar word usage, I.E. content. And that makes sense: you're talking about the brown tail squirrel on every stinking page.

I think if every page is written differently and distinctly (let's see each page is an answer to a question and each page has it's own distinct title and h1 tags), then google is going to treat each page as unique and dupe content issues won't enter into the equation.

So....I doubt if dupe content issues would apply to sites that simply look similar or have a similar setup. Duplicate content IMO refers to simply that: duplicated and indentical content.

janejanejane

6:29 pm on Jun 25, 2005 (gmt 0)

In the scenario described, I am less worried about duplicate content than I am a manual penalty based on a complaint from a competitor - and having all of our websites banned/penalized. I guess my question is whether Google would find it a transgression to have similar websites all selling the same inventory....

ownerrim

7:36 pm on Jun 25, 2005 (gmt 0)

"I guess my question is whether Google would find it a transgression to have similar websites all selling the same inventory...."

I can't see why myself. Look up a competitive term and you may see a horde of competing sites all offering what is essentially the same product inventory, prescription drugs for example

JudgeJeffries

11:30 pm on Jun 25, 2005 (gmt 0)

Not sure about Google but Skaffe recently ditched a load of sites that did not have the owners details ie address and phone number emblazoned on them on the basis that they believed it was possible for a single firm with multiple sites to dominate the sereps. Seems sensible enough and I suspect Google will do the same.

prairie

12:20 am on Jun 26, 2005 (gmt 0)

I guess my question is whether Google would find it a transgression to have similar websites all selling the same inventory

Do any more than one of the sites show up in a given set of 10 SERPs anyway?

Their guidelines for webmasters recommend you not create multiple sites with essentially the same content, so code's probably already handled the greater part of the issue.

I don't think you'd be banned for something like this.... surely that's reserved for people who flagrantly attempt to manipulate the search space.

larryhatch

9:53 pm on Jun 26, 2005 (gmt 0)

I take it you have three sites with very similar content and/or inventory.
How about taking the weakest one, with the least revenue, and redirecting that to your best site?
You could do this page by page as a test.
If all goes well, do more pages until the whole shebang is redirected.

I'm talking 301 redirects of course. I avoid 302s like the black plague. -Larry

joeduck

12:20 am on Jun 27, 2005 (gmt 0)

janejane -

ask yourself exactly what Google asks itself about what constitutes a quality search:

Are you enhancing or detracting from the user's experience at your site? If the extra pages are just fluff to have more pages and are only modified slightly you may be doing more harm than good. However if the "similar" content reflects real differences in the type of users viewing that section it "should" be considered OK by search engines.

Rollo

1:06 am on Jun 27, 2005 (gmt 0)

I think similar content is ok on different sites on differnt Ips not engaged in a link scheme with each other. In travel, site after site has the same photos, descriptions (typically written by hotels) etc... similarly, look at book reviews, or product specs... all the same and no penalties to speak of. Still, to be on the safe side, I'd always write 100% original content or put borrowed content in quotes. What's true in Google today might not be true tomorrow.

Reid

2:07 am on Jun 27, 2005 (gmt 0)

since you see each site as an improvement upon the others - A and B should both have a 301 re-direct to C.
There is no need for duplicate sites when you can just re-direct your customers to the 'latest update'.

My thoughts are that if a SE rep looked at it, they would see this as a spammy move and penalize it, if you are bypassing the filters now then welcome to 'blackhat' or 'on the edge' which could be penalized at any time without warning or notice - maybe.

webdevfv

12:05 pm on Jun 27, 2005 (gmt 0)

When we talk about a duplicate content penalty does it apply to a whole site or just the page(s) concerned?

i.e. will your whole site be penalised because a couple of pages are similar.

janejanejane

3:46 pm on Jun 27, 2005 (gmt 0)

I appreciate the feedback to my question - its all great.

As opposed to a 301 redirect, I'm considering removing the e-commerce/inventory of the older sites and turn them into authentic content/info sites that can link directly to our current main site.

I think this would provide a useful solution for the older sites (while avoiding any issues that would get us penalized).

If I have learned anything watching the SEs over the past 3 years, it seems that playing it safe probably the best idea.

MikeNoLastName

1:46 am on Jun 29, 2005 (gmt 0)

After apparently being dupe penalized in Bourbon, what I'm wondering is: does it matter if the duplicate content is even all on the SAME OTHER page. I'm getting a little worried, since in our case, I just finished trying searches for a half dozen different sentences (at least 15 words each) from our home page and EVERY SINGLE ONE OF THEM has been copied by at least 70-100 scrapers! Not the same sites for line or vice versa, but mostly our site title, link and 1 or 2 random lines each. It's absolutely mind-boggling! I can search for any single sentence on our home page in quotes and close to 100 other sites some up AHEAD of OURS. No wonder G thinks we're the duplicate!
Guess I need to go rewrite our home page from scratch and start all over. This c$*%* is OUT OF CONTROL!

frfvr

3:52 am on Jun 29, 2005 (gmt 0)

JaneJaneJane

If I have learned anything watching the SEs over the past 3 years, it seems that playing it safe probably the best idea.

Agreed and I want to be perfectly safe. A Shopping directory has one direct link to my site that describes it so well that G bot has had it on the same SERP as mine for several months now. I would like to leave it there, but not if I will be penalized or removed because of it. To play safe, I could trash the link, but how can I tell if it could remain as a legitamate link, or not. Comments much appreciated

ownerrim

12:42 pm on Jun 29, 2005 (gmt 0)

"Their guidelines for webmasters recommend you not create multiple sites with essentially the same content, so code's probably already handled the greater part of the issue."

Now what do you mean by "essentially the same content". Are you talking about two sites with largely duplicated content? What about two sites that each deal with the same niche, but where the content has been separately and originally written for each? For example, two sites that deal with hiking. Each site offers tips on hiking, but the content is original on each.

bbcarter

6:03 pm on Jun 29, 2005 (gmt 0)

I would think, as has been mentioned, that to whatever extent G devotes computing power to finding duplicate content, they must use a % similarity.

It would be dumb to penalize a site that only has 5% the same content as a scraper site (which stole that 5% from you).

For an article to be duplicate, the % is going to be much higher than simple plagiarism-

plagiarism could probably be detected at about 30-50% similarity- no original content should contain that much the same as another article.

but when you do a research paper, you end up quoting from many sources- much like scrapers do, but for a different purpose. So having 2-3% same content as 10 other sites might look like 'research'... the kind of article that summarizes and synthesizes, which would be useful if it weren't a scraper.

If I were G, I'd put the duplicate threshold at around 80%... and I'd use the reverse to determine 'freshness'. I think duplicates and freshness must be two sides of the same coin- when has an article been updated? when it's at least 15% different from their last cached copy, e.g.

NOTE: I made up all these percentages just as estimates.

joeduck

6:09 pm on Jun 29, 2005 (gmt 0)

bb - "12%" duplicated has often been cited, I think because it was the percentage used in an influential dissertation on search functions and duplicates.

Scrapers are taking advantage of the fact that "fair use" allows small snips. Taking these from many sites means they have created a page that Google sees as "unique and not copied" but that has NO original content at all - it's all duplicated from others.

I actually think Google wants to fight this because usually these sites diminish the users experience and G's credibility, but it's not an easy task to algorithmically determine scraped content.

MikeNoLastName

11:34 pm on Jun 29, 2005 (gmt 0)

>It would be dumb to penalize a site that only has 5% the same content as a scraper site (which stole that 5% from you).

>>Taking these from many sites means they have created a page that Google sees as "unique and not copied" but that has NO original content at all - it's all duplicated from others.

Yeah, so what about, as I mentioned yesterday, YOU write a unique page/website from scratch of say 100 lines of text. And it's so good that 100 scraper sites each take 2 random lines as "excerpts". Say Scraper Site 1 (SS-1) takes lines 1 & 2 from your site, SS 2 takes lines 2&3 and so on.
You now have a page/website, which YOU wrote which ENTIRELY exists elsewhere on the net (albeit in 50 pieces)! You now have 100% duplication, when YOU didn't copy anything. YOU are now INDISTINGUISHABLE from a scraper because, just like them (in fact WORSE than them) you have NOTHING other than 2 lines of excerpts from SS-1 and 2 lines of excerpts from SS-2, etc.
Gee, I bet I could kill 10 competitors within days simply by creating a single throw-away site and systematically creating 100 or less pages doing just what I decribed above with 2 line excerpts from 10 of them on each page TOTALLY ENCOMPASSING every snippet possible from all their sites, all interlinked and get them all indexed quick using sitemaps.
There has to be a better way to determine who is the scraper and who is NOT. You can't necessarily use page date, because a lot of people move their pages or have to change servers. If G could PROPERLY maintain a monthly history and properly trace 301 redirects through it, they MIGHT be able to determine WHO had the content first. Of course then the first site to quote a line from shakespere automatically becomes the official ORIGIN of that text.

joeduck

12:01 am on Jun 30, 2005 (gmt 0)

There has to be a better way to determine who is the scraper and who is NOT

I sure hope so, and I'm pretty sure there is a job at a major SE of their choice for the person who figures out a scalable and robust solution. I think you have identified the problem in your post. People ARE building sites from snippets and they ARE killing off legit sites and Google is failing to identify/penalize them.

bbcarter

5:36 am on Jun 30, 2005 (gmt 0)

This is the major reason I think one of two things will happen:

1. backlinks will be king, because only a human eye may be able to tell crap from good content

2. the huge search engine model will die, and we'll return to directories... I still use yahoo directory for important things, for reliable companies, referrals, etc.

larryhatch

5:48 am on Jun 30, 2005 (gmt 0)

If a content scraper just takes a sentance or two from each victim, he better be a good writer himself.
Just pasting them together at random will produce nothing but 'word salad'.
Did I say 'good writer'? Shame on me. -Larry

joeduck

6:29 am on Jun 30, 2005 (gmt 0)

larry -

no, it's not word salad because they are taking, for example, a paragraph from YOUR page about texas, one from mine, and one from wiki. Each is good quality and may even link back to the respective sites. Extremist scraper advocates could even make a case (I would NOT) that this is a reasonable form of content in line with what a search engine does - bring many site's info into a quickly perusable page.

larryhatch

6:48 am on Jun 30, 2005 (gmt 0)

OK Joeduck, I see what you mean.
Even using paragraphs, each from a different source, will seem somehow disjointed
at least to me as a reader. I doubt an SE would pick up on that though.
If the process is automated, I would expect a lot of laughable results.
If its done by hand, I have to ask if the perp wouldn'd do better writing his own content.
But, I keep forgetting. Looks like most scrapers can't write much at all. -Larry

econman

1:49 pm on Jun 30, 2005 (gmt 0)

The scraper/automated page creation process is clearly a problem for Google, since it detracts from the user experience -- particularly if the user stumbles on one of the autogenerated pages (rather than one of the target pages, which the autogenerated pages are pointing to).

Google doesn't want to engage in human verification itself (contrary to their corporate culture, plus its expensive), so they are probably attempting to achieve the verification algorithmically.

Google has been talking about introducing new "signals of quality"; perhaps some of these new "signals" are attempting to detect pages that result from an automated snippet assembly process.

Google can't solve the problem simply by demanding more and more backlinks; too many high quality sites have relatively few backlinks (e.g. government and educational sites which don't engage in SEO). Plus, its actually easier for spammers to create thousands of (not really legitimate) inbound links than it is for the typical webmaster to attract hundreds of (relevant) inbound links.

Since it can't simply rely on the quantity of inbound links (with or without page rank weighting), Google is probably looking for better ways to detect quality.

One possibility is to engage in sophisticated statistical techniques, looking for subtle patterns along the lines of the "trust rank" concept. In general, an obvious solution would be to attempt to weight inward links with respect to the likelihood that the link has been created and verified by a human being (not affiliated with the recipient of the link).

Interestingly, even traditional reciprocal links can be useful for this purpose, provided both parties to the link swap try to avoid linking into a "bad neighborhood), because they can't afford the risk of having their site identified as being part of a spamming scheme.

MikeNoLastName

1:38 am on Jul 1, 2005 (gmt 0)

Larry,
Most scraper sites, simply search on a high-paying keyword on say Y! that they want to attract, say "blue widgets", and then capture/copy the output on that screen (which generally includes snippets from 10 different sites) to a new page on their own site (usually links to your site included), ad some Adsense or affiliates and then move on to a new keyword. The page looks very similar to a search engine result page. Then if they (or someone else) then search on "red widgets" they may get the SAME page from your site in the top ten resuls again but this time with a slightly different snippet (some search engines conveniently change the snippet to emphasize the text your searched on) which adds yet MORE of your content to a different page on their site. This is overly simplified, but THIS is what I'm talking about.

If your site shows (OR FORMERLY SHOWED UP BEFORE IT WAS DUMPED BY THE SEARCH ENGINE FOR DUPLICATE CONTENT) up in the top 10 results consistently enough, YOUR page would be on most of those copied directory pages.

Even if you then let them HAVE that content and change your entire website, you're still screwed with no traffic for months until the search engines re-spider, re-index and de-penalize (de-sandbox) your site to start the whole process all over again. It SUX!

This 31 message thread spans 2 pages: 31