Duplicate content questions... - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Duplicate content questions...

does google really penalize for duplicate content

andsieg888

7:21 pm on Oct 11, 2004 (gmt 0)

10+ Year Member

I have read many threads about duplicate content and how google 'penalizes' sites with duplicate content. Since I'm not an expert it would be unfair for me to doubt this but I do wonder how it works. In particular how does google determine what is the original content and what is duplicate content?

For example two sites have identical content what determines which content is original? Is it the first site to be indexed? This seems somewhat arbitrary to me.

Also what about if I have a site which provides info about widgets and the site has a different page for purchasing widgets in each state. Each page would have identical content (widget product description) and the only difference would be the state name along with the contact info. From what I've read many pages would either be penalized or dropped from the google's index due to the high degree of duplicate content. But in reality each page is unique and would be useful for people in different geographic locations.

In the end I'm hoping someone can explain to me why i'm wrong or if not tell why google would be so arbitrary?

nancyb

6:28 pm on Oct 12, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Yes, absolutely G can and does penalize for duplicate content.

Two thoughts, if everything about the widget is the same use links for each state to place order. -or-

Add enough content on each state page so the content is not duplicated. Perhaps some other info specific to the state.

Brett_Tabke

6:31 pm on Oct 12, 2004 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

> how does google determine what is the original
> content and what is duplicate content?

copy found first, or found with first link is considered to be the original. Hence, the hyper active spidering.

andsieg888

6:46 pm on Oct 12, 2004 (gmt 0)

10+ Year Member

>copy found first, or found with first link is considered to be the original.

If this is true I think it really encourages 'bad behavior' so to speak, because it allows people to copy one's competition content which hasn't been indexed yet and present it in a more search engine friendly way. I assume you can magine all the compications which might arise from this.

Terabytes

6:47 pm on Oct 12, 2004 (gmt 0)

10+ Year Member

duplicate content...
Here is a true-life scenario...

I originally wanted a specific ".com" domain name however it was not available. I purchased the ".net" variety and began to build a web site, waiting for the ".com" to become avalable...

During this time I knew nothing about search engine submissions so the site went un-submitted...(8 years ago)

(there is a point to all this...bear with me...) 8-)

All of the back end web issues were handled by the ".net" domain (mail, forms, etc).

I purchased the ".com" domain when it became available, and configured it to point to my current (.net)web site. By this time I was starting to submit my site to everywhere...I decided that the ".com" was what I really wanted and began to submit it...

so far...everything is great...my listings are happening....(I'm on my way to fame and fortune...lol).

(the point is coming......) 8-)

My Error: hardcoding a link to my forms so that they were always sent users to the ".net" domain..(my forms only worked if they were submitted from my original ".net" domain name...)(this is before I knew about mods and such)

The Point:When I get indexed as ".com" the SE located my internal link to my form page in the ".net" domain, and began to index the ".net" web site along with the ".com" site...IDENTICAL CONTENT....

My current situation, is as follows:

Google shows ".com" results...If I search for ".net" I show a result....but only if I look for it...Google decided duplicate content fell on the ".com" side...and only shows ".com" results...

Yahoo shows ".net" results...likewise on the search for ".com"..but only diplays ".net" for searches..Yahoo decided that duplicate content fell on the ".net" side

It's deffinately something to consider when linking the 2 sites together. You may not achieve what you're looking for, and you may just kill one of the domains...

Just my 2-cents...

andsieg888

7:11 pm on Oct 12, 2004 (gmt 0)

10+ Year Member

interesting situation definitely proves the whole duplicate content thing, I'm just suprised they don't have some more sophisticated way of dealing with it.

By the way how much does the content have to vary to be considered non duplicate content?

jimbeetle

7:21 pm on Oct 12, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I'm just suprised they don't have some more sophisticated way of dealing with it

But the current setup suits the SEs' purposes -- to present non-duplicate results in response to a searcher's query.

The SE doesn't really care which domain the page is on, just that it (at least somewhat) answered the user's question.

bhd735

8:06 pm on Oct 12, 2004 (gmt 0)

10+ Year Member

How does blogging software avoid this issue of duplicate content?

Every blog has content on their blog home page. That exact content is duplicated in the archives pages.

So why arent blogs getting penalized for duplicating their stories in two places?

Chad

8:37 am on Oct 13, 2004 (gmt 0)

10+ Year Member

How does blogging software avoid this issue of duplicate content?

Supposedly, Google is able to first strip out all the shared information between pages. The duplicate content comparison is only for the 'content' portion of a page.

But then this might be more Google hot air. I've a site with hundreds of totally unique articles -- but most of them do not show up in a site: search unless you select "repeat search with ommitted results included". I'm guessing that the shared page template confuses Google into thinking that the content is similar. If so, then Google's shared-element removal algo is really weak and many innocent pages are going to be eliminated as similar content.

Perhaps this is why all those auto-generated spam sites hitting high in the SERPs have virtually no page structure. Purely unique "content".

bhd735

9:20 am on Oct 13, 2004 (gmt 0)

10+ Year Member

Supposedly, Google is able to first strip out all the shared information between pages. The duplicate content comparison is only for the 'content' portion of a page.

But that's my point about blogs, it is the content that is duplicated, on the main page and in the archives. It should be identical. And this is standard on every blog, so how does that not raise the duplicate content penalty?

Brett_Tabke

10:10 am on Oct 13, 2004 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

>which hasn't been indexed yet

It is nearly impossible to have a dupe beat the original content because the original hasn't been indexed. Almost all new pages are found within 0-48hrs of being linked too.

Even if the 2nd page is indexed first, there is still the opportunity for the original to win based upon significantly higher pr.

waynne

10:15 am on Oct 13, 2004 (gmt 0)

10+ Year Member

Top Contributors Of The Month

I'm assuming that google uses the server response (304) - content not changed to find the oldest occurance of the duplicate content. I have learn't the hard way not to update pages.

Using the web archive I could prove that a site ripped of a 300 word paragraph 6 months after I first published it but because I amend the templates of my site often it was my page dropped by google. My page had higher PR and my site was running for longer.

I sent a cease and desist letter to the offending site and they have deleted the duplicate content but my page is still not in the SERPS two weeks later.

Surely google could run an auto enquiry of the web archive pages and check the older incarnation and penalise newer pages with a dupedump.

prairie

10:49 am on Oct 13, 2004 (gmt 0)

10+ Year Member

Has anyone been able to establish if duplicate content "penalties" apply to single pages or entire sites?

For example -- if a variety of sites have the same basic product information (e.g. technical specifications), but they all present a very different commentary surrounding the same products, are they still clumped together, with the first indexed or higher PR version coming out on top?

And if significant duplicate content *does* effectively penalize ranking of an entire site, what might the threshold be at which point you don't warrant a rank?

techsmith

10:57 am on Oct 13, 2004 (gmt 0)

10+ Year Member

Hi

I have a query guys as you said it penalize duplicate content and google consider the first site which is crawled ..right ....my question here is google take the hole page similarity or only content?

prairie

11:04 am on Oct 13, 2004 (gmt 0)

10+ Year Member

Hi techsmith, I don't know -- but my guess is that they can determine both what is repeated verbatim and what covers the same topic without adding anything new, at their automatic discretion.

andsieg888

11:50 am on Oct 13, 2004 (gmt 0)

10+ Year Member

waynne....I'm a little confused who was listed first in google's index you or the site which ripped the content? Or where you listed first but because you changed your template google mistakenly thought it was 'new' content.

I just am trying to establish the fact that whichever content is indexed first is what google considers original and that they don't check the date that the content was published...

SofterLogic UK

11:52 am on Oct 13, 2004 (gmt 0)

10+ Year Member

It's not about who was first, my evidence shows that it's who seems the most important. So, if your site was ripped by a big well ranked site, then tough luck... your best option is to go after them with a C&D letter.

waynne

12:59 pm on Oct 13, 2004 (gmt 0)

10+ Year Member

Top Contributors Of The Month

andsieg888 FYI My Site was indexed first - every page has always been in the index until the other site ripped off my content then my page was kicked out (the penalty appears to work on a page by page basis).

Florida88

8:53 pm on Oct 13, 2004 (gmt 0)

10+ Year Member

First off thanks for all the information I get from reading WebmasterWorld.

I have a question on the duplicate content that I just discovered.

My site (www.mysite.com) is index by Google, but I have dropped from # 2-3 to about 9 and down to 12 on my 2 word keyword phrase. Held the #2-3 position for years. Am still 1-2 in MSN, Yahoo, Ask Jeeves, etc.

I found out today there are pages named www.mysite.net, and www2.mysite.com that bring up my exact contant home page but still show thier URL in the address bar (mysite.net, etc.)

Neither of these duplicate pages have any PR or BL.

Am I being penalized by google?

Powdork

2:06 am on Oct 14, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I think in many cases you are giving Google way too much credit. In my region, most searches will bring up at least three or four sites in the top ten that have copied their content verbatim from public domain government sites. The only difference is their template and their advertisers.

jady

2:13 am on Oct 14, 2004 (gmt 0)

10+ Year Member

Recently we had a VERY interesting situation happen with duplicate content. Due to our site being very well written we do have frequent folks that steal our content for use on their sites, however this was a clear attempt for someone to hinder our high rankings on G. This is what they did:

They file/saved our ENTIRE website. They then changed our company name to a ficticous name. They left everything else the same! They left our phone numbers, address, etc. The WHOIS for the domain was fake. It was pretty clear that someone was trying to get US penalized as, even to me, it did look like we were trying to put another site up and monopolize the rankings.

QUESTION: CAN THIS REALLY HURT US? WHAT IF WE DIDNT FIND THIS SITE? Could our site, which has had top 5 organic rankings for years really be banned due to a bad persons attempts to defraud us?

Needless to say, we did traceroutes and found it to be a local hosting company (SURPRISE!) and the site was shut down within 24 hrs.

Algebrator

1:34 am on Nov 6, 2004 (gmt 0)

10+ Year Member

Slightly off topic ...
Can somebody point me to an in-depth discussion (WebmasterWorld thread, white paper) on duplicate contents detection. As a professional programmer and am finding it extremely difficult to believe that Googgle (or anybody) can detect duplicate contents except for most rudimentary cut-and-paste jobs, unless they make it extremely time intensive. Simple 'checksum' even fuzzy checksum are very limited. (i.e. how different is 'abc' from 'xabc' or 'axbc', 'abc' from 'bca' etc.. - consider this on the level of words, paragraphs, pages...)

internetheaven

9:51 am on Nov 6, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

It is nearly impossible to have a dupe beat the original content because the original hasn't been indexed. Almost all new pages are found within 0-48hrs of being linked too.

So only if the person who created the site or who wrote the content on a "homepages" site understands SEO and the importance of linking quickly to avoid duplicate content filters will they "qualify" for first consideration? Seems a bit of a hair-brained detection scheme to me, or is "unrealistic expectations" a better phrase?

Even if the 2nd page is indexed first, there is still the opportunity for the original to win based upon significantly higher pr.

Again, this means that the person with the better understand of SEO (which of course would be the spammer 99% of the time) wins.

But the current setup suits the SEs' purposes -- to present non-duplicate results in response to a searcher's query.

It "suits" them? Since when can they makes decisions on what "suits" them. How about "law-suits", do they suit them? Google lost many cases where they allowed competitors to advertise in Adwords using trademarked names. If they promote competitors who have stolen copywrited content or simply kept some trademarked names on the page by "penalising" the real owners then Google will surely be held accountable? If they want to go down this route then they either have to make sure it works 100% of the time or just forget it and let the results fight each other, then it's not their problem.

I have learn't the hard way not to update pages.

My post has already gotten too long, can anyone else field this one?

Marcia

10:01 am on Nov 6, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Google is not responsible for settling disputes between webmasters - they have absolutely no contractual obligations for pages found on a free crawl. Who is supposed to pay staff to investigate every incident? It's strictly between the originator of the content and the infringer.

>>Every blog has content on their blog home page. That exact content is duplicated in the archives pages.

Not totally exact, but close enough. It depends on the software. Some entries are archived on individual pages and some in groups of posts.

It's not handled, it's kind of dealt with. I put some pages up on a blog and let them just roll naturally as is, to see what happens. It's a total mess with the duplicates, a lot go URL only.

It isn't a penalty as far as I'm concerned, it's just purging the index and not fully indexing extraneous pages that add nothing to the value of the index. With a blog, I don't see how they'd know which are the correct pages to fully index - they can't unless something is indicated server-side.

Vec_One

7:12 pm on Nov 6, 2004 (gmt 0)

10+ Year Member

In another thread, someone (can't remember who unfortunately) aptly pointed out the fact that Google differentiates between upper-case and lower-case text. Thus, www.widgets.com/Blue would be treated as a different page than www.widgets.com/blue.

I have somehow managed to create a number of links which have at least one upper-case character. When one of these is clicked, all links begin displaying the upper-case character. I don't know but I assume this is a Cold Fusion issue.

I am concerned that Google thinks my site is saturated in duplicate content because of this. If so, this might be why it was sandboxed.

Should I change all links to lower-case, or would that be a waste of time? Do I run the risk of losing PR by halving my (incorrectly perceived) number of pages?