| This 32 message thread spans 2 pages: 32 (  2 ) > > || |
|What is exactly "duplicate content"|
What does Google consider duplicate
I know this question may sound stupid, but I read somewhere that duplicate content is based on the title and snippet and not on the whole page.
Lets see. When googlebots sipide pages, they can raise flags for certain spamming issues, but as far as I understand, the bots canīt the content with other pages. So duplicate pages can be indexed.
Then, I suppose that the duplicate content is checked at the moment a visitor makes a search at google site. So it makes sense that the duplicate content is mainly observed by google in the title and snippet which is the one thing that is made visible to the user prior to the click.
There can be duplicate content (datafeeds for example) but with different titles or text surrounding it. The same way you can have similar titles all over the site and similar surrounding text, but the content is quite different from page to page.
So, what does Google exactly means with duplicate content? Is it a matter of showing clean search results (title and snippets) to visitors or is it a more in depth analysis of the page?
Thanks in advance
Unfortunately, it seems rather subjective to us mere mortals. I simply use the hell out Copyscape. We're doing OK now, but Copyscape flags even page footers as dupes. For instance, many companies have something similar to what I have below on all their web pages, and I worry that Google may penalize for it:
"Copyright 2005, Company X. All rights reserved. Reproduction is prohibited without permission of Company X. Contact us at 100322 Main Street, Anytown Maine 00211 Phone 555-111-3333 Fax 555-111-3344 E-mail firstname.lastname@example.org"
Does Google penalize for this?
>Does Google penalize for this?
Quite simply no, well, not in my experience. Nearly all the best sites are template driven, to keep uniformity, with this kind of information on each page.
My sites have 99+% identical pages according to Copyscape however Google manages to recognise the differences and place them #1, and I am talking about very minor differences in title bars, meta tags, alt tags etc such as:
widget widget widget type A
widget widget widget type AB
widget widget widget type B
Hope this helps.
OptiRex - so would you say that G seems to consider "substantially duplicate content" as an identical title tag, desc tag, navigation and on-page copy when comparing two pages? It'd be great to know whether just one change-up could help you avoid the dupe filters - such as changing the title tag. Especially concerning data feeds...
siteseo - I have no idea how Google's algo tells the minute difference between my pages so well. It's only through sheer doggedness and hard work and experience over the years that I've been able to come to these conclusions.
>consider "substantially duplicate content" as an identical title tag, desc tag, navigation and on-page copy when comparing two pages?
I don't think so, there seems to be more to it than that since Google recognised a test site I had as having duplicate content even though the pages were substantially different in page weight, navigation etc however the titlebars, metatags, descriptions and image were all identical. G did not want to know them even when I tried those same pages again with different descriptions...honestly, I modify pages by the seat of my pants just watching what happens when I make specific alterations, I think that's what most good SEOs do.
I do know that I can add almost any new trade widget to my sites and they will go straight to the top 3 regardless of who else is there. I'm about to add another 140 pages of widget products from a new Turkish supplier after he saw the ten test pages I created for him go in at #1 within a very short time and, obviously, in front of all his competitors.
Maybe I do well since the sites have been online for 10+ years and their algo recognises the similarity of these pages over the years.
They have all been substantially changed over this period of time from plain html to CSS with new images, titlebars, metatags, design etc so it's not as though they were uploaded and just left there.
Sorry, I have no experience at all with data feeds so not able to comment.
For some reason I've always had it in my mind that you had to chance a bit more than 10% of the page in order to avoid getting dinged for duplicate content. When I took an entire tutorial section from 1 website and added it to another - I deleted ALL metatags - then added new title metas for the new site, and changed 2 or 3 image names & alt text, as well as changing the names & link text for the 2 standard links on the bottom.
I also changed some of the page names and a little bit of the structure of the overall tutorial, as you really can't be too careful when it comes to avoiding the dreaded PR zero that duplicate content could potentially bring.
>you had to chance a bit more than 10%
I have to say I've never heard of a figure that low before, certainly 50% used to be bandied about and some said 70-75% however we don't give Google's algo "creators" the plaudits they deserve at times when being able to tell such minute differences perfectly, well, in my case anyway.
There are supposedly (if I remember correctly) 100-110 differently weighted factors which makes up the overall algo and I guess that's locked up in a safe somewhere.
I did have that list but I lost it when I had to install a new HD!
Just search for - 100 google algo factors [google.com]
sorry but I'd have to disagree! If you had to have 75 percent of the page different - tons of websites listing amazon products, etc would all be yanked from Google. And as has already been mentioned nearly all major websites are template based - which means thousands of pages that are very similar.
This topic was covered at length a few years ago here on the forum - I can't remember where the 10% figure came from, but I've had not PR zeros with any of my pages altered around this number.
I have been making pages right, left, and center.
I have been using my main page to create templates - and some of the pages were a little similar whilst i got round to viewing them.
I never got penelized so I would like to know how they work out the thresholds - any clues everyone?
|If you had to have 75 percent of the page different - tons of websites listing amazon products, etc would all be yanked from Google. |
Or downranked. IMHO, it's only reasonable to assume that Google would like to push boilerplate content farther down in its search results, and that it will become more proficient at doing so.
|And as has already been mentioned nearly all major websites are template based - which means thousands of pages that are very similar. |
Why would Google rely on a simple-minded brute-force approach like: "Hmmm....these two NEW YORK TIMES pages have the same page layout, so they must be duplicates"? They didn't hire all their PhD's for nothing. As time goes by, Google's algorithms should become much better at identifying word patterns--whether those word patterns represent duplicate content or are made up of machine-altered text, machine-translated and retranslated text, etc. that are designed to circumvent duplicate-content filters.
Remember, too, that a Google duplicate-content filter doesn't have to be perfect. If it's imperfect, it can be applied to some types of content more aggressively than to others without compromising Google's mission of "organizing the world's information and making it universally accessible and useful."
My theory as to how it works is
- Google Indexes Your Entire Site
- Google Notices Identical Code on every page (or several pages) of your Site
- Google Identifies this as a template and ignores it (they have already followed all links)
- Finally, Google Uses the remaining content to determine what searches the page shoud come up for
Again this is just my theory, but it would explain why sites that have seperate pages for blue, green, white, etc widgets are able to be indexed and rank highly for all when the only real difference in the pages is maybe 10% (price, pic, short description).
I think their is a seperate penalty for content that is dup between domain names. For this it seems Google penalizes the page and possibly the whole site. I posted an example where it seems that at least on some sites Google has penalized the entire site for having dup content
The thread where I talked about this is [webmasterworld.com...]
Lookup the patent information Google has with regards to dup cont. There is quite a bit of information on how they take and analyze thumbprints. You'll skip the speculation remarks and get to draw your own conclusions based on actual facts and data. Enjoy..
Just bec' Google has patented some dup' content technology it doesn't mean they are using it at present, or ever, as part of their algo'!
The data (link below) is there for those intelligent enough to research and run tests against. Matt -As always, you are welcome to base your actions on speculation too..
Duplicate Content Data [patft.uspto.gov]
I think you've misunderstood me:
>sorry but I'd have to disagree! If you had to have 75 percent of the page different
I didn't say that, I said:
>My sites have 99+% identical pages according to Copyscape
I think you've been reading my figures the wrong way round however I must disagree with nzmatt since, in my experience, I have had test pages penalised for duplicity.
Fantatsic - Thanks for this link...I have been looking for this information for ages.
This has just explained to me how Google views my "seemingly identical" pages since I had one page (!) that I was having a problem with and couldn't get it moved up from the second page.
I know why now:-)
This is an area that fascinates me and just last week I had a response from a SENIOR Google employee about identifying duplicates. Not in regards to how Google actually does it but how it can be done in general (for a dupe content algo I am writing).
This employee has done tons of research on dupe content and I think the answers to Googles dupe algo are known by him.
One thing to remember is that dupe and specifically near-dupe checking is very resource hungry if you implement many of the known techniques. Google CANNOT be using all known techniques as it would be impossible to calculate on a database of 8 billion urls, scalability is a major factor which impedes Google and you should think of this when considering the techniques that Google can actually use rather than those that it knows about. All those PHDs can figure out very smart ways of catching things but it's useless unless they can make it near-linearly scaleable.
Techniques are varied and include:
shingling - taking sequences of words in sets of x so if you had 'mary had a little lamb, its fleece...' and set x to be 4 you would get 'mary had a little' through to 'little lamb its fleece'. All sets are recorded for every page and a percentage score can be given for duplicate on word sequencing. such as - doc1 has 1000 sets, doc2 has 500 sets, 200 match, 40% of content on doc2 is duplicate. There are many factors that influence this such as stop sequences - like stop words and the percentage that a flag is set to.
tree matching (see mirror identification)- take a 1000 page site, compare it to another site with 1200 pages. Each has to have a structure for the content, not just the content itself. The way that you would structure a site with 20 categories and 50 sub categories will be different from almost everyone else. Say you wanted to mirror the site this would give an exact match that is easy to spot, but if you had the aim of creating slightly different content on every page then this method would catch you by seeing that structure and page naming are too similar (giving you another thing to worry about when creating new sites based on a similar theme)
I cannot give more specifics here (the thread is a bit long as it is) but anyone really interested should have a good look at the research by Andrei Broder (and others that work with him) and Moses Charikar. Looking for research done by people on the Google team is obviously sensible too, many people in the field in the late 90's work at Google so a scan of techniques from then may lead you to people on their staff.
The fact remains, I think, that the duplicate content filter is not always applied.
I've got the following scenario right now...
I had a page 6th in the SERPS for one keyword.
Its URL was www.mysite.com/directory/keyword.html
I needed to expand on that page and put up related pages.
I moved it into a subdirectory so its Google listing (without the index.html) is now
I put a 301 on the old address, as Google recommends.
Right now, the old page (which was being visited daily) hasn't attracted any more visits because there's no active link to it.
The new page is being spidered daily and has a new fresh date each day.
The NEW page is an indented listing under the old page, rather than (as I'd hoped) the other way round.
The two page contents (the old cache from the now moved page and the new cache in its new location) are absolutely identical.
I've every reason to believe Googlebot came to the old page on one last daily visit and saw the 301 (and followed it) but hasn't deleted the old page.
I therefore have two duplicate pages in the index. (OK, to be pedantic, I have one duplicate page!)
I know Google says it takes 6-8 weeks for the 301 to percolate through, but I think this example shows that the duplicate content filter isn't applied ALL the time...
The last post illustrates the problem of actually using the knowledge that Google has, it would be simple to see that the pages are identical but Google can't run the algo all of the time, my guess is that a dupe check is performed periodically.
|Google can't run the algo all of the time, my guess is that a dupe check is performed periodically. |
Couldn't they just flag the top 20 or 50 or 100 results from certain types of searches (i.e., "money" keyphrases) for more detailed analysis by a duplicate-content filter at their convenience? In the real world, most users aren't going to notice if there are 500 boilerplate pages for the Hotel Whatsis or Elbonian home mortgages in the bottom 5,000,000 rankings for a competitive search phrase.
I think your suggestion just leads to further problems, try reading some articles on large-scale dupe checking to see why picking out search phrases as the starting point for dupe checking does not work. Dupe checking is about finding similar structures and content, not checking for duplicates in the dataset that results from any search algo.
What is dupe content?
a) Strip duplicate headers, menus, footers (eg: the template)
This is quite easy to do mathematically. You just look for string patters that match on more that a few pages.
b) Content is what is left after the template is removed.
Comparing content is done the same way with pattern matching. The core is the same type of routines that make up compression algos like Lempel-Ziv (lz).
This type of pattern matching is sometimes referred to as a sliding dictionary lookup. You build an index of a page (dictionary) based on (most probably) words. You then start with the lowest denominator and try to match it against other words in other pages.
How close is duplicate content?
A few years ago, an intern (*not* Pugh) who helped work on the dupe content routines (2000?), wrote a paper (now removed). The figure 12% was used. Even after studying, we are left to ask how that 12% is arrived at.
Cause for concern with some sites?
Absolutely. People that should worry:
a) repetitive content for language purposes.
b) those that do auto generated content with slightly different pages (such as weather sites, news sites, travel sites).
c) geo targeted pages on different domains.
d) multiple top level domains.
Can I get around it with random text within my template?
Debatable. I have heard some say that if a site of any size (more than 20pages) does not have a detectable template, that you are subject to another quasi penalty.
When is dupe content checked?
I feel it is checked as a background routine. It is a routine that could easily run 24x7 and hundreds of machines if they wanted to crank it up that high. I am almost certain there is a granularity setting to it where they can dialup or dial down how close they check for dupe content. When you think about it, this is not a routine that would actually have to be run all the time because one they flag a page as a dupe, that would take care of it for a few months until they came back to check again. So I agree with those that say it isn't a set pattern.
Additionally, we also agree that G's indexing isn't as static as it used to be. We are into the "update all the time" era where the days of GG pressing the button are done because it is pressed all the time. The tweaks are on-the-fly now - it's pot luck.
What does Google do if it detects duplicate content?
Penalizes the second one found (with caveats). (As with almost ever Google penalty, there are exceptions we will get to in a minute).
What generally happens is the first page found is considered to be the original prime page. The second page will get buried deep in the results.
The exception (as always) - we believe - is high Page Rank. It is generally believe by some that mid-PR7 is considered the "white list" where penalties are dropped on a page - quite possibly - an entire site. This is why it is confusing to SEO's when someone says they absolutely know the truth about a penalty or algo nuance. The PR7/Whitelist exception takes the arguments and washes them.
Who is best at detecting dupe content?
Inktomi used to be the undisputed king, but since G changed their routines (late 2003/Florida?), G has detected the tiny page to the large duplicate page without fail.
On the other, I think we have all seen some classic dupe content that has slipped by the filters with no explaination apparent.
For example, these two pages:
The 10,000 unauthorized rips: (10k is best count, but probably higher):
Successful Site in 12 Months with Google Alone [google.com]
All-in-all, I think the dupe content issue is far over rated and easy to avoid with quality original content. If anything, it is a good way to watch a competitor penalized.
>> easy to avoid with quality original content. If anything
Brett ... you had me going until that line! Come on... secret sauce please. :)
G may have 8 billion pages indexed but many will never be downloaded by humans. Google doesn't have to dup check every page, just the "hotlist", the top 100 million, or whatever, that are popular.
Has anyone written or been replied to from G concerning this dupe content filter? Are they even aware of the problems it is causing on honest websites that have been hijacked? I just wonder if they even are aware of these problems or is it just an "interesting discussion" on messageboards? Maybe if they were made aware of such disconcerting problems they would strive to block the offending dupe/ hijack sites and return good websites to prior positions in SERPS. I say again...does Google even know about this? Has anyone written them? Have they replied?
[edited by: engine at 6:20 pm (utc) on Jan. 18, 2005]
[edit reason] formatting [/edit]
|does Google even know about this? Has anyone written them? Have they replied? |
2. yes, many times, many people.
3. yes, in a generic, thanks we'll look into it sort of way.
Just expanding this a little further, when testing various aspects of duplicates, the IP plays a role in this otherwise you'd easily be able to waste a competitors site.
There is a quicker method to having stolen content removed than writing Google.
I had a client's site that had content stolen by a competitor who stole several original lines and even original pictures and he did this on a multitude of freebie sites he had set up that linked to his main site (the main site was a valid competitor site to my client which didn't have stolen content). We considered the valid competition site as worthy competition but this guy was using his stolen content on his freebie sites linked to his valid site trying to boost it's rank).
We redesigned the client's site with new and improved content and it showed up on the competitor's multiple freebie sites within 24 hours. Then the client wrote the culprit and asked him to remove the stolen content. After several days with no reply the client then wrote the hosting company (which was a freebie host with very strict rules about copyright) and the site and all it's copies were removed within 24 hours (except for the valid competition site).
I am a real estate agent who webmasters my own sites, so I am in no way an expert but have a deep interest in all of this so I read up all the time, test things for myself and try to get a basic handle on this as it evolves.
I usually just lurk, but here goes...
For the entire year (I'm pretty sure) of 2004 I was #2 on Google for my main KWP "city real estate" and the #1 site was/is hosted by the same company, with many of the same info pages, much of the same words on the home page (some paragraphs were even indentical), and the exact same IP address. One major difference was that my competitors Meta title was "city real estate" and mine was/is "city real estate - city homes for sale".
Then came our last update. I got spanked, and had done nothing to the site in the weeks before the update, but still I got bumped down from 2 to 14.
This of course baffled me and I began searching high and low for some rhyme or reason. One thing that struck me about a week later was that my competitor had changed their Meta title to match mine. Now they are "city real estate and city homes for sale" while mine remains "city real estate - city homes for sale" with the only difference being they have "and" and I have "-". Now I know Meta tags are not copyright infringment, but it does make me wonder what effect all of these factors have had on my site vs. the #1 site.
Now with their new Meta title, we are very close to being the same, at least as far as our home pages were concerned. So, I changed some wording around on my HP and have moved up to #11 now, but that is not good enough... I need help.
Does anyone see an issue with the above mentioned? Or did I simply slip, or did 12 of my competitors all of a sudden get better?
Any input would be greatly appreciated.
| This 32 message thread spans 2 pages: 32 (  2 ) > > |