or base the decision on how old the page is.
oldest page wins, younger pages ranked lower/not at all?
seems fairly straightforward to me.
Google Patent: Detecting duplicate and near-duplicate files [patft.uspto.gov]
[edited by: ciml at 4:26 pm (utc) on Mar. 11, 2004]
[edit reason] Shortened for sideways scrolling. [/edit]
If they do ban them, Webmasters can cause them problems. If they don't, Webmasters can cause them problems. I don't think that they can write the perfect algo because they just try to figure out what mix produces the best results without worrying too much about specific sites.
As I said in another thread, keeping the oldest page is sensible, but only if the definition of oldest is appropriate.
For instance, if someone rips off your pages and you then update your pages, you would not want Google to ban the newer pages. Therefore, "oldest" must mean "first page to be indexed".
It seems likely that Google would keep a record of when pages were first indexed, but if they don't, then there is no way for them to decide, based on age, which page of several duplicates is the original.
The idea of a duplicate content penalty always seemed odd to me.
I think that Google's near-duplicate filtering is rather good; pick the top listed URL for a page and bury the others unless "repeat the search with the omitted results included" is clicked ( &filter=0 )
stripey - "seems fairly straightforward to me."
then you've never tried to build a search engine then?
there are a million and one variations on why there is a duplicate page of content. Like kaled said, if you update your content the page is newer. Also, if you move the page to a different directory of your site then that page will not be the first indexed page of the copy even though you were the original. An example of my own recently was a move from .co.uk to .com with all my pages. I was the original writer of some articles that have been ripped off by other webdesigners and now I ranked lower as I have no inbound links to the new site.
power-iq: I didn't see anything in that document that pointed to banning sites, just identifying them. Can you be more specific.
slowmove: I really started this thread to get a discussion on whether or not Google actually bans pages, not to discuss the effects of whether they do or not. There are already several threads on that.
I do believe that Google have a good algorithm to deal with duplicates but I don't think it can penalise them as it would be hit or miss and I definately do not believe that they ban them.
I feel this is important as it is another thing that webmasters have been spouting for months on end - 'duplicate pages get banned.' - I don't believe it's true and I think it's something we should be worried about as other people could get ranked above us for duplicate content if we aren't careful. (like my recent web site change losing me rankings for my own articles)
|then you've never tried to build a search engine then? |
|there are a million and one variations on why there is a duplicate page of content. Like kaled said, if you update your content the page is newer. |
its also likely to be different as well
|Also, if you move the page to a different directory of your site then that page will not be the first indexed page of the copy even though you were the original. |
redirect redirect redirect
|An example of my own recently was a move from .co.uk to .com with all my pages. I was the original writer of some articles that have been ripped off by other webdesigners and now I ranked lower as I have no inbound links to the new site. |
again, redirects would have told G that the articles were moved, not gone. as for building up links, well if you move the site, you gotta take the consequences IMO.
stripey, your not really contributing to the discussion just trying to point out flaws in my poor site migration planning.
You 'just have to take the consequences'. What in the world does that have to do with the discussion?
The topic is does Google ban duplicate pages. If you want to look big and clever go and join a discussion on redirecting and link building and tell everyone else the best way to do everything.
Anyway, for those of you who are interested in the actual topic at hand I've just been checking on various search engines and it looks as though Yahoo have the same proceedure. Duplicates are not infact deleted, just pushed below the 'repeat the search with the omitted results included' link.
Unfortunately for some of the articles I was researching, the authors themselves had been pushed below this line. Comparing Google and Yahoo results for the same terms thought does show that Google seem to be better are picking out the authors.
My guess, this is due to their recent 'Authority Site' ranking tweaks that everyone is complaining about.
i have a problem. :-)
I think, google is filtering out my site because a web directory has listed my site.
My Index has vanished from G. and instead a site of the directory is still at the top
with my titel, my description and a redirect/meta refresh to my site.
Could it be, that google thinks it is a duplicate content site?
some days ago i changed the title of my site and today the new title with the
url of the directory appears in Google search results.
I tried to write the webmaster to delete this entry,
but no response now.
Can somebody give me an advise how to solve the problem?
Google has always been able to detect exact duplicate pages and "merge" them so that only one lists (the one with the higher PR). I have several carbon copies of my main site, each with its own unique collection of backlinks, and only one can be found.
But now, who knows? It now seems to be filtering near-exact copies of pages, but not always very reliably.
landmark is absolutely right. I have a site with over 40 mirrors that only exist for the purpose of off-loading the main site. The mirror sites pages are also in Google index and have their own PR but when I search for something, only the pages of the main site appear in the results. None of the main site and the mirror pages are banned.
This is something that Google does right while other search engines mess up every thing in the SERPs. This just reminds me how Google is so far ahead other search engines that makes me wonder if Yahoo dropping Google search engine results was a step forward or a step backwards for Yahoo.
mlemos, I don't think it was a step forward in relevant search results but it was certainly a step forward in earning cash! Yahoo should have messed up their Alltheweb search engine first to test things out instead of just jumping in at the deep in with their highest trafficked site.
Which brings me to another point, seems like Alltheweb is even worse at filtering out duplicates (at this point I am assuming that everyone here is now agreed that the 'duplicate penalising and banning' warnings are just another webmaster myth?). Large numbers of academic studies are having their copies ranked higher than them.
What's bothering me is that all of you seem to be in agreement that duplicate pages are not banned yet there are threads every single day in webmaster forums with people spouting the 'you'll be banned' doomsday threats. I'm not just talking about new webmaster either, alot of senior members keep repeating the slogan 'Google bans duplicate pages'. It worries me about the rest of what is being said in here ...
It may sound like a stupid question, but why would anyone want duplicate pages?
There are two reasons for duplicate pages. One is an attempt to fool search engines. The other is to mirror content. Mirroring means the same information is available in more than one place, and works very well when demand exceeds what one particular site can handle.
>> I am assuming that everyone here is now agreed that the 'duplicate penalising and banning' warnings are just another webmaster myth?
I don't get it. Half the posts above give examples of where duplicate content has been removed from the SERPs.
|There are two reasons for duplicate pages. |
There are a few more reasons - content syndication springs immediately to mind, as does press realeases, newswires and affiliate sites using the same content as their affiliate providers.
I've always wondered how Google deals with content syndication - especially on a large scale, such as travel content etc. I've seen the same travel content used over and over on different sites and never any penalties... I guess the strucure of the page is looked at as well.
In answer to the poster questioning Google myths:
Dupe pages, when spotted, are simply filtered out - it's not a penalty as such. On mega sites like the BBC, there are bound to be dupes all over the place in different folders, designed to be served up to different locations etc. It would be unfair to penalise.
The trouble with dupes is that which one is filtered out is determined by Google and its increasingly rickety algo.
Thus, whilst the entire universe might point to a particular page, Google has a habit of filtering this one out in favour of an old orphaned dupe, whose content is ancient, with zip PR.
Keeps us all amused though, although the smiles are starting to look a bit thin these days :)
|Dupe pages, when spotted, are simply filtered out - it's not a penalty as such. |
I site (80%) has been ripped off by some ******* in the Phillipines. Google's index does not block duplicate pages. I found this site by way of a Google search.
SyntheticUpper - thank you for pointing out that important distinction between a site being banned (negative association) and filtered (neutral).
slyolddog - "I don't get it. Half the posts above give examples of where duplicate content has been removed from the SERPs."
Not one of them did. Try reading them again and please copy and paste any statement that says they were removed.
I'd like to make a distinction, I've re-read some of the threads on Webmaster world and noticed that some of the people saying they had been removed for duplicate content also were giving the impression that it was their own mirros that were being removed. I'm assuming that they probably were holding several mirror sites on with spam content to try and trick the servers. (Thats the impression they give from the type of questions they are asking) I believe that they have not been removed for duplicate content but for spamming the search engine. This may be where the confusion has arisen. Again, I feel it is impossible for a search engine to have a policy of banning duplicate content if they want to provide relevant results. Look at what happened with PageRank and S***hKing. The same would happen if duplicate page removal was true.
A couple of good points in here so far, but I think we need a clearer picture of how Google might 'filter' duplicate pages. Choosing the oldest is no good as a guideline because even archived news articles in a site generally get updated with new sitemaps, internal linking structure, maybe add a small article at the bottom directing to more recent news articles on the subject. Any good site will never leave a page exactly as it is for any length of time, that's not what the internet is about - newspaper clippings go in libraries.
They can't pick the 'first one they index' as being the authorative site either as this is the assumption that Googlebot knows all the web and can pick up new content within seconds. A link was posted in a thread last week about latent semantic indexing, on visiting it, it was not the original article but a copy that mentioned where the original article came from. Sure enough, the one linked to by Webmasterworld was indexed within a couple of days (probably everyone's Toolbars!) but I don't even know if the original has been picked up yet.
Any ideas on how Google can actually filter duplicates? Using PageRank could also be a problem as copyright infringers are also normally PageRank theives too.
I'm using the idea of articles as this is the most common form of copyright infringment. The main problem is whole sites being copied which has happened to a lot of people in here, me included on several occasions. I have to work hard on a daily basis just to ensure that the duplicate doesn't rank higher than me, THAT is annoying!
FWIW and IMHO I think that there is a duplicate site issue. This can be particularly stark in a situation where [domain.com...] and [domain.com...] resolve to the same pages with a 200 OK server response. If you can't switch off one of these "subdomains" then you can try putting in a permanent redirect but some folks say that these can cause more harm than good. In any case you should use absolute URLs in your site navigation so that Googlebot only finds tha odd page duplicated not the whole site. Assuming most of your important backlinks come into your index page, a big assumption I know but absolutes will definitely not harm and should do you good.
With regard to dupe pages IMHO the certain ellements of the pages being exact dupes is more harmful than otheres. I think that titles, descriptions, <h> tagged text and text in anchor are critical in this. But hey that's just my opinion based on anecdotal evidence and supposition.
They could of course be using semantic finger printing ;)
I guess I am missing the point but if you want to see duplicate content which Google has filtered out you just need to add &filter=0 to the google query string.
That shows all pages, duplicates and all. Without it you just see the unfiltered content.
So if you want to see how Google finds duplicates, run an unfiltered query on some unique content on a site which has duplicates in the index.
I found a website just 10 mintues ago which is duplicate content, 2 websites completly identical, one is a .com domain the other is hosted with a free web hosting company!?
both identical in everyway both registering a pr4!
where is the love?
|The only way around it is to judge the duplications purely on inbound links. |
I think this old thread should help :-
Duplicates and the challenges search engines face
They did to me.
|both identical in everyway both registering a pr4! |
If you check the Google cache for both URLs, you may find that they have been merged.
Taking two URLs, merging them and crediting both sets of backlinks and PR to the remaining one isn't a penalty, it's a bonus.
"both identical in every way"
Could also be that the hosting IS on the free site, and the webmaster just registered a domain name and has it point to that place. Most of my websites are available through their domain names and also through their account number and the name of the hosting company.
|then you've never tried to build a search engine then? |
| This 31 message thread spans 2 pages: 31 (  2 ) > > |