Welcome to WebmasterWorld Guest from 22.214.171.124
Forum Moderators: open
If Google banned duplicate pages then a devious webmaster (which I generally tend to think like) would simply create several carbon copies of their competitors which would get them either banned or seriously dropped in ranking.
I checked and there are many terms for which their are hundreds of pages of almost identical content (only differences being a short intro and a couple of links). You can check yourself, the easiest way is to search for tutorials as there are thousands of 'copy and paste' duplicates of each.
All those who have been bashing on about duplicate pages being banned for so long, have I misunderstood you? As far as I can see, Google can't remove pages that are duplicated as they can't always be sure that they are not removing the original. The only way around it is to judge the duplications purely on inbound links.
Google Patent: Detecting duplicate and near-duplicate files [patft.uspto.gov]
[edited by: ciml at 4:26 pm (utc) on Mar. 11, 2004]
[edit reason] Shortened for sideways scrolling. [/edit]
For instance, if someone rips off your pages and you then update your pages, you would not want Google to ban the newer pages. Therefore, "oldest" must mean "first page to be indexed".
It seems likely that Google would keep a record of when pages were first indexed, but if they don't, then there is no way for them to decide, based on age, which page of several duplicates is the original.
I think that Google's near-duplicate filtering is rather good; pick the top listed URL for a page and bury the others unless "repeat the search with the omitted results included" is clicked ( &filter=0 )
then you've never tried to build a search engine then?
there are a million and one variations on why there is a duplicate page of content. Like kaled said, if you update your content the page is newer. Also, if you move the page to a different directory of your site then that page will not be the first indexed page of the copy even though you were the original. An example of my own recently was a move from .co.uk to .com with all my pages. I was the original writer of some articles that have been ripped off by other webdesigners and now I ranked lower as I have no inbound links to the new site.
power-iq: I didn't see anything in that document that pointed to banning sites, just identifying them. Can you be more specific.
slowmove: I really started this thread to get a discussion on whether or not Google actually bans pages, not to discuss the effects of whether they do or not. There are already several threads on that.
I do believe that Google have a good algorithm to deal with duplicates but I don't think it can penalise them as it would be hit or miss and I definately do not believe that they ban them.
I feel this is important as it is another thing that webmasters have been spouting for months on end - 'duplicate pages get banned.' - I don't believe it's true and I think it's something we should be worried about as other people could get ranked above us for duplicate content if we aren't careful. (like my recent web site change losing me rankings for my own articles)
then you've never tried to build a search engine then?
there are a million and one variations on why there is a duplicate page of content. Like kaled said, if you update your content the page is newer.
its also likely to be different as well
Also, if you move the page to a different directory of your site then that page will not be the first indexed page of the copy even though you were the original.
redirect redirect redirect
An example of my own recently was a move from .co.uk to .com with all my pages. I was the original writer of some articles that have been ripped off by other webdesigners and now I ranked lower as I have no inbound links to the new site.
again, redirects would have told G that the articles were moved, not gone. as for building up links, well if you move the site, you gotta take the consequences IMO.
You 'just have to take the consequences'. What in the world does that have to do with the discussion?
The topic is does Google ban duplicate pages. If you want to look big and clever go and join a discussion on redirecting and link building and tell everyone else the best way to do everything.
Anyway, for those of you who are interested in the actual topic at hand I've just been checking on various search engines and it looks as though Yahoo have the same proceedure. Duplicates are not infact deleted, just pushed below the 'repeat the search with the omitted results included' link.
Unfortunately for some of the articles I was researching, the authors themselves had been pushed below this line. Comparing Google and Yahoo results for the same terms thought does show that Google seem to be better are picking out the authors.
My guess, this is due to their recent 'Authority Site' ranking tweaks that everyone is complaining about.
i have a problem. :-)
I think, google is filtering out my site because a web directory has listed my site.
My Index has vanished from G. and instead a site of the directory is still at the top
with my titel, my description and a redirect/meta refresh to my site.
Could it be, that google thinks it is a duplicate content site?
some days ago i changed the title of my site and today the new title with the
url of the directory appears in Google search results.
I tried to write the webmaster to delete this entry,
but no response now.
Can somebody give me an advise how to solve the problem?
But now, who knows? It now seems to be filtering near-exact copies of pages, but not always very reliably.
This is something that Google does right while other search engines mess up every thing in the SERPs. This just reminds me how Google is so far ahead other search engines that makes me wonder if Yahoo dropping Google search engine results was a step forward or a step backwards for Yahoo.
Which brings me to another point, seems like Alltheweb is even worse at filtering out duplicates (at this point I am assuming that everyone here is now agreed that the 'duplicate penalising and banning' warnings are just another webmaster myth?). Large numbers of academic studies are having their copies ranked higher than them.
What's bothering me is that all of you seem to be in agreement that duplicate pages are not banned yet there are threads every single day in webmaster forums with people spouting the 'you'll be banned' doomsday threats. I'm not just talking about new webmaster either, alot of senior members keep repeating the slogan 'Google bans duplicate pages'. It worries me about the rest of what is being said in here ...
I don't get it. Half the posts above give examples of where duplicate content has been removed from the SERPs.
There are two reasons for duplicate pages.
There are a few more reasons - content syndication springs immediately to mind, as does press realeases, newswires and affiliate sites using the same content as their affiliate providers.
I've always wondered how Google deals with content syndication - especially on a large scale, such as travel content etc. I've seen the same travel content used over and over on different sites and never any penalties... I guess the strucure of the page is looked at as well.
Dupe pages, when spotted, are simply filtered out - it's not a penalty as such. On mega sites like the BBC, there are bound to be dupes all over the place in different folders, designed to be served up to different locations etc. It would be unfair to penalise.
The trouble with dupes is that which one is filtered out is determined by Google and its increasingly rickety algo.
Thus, whilst the entire universe might point to a particular page, Google has a habit of filtering this one out in favour of an old orphaned dupe, whose content is ancient, with zip PR.
Keeps us all amused though, although the smiles are starting to look a bit thin these days :)
Not one of them did. Try reading them again and please copy and paste any statement that says they were removed.
I'd like to make a distinction, I've re-read some of the threads on Webmaster world and noticed that some of the people saying they had been removed for duplicate content also were giving the impression that it was their own mirros that were being removed. I'm assuming that they probably were holding several mirror sites on with spam content to try and trick the servers. (Thats the impression they give from the type of questions they are asking) I believe that they have not been removed for duplicate content but for spamming the search engine. This may be where the confusion has arisen. Again, I feel it is impossible for a search engine to have a policy of banning duplicate content if they want to provide relevant results. Look at what happened with PageRank and S***hKing. The same would happen if duplicate page removal was true.
A couple of good points in here so far, but I think we need a clearer picture of how Google might 'filter' duplicate pages. Choosing the oldest is no good as a guideline because even archived news articles in a site generally get updated with new sitemaps, internal linking structure, maybe add a small article at the bottom directing to more recent news articles on the subject. Any good site will never leave a page exactly as it is for any length of time, that's not what the internet is about - newspaper clippings go in libraries.
They can't pick the 'first one they index' as being the authorative site either as this is the assumption that Googlebot knows all the web and can pick up new content within seconds. A link was posted in a thread last week about latent semantic indexing, on visiting it, it was not the original article but a copy that mentioned where the original article came from. Sure enough, the one linked to by Webmasterworld was indexed within a couple of days (probably everyone's Toolbars!) but I don't even know if the original has been picked up yet.
Any ideas on how Google can actually filter duplicates? Using PageRank could also be a problem as copyright infringers are also normally PageRank theives too.
I'm using the idea of articles as this is the most common form of copyright infringment. The main problem is whole sites being copied which has happened to a lot of people in here, me included on several occasions. I have to work hard on a daily basis just to ensure that the duplicate doesn't rank higher than me, THAT is annoying!
With regard to dupe pages IMHO the certain ellements of the pages being exact dupes is more harmful than otheres. I think that titles, descriptions, <h> tagged text and text in anchor are critical in this. But hey that's just my opinion based on anecdotal evidence and supposition.
They could of course be using semantic finger printing ;)
I guess I am missing the point but if you want to see duplicate content which Google has filtered out you just need to add &filter=0 to the google query string.
That shows all pages, duplicates and all. Without it you just see the unfiltered content.
So if you want to see how Google finds duplicates, run an unfiltered query on some unique content on a site which has duplicates in the index.
both identical in everyway both registering a pr4!
If you check the Google cache for both URLs, you may find that they have been merged.
Taking two URLs, merging them and crediting both sets of backlinks and PR to the remaining one isn't a penalty, it's a bonus.
Could also be that the hosting IS on the free site, and the webmaster just registered a domain name and has it point to that place. Most of my websites are available through their domain names and also through their account number and the name of the hosting company.