Forum Moderators: Robert Charlton & goodroi
Obviously if a page is identical to another, ie 100% duplicate, then you would prob get some sort of penalty.
But surely there is a basic format for websites, ie menus down the side, banner on top, so pages are going to be similar to a certain degree.
I was just wondering what sort of percent google would give you so as not to trigger the penalty. ie. is 60% similar to another webpage too much?
As you point out there are common parts to most sites, the navigation and footers. Search engines can easily distinguish what is content and what is navigation, whether they take notice of this is another matter.
A huge area of duplicate recognition technology that I suspect has been tested in the wild is directory structure similarities. The way that a person lays out the navigation and directory structure (including file naming) is almost like a fingerprint when you lokk at larger sites, from this it is easy to establish that 2 sites may share a common author even if the content for each page has been rewritten to avoid penalties.
Duplicate content is usually measured in small sets of words such which follow each other. This means that a page with 100 words would have 97 sets of 4 word phrases. If you split each page in this manner it becomes simple to spot re-used sentences, especially if they use unusual words or phrasing.
Again it is questionable whether the search engines are fully adopting all of the arsenal of tolls that they have due to processing requirements, if they are not (as I suspect) I can see a whole load of upset webmasters when they start implementing all of those tools.
I have recently been working on a big latent semantic inxdexing project and know that there are futher ways to identify documents that are suspiciously similar. If you look at academia there are companies that concentrate on automatically spotting copied or suspicious essays/thesesis (sp?). When you are looking at the economics of search engines it does not make sense to spend so much processing power on every document, however it used to be a pipe dream for academia. Is it just around the corner? Given that such an everyday object as the upcoming Playstation 3 will handle over 1 Terraflop it's entirely concievable that the costs of computing such vast amounts of data will soon be within reach of Yahoo, Google and Microsoft, after all they are multi billion-dollar companies.
To summarise I would warn people off trying to write pages which currently comply with the fairly basic algorithms that are in use, it won't be long before such techniques will be obvious to the search engines and the first engine to get rid of the majority of garbage that clogs up their indexes will reap huge rewards in terms of search result quality and hence user migration.