Forum Moderators: open
I was wondering how different a page has to be so that it's not seen as a duplicate.
1. Does the TEXT have to be different? If so, approx what percentage?
2. Does the CODE/TEMPLATE have to be different? If so, how much?
Are both of these read in conjunction, or does Google check for both?
I guess my question is..If I have a site whose content is relevant for another site, can I take the same text and put it on the other template? Should I change the text a bit?
And if I copy a template and write new text for it, is that enough to make it appear unique, or should I also change the template?
Sorry for all the questions, but this issue has been bothering me for days and days. Any suggestions would be appreciated.
I have always asumed that we where over reacting a bit on duplicate content issues, but as 2_much said Google have certainly made some ground with google news in that there are very few pages with the same or even similar content.
Is this something that is going to migrate into the web search algo?
1. recognise the template and remove it. I do not expect them to totally remove the template in web indexing, but I expect they would devalue it.
2. Find all the same news pages (AP, Reuters, etc.) and clump them together. That is the easy part.
3. Use statical analysis to the wors used in the articles to find similarities. Certain word combinations will only apply to specific stories.
All three of these would be much easier to do on news sites than on random web pages.
Can you guys refer me to some reading about this techonology?
You ask a dangerous question.
for recognizing the template parts, it would just be simple parsing and compares. HTML makes this easy since there are well defined sections of each file. Navigation elements tend to be fairly repetitive, and verbose.
I would recommed reading the parsing section of any good text on compiler design.
For finding duplicate content, you would then look for similar chunks of text. I think a lot of people would make this more difficult than it would have to be. I'm guessing that they only go after pages that are almost exact duplicate.
For finding similar news stories, and such things as themeing you might want to do some reading on linguistics and 'bayesian analysis'.