Forum Moderators: open
The question is, how similar is too similar so that our pages are not considered duplicate? Figures of 20%, or 10kb differences are often quoted - I don't believe in the 10kb difference otherwise pages would need to be relatively large / possibly too large.
Is software available that will perform this analysis? (other than Google's algorithms of course :), though if anyone has a copy I'd love to hear from you!).
Bottom line, we customise information on websites that are presenting information about similar topics -- and want to ensure these pages are not considered too similar and excluded from the Google DB.
bcc1234 - My feeling is that within-site content may be analysed for duplicates, however archive sites would still have different linking structures to other archive domains.
Perhaps Google is concentrating their algorithms on easily detectible spam and not going through permutations to detect duplicate content?
This is some comfort for sites that are regional affiliated sites, different content, but similar.
"Focus on common filename structures"
For some reason I have never thought of this. I am however 100% confident that they do have a duplicate content detection mechanism which is more advanced then simply matching link structures like Altavista or checking common filename structures.
I would like to know how they do it - still hunting the relevant paper ;)
Html file names and linking structures
Image names (if you have just the same image names all over your sites or pages, and you do not have other images)
Page title
Text in the site (If they go over the first paragraph and it is exactly the same)
If you have a site, with the exact same information in all of the above cases it will be penalized. For example in the case mentioned by John5, the text and meta tags are probably different. I guess you may also edit your text so it says the same thing in a different way.
I posted a similar question in the forum Cloaking And Gateways [webmasterworld.com...] I would appreciate any suggestions.
[edited by: WebGuerrilla at 4:40 pm (utc) on Aug. 23, 2002]
[edit reason] made url clickable [/edit]
There isn't a specific type of file name structure that is considered spam. What we are referring to is having the same file name structure across two different domains.
The majority of duplicate content that a search engine has to deal with comes from mirrored sites, or sites that have multiple domains resolving to them. In those cases, the two domains always have identical file structures.
www.domain1.com/widgets/fuzzy/blue.html
www.domain2.com/widgets/fuzzy/blue.html
Dupe content detection does go beyond just link and structure analysis, but the majority can be caught and dumped without getting into actually comparing the content.
When I've come across situations where near duplicate content was necessary, I've found that changing the structure and file names of the second site dramatically reduce the risk of getting dumped.
To echo John5, this is a quite common structure with translated sites. I never seen any problems with that.
So if Google's duplicate filter gets triggered by this there must be a second step, where cached text elements get compared.
See this paper [www-db.stanford.edu] for a brief description of one type of similarity computation. The title of the paper is "Efficient Crawling Through URL Ordering" by Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Go to Section 2, "Importance Metrics," subsection 1, "Similarity to a Driving Query Q."
I recall someone from Google about a year ago stating that by using vector math, it is a trivial matter to locate duplicate or near-duplicate pages. Anchor text and structure analysis might not even be required as a separate calculation. What you need is the formula, you need to understand the formula, and you need to know what the threshold is for a penalty. Good luck.
I am convinced that Google has such vector math in operation. I am SO convinced that I have even taken down an old website of mine and incorporated it into my main site. This second site had been listed in Yahoo (since 98 and free) and gave details of the children's version of my widget invention.
Even though the content on each page was different to my main site and none of the gifs. or jpegs. were the same, I felt that the professional reports, testimonials, University references etc. on the main site were most important to the overall impression given to a potential customer and cross-linking to achieve this would CERTAINLY trigger some filter in Google. Accordingly I made the big decision and ultimate sacrifice yesterday sending up a meta-refresh to the children's site as well as a robots.txt < user agent disallow >. I then incorporated the children's widget into the main site.
A major factor influencing my decision, of course, is my continuing struggle to get a 10 month penalty lifted from my main site (PR7 down to a lousy PR2). I had also become aware of a penalty being imposed on the children's site by Google when it showed a line of code for links to the childrens site when there are actually about 6 links to that site.
So my advice from bitter experience is keep just the one site, set up subdirectories for each of your local content sections and show them clearly on your main menu.
Google - are you out there??
ANSWER ME OH MY LOVE,
TELL ME WHAT IT IS THAT I'VE BEEN GUILTY OF.
DAH DAH DAH DAH DAH DAH DAH
PLEASE LISTEN TO MY PRAYER.! :(