|Is there software to analyze page similarity?|
The need for similar pages on different websites that don't upset Google
We have a need to have similar content (not identical) on different websites, primarily because people conducting *regional* searches will require different/customised information, and many will limit their search to their country postfix, ie site:com.au
The question is, how similar is too similar so that our pages are not considered duplicate? Figures of 20%, or 10kb differences are often quoted - I don't believe in the 10kb difference otherwise pages would need to be relatively large / possibly too large.
Is software available that will perform this analysis? (other than Google's algorithms of course :), though if anyone has a copy I'd love to hear from you!).
Bottom line, we customise information on websites that are presenting information about similar topics -- and want to ensure these pages are not considered too similar and excluded from the Google DB.
I don't know of any software, but I would advise you to focus your efforts on file names and and linking structures first.
I don't think Google really does a comparative analysis on the actual content. I think they primarily focus on common file name structures.
Very good observation.
|I don't think Google really does a comparative analysis on the actual content. I think they primarily focus on common file name structures. |
What about archive sites that have thousands of pages like:
Thank you, this has helped a great deal. I guess when you think of it this logical as there is no way Google would be able to compare *near* duplicate content between all permutations of domains.
bcc1234 - My feeling is that within-site content may be analysed for duplicates, however archive sites would still have different linking structures to other archive domains.
Perhaps Google is concentrating their algorithms on easily detectible spam and not going through permutations to detect duplicate content?
This is some comfort for sites that are regional affiliated sites, different content, but similar.
"Focus on common filename structures"
For some reason I have never thought of this. I am however 100% confident that they do have a duplicate content detection mechanism which is more advanced then simply matching link structures like Altavista or checking common filename structures.
I would like to know how they do it - still hunting the relevant paper ;)
Can someone say what file name structures are considered okay vs spam? ie www.abc.com/cameras/model1.html
Is that good or a problem???
What is a good file name structure for different model numbers or names?
I have two rather large web sites - one in english and a German version. They have the same filename structure to 99,99 % and I have no trouble with Google. Therefore I think it cannot be file name structure alone. They must compare contents or they recognize that it is two different language versions (I also use the Meta tag for language).
I am speculating, however, my guess is Google’s algorithm checks:
Html file names and linking structures
Image names (if you have just the same image names all over your sites or pages, and you do not have other images)
Text in the site (If they go over the first paragraph and it is exactly the same)
If you have a site, with the exact same information in all of the above cases it will be penalized. For example in the case mentioned by John5, the text and meta tags are probably different. I guess you may also edit your text so it says the same thing in a different way.
I posted a similar question in the forum Cloaking And Gateways [webmasterworld.com...] I would appreciate any suggestions.
[edited by: WebGuerrilla at 4:40 pm (utc) on Aug. 23, 2002]
[edit reason] made url clickable [/edit]
>>Can someone say what file name structures are considered okay vs spam?
There isn't a specific type of file name structure that is considered spam. What we are referring to is having the same file name structure across two different domains.
The majority of duplicate content that a search engine has to deal with comes from mirrored sites, or sites that have multiple domains resolving to them. In those cases, the two domains always have identical file structures.
Dupe content detection does go beyond just link and structure analysis, but the majority can be caught and dumped without getting into actually comparing the content.
When I've come across situations where near duplicate content was necessary, I've found that changing the structure and file names of the second site dramatically reduce the risk of getting dumped.
To echo John5, this is a quite common structure with translated sites. I never seen any problems with that.
So if Google's duplicate filter gets triggered by this there must be a second step, where cached text elements get compared.
Some more details from the experience of my two language sites. Both sites use a common image pool organized on a separate web site. The images (thousands) are accessed as htpp referrals, which therefore are identical for both sites. Titles, Meta tags and text is different of course. During the process of translating the English site into German, I had quite a few non-translated or half translated pages without getting into problems. Some contents that comes out from a common database is still in English on the German site (low percentage however). My feelings are that it must be something as Mosio points out - several criteria which are looked at and possibly a rather "generous" filter.
Google uses vector math to compute similarity. Remember, they already have data on the words in every document, the position of each word (from which proximity is known), and the frequency.
See this paper [www-db.stanford.edu] for a brief description of one type of similarity computation. The title of the paper is "Efficient Crawling Through URL Ordering" by Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Go to Section 2, "Importance Metrics," subsection 1, "Similarity to a Driving Query Q."
I recall someone from Google about a year ago stating that by using vector math, it is a trivial matter to locate duplicate or near-duplicate pages. Anchor text and structure analysis might not even be required as a separate calculation. What you need is the formula, you need to understand the formula, and you need to know what the threshold is for a penalty. Good luck.
Thanks for starting this thread Flash and for your considered reply Everyman.
I am convinced that Google has such vector math in operation. I am SO convinced that I have even taken down an old website of mine and incorporated it into my main site. This second site had been listed in Yahoo (since 98 and free) and gave details of the children's version of my widget invention.
Even though the content on each page was different to my main site and none of the gifs. or jpegs. were the same, I felt that the professional reports, testimonials, University references etc. on the main site were most important to the overall impression given to a potential customer and cross-linking to achieve this would CERTAINLY trigger some filter in Google. Accordingly I made the big decision and ultimate sacrifice yesterday sending up a meta-refresh to the children's site as well as a robots.txt < user agent disallow >. I then incorporated the children's widget into the main site.
A major factor influencing my decision, of course, is my continuing struggle to get a 10 month penalty lifted from my main site (PR7 down to a lousy PR2). I had also become aware of a penalty being imposed on the children's site by Google when it showed a line of code for links to the childrens site when there are actually about 6 links to that site.
So my advice from bitter experience is keep just the one site, set up subdirectories for each of your local content sections and show them clearly on your main menu.
Google - are you out there??
ANSWER ME OH MY LOVE,
TELL ME WHAT IT IS THAT I'VE BEEN GUILTY OF.
DAH DAH DAH DAH DAH DAH DAH
PLEASE LISTEN TO MY PRAYER.! :(