Page is a not externally linkable
inbound - 3:49 pm on Jan 16, 2005 (gmt 0)
This employee has done tons of research on dupe content and I think the answers to Googles dupe algo are known by him. One thing to remember is that dupe and specifically near-dupe checking is very resource hungry if you implement many of the known techniques. Google CANNOT be using all known techniques as it would be impossible to calculate on a database of 8 billion urls, scalability is a major factor which impedes Google and you should think of this when considering the techniques that Google can actually use rather than those that it knows about. All those PHDs can figure out very smart ways of catching things but it's useless unless they can make it near-linearly scaleable. Techniques are varied and include: shingling - taking sequences of words in sets of x so if you had 'mary had a little lamb, its fleece...' and set x to be 4 you would get 'mary had a little' through to 'little lamb its fleece'. All sets are recorded for every page and a percentage score can be given for duplicate on word sequencing. such as - doc1 has 1000 sets, doc2 has 500 sets, 200 match, 40% of content on doc2 is duplicate. There are many factors that influence this such as stop sequences - like stop words and the percentage that a flag is set to. tree matching (see mirror identification)- take a 1000 page site, compare it to another site with 1200 pages. Each has to have a structure for the content, not just the content itself. The way that you would structure a site with 20 categories and 50 sub categories will be different from almost everyone else. Say you wanted to mirror the site this would give an exact match that is easy to spot, but if you had the aim of creating slightly different content on every page then this method would catch you by seeing that structure and page naming are too similar (giving you another thing to worry about when creating new sites based on a similar theme) I cannot give more specifics here (the thread is a bit long as it is) but anyone really interested should have a good look at the research by Andrei Broder (and others that work with him) and Moses Charikar. Looking for research done by people on the Google team is obviously sensible too, many people in the field in the late 90's work at Google so a scan of techniques from then may lead you to people on their staff.
This is an area that fascinates me and just last week I had a response from a SENIOR Google employee about identifying duplicates. Not in regards to how Google actually does it but how it can be done in general (for a dupe content algo I am writing).