Forum Moderators: open

Message Too Old, No Replies

Another new algo possibility

         

sergi

11:22 am on Mar 10, 2004 (gmt 0)

10+ Year Member



Since the changes at Google, many of us have looked for patents and academic papers on searching technologies to see if they could fit with the changes in the algorithm. I propose another one:

Method and apparatus for preventing topic drift in queries in hyperlinked environments (Altavista patent) [patft.uspto.gov]

There are several bits that I find very interesting:

After the n-graph is built and optionally pruned in steps 208-212, the nodes in the n-graph are ranked in step 214. Processing continues at step 216 where the equivalence components are identified using the edges of the n-graph in an undirected manner. An equivalence component is a maximum set of nodes that are equivalent. Two nodes may be defined equivalent in many ways. For example, two nodes may be equivalent if they are connected by a path in the undirected version of the neighborhood graph. Another example where two nodes may be equivalent is if they are connected by a path that alternates between forward and backward edges. Such a path is called a "zig-zag" path.

Each equivalence component that is created in step 216 is then examined in step 218. For each equivalence component, a determination is made as to whether that component is on topic or not. A component may be off topic if the optional pruning step is omitted or because the pruning step, if chosen, only examines a subset of the nodes in the n-graph.

Various techniques may be employed to examine the nodes contained in each equivalence component and to determine whether they are on topic. For example, the nodes may be randomly sampled. Alternatively, the N highest ranked pages in the equivalence component may be tested. In either case, the proportion of on topic pages to total pages tested is calculated. If the proportion is below a specified threshold level, for example 50%, then all of the nodes in the entire component would eliminated from the neighborhood graph. The reason for eliminating the entire component is the assumption that too many of the nodes contained in the component were off topic and that its authority ranking results are therefore suspect.

(…)

In step 222, the results are ordered by taking the top few rated pages from each component and using them to construct a final ordered list of ranked pages. By choosing nodes that have a lower absolute authority score but are the best ranked pages in a component, the problem of choosing all the highest ranked nodes from a single component simply because that component was the largest component on the graph is avoided.

Well, I find that interesting because since Florida and the later developments, I think that no satisfactory explanation for sites disappearing from certain searches has been made. Intuitively, many people think that a shift in the weight of the different factors that the Google algorithm takes into account can reasonably downgrade the rank from #1 to, let's say, anywhere between the first 30 or 50 results, but if it falls from the first 1000 results while it's still ranked for other searches, well, I think we can assume that new factors are at play.

The implications of the highlighted fragments could be that:

- It would be detrimental to link / be linked by (“the equivalence components are identified using the edges of the n-graph in an undirected manner”) off-topic pages.
- It would be detrimental to link / be linked by on-topic pages that have a stronger authority score than you (ie, they are on the receiving end of the linking structure for that component).

Of course, although the authors of this patent all work for Google now, the patent was issued for Altavista, so if they are using something similar to this, it has to be sufficiently different. And most effects of this changes would depend on the specifics of the implementation (the % of nodes chosen as a sample, the % of off-topic documents necessary to consider the component suspect, the weight it would have in relation with the others factors of the algorithm, etc). Also, the highlighted fragments were in fact secondary to the main method to avoid topic drift, and if something is in use in the new Google, it can be only an additional ingredient to the main course, be it topic-sensitive PageRank, Hilltop, semantics, etc.

Could be something in the lines of the cited text what we are seeing for the pages that fall beyond the first 1000 results?

Marcia

7:52 am on Mar 11, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you for posting this for us, sergi!

if something is in use in the new Google, it can be only an additional ingredient to the main course, be it topic-sensitive PageRank, Hilltop, semantics, etc.

What's fascinating and seems to be relevant to two recent updates is the concept of a two step process, with one step consisting of what appeared to be processing a data set of limited scope based on certain pre-defined criteria.