Forum Moderators: open
patent search [164.195.100.11]
patent search [164.195.100.11]
The first patent is for:
"METHOD FOR DETERMINING THE RESEMBLANCE OF DOCUMENTS"
meaning they can detect duplicate documents.
What is claimed is:
1. A method of comparing a plurality of documents stored on a computer comprising the steps of:
loading a first document into a random access memory (RAM);
loading a second document into the RAM;
reducing the first document into a first sequence of tokens;
reducing the second document into a second sequence of tokens;
converting the first sequence of tokens to a first (multi)set of shingles;
converting the second sequence of tokens to a second (multi)set of shingles;
determining a first sketch of the first (multi)set of shingles;
determining a second sketch of the second (multi)set of shingles; and
comparing the first sketch and the second sketch.
Oh yes! They are dumping out some future secrets now!
"What is needed is a method to determine whether two documents have the same content except for modifications such as formatting, minor corrections, web-master signature, logo, etc., using small sketches of the document, rather than the full text. "
Quote from: US Pat. 6,230,155
Full List of Altavista Approved Patents [164.195.100.11]
Wow, why hasn't Alta made noise about this since they were approved? I remember all the fuss when they just applied for them, you'd think they would crow about getting them approved. Or am I missing something here?
So are they referring to some of their methods here? If so, this explains why older pages do well. So if I created a website two years ago and filled all the pages with a high frequency of keywords I would basically be knocking out anyone in the present for getting listed well?
If there are millions of pages out there, many of them are going to have the same types of small phrases that were not created intentionally. I wonder how many quality sites cannot be listed because of that.
Quote from US Pat: 6,230,155
Exactly. The last time they made public announcements [webmasterworld.com] about patents, a great deal of people/organizations started talking about challenging patents based on prior art. A lot of articles were written about who weak many of their claims are.
I know if I was shopping the company, I wouldn't want the press to point out to any potential buyer the fact that the patents might not end being worth much.
<cough>
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
</cough> :)
They sure are. I've read a few of them in the past but missed the one on:
Method for determining the resemining the resemblance of documents
They might not give a great amount of info on how to rank really well but they sure clear things up on What Not To Do. As well as why this or that happened.
....i just wish i paid more attention in those math classes.....
Basically states how AV takes a bunch of URL's off a crawled page and then assigns priority levels to each link to determine which ones have the most importance.
During the course of processing a downloaded document, various data can be collected about it. Examples include the date and time of the download, how long it took to perform the download, whether the download was successful, the document's size, its MIME type, the date and time it was last modified, its expiration date and time, and a checksum of its contents. These data can be used for a variety of purposes, including, but not limited to:passing information from one processing module to a later processing module in a processing pipeline;
collecting statistics about the downloaded documents; and
in the context of a continuous web crawler, the collected data can be used as a basis for determining when a document should next be downloaded (refreshed).
Every web crawler must maintain a data structure or set of data structures reflecting the set of URL's that still must be downloaded. In this document, that set of data structures is called "the Frontier." The crawler repeatedly selects a URL from the Frontier, downloads the corresponding document, processes the downloaded document, and then either removes theURL from the Frontier or reschedules it for downloading again at a later time. The latter scheme is used for so-called "continuous" web crawlers.
When selecting a URL from the Frontier, the inventors have determined that it would often be desirable for the crawler to preferentially select certain URL's over others so as to maximize the quality of the information processed by the other applications to which the web crawler passes downloaded documents. For instance, the web crawler may pass downloaded pages to a document indexer. An index of documents on an Intranet or the Internet will be more accurate or higher quality if the documents of most interest to the users of the index have been preferentially updated so as to make sure that those documents are accurately represented in the index. To accomplish this, the web crawler might preferentially select URL's on web servers with known high quality content. Alternately, heuristics might be used to gauge page quality. For instance, shorter URL's might be considered to be better candidates than longer URL's.
In the context of a continuous web crawler, it may be desirable to prefer URL's on web servers whose content is known to change rapidly, such as news sites. It may be desirable to prefer newly-discovered URL's over those that have been previously processed. Among the previously processed URL's, it may be advantageous to prefer URL's whose content has changed between the previous two downloads over URL's whose content has not changed, and to prefer URL's with shorter expiration dates over those with longer expiration dates.