Altavista Search Engine Patents Approved

Forum Moderators: open

Message Too Old, No Replies

Altavista Search Engine Patents Approved

very good reading.

msgraph

4:02 pm on May 10, 2001 (gmt 0)

Looks like the Patent Office finally issued these patents to AV. They have been posted here before they were issued but might be hard for some to find. Well here you go :)

patent search [164.195.100.11]

Brett_Tabke

4:20 pm on May 10, 2001 (gmt 0)

Wow. Nice find. Was there a story somewhere?

The first patent is for:
"METHOD FOR DETERMINING THE RESEMBLANCE OF DOCUMENTS"

meaning they can detect duplicate documents.

What is claimed is:
1. A method of comparing a plurality of documents stored on a computer comprising the steps of:
loading a first document into a random access memory (RAM);
loading a second document into the RAM;
reducing the first document into a first sequence of tokens;
reducing the second document into a second sequence of tokens;
converting the first sequence of tokens to a first (multi)set of shingles;
converting the second sequence of tokens to a second (multi)set of shingles;
determining a first sketch of the first (multi)set of shingles;
determining a second sketch of the second (multi)set of shingles; and
comparing the first sketch and the second sketch.

msgraph

4:25 pm on May 10, 2001 (gmt 0)

Nope not yet. Was pondering AV today and was like Hey! I forgot to check on their patent updates.

Oh yes! They are dumping out some future secrets now!

"What is needed is a method to determine whether two documents have the same content except for modifications such as formatting, minor corrections, web-master signature, logo, etc., using small sketches of the document, rather than the full text. "

Quote from: US Pat. 6,230,155

Brett_Tabke

4:34 pm on May 10, 2001 (gmt 0)

There are a staggering 50 of them that have been approved since the first of the year.

Full List of Altavista Approved Patents [164.195.100.11]

Wow, why hasn't Alta made noise about this since they were approved? I remember all the fuss when they just applied for them, you'd think they would crow about getting them approved. Or am I missing something here?

seth_wilde

4:47 pm on May 10, 2001 (gmt 0)

"why hasn't Alta made noise about this since they were approved?"

My guess is that the next time we publicly hear about the patents will be because AV will be suing someone for violating them.... (keep quite until they strike and then rake in the dough)

msgraph

5:17 pm on May 10, 2001 (gmt 0)

"The method detects copies based on comparing word frequency occurrences of the new document against those of registered documents."

So are they referring to some of their methods here? If so, this explains why older pages do well. So if I created a website two years ago and filled all the pages with a high frequency of keywords I would basically be knocking out anyone in the present for getting listed well?

If there are millions of pages out there, many of them are going to have the same types of small phrases that were not created intentionally. I wonder how many quality sites cannot be listed because of that.

Quote from US Pat: 6,230,155

rcjordan

5:33 pm on May 10, 2001 (gmt 0)

>the next time we publicly hear about the patents

When CMGI sells AV.

WebGuerrilla

7:30 pm on May 10, 2001 (gmt 0)

>>the next time we publicly hear about the patents
>When CMGI sells AV.

Exactly. The last time they made public announcements [webmasterworld.com] about patents, a great deal of people/organizations started talking about challenging patents based on prior art. A lot of articles were written about who weak many of their claims are.

I know if I was shopping the company, I wouldn't want the press to point out to any potential buyer the fact that the patents might not end being worth much.

Brett_Tabke

8:42 pm on May 10, 2001 (gmt 0)

We are bound to miss a story from time to time. We completely missed this one. Danny had it clear back in Feb:

[searchenginewatch.com...]

NFFC

8:48 pm on May 10, 2001 (gmt 0)

>back in Feb

<cough>
[webmasterworld.com...]

[webmasterworld.com...]

[webmasterworld.com...]
</cough> :)

Brett_Tabke

9:08 pm on May 10, 2001 (gmt 0)

Ya, I knew we talked about it quite abit and that applications had be filed, but I didn't know any of the patents had been approved. The core ones on se algo's like the duplicates are the critical ones.

msgraph

9:29 pm on May 10, 2001 (gmt 0)

>>The core ones on se algo's like the duplicates are the critical ones.

They sure are. I've read a few of them in the past but missed the one on:
Method for determining the resemining the resemblance of documents

They might not give a great amount of info on how to rank really well but they sure clear things up on What Not To Do. As well as why this or that happened.

....i just wish i paid more attention in those math classes.....

msgraph

2:41 pm on Jul 17, 2001 (gmt 0)

Issued Today July 17, 2001 -- Filed on November 2, 1999

Basically states how AV takes a bunch of URL's off a crawled page and then assigns priority levels to each link to determine which ones have the most importance.

During the course of processing a downloaded document, various data can be collected about it. Examples include the date and time of the download, how long it took to perform the download, whether the download was successful, the document's size, its MIME type, the date and time it was last modified, its expiration date and time, and a checksum of its contents. These data can be used for a variety of purposes, including, but not limited to:
passing information from one processing module to a later processing module in a processing pipeline;
collecting statistics about the downloaded documents; and
in the context of a continuous web crawler, the collected data can be used as a basis for determining when a document should next be downloaded (refreshed).

Every web crawler must maintain a data structure or set of data structures reflecting the set of URL's that still must be downloaded. In this document, that set of data structures is called "the Frontier." The crawler repeatedly selects a URL from the Frontier, downloads the corresponding document, processes the downloaded document, and then either removes the
URL from the Frontier or reschedules it for downloading again at a later time. The latter scheme is used for so-called "continuous" web crawlers.
When selecting a URL from the Frontier, the inventors have determined that it would often be desirable for the crawler to preferentially select certain URL's over others so as to maximize the quality of the information processed by the other applications to which the web crawler passes downloaded documents. For instance, the web crawler may pass downloaded pages to a document indexer. An index of documents on an Intranet or the Internet will be more accurate or higher quality if the documents of most interest to the users of the index have been preferentially updated so as to make sure that those documents are accurately represented in the index. To accomplish this, the web crawler might preferentially select URL's on web servers with known high quality content. Alternately, heuristics might be used to gauge page quality. For instance, shorter URL's might be considered to be better candidates than longer URL's.
In the context of a continuous web crawler, it may be desirable to prefer URL's on web servers whose content is known to change rapidly, such as news sites. It may be desirable to prefer newly-discovered URL's over those that have been previously processed. Among the previously processed URL's, it may be advantageous to prefer URL's whose content has changed between the previous two downloads over URL's whose content has not changed, and to prefer URL's with shorter expiration dates over those with longer expiration dates.

Web crawler system using plurality of parallel priority level queues having distinct associated download priority levels for prioritizing document downloading and maintaining document freshness [164.195.100.11]