Inktomi: Identifying spoof documents

Forum Moderators: open

Message Too Old, No Replies

Inktomi: Identifying spoof documents

If only this document was released earlier.

msgraph

2:17 pm on Aug 27, 2002 (gmt 0)

This was filed about 3 years ago. Although some of them are probably ignored now, there are some key issues that are pointed out:

Limiting The Number Of Metawords To Consider
Discounting Stop Words
Removing Duplicate Metawords
Comparing Metawords With The Text Of The Document
Bonusing Certain Words
Invisible Text
Repeated Keyword Test
Title Length Test
Indexing Web Pages
List Of Stop-Words

Inktomi Corporation: Method and apparatus for identifying spoof documents [164.195.100.11]

msgraph

2:54 am on Aug 28, 2002 (gmt 0)

If having read this document, what do you feel remains in how Inktomi handles documents today. Are most things ignored now? Improved now? Never seemed to have existed?

If most of this is NOT in effect, it really gives an insight into how Inktomi tried to make an effort into putting down "spamdexing."

Please, do not be afraid to read this patent, thinking that it is all tech jargon. There are a lot of tidbits within that are very interesting.

Meta Keywords
Only the first fifteen words, with stop-words removed, are indexed.

Removing Duplicate MetaWords
They state that if one word is repeated three times throughout the meta keyword tag, only one is considered.

Comparing Metawords With The Text Of The Document
They will not pay attention to ALL the keywords if less than 50% of them are not visible on-page. Only those that match the on-page words will be considered. If more than 50% match, then ALL will be considered.

Of course there are certain variables applied depending on what words are in certain meta tags or title tags.

Bonusing Certain Words
Words that are found in certain tags are given special "bonus" weight depending on the percentage of the terms on-page.

Invisible Text

For example, if a document includes text that is of the color navy blue and the background color of the text is black, even though the colors are not an exact match, the text may still be identified as invisible text.

Repeated Keyword Test

Words that appear more than 24x or 18% of the page, then the page is considered spam.

Title Length Test
If the title has more than 50 words then it is considered spam.

Indexing Web Pages
Brief details on how they handle a page from start-to-finish.

Robert Charlton

5:31 am on Aug 28, 2002 (gmt 0)

>>If the title has more than 50 words then it is considered spam.<<

I'm not a lawyer, but I don't think that any of the figures in these examples is to be taken literally. They use the phrase, "in an exemplary embodiment," I think as a way of describing the mechanisms in the algos without revealing the actual thresholds.

In an exemplary embodiment, any document that contains a title that includes more than fifty (50) words is identified as a spoof document and thus no words within the document are indexed.

That's my emphasis in the quote. I think the "examplary embodiment" might just as well be 20 words, or 100, and they'd be making the same patent claim without revealing their actual criteria. Ditto for the other areas. Lawman, do I have this right?

Great find, by the way. Thanks....