Welcome to WebmasterWorld Guest from

Message Too Old, No Replies

New Patent Application - Spam Detection Based on Phrase Indexing



3:19 pm on Dec 29, 2006 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

Googler Anna Lynn Patterson is credited as the inventor on this new patent application, Detecting spam documents in a phrase based information retrieval system [appft1.uspto.gov], which was filed Jun 28, 2006 and published Dec 28, 2006.

So who is Anna Lynn Patterson? She came to Google from her previous job at archive.org where they reportedly handle 55 billion documents in the index, so she's no stranger to large scale information retrieval. She's also the author of a short article that many may find interesting: Why Writing Your Own Search Engine is Hard [acmqueue.com].

The abstract for the application describes a bird's eye view of the patent:

Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. A spam document is identified based on the number of related phrases included in a document.

Now it's time to study before I comment more - but I wanted to post the news so any interested members also get a chance to read up.

[edited by: tedster at 3:59 pm (utc) on Dec. 29, 2006]


2:26 am on Jan 6, 2007 (gmt 0)

10+ Year Member

Too many words in the article.

Can anyone who managed to finish write a logical equation?


11:43 pm on Jan 19, 2007 (gmt 0)

5+ Year Member

First thing if they have gone this route to detect spam then I guess only person they will kick out is genuine person who has high ranking and relevant information which user wants. I guess this whole paper is junk and if they have followed this paper to penalize website or doing -950 rank then they did blunder. So i guess this is the time Yahoo and MSN should step in and create a better search engine because this type of paper and its implemetation will harm google user experience. Hopefully google will revert back their junk implementaion of spam filtering.


12:55 am on Jan 20, 2007 (gmt 0)

5+ Year Member

I wonder what Google would do if every webmaster and web site there is got so tired of Google's seventy-eighty percent something hold on search market and constant gaming of us and decided to add

User-agent: Googlebot
Disallow: /

to their robots.txt file.

Of course it would probably be almost as hard as blackmailing the oil companies. But how sweet it would be to turn the tables.


7:27 pm on Jan 22, 2007 (gmt 0)

WebmasterWorld Senior Member marcia is a WebmasterWorld Top Contributor of All Time 10+ Year Member

There are some parts of this patent that make me suspect that it might have something to do, at least in part, with what's now referred to as the "950 penalty," which seems to be hitting sites that really aren't pulling any fancy tricks at all.


10:45 pm on Jan 22, 2007 (gmt 0)

10+ Year Member

I have a commerce site. I offer probably 15 products lines, and each one will fit a different thing. For example,
- Long Widgets
- Short Widget
- tall widgets
- wide widgets
- metal widgets
- wood widgets


I understand the thory of over optimization, and i would also say thats possible for my site. However we also learned that anchor text is very important if we are to be found in the engines. If I were to list on my page:

- Long
- Short
- tall
- wide
- metal
- wood

My anchor text would not be worth much. So where do you draw the line? Is this script / patent going to know that all of those sizes were widgets?
What the problem could be is that if you are dynamic and drill down through the products, and you only happen to show a few items listed on one page, well thats one thing - your term may only appear a few times and I would actually think that was good. However, if in that category I had a large page 25 items to list, and the word "widgets" appeared each time, then I could see it getting caught up in stuffing filters. But not the whole site or the section.
I guess I could write a script that said that if a term was included on my page 10 times already, then to replace it with another term - but what the heck - what next?


11:04 pm on Jan 22, 2007 (gmt 0)

WebmasterWorld Senior Member marcia is a WebmasterWorld Top Contributor of All Time 10+ Year Member

I'm not sure it's number of occurences, it seems to be focusing more on co-occurence of related phrases. Of course, if there are a number of products with different modifiers and the same keyword that might mean more phrases but it'll take dissecting what's being said in the patent on how phrases are being identified.

It's just some intuitive speculation on my part, but it makes sense and a few of the things mentioned seem to be a tangible reality so it can't hurt to try a thing or two to overcome what's apparently a penalty that's quite possibly phrase specific.

I've done exactly that for a page that's without doubt got that penalty - as white hat as can possibly be, with no tricks or games. I've noted the cache dates and will be watching over the next couple of weeks to see if the "remedy" applied has any effect.


9:50 pm on Jan 24, 2007 (gmt 0)

WebmasterWorld Senior Member annej is a WebmasterWorld Top Contributor of All Time 10+ Year Member

I don't claim to understand it but I did read the patent over. It seems to me there is a very fine line between pages that rank well and pages that are penalized. The very involved phrase calculations are made and a line is drawn. Above the page is fine, below it is penalized.


8:41 am on Jan 25, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

so we now have to rewrite all our spam.... sigh.

Why do google bother, can they be any bigger? Joe public have loved them for years with our spam dominating the serps. I think they are taking a big risk trying to find naturally written unique content which has a yet unproven effect on their popularity. I hope they reconsider...


11:56 pm on Jan 25, 2007 (gmt 0)

10+ Year Member

I refute that these people thought of this first. I have posted on here about searching for random phrases to sus out scraping and spam many years ago!

Ok .. well ... looks like I will only be able to surmise about this as the proof of these postings seems to NOT be in the google site search for this site.

now there is a surprise!


2:36 pm on Feb 8, 2007 (gmt 0)

WebmasterWorld Senior Member zeus is a WebmasterWorld Top Contributor of All Time 10+ Year Member

outland88 "attacking spam by quantity and number of domains" owend by one person, I dont like that I got about 20 domains, but that is a must when you have to get a steady income with all whats going on on the internet "google" if google did have so much power we maybe could have less. I do agree if a person have 1000 domains live on the net, that could be spam and all with a 1 year registre.


2:43 pm on Feb 8, 2007 (gmt 0)

WebmasterWorld Senior Member zeus is a WebmasterWorld Top Contributor of All Time 10+ Year Member

If it come to it, we are talking about keyword density nothing els, if you look at it.


9:14 pm on Feb 10, 2007 (gmt 0)

WebmasterWorld Senior Member marcia is a WebmasterWorld Top Contributor of All Time 10+ Year Member

We're talking about keyword co-occurrence, IDF (Inverse Document Frequency) and levels of threshold acceptability in document collections.
This 42 message thread spans 2 pages: 42

Featured Threads

Hot Threads This Week

Hot Threads This Month