homepage Welcome to WebmasterWorld Guest from 54.167.179.48
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 42 message thread spans 2 pages: 42 ( [1] 2 > >     
New Patent Application - Spam Detection Based on Phrase Indexing
tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3202878 posted 3:19 pm on Dec 29, 2006 (gmt 0)

Googler Anna Lynn Patterson is credited as the inventor on this new patent application, Detecting spam documents in a phrase based information retrieval system [appft1.uspto.gov], which was filed Jun 28, 2006 and published Dec 28, 2006.

So who is Anna Lynn Patterson? She came to Google from her previous job at archive.org where they reportedly handle 55 billion documents in the index, so she's no stranger to large scale information retrieval. She's also the author of a short article that many may find interesting: Why Writing Your Own Search Engine is Hard [acmqueue.com].

The abstract for the application describes a bird's eye view of the patent:

Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. A spam document is identified based on the number of related phrases included in a document.

Now it's time to study before I comment more - but I wanted to post the news so any interested members also get a chance to read up.

[edited by: tedster at 3:59 pm (utc) on Dec. 29, 2006]

 

justageek

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3202878 posted 3:49 pm on Dec 29, 2006 (gmt 0)

Maybe I'm missing something but what's the difference between this and LSI on a smaller scale? Phrases rather than across pages?

JAG

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3202878 posted 3:56 pm on Dec 29, 2006 (gmt 0)

The first thing I see is that this patent resolves a very gnarly problem:

For example, on the assumption that any five words could constitute a phrase, and a large corpus would have at least 200,000 unique terms, there would approximately 3.2.times.10.sup.26 possible phrases, clearly more than any existing system could store in memory or otherwise programmatically manipulate.

digitalghost

WebmasterWorld Senior Member digitalghost us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3202878 posted 4:07 pm on Dec 29, 2006 (gmt 0)

Word dependency as spam prediction. Nothing to do with LSI really other than word dependency has a role to play in document analysis. Lot of patent applications lately that seem to be slight modifications on prior art.

Web_Savvy

10+ Year Member



 
Msg#: 3202878 posted 4:17 pm on Dec 29, 2006 (gmt 0)

Good and quick find tedster, thanks for posting it.

It may not relate to the actual invention, but I could not resist commenting on this, from the 'Background of the Invention':

For example, in a typical Boolean system, a search on "Australian Shepherds" would not return documents about other herding dogs such as Border Collies that do not have the exact query terms. Rather, such a system is likely to also retrieve and highly rank documents that are about Australia (and have nothing to do with dogs), and documents about "shepherds" generally.

Well, 'such a system' (as described in the 2nd sentence above) would have to be pretty primitive in 2006, won't it? ;-)

[edited by: Web_Savvy at 4:21 pm (utc) on Dec. 29, 2006]

SlyOldDog

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3202878 posted 7:57 pm on Dec 29, 2006 (gmt 0)

I think this would be more useful for identifying blogspam links than spam pages per se.

What does a spam page look like anyway? The best spam is indistinguishable from the real thing so on page factors are irrelevant

mattg3

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3202878 posted 8:09 pm on Dec 29, 2006 (gmt 0)

What's the difference in this to a bayesian spamassassin style approach?

This seems to be just a clear out for automatically generated text.

So if you use "President of the USA" on your page and don't use "White House" your page might get nuked ..

[edited by: mattg3 at 8:21 pm (utc) on Dec. 29, 2006]

SullySEO

5+ Year Member



 
Msg#: 3202878 posted 8:12 pm on Dec 29, 2006 (gmt 0)

Those Google folks...do they ever think about anything but spam?

I think a better and more positive patent and title would be Detecting Quality Documents In A Phrase Based Information Retrieval System

To propose that a document is spam because it doesn't contain 26 possible phrases, or whatever, is ridiculous. If that's the case, most of the web is spam.

mattg3

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3202878 posted 8:24 pm on Dec 29, 2006 (gmt 0)

To propose that a document is spam because it doesn't contain 26 possible phrases, or whatever, is ridiculous. If that's the case, most of the web is spam.

Unless this "spam phrase index" is huge it will lead to near duplicate content as only the folks that use President of the USA and White House in a document will get through..

[edited by: mattg3 at 8:41 pm (utc) on Dec. 29, 2006]

plasma

10+ Year Member



 
Msg#: 3202878 posted 8:30 pm on Dec 29, 2006 (gmt 0)

That patent wouldnt work with obfuscated texts. Only the stupiedest spammers would be caught.
How do they define spam btw?

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3202878 posted 8:30 pm on Dec 29, 2006 (gmt 0)

I read it as looking for pages with too many phrases, not as having too few -- we might call it concept stuffing.

The patent is looking at sigificant deviations that are too high -- certain phrases make that clear: "identifying as spam documents those documents that have a statistically significant deviation in the number of related phrases relative to an expected number", "where it is at least some multiple number of standard deviations greater than E, for example, more than five standard deviations", or "at least one phrase exceeds predetermined maximum expected number."

This approach seems to me to be aimed at autogenerated pages, constructed from scraped bits and pieces to attract a long tail search to a page with ads. Of course, it does all hang on the base measures of assumed non-spam documents, but I assume Google has enough data to take a decent baseline measure.

I wish them them the best in this. If I never again click through to autogen junk page I'll be very happy.

Oliver Henniges

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3202878 posted 8:38 pm on Dec 29, 2006 (gmt 0)

I got stuck between 0036 and 0039:

[0036] In each phrase window 302, each candidate phrase is checked in turn to determine if it is already present in the good phrase list 208 or the possible phrase list 206. If the candidate phrase is not present in either the good phrase list 208 or the possible phrase list 206, then the candidate has already been determined to be "bad" and is skipped.
...
[0039] If the candidate phrase is not in the good phrase list 208 then it is added to the possible phrase list 206, unless it is already present therein. Each entry p on the possible phrase list 206 has three associated counts:

So, what now then? Skipped or added? To my understanding the list of possible phrases would remain empty. What did I get wrong?

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3202878 posted 8:40 pm on Dec 29, 2006 (gmt 0)

How do they define spam btw?

I was amused to see the word in a patent application with no formal definition. I wonder if the management at Hormel are also amused.

The closest thing I can see to a definition is this:

[0006] Some spam pages are documents that have little if any meaningful content, but instead comprise collections of popular words and phrases, often hundreds or even thousands of them; these pages are sometime called "keyword stuffing pages." Others include specific words and phrases known to be of interest to advertisers. These types of documents (often called "honeypots") are created to cause search engines to retrieve such documents for display along with paid advertisements.

outland88

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3202878 posted 9:02 pm on Dec 29, 2006 (gmt 0)

This is one thing I was searching for related to the Dec 20th problem. It's almost like Google was using a new approach to attack spam in my areas. This doesn't bode well for commerce sites.

> constructed from scraped bits and pieces to attract a long tail search to a page with ads.<

Interestingly Tedster this is what was popping up all over my areas. Scrapersí using considerably more differing bits and pieces that were never there before. Article sites also saw big jumps in their rankings in my areas.

I've got to agree with Sully SEO an awful lot is going to be detected as spam. It's almost an ultra sophisticated brand of censorship. Google needs to concentrate on indexing the web, attacking spam by quantity and number of domains with similar ownership, and let people do business once again. Iím tired of all this collateral damage.

Web_Savvy

10+ Year Member



 
Msg#: 3202878 posted 9:04 pm on Dec 29, 2006 (gmt 0)

Within the 11 claims of this patent application, the term 'spam' occurs some 20 times, about twice per claim. Also, it occurs some 50 times in total across the whole patent document.

I wonder if the URL of the patent would trigger some 'spam' filter/s in the Google algo? ;-)

But more seriously, I too would have thought that a term so frequently used in the Claims should have been somewhat more substantially defined.

Also, did I find ANY prior art references completely absent, or did I miss something here? Isn't this a bit surprising?

mattg3

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3202878 posted 9:12 pm on Dec 29, 2006 (gmt 0)

Also, did I find ANY prior art references completely absent, or did I miss something here? Isn't this a bit surprising?

Shouldn't a patent at least claim it's a unique idea [wasn't a patent once a claim to be the first that published it or something like that] .. bit like going into your viva and saying btw I copied that guys PhD ...

atlrus

10+ Year Member



 
Msg#: 3202878 posted 10:07 pm on Dec 29, 2006 (gmt 0)

Oh, that would be a nightmare for us, as we run an industry specific news portal, and our home page is mostly headline links to internal pages, and our keyword repeats A LOT...

Really, I have to agree - Google is in the business of providing the good results not constantly looking for ways to find and ban spam - you have 1000 results - find the way to identify the best 1000, hell the best 20, results and problem solved.

[edited by: tedster at 11:15 pm (utc) on Dec. 29, 2006]
[edit reason] fix typo [/edit]

Swanson

10+ Year Member



 
Msg#: 3202878 posted 10:21 pm on Dec 29, 2006 (gmt 0)

I wouldn't be surprised if this algo is already in use on the landing page quality score "adbot" for adwords.

Anything "page" or "phrase" based is easy to implement in the adbot as it is much less to do with inbound links for example.

Halfdeck

5+ Year Member



 
Msg#: 3202878 posted 8:16 am on Dec 30, 2006 (gmt 0)

I believe the supplemental index is based on one of Anna's patents.

pontifex

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3202878 posted 12:48 pm on Dec 30, 2006 (gmt 0)

If that's the case, most of the web is spam.

SullySEO: pssst ... don't tell anyone until they take the red pill :)

I see the approach in the patent, but I wonder why it is worth to patent? Some SPAM filter work on a simliar basis and patents should cover NEW technology?

my 2 pennies,
P!

SullySEO

5+ Year Member



 
Msg#: 3202878 posted 5:50 pm on Dec 30, 2006 (gmt 0)

I read it as looking for pages with too many phrases, not as having too few -- we might call it concept stuffing.

Yes, I read your earlier quote which could be taken out of context if you don't read the entire paragraph from which it came. My bad.

So basically if you exceed the number of phrases which have been determined are "valid" or "good", you have "bad" spammy phrases.

Clark

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3202878 posted 6:40 pm on Dec 30, 2006 (gmt 0)

What is LSI?

Re: the patents, unless anyone has reports of Google going around suing other SE's for "violating" their patents, I think they probably have fun applying with these patents to waste the time of SEOs.

I mean, how exactly would you sue a SE for breaking a patent? The SE's keep their algo's secret anyway. So you can't even "know" they are using "your" patented technology.

Web_Savvy

10+ Year Member



 
Msg#: 3202878 posted 7:14 pm on Dec 30, 2006 (gmt 0)

At times it could all boil down to corporate strategy.

Sometimes, IP rights, including patents can be used as means of defense rather than offence.

A competitor sues you for something, you sue them back under the pretext of some patent violation or the other (there could be many grey areas). Then, settle out of court and back to business as usual. ;-)

Wierd though it may sound, sh^t happens. :-)

centime

5+ Year Member



 
Msg#: 3202878 posted 7:30 pm on Dec 30, 2006 (gmt 0)

Thanks for the article on building a search engine, it sure confirms a lot of things i've kinda deduced and so many more insights,

Fantastic

arpecop

5+ Year Member



 
Msg#: 3202878 posted 9:42 pm on Dec 30, 2006 (gmt 0)

so this patent is spam at all cause there is too many "phrase" and "phrases" words I just count them using find and replace using word ... there is 667 "phrase" words
LOL

gibbergibber

10+ Year Member



 
Msg#: 3202878 posted 2:56 am on Dec 31, 2006 (gmt 0)

The word "spam" is sort of like the word "evil", everyone can come up with clear extreme examples but no one can come up with a way of distinguishing between more ambiguous cases.

There are so many websites (especially those that are made up entirely of syndicated material) where there's seemingly little or no original content to justify its advertising revenue, yet they're very popular and would be much missed if they disappeared, so they're arguably legitimate and not spam. Google News for example.

In other cases sites may original content, but it occupies only a tiny proportion of the page compared to advertising and affiliate links, yet the sites are seen as legitimate sources of original material. IGN or IMDB for example, or the many high profile news sites that cover themselves in buttons and banners and floating objects.

Then there's online stores where most of the products are provided by affiliates rather than the site itself, seemingly a tell-tale sign of a spam site, yet Amazon.com and many other large stores habitually contract out sales of some or most of their items to third parties.

Not being able to define spam isn't an everyday problem for individuals because we all have our own standards, just as we do for distinguishing between good and evil. The problem comes in single blanket rules that are supposed to apply to absolutely everything and everyone, and there you run into the same problem that the courts do every day: it's wrong to kill someone, it's wrong to burgle someone, so is it wrong for a homeowner to kill a burglar? Whatever side a court comes down on, there will be significant numbers of people who disagree, and will feel victimised by what the court decides. Google's pronouncements about spam, if they ever became public, might have the same polarising effect on users of the internet.

In this sense, it's impossible to absolutely 100% totally define spam and non-spam, just shades of spam. :-)

The problem with Google is that we never know how exactly they're punishing people, and for what. It's like a Kafka-esque court that meets in secret, never reveals its deliberations and never reveals its decisions. As many people have pointed out, sooner or later someone at Google is likely to abuse that position to favour sites that somehow favour Google.

BillyS

WebmasterWorld Senior Member billys us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3202878 posted 3:05 am on Dec 31, 2006 (gmt 0)

This is one of those documents I'm going to have to print out and read. About all I've figured out is what Tedster already mentioned.

Google knows what types of phrases should appear together and by looking for certain combinations of phrases it can paint a better picture of what the page is all about. However, if certain phrases are used together too frequently then a spam filter is tripped.

Pirates



 
Msg#: 3202878 posted 5:10 am on Dec 31, 2006 (gmt 0)

Lotts of google patents are rubbish that comes to nothing. This one I think is already active on serps.

Marcia

WebmasterWorld Senior Member marcia us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3202878 posted 6:08 am on Dec 31, 2006 (gmt 0)

[0006] Some spam pages are documents that have little if any meaningful content, but instead comprise collections of popular words and phrases, often hundreds or even thousands of them; these pages are sometime called "keyword stuffing pages." Others include specific words and phrases known to be of interest to advertisers. These types of documents (often called "honeypots") are created to cause search engines to retrieve such documents for display along with paid advertisements.

(Note: my bolding to separate two concepts)

I can see the first, the unbolded, referring to keyword dumps that are sometimes put into a hidden <div> that's off-page for user viewing or using hidden (or almost hidden) text.

That second, bolded part, is the fanciest description I've seen of MFA pages/sites.

Martin40

5+ Year Member



 
Msg#: 3202878 posted 7:06 pm on Jan 1, 2007 (gmt 0)

Isn't the "patent" just about on-page over-optimisation?

It concludes with:

Finally, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.

..which says that this paper is for entertainment only and tells us nothing until new papers arrive.

This 42 message thread spans 2 pages: 42 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved