Forum Moderators: open

Message Too Old, No Replies

Using Semantic Analysis to Classify Search Engine Spam

With some useful tools of the trade available

         

Marcia

9:02 am on Jun 14, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There's been a lot of recent interest in and discussion about Latent Semantic Indexing, and this Stanford paper, while the research was conducted for different reasons, presents some interesting concepts that could prove very useful to us when constructing and/or analyzing web pages.

The research concluded expressing limitations on this method for their original intended purposes, but held out promise for other possibilities.

Due to the similarities between spam and non-spam our original semantic analyzers are not an effective method to classify spam content. Since spam and non-spam documents are so similar, it is sometimes very difficult for a human to differentiate between the two. Because of these similarities, it is unlikely that any natural language analysis method will be successful in differentiating between spam and non-spam.

However, using semantic analyzers to determine the usefulness of information on a webpage had much more promising results. Assuming the user is more interested in finding a quick answer to their query, a page with more textual information should have a higher rank. Our analyzers could help to determine this rank.

Stanford Semantic Analysis Paper - PDF Document [stanford.edu]

HTML Version from the Google Cache [64.233.167.104]

This is the open source software at SourceForge, which uses a variant of LSA

Infomap NLP Software [infomap-nlp.sourceforge.net]

And here's the demo & search engine at the Stanford project site

Infomap Demo & Search Engine [infomap.stanford.edu]

Plus some other semantics related toys to play with there.

For the really dedicated fans of semantic analysis there's another available at the Princeton University Cognitive Sciences Lab

WordNet, a Lexical Database for the English Language [cogsci.princeton.edu]

Now the task is to clarify and classify the principles for purposes of practical application.

marin

2:12 pm on Jun 14, 2004 (gmt 0)

10+ Year Member



<There's been a lot of recent interest in and discussion about Latent Semantic Indexing >

Where?

trillianjedi

2:29 pm on Jun 14, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Where?

All over the google news forum since about February.

Excellent post Marcia, some good reading in there.

Many thanks.

TJ

Wail

2:30 pm on Jun 14, 2004 (gmt 0)

10+ Year Member



Here: [webmasterworld.com...]

It is a hot topic. Fair enough, it might be whispered rather than shouted, but all the best bits of SEO are.

Marcia

2:45 pm on Jun 14, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



And here also - Microsoft Research
[webmasterworld.com...]

Notice the forum name here: Toolbar & Desktop Applications
[webmasterworld.com...]

They're very relevant concepts for static search in non-hyperlinked environments, which is being actively pursued by some major players. For our practical purposes it isn't so much the theoretical dissection as using whatever information is available to know how to apply the concepts to websites. Whether or not LSI is actually actively in place there are still elements that it certainly can't hurt to use.