Is Google using Bayesian spam detection?

Forum Moderators: open

Message Too Old, No Replies

Is Google using Bayesian spam detection?

luma

11:47 am on Aug 17, 2002 (gmt 0)

The article "A Plan for Spam" describes the spam-filtering techniques used in the new spamproof web-based mail reader using Bayesian combination of the spam probabilities of individual words [paulgraham.com].

I wonder if Google is using some "dumb" static filters or more sophisticated filters based on probabilities. What do you guess?

Brett_Tabke

12:19 pm on Aug 17, 2002 (gmt 0)

Sure, why do you suppose the affiliate sites were all screaming at Ink in 2000 and Google in 2001? Easily filtered for content in fewer lines that most of us could fathom. (commision junction, link exchange, burst, fast click...etc etc). In about 20 simple string comparisons, you coule elminate 75% (guess) of any spam problem in se's today. I think Alta did that in late 98 and early 99 - ink in 99-00, and Google last year.

I think the majority of those that claimed pr0's as cross linking problems, were really just hard filters on strings. It's why 30 interlinked sites saw 20 zapped and 10 standing.

bird

1:02 pm on Aug 17, 2002 (gmt 0)

I suspect that SE spam is slightly harder to detect that way, because you don't have the incriminating header information (which helps me to block more than 90% of the crap that I receive, without even looking at the message text). Another problem is the necessity to build a seed database, where you have to decide manually which submission is or isn't spam. This is relatively painless to do for someones inbox, but may pose quite a challenge for a search engine. All the same, I'm sure that the basic concept can be applied there with some adaptations.

One of the most interesting factoids from the article appears to be this one:

The cost is enormous for the recipients, about 5 man-weeks for each million recipients who spend a second to delete the spam, but the spammer doesn't have to pay that.

It has always been obvious that there are a lot of costs involved on the recipients side, but I have never seen any actual figures attached to that (and this figure doesn't even include the waste of bandwidth and storage space!).

Chris_R

2:50 pm on Aug 17, 2002 (gmt 0)

I think links may have done it in some cases. This would be similar to what Brett is saying.

Used by itself - it is an arbitrary an unfair/inaccurate method for spam detection (assuming you want to get rid of 100% of spam and 0% of non spam)

I believe google uses a sliding scale based on PR and other factors. I could be wrong, but my opinion is google is more likely to tag you if you have low PR and fail other tests than just tag people with a link to qksrv.net.

I think their thinking would be - eh - it is a PR1 anyway - who is going to miss it.

It appears to me that the higher quality sites don't seem to get caught up in these filters. In other words - Linking to a bad neighborhood isn't going to get yahoo banned any time soon, but may get your PR1 site banned that also fails more tests.

I think string searching could be an accurate measure of spam, but only when used with other criteria. It smacks of "profiling" and I don't think google would ever use it by itself.

I also don't think it would be done in an arbitrary manner, but a more balanced measure based on other factors. For example, distribution of links across PR1 pages vs higher PR pages. So that pages that have a significantly higher number of links to certain sites from low PR pages (across a broad spectrum) are more likely to be tagged as bad links.

Remember that google likes to avoid human intervention. Using a method such as above could result in a similar penalty, but without human biases. It would probably end up being more accurate as well.

danny

3:49 pm on Aug 17, 2002 (gmt 0)

linking to a bad neighborhood isn't going to get yahoo banned any time soon

Didn't someone find a DMOZ page that was PR0ed? I certainly can't see Yahoo as a whole being banned, but there's no reason individual pages on it might not be.

Chris_R

3:55 pm on Aug 17, 2002 (gmt 0)

Not sure if there was a pr0'd page in dmoz or not. Maybe someone has mistaken a legitimate pr0 with a penalty.

If it was a legitimate penalty - could someone sticky me the url.

creep

2:14 am on Aug 18, 2002 (gmt 0)

I've seen lots of deep DMOZ pages PR0'ed

Chris_R

4:45 am on Aug 18, 2002 (gmt 0)

deep pages don't count - they are usually really pr0.

Someone was kind enough to send me a link to a dmoz page that really is pr0.

bcc1234

6:09 pm on Aug 18, 2002 (gmt 0)

Well, actually there is a way to detect spam with high probability of success.

I started a project a while back (which was shut down due to funding problems, ie a dot com gone under) that used ANN to generalize a piece of information.

The idea was:
You set-up the simplest possible feed-forward network with GA as your learning mechanism.
With the number of inputs equal to the average word count in the file from the test pool.
(Numbers are made up)
So you have 50k files with 100 words in a file on average.
You create a network with 100 inputs and 1 output. The hidden node count is determined during the experimentation.

After that, you use something like MD5 to convert each word(with a variable number of characters) into a hash(fixed-length bit pattern).
MD5 is a good candidate because there is very small probability of getting the same hash for two different words.

Next, you start training your net on the test set. Passing the hashes of the words in each file into the net and telling it if that was actually spam or not.

If a particular file has more than 100 words - you take 100 random words from it. If less - you take all the words it has and then pick random words from it until you get 100. (not the best approach but it worked)

The idea behind it is to create a population of nets with random initial weights. After that, you start feeding it the input and compare the output.
The output is a double (actually a bit mask converted to double) which represents the probability of a file being a piece of spam.
Something like 0.0 - not spam, 1.0 - spam.

For any particular test case, you have:
- input hash
- output of the net
- the correct output that matches the test file

The negative of the absolute value of the difference between the output of the net and the correct output is the fitness function for the GA.
So if the net output is 1.0 and the correct answer is 1.0 - the net guessed. The same with 0.0 and 0.0. (But that should never happen).
A more realistic would be: 1.0 and 0.9245345343 or 0.0 and 0.2534534534.

You set GA with the properties close to: mutation 90%, crossover 10%, reproduction 0% (much more efficient to use elitism).

So after many-many iterations, you get a few nets that can generalize a file and produce a probability of the file being spam.

Training takes a lot of resources, but once you get a good net - the pass-through does not take much resources.

And if you keep training the nets on separate hardware set with new test files being added - you get a net that evolves with spammers' habbits.

Of course this is oversimplified description but with lots of experimentation - it gets better and better.

And you can always set your filters, like drop all files with probability higher than 0.85. If that's too much - then try 0.75, etc...

The system was being developed for e-mail spam and did not work out mainly because there was never a guarantee that a valid file would not get dropped, and that's unacceptable.

But for Google - if it looses your site along with 1000 spammers, I don't think it would hurt them.

Abrexa_UK

10:07 pm on Aug 18, 2002 (gmt 0)

bcc1234,
I put your reply through Babelfish, but still couldn't get it translated. I think that it could do with another 5 or 6 pages of explanation to be understandable ;)

What kind of spam could you try and detect with that kind of set up?

As I see it, there are a number of different types of web 'spam':

1. pages designed to appear for irrelevant results.
2. duplicated content across different domains.
3. pages using dubious techniques to boost positioning (hidden text etc).
4. sites using dubious external techniques (link farms, Zeus etc.)

I would think that each of these would require different detection mechanisms.

bcc1234

2:15 am on Aug 19, 2002 (gmt 0)

LOL.

As I see it, there are a number of different types of web 'spam':
1. pages designed to appear for irrelevant results.
2. duplicated content across different domains.
3. pages using dubious techniques to boost positioning (hidden text etc).
4. sites using dubious external techniques (link farms, Zeus etc.)
I would think that each of these would require different detection mechanisms.

It would do nothing against #2,#4 and be the most effective against #1,#3 because it deals with 1 page at a time. It does not analyze the links and relationships, just the contents of the page (any file for that matter).

Although the same principle could be used to #2 and #4 but I have no data to back this up.

I might try to implement it later on as a personal project, but it's not going to be any time soon.

Beachboy

6:37 am on Aug 19, 2002 (gmt 0)

Agree with BCC, and there are even more forms of spam. Spam would be zero-value artificial manipulation with a goal in mind of boosting the ranking of some page, itself or some other. It is quite possible to create completely transparent spam, the detection of which would require the full attention of a human and his/her investigation for considerably longer than just a couple of minutes.

bcc1234

8:00 am on Aug 19, 2002 (gmt 0)

It is quite possible to create completely transparent spam, the detection of which would require the full attention of a human and his/her investigation for considerably longer than just a couple of minutes.

If there is a way to quantify and normalize the input - it can be filtered with some degree of certainty.

But I honestly believe a site that would require full attention of a human for more than a few minutes is not the problem.

If it takes more than a few minutes to review - it would take even more time to create. And that limits the total amount of such sites on the net (and in the google's index).