Forum Moderators: open
I wonder if Google is using some "dumb" static filters or more sophisticated filters based on probabilities. What do you guess?
I think the majority of those that claimed pr0's as cross linking problems, were really just hard filters on strings. It's why 30 interlinked sites saw 20 zapped and 10 standing.
One of the most interesting factoids from the article appears to be this one:
The cost is enormous for the recipients, about 5 man-weeks for each million recipients who spend a second to delete the spam, but the spammer doesn't have to pay that.
Used by itself - it is an arbitrary an unfair/inaccurate method for spam detection (assuming you want to get rid of 100% of spam and 0% of non spam)
I believe google uses a sliding scale based on PR and other factors. I could be wrong, but my opinion is google is more likely to tag you if you have low PR and fail other tests than just tag people with a link to qksrv.net.
I think their thinking would be - eh - it is a PR1 anyway - who is going to miss it.
It appears to me that the higher quality sites don't seem to get caught up in these filters. In other words - Linking to a bad neighborhood isn't going to get yahoo banned any time soon, but may get your PR1 site banned that also fails more tests.
I think string searching could be an accurate measure of spam, but only when used with other criteria. It smacks of "profiling" and I don't think google would ever use it by itself.
I also don't think it would be done in an arbitrary manner, but a more balanced measure based on other factors. For example, distribution of links across PR1 pages vs higher PR pages. So that pages that have a significantly higher number of links to certain sites from low PR pages (across a broad spectrum) are more likely to be tagged as bad links.
Remember that google likes to avoid human intervention. Using a method such as above could result in a similar penalty, but without human biases. It would probably end up being more accurate as well.
I started a project a while back (which was shut down due to funding problems, ie a dot com gone under) that used ANN to generalize a piece of information.
The idea was:
You set-up the simplest possible feed-forward network with GA as your learning mechanism.
With the number of inputs equal to the average word count in the file from the test pool.
(Numbers are made up)
So you have 50k files with 100 words in a file on average.
You create a network with 100 inputs and 1 output. The hidden node count is determined during the experimentation.
After that, you use something like MD5 to convert each word(with a variable number of characters) into a hash(fixed-length bit pattern).
MD5 is a good candidate because there is very small probability of getting the same hash for two different words.
Next, you start training your net on the test set. Passing the hashes of the words in each file into the net and telling it if that was actually spam or not.
If a particular file has more than 100 words - you take 100 random words from it. If less - you take all the words it has and then pick random words from it until you get 100. (not the best approach but it worked)
The idea behind it is to create a population of nets with random initial weights. After that, you start feeding it the input and compare the output.
The output is a double (actually a bit mask converted to double) which represents the probability of a file being a piece of spam.
Something like 0.0 - not spam, 1.0 - spam.
For any particular test case, you have:
- input hash
- output of the net
- the correct output that matches the test file
The negative of the absolute value of the difference between the output of the net and the correct output is the fitness function for the GA.
So if the net output is 1.0 and the correct answer is 1.0 - the net guessed. The same with 0.0 and 0.0. (But that should never happen).
A more realistic would be: 1.0 and 0.9245345343 or 0.0 and 0.2534534534.
You set GA with the properties close to: mutation 90%, crossover 10%, reproduction 0% (much more efficient to use elitism).
So after many-many iterations, you get a few nets that can generalize a file and produce a probability of the file being spam.
Training takes a lot of resources, but once you get a good net - the pass-through does not take much resources.
And if you keep training the nets on separate hardware set with new test files being added - you get a net that evolves with spammers' habbits.
Of course this is oversimplified description but with lots of experimentation - it gets better and better.
And you can always set your filters, like drop all files with probability higher than 0.85. If that's too much - then try 0.75, etc...
The system was being developed for e-mail spam and did not work out mainly because there was never a guarantee that a valid file would not get dropped, and that's unacceptable.
But for Google - if it looses your site along with 1000 spammers, I don't think it would hurt them.
What kind of spam could you try and detect with that kind of set up?
As I see it, there are a number of different types of web 'spam':
1. pages designed to appear for irrelevant results.
2. duplicated content across different domains.
3. pages using dubious techniques to boost positioning (hidden text etc).
4. sites using dubious external techniques (link farms, Zeus etc.)
I would think that each of these would require different detection mechanisms.
As I see it, there are a number of different types of web 'spam':1. pages designed to appear for irrelevant results.
2. duplicated content across different domains.
3. pages using dubious techniques to boost positioning (hidden text etc).
4. sites using dubious external techniques (link farms, Zeus etc.)I would think that each of these would require different detection mechanisms.
It would do nothing against #2,#4 and be the most effective against #1,#3 because it deals with 1 page at a time. It does not analyze the links and relationships, just the contents of the page (any file for that matter).
Although the same principle could be used to #2 and #4 but I have no data to back this up.
I might try to implement it later on as a personal project, but it's not going to be any time soon.
It is quite possible to create completely transparent spam, the detection of which would require the full attention of a human and his/her investigation for considerably longer than just a couple of minutes.
If there is a way to quantify and normalize the input - it can be filtered with some degree of certainty.
But I honestly believe a site that would require full attention of a human for more than a few minutes is not the problem.
If it takes more than a few minutes to review - it would take even more time to create. And that limits the total amount of such sites on the net (and in the google's index).