Forum Moderators: Robert Charlton & goodroi
What EXACTLY is the Penguin Algorithm?
[edited by: Robert_Charlton at 7:51 pm (utc) on Mar 17, 2016]
[edit reason] Moved description line to body of post. [/edit]
In my humble opinion Penguin is not a modification engine
So the question remains why they haven't been able to integrate it into the normal ranking engine pipeline.
I haven't checked with the team for a while. We do definitely check in with them and ask them, "What's up with Penguin?" but I think, as any human, they have a threshold for nagging.... // They are running the experiments, but we will also not launch something that we are not happy with.While Gary won't comment about whether machine learning is being used, IMO it's completely clear from the context of his remarks that this is machine learning, and that the team is having problems eliminating false positives.
My own speculations here: I'm thinking that the algorithm may be highly "recursive"... with the same or related processes repeated on the results of the previous operations, giving us results that are increasingly refined. There's likely a pause to check results at every step, so Google can gauge whether the algorithm is working as anticipated and decide what to do next. Perhaps this will eventually lead to a procedure that can be maintained on a more continuous basis.
For recursion to terminate, each time the recursion method calls itself with a slightly simpler version of the original problem, the sequence of smaller and smaller problems must converge on the base case.
Gary: ...first there's lots of brute tuning going on, and after a while, you reach a phase where you have to actually do really, really tiny fine-tuning on Penguin and algorithms in general. And sometimes that fine-tuning can actually take way more time than the brute tuning. We are working hard to launch it as soon as possible. I can't say more than that.
it looks like a machine learning process that is slow in producing acceptable results, maybe overfitting or producing too many false positives.We can't really know the cause from the outside, but the huge data-size and complexity are factors that can produce such problems. I assume that the team is about as good as they get.
There was much talk about onpage spam etc... which possibly might be visible in whatever web-graph analysis Google might be doing in addition to measuring and tracking linking properties.
Some sites have content about a lot of subjects - so the identification is probably a little harder.
Now just imagine 65% out of those up to 40000 clickworkers downvoted some 500 sites (out of 5000 websites they had to rate).
A machine learning algorithm would try to find a set of many vectors that would help determining algorithmatically the features that those bad sites have in common.
The search engines are way beyond what you think they're capable of.