What EXACTLY is the Penguin Algorithm?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

What EXACTLY is the Penguin Algorithm?

martinibuster

4:02 am on Mar 17, 2016 (gmt 0)

Read an article last month last month that asked a dozen "Internet Experts" what their opinion of what Penguin was. Many of the responses were clearly about on-page Panda issues.

Funny thing. Nobody discusses what the algorithmic foundations of Penguin are. Have you noticed? Nobody says it's link analysis and points to a patent. In fact, speculation of what the Penguin Algorithm actually is, it's totally missing. So please, throw your two cents into this discussion. Three if you have it.

I have my ideas about what Penguin is. But I'm interested in yours.

(Note: Facts and speculation only. Jokes and complaints are Off Topic)

[edited by: Robert_Charlton at 7:51 pm (utc) on Mar 17, 2016]
[edit reason] Moved description line to body of post. [/edit]

Andy Langton

11:20 pm on Apr 1, 2016 (gmt 0)

"Machine learning" is flavour of the month and is not the only explanation for slow processing times. Admittedly, it's likely that this type of approach is permeating most Google processes, but is the theory that Penguin started as "machine learning". Surely AI is a way to do things more efficiently, not an explanation in itself?

martinibuster

12:36 am on Apr 2, 2016 (gmt 0)

In my humble opinion Penguin is not a modification engine

I tend to agree with you.

Remember when it was announced that Penguin would be a Real-Time algorithm? [seroundtable.com] The chatter focused on the real-time aspect of it. But the really interesting part to me was the implication in relation to Penguin being inside or outside of the Ranking Engine.

To me, what that statement means was that Penguin was previously a feature that existed outside of the ranking engine and the announcement that Penguin was going to be real-time was a less technical way of communicating that Penguin was being integrated into the Ranking Engine itself.

Spiekerooger

7:14 am on Apr 2, 2016 (gmt 0)

Yes, thought the same about that chatter. So the question remains why they haven't been able to integrate it into the normal ranking engine pipeline. Either it takes too much processing time or power or the results are not convincing.

Whitey

8:32 am on Apr 2, 2016 (gmt 0)

Either it takes too much processing time or power or the results are not convincing.

and/or it wan't a mission critical priority to push out quickly from G's point of view.

Andy Langton

10:45 am on Apr 2, 2016 (gmt 0)

So the question remains why they haven't been able to integrate it into the normal ranking engine pipeline.

My assumption is that it's the data size that's the biggest issue, not the processing of it. I.e. the algorithm is fine and dandy, but requires a significant amount of data to deliver good results. Given that this is about links, a likely issue is that operating on partial data does not give reliable output so "speeding up" "and "real time processing" don't work very well.

Such a large amount of link data suggests to me that either a current snapshot of all links to a site or all linking pages need to be assessed, or (worse still in terms of processing) the output needs to be compared to everyone else's.

Robert Charlton

11:24 am on Apr 2, 2016 (gmt 0)

Regarding the delays in the update... at the risk of redundancy, it should be mentioned that Gary Illyes has commented on this for months now in numerous interviews, really all over the SEO blogosphere at this point, as well as in the cited Eric Enge interview which sits several posts above. Gary's comments in the interview reinforce various statements he's made before, that there have been so many false predictions that he's obviously reluctant to explain yet again that the team is having difficulties...

Google's Gary Illyes on Emerging Search Trends
3/3/2016 by Eric Enge
[stonetemple.com...]

Here's part of Gary's response to Eric's question about Penguin...

I haven't checked with the team for a while. We do definitely check in with them and ask them, "What's up with Penguin?" but I think, as any human, they have a threshold for nagging.... // They are running the experiments, but we will also not launch something that we are not happy with.

While Gary won't comment about whether machine learning is being used, IMO it's completely clear from the context of his remarks that this is machine learning, and that the team is having problems eliminating false positives.

Gary essentially confirms what I posited earlier (and several others here appear to concur), that Penguin is a recursive algorithm, which is very computational and time intensive... going through similar and increasingly refined operations over and over, stopping for evaluation and refinement at each stage.

In the seventh post of this current discussion, I summarized some of my thoughts about the nature of the algorithm and why it's taking so long, thoughts that I'd had fifteen months ago, in a post about Penguin 3. Here's the original thread...

According to Google: Penguin 3.0 is continuing
Dec, 2014
https://www.webmasterworld.com/google/4719313.htm [webmasterworld.com]

My own speculations here: I'm thinking that the algorithm may be highly "recursive"... with the same or related processes repeated on the results of the previous operations, giving us results that are increasingly refined. There's likely a pause to check results at every step, so Google can gauge whether the algorithm is working as anticipated and decide what to do next. Perhaps this will eventually lead to a procedure that can be maintained on a more continuous basis.

I also expanded on recursiveness in a quote in a later post in the Penguin 3.0 thread. Here's part of it, and I should restate the caveat that I'm not a mathematician...

For recursion to terminate, each time the recursion method calls itself with a slightly simpler version of the original problem, the sequence of smaller and smaller problems must converge on the base case.

As Gary describes it in Eric's interview...

Gary: ...first there's lots of brute tuning going on, and after a while, you reach a phase where you have to actually do really, really tiny fine-tuning on Penguin and algorithms in general. And sometimes that fine-tuning can actually take way more time than the brute tuning. We are working hard to launch it as soon as possible. I can't say more than that.

As Spiekerooger summarizes it in the vernacular of the discipline...

it looks like a machine learning process that is slow in producing acceptable results, maybe overfitting or producing too many false positives.

We can't really know the cause from the outside, but the huge data-size and complexity are factors that can produce such problems. I assume that the team is about as good as they get.

Re machine learning, I assume that's obligatory at this stage. I can't imagine that Penguin could run as part of the core algorithm without machine learning. "Core algorithm", as I understand it, means that it's enough a part of the regular algorithm routine that parts of it are, in fact, routine... but that doesn't mean that it runs by itself or that there are no further changes. Someone somewhere said recently that it becomes part of the core algo when the engineers start forgetting how it works. ;)

Just guessing about what "real time" means for Penguin, which might necessarily be a series of frozen snap-shots, which in themselves might be difficult to synchronize... as Google has many databases.

I'm thinking that when Penguin is "real time", before we see ranking changes there's still got to be enough of a delay or distraction in its roll-outs that the results can't be easily reverse-engineered. I'm also still not convinced, btw, that it's about link-spam only. There was much talk about onpage spam etc... which possibly might be visible in whatever web-graph analysis Google might be doing in addition to measuring and tracking linking properties.

Whitey

1:18 am on Apr 3, 2016 (gmt 0)

There was much talk about onpage spam etc... which possibly might be visible in whatever web-graph analysis Google might be doing in addition to measuring and tracking linking properties.

I'm wondering if that's separated, even though there is a consideration. My thoughts question that both i/bound and o/bound links need to show patterns that provide a strong indication that match relevance. The pattern of content surrounding the links might be part of that consideration. In that type of context, I'd be open to thinking Penguin is not just about links.

Some sites have content about a lot of subjects - so the identification is probably a little harder. One clear technique SEO's used in the past was to insert link text into content that closely matched the surrounding text and it was difficult to weed out as being "manipulative". Do it enough and you have a pattern, that likely is identifiable.

Long ago, back in 2007 actually, i was stunned to see Google ranking us for keywords that surrounded the link text on links, both internal and external referring links. But heck, that was 9 years ago.

If you were to list all the things that G wanted to eradicate from influencing ranking with links what would they be? I think that's easy to work out. Or am i wrong?

(Still I'm unconvinced of Google's commercial priorities for this "search quality team department" to perfect Penguin - I just don't think the head of Google would be giving it much though at all - the SERP's are sufficiently stable, and controlling the content layer with Google assets in the SERP's is more likely the area of priority interest).

aristotle

12:40 pm on Apr 3, 2016 (gmt 0)

I haven't checked with the team for a while. We do definitely check in with them and ask them, "What's up with Penguin?" but I think, as any human, they have a threshold for nagging....

Well at least somebody in the company must be wondering what's going on.

martinibuster

2:38 pm on Apr 4, 2016 (gmt 0)

Some sites have content about a lot of subjects - so the identification is probably a little harder.

Not harder. The search engines are way beyond what you think they're capable of.

A web graph is a map of the web

The web graph can be created with sites as the nodes, showing the linking patterns on a site to site level for the entire web. Thus they can see site relevance on a site to site community level, to see how communities form by niche topic.
A more detailed web graph can be created with individual web pages as the nodes, showing the linking patterns from web page to web page for the entire Internet. Thus search engines can see how pages on specific topics form distinct communities arranged by niche topic.
A web graph can be created at the page section level, with each section of a page becoming a node, showing section to section linking patterns. This recognizes that many web pages are comprised of multiple topics. Each section of a page becomes a node. The relevance of one section of a page to another section of a page somewhere else becomes even sharper. Now you can do things like map out all interconnections from the footer. What kinds of link relationship fun can you do with that? But that's not all you can do with a section level web graph!

martinibuster

2:51 pm on Apr 4, 2016 (gmt 0)

Here's an interesting fact.

Now just imagine 65% out of those up to 40000 clickworkers downvoted some 500 sites (out of 5000 websites they had to rate).

A machine learning algorithm would try to find a set of many vectors that would help determining algorithmatically the features that those bad sites have in common.

It happens that human quality raters make mistakes. Additionally, human quality raters judge sites differently. These aren't flaws in the training system. These are actually rich data points for study that results in better training. In these cases the machine is more accurate than the human quality raters.

In some cases the instructions were not specific enough and the quality raters had to re-do the task with better instructions.

Storiale

3:06 pm on Apr 4, 2016 (gmt 0)

The evidence points to a lot of this weekend - I'm assuming Penguin. We have a web ranking tool that runs queries on over 800 keyword phrases. We test Brand keywords, Category Keywords and Brand + Category. The report runs on Sunday nights and there was a lot of fluctuation in the Brand + Category set of keyword phrases.

Because of the changes, I compared those results with manual rankings using incognito mode this morning. Manual ranking compared to the tool are usually accurate for 95% of the queries with only a few minor changes averaging 1 position up or down. (+/-1).

Today's results (Monday 4/4/16), during the manual confirmation process has more than 60% differentiation compared to the Sunday report (just 9 hour ago) and with average of (+/- 3) for BRAND + CATEGORY phrases.

To re-iterate: The manual rankings are commensurate with last week's numbers, but the report that ran Sunday night at 11pm shows MUCH different results... period. I am assuming, they are testing Penguin.

BTW - The brand keyword phrases and Category keyword phrases showed no change at all.

Walt Hartwell

12:49 am on Apr 5, 2016 (gmt 0)

The search engines are way beyond what you think they're capable of.

Back in the mid-life TLA days, it was possible to obtain links that had relevance from pages that had very few outbound links. As time went on, people jumped on the "get links" bandwagon with very indiscriminate link placement.

What a person would often end up with was a link among other links from a site structured like:
Crappy site A
Your wonderful site
Crappy site B

Repeat that across a few sites and it is a very obvious pattern that can be expanded to put some "suspicion" on any site that links to site A and/or wonderful site and/or site B. Penguin is certainly a bit more complex that that, I've always mentally pictured it as a ratio thing of the percentage of quality links vs the percentage of "suspect" links.

This 102 message thread spans 4 pages: 102