What EXACTLY is the Penguin Algorithm?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

What EXACTLY is the Penguin Algorithm?

martinibuster

4:02 am on Mar 17, 2016 (gmt 0)

Read an article last month last month that asked a dozen "Internet Experts" what their opinion of what Penguin was. Many of the responses were clearly about on-page Panda issues.

Funny thing. Nobody discusses what the algorithmic foundations of Penguin are. Have you noticed? Nobody says it's link analysis and points to a patent. In fact, speculation of what the Penguin Algorithm actually is, it's totally missing. So please, throw your two cents into this discussion. Three if you have it.

I have my ideas about what Penguin is. But I'm interested in yours.

(Note: Facts and speculation only. Jokes and complaints are Off Topic)

[edited by: Robert_Charlton at 7:51 pm (utc) on Mar 17, 2016]
[edit reason] Moved description line to body of post. [/edit]

martinibuster

8:23 pm on Mar 17, 2016 (gmt 0)

Everyone has ideas about what Panda is. But what about Penguin?

Andy Langton

9:24 pm on Mar 17, 2016 (gmt 0)

This is a very good question, and I won't pretend I can give a categorical answer.

My take is this:

- Google uses a variety of techniques to identify 'poor quality' pages, from textual analysis, to pattern analysis to site history, links and everything in between
- Links from these pages are bad links
- Certain types of links from 'good' pages can also be bad links. These will be links in suspicious contexts - anchor text, placement etc.
- Over a certain threshold, bad links are actively negative, rather than ignored. This threshold depends upon context, which includes the keyword-level and site-level
- If there are too many bad links, your relevance for perceived target keywords will be decreased, and your rankings will be lower than would normally be expected

The processing of Penguin seems to be CPU-intensive - hence the long delays and constant speculation about updates and whether the algorithm is "real time". This is similar to when Google made PageRank a constant calculation rather than a periodic process. This implies it is testing things above and beyond normal ranking procedures.

aristotle

9:24 pm on Mar 17, 2016 (gmt 0)

The Penguin part of the algorithm is generally thought to be a counter-measure against artificial link building. It looks for un-natural patterns in a site's backlink profile, such as an over-occurence of a particular anchor text, or a high percentage of backlinks of a certain type.

Shepherd

10:22 pm on Mar 17, 2016 (gmt 0)

The Penguin Algorithm is... A tool used by google to "adjust" the search results mainly for commercial keyword searches.

Wilburforce

10:45 pm on Mar 17, 2016 (gmt 0)

What aristotle said succinctly summarises my own view.

It would also be interesting to know what EXACTLY the disavow tool does. I'm pretty sure it has helped Google in refining Penguin to reduce false-positives, but I don't know of any specific effect it has had on any specific site.

Robert Charlton

10:57 pm on Mar 17, 2016 (gmt 0)

Some general thoughts....

Penguin was/is a "webspam" penalty, and not confined to links, though manipulative inbound links have been seen as the most common cause.

Beyond inbound links that fall into statistical patterns which Google interprets as manipulative... stuffed anchor text being the most obvious signal, there are also outbound links that do the same... including excessive link exchanges. I'm sure that Google used as much of the historical ranking patent as it could make work. Not sure about nav linking within a site.

Additionally, as I remember from Matt Cutts comments somewhere, some types of keyword stuffed onpage text content (eg, large blocks of keywords, possibly hidden text), scraped content, and also redirected pages, etc are part of the mix. In other words, why not all spam factors that could be looked at statistically?

Note that Penguin and Panda are often related... as sites with, say, excessively thin or shallow content, which couldn't get "freely given natural links", have often relied on unnatural links or spammy techniques to rank. That said, the two algos look at different factors, though they may work in somewhat similar ways mathematically.

I've felt that both Panda and Penguin are statistically "recursive" algorithms, and posted that thought about Penguin in this thread...

According to Google: Penguin 3.0 is continuing
Dec, 2014
https://www.webmasterworld.com/google/4719313.htm [webmasterworld.com]

My own speculations here: I'm thinking that the algorithm may be highly "recursive"... with the same or related processes repeated on the results of the previous operations, giving us results that are increasingly refined. There's likely a pause to check results at every step, so Google can gauge whether the algorithm is working as anticipated and decide what to do next. Perhaps this will eventually lead to a procedure that can be maintained on a more continuous basis.

In the context of this current discussion, that means that Penguin is about more than spidering old backlinks. Beyond just the backlinks, it's got to be about evaluating heuristic signals of some sort and evaluating patterns, perhaps in the web-graph, much the same as Panda might look, say, at patterns of content on a page or within a site.

Robert Charlton

11:34 pm on Mar 17, 2016 (gmt 0)

PS: Speculation... some types of spam might also have "vector" signatures that might be identifiable, or suggest characteristics that might lead to the next stage in the algo. Each successful recursion in this type of algorithm would raise the bar a little higher.

iamlost

2:23 am on Mar 18, 2016 (gmt 0)

What EXACTLY is the Penguin Algorithm?
Let me look deep into the Google black box; past the third, or is the fourth, sequential event horizon where but a few select acolytes chant weird and arcane strictures to a creature of purest yin and purest yang with a preference for oily piscine treats...
My apologies, but due to quite profound ignorance can only offer my quite non-exact opinion :)

First, a prologue:
That after each update, of whatever sort, one can still find well ranking egregious examples of what Google say they are targeting while others are nuked that seem to check all what Google claims it wants is a puzzle...the first would indicate that thresholds are rather loose, the latter that they are too tight. The one possible glimmer of sense in that dichotomy is that the examples I've seen contrasted are most often in different niches/verticals; thus Google may well, in Penguin as they do elsewhere, be differentiating by market/search segment, some much stricter than others.

And now, my opinion:
I tend to believe that Penguin is targeting links (their component parts, velocity, neighbourhoods, position relative to niche/vertical graph, etc.) and what flows from those links.
Why?
Because if content is metaphorically king and Panda targets that, then Penguin, logically, should be be targeting the metaphoric queen: links; and all who sail through her.

That Penguin is a serious computational load is seen by it's infrequency. Robert Charlton's suggestion of a recursive algorithm seems sensible especially if the target is links and all that they are (to the SE); if I'm correct then there is serious computational parallelisation (list all those link values that are thought to be considered as inputs) definitely requiring check stops for quality control.

That Google has tended to broadly answer: follow webmaster guidelines et al almost as a mantra, indicates that Penguin inputs are fairly constrained, certainly more so than Panda's 'high quality site'.

Perhaps the most surprising aspect of Penguin is not the algorithm itself nor even it's effect but the response of so many webdevs anguishing: well I cleaned house not just once but n-times and I still haven't recovered after n-Penguin updates.
Ummm, you clear cut your back link profile, thereby removing some significant percentage even unto most of your externally attributed values... and you expect to still rank above competitors who retain theirs? Why?

The logical consequential overlap between the two P's is quite pronounced:
* as Panda blots pages for various content flaws the values available to flow from their external link outs diminish and pages downstream topple or tremble in turn.
* as Penguin blots links for various transgressions they stop flowing (the pipe is plugged/capped) values from their page and pages downstream topple or tremble in turn.

Given that many who go for low quality (in Google's eyes) pages also tend to go for low quality links (auto-magical success) there is an inherent feedback loop between the two such that improving one alone may not be sufficient and that once both are 'corrected' the new or improved pages are laps behind those of unaffected competitors.

And so many webdevs are in serious denial on the quality of their content and back links. Most sites are junk, except for yours and mine and I'm beginning to have concerns about yours...

tangor

7:27 am on Mar 18, 2016 (gmt 0)

Do we know how many penguins have been run? Do we, other than the above speculation by all which I think is spot on and very much mirrors my own thoughts, actually know what penguin does? What we do know is g rolls out an animal name says it does "something" and the result is a growing wail from webmaster upon webmaster that "I've been hit and what do I do to fix it?" and they don't know why (or won't admit it).

Personally I have not been "hit", and few of the sites I manage for others have suffered... then again no magic links, scam, spam or thin...

How many penguins have there been?

FranticFish

12:08 pm on Mar 18, 2016 (gmt 0)

Following what iamlost said about false positives, perhaps these could be because telling the difference between spam and certain types of viral is very hard indeed, even when you have access to the full data set.

I did a link audit using a popular tool that includes a 'suspect links' check and a proportion of the links were flagged by this tool. They were link drops, but they weren't complete drive bys from brand new posters: people citing stuff in forums, blogs. However, the only reason I can be sure they weren't unnatural is because the business has never engaged anyone to do this sort of thing.

Google is (we would hope) more sophisticated than the tool I was using. Nevertheless, if I planned it carefully, I could certainly simulate this kind of activity - and the more businesses I was doing it for, the easier it would be for me to do it and the harder to tell why I was doing it.

Given that, the decision to penalise rather than ignore is one I've always struggled to understand.

martinibuster

12:38 pm on Mar 18, 2016 (gmt 0)

Some really great answers! Anyone more?

telling the difference between spam and certain types of viral is very hard indeed

I did a site review panel for SMX East last fall. A guy stood up and gave the details of his wildly successful viral campaign then asked why it didn't move the needle on rankings or sales. Certain kinds of viral campaigns simply don't have an effect. The links are effectively ignored. This, I think, says something about how Google is processing links. Ignoring over penalizing.

There have been a number of manual actions against certain link selling networks, but it doesn't seem like there have been all that many. Few and far between. That too might say something about how Google is processing links.

Shepherd

1:01 pm on Mar 18, 2016 (gmt 0)

the decision to penalise rather than ignore is one I've always struggled to understand.

To ignore would not result in the desired serps.

What if there is no answer? Everyone is looking for the way forward, the way to get their site back to the top of the serps. What if google simply decided that your page being listed on the first page of the serps cost them too much money and that there is nothing you can do, no endorsement, not even a link from <insert your supreme being here> can cause google to list your page on the first page of the serps?

Here's a question: has anyone been affected by penguin for a keyword/search term that is NOT commercial?

Andy Langton

2:35 pm on Mar 18, 2016 (gmt 0)

Ignoring over penalizing.

I suspect that Google ignores far more links than most people expect. That said, ignored links do not explain the Penguin effect, which is actively negative.You can see this easily in non-competitive areas where sites are penalised, but a new site can rank immediately with next to no links.

the decision to penalise rather than ignore is one I've always struggled to understand.

It ups the stakes for anyone considering aggressive SEO. If the risk was that it might not work, more people would go for a scattergun approach until they happened upon something that works. Making it a high-risk proposition will deter many.

Wilburforce

4:26 pm on Mar 18, 2016 (gmt 0)

has anyone been affected by penguin for a keyword/search term that is NOT commercial?

Yes.

I have a purely informational section on my (business) website that is well-regarded, and has unsolicited links from a wide variety of sources.

One particular page (mysite.com/technical-term.htm) was at the top of Google SERPs for <technical-term> (also for variants like <the technical-term> or <technical-terms>) prior to P1. The majority of the large number of unsolicited links - of which most were from forums - used one of the three variants as anchor-text.

The page also has a fairly high key-term density, as e.g. "Technical-term Values" is a more natural and obvious sub-heading that some contrived description that avoids using the term, or e.g. "Values". This could well affect any interaction there might be between Panda and Penguin.

When P1 hit, that page went from #1/#2 to somewhere below page 10. Singular and plural were - and still are - affected differently, but I don't know whether this is a function of the algorithm itself, of singular/plural proportions in anchor-text, or of some other factor.

I have taken no action at all about links that are quite properly using that or similar terms as anchor-text (I haven't disavowed them, or asked site owners to remove them), and keep an eye on it as a barometer. The singular is currently at #97, the plural at #47, and <the technical-term> at #27.

Other things have changed since P1, so I'm not saying that the page is still suffering as a result of Penguin alone, but it certainly did when it first rolled out.

martinibuster

7:01 pm on Mar 18, 2016 (gmt 0)

Wilburforce, what kind of sites are now in the SERPs for that non-commercial term?

Shepherd

7:05 pm on Mar 18, 2016 (gmt 0)

I have a purely informational section

Not really what I asked, is <technical-term> a commercial keyword? When you search for <technical-term> do any adwords ads come up?

Wilburforce

7:28 pm on Mar 18, 2016 (gmt 0)

@martinibuster

Generally still informational, but there are now several product pages on page 1, and

@Shepherd

No ads come up (Google's Knowledge Graph is top of page, now followed by Wikipedia).

Shepherd

7:37 pm on Mar 18, 2016 (gmt 0)

Ok Wilburforce, that's interesting, next question, are any of the sites listed on page one of the serps for <technical-term> also listed here: [gv.com...]

Wilburforce

10:48 pm on Mar 18, 2016 (gmt 0)

@Shepherd

No, none of those.

Shepherd

11:16 pm on Mar 18, 2016 (gmt 0)

That's interesting Wilburforce, seems like an outlier from what I've seen. You've got everything except the commercial aspect, no adwords and not competing with a google backed company. Wikipedia #1 + infobox (I see this a lot with penguin-ized keywords) , That's an odd one for sure. I'd be curious to see the search results.

martinibuster

1:00 am on Mar 19, 2016 (gmt 0)

Wilburforce, that doesn't sound like a Penguin issue. More like the web page isn't good enough or appropriate anymore. It's an informational query and the best results are going to tend to be from informational sites that are comprehensive. No offense intended, but that page sounds like classic keyword fly paper created with search traffic in mind, not necessarily a resource created to be a comprehensive resource. That's not a Penguin issue.

Back to Penguin
I don't believe traditional statistical analysis of anchor text percentages, of link velocity, and other similar old school statistical analysis has anything to do with Penguin. Have a great weekend! :)

[edited by: martinibuster at 1:29 am (utc) on Mar 19, 2016]

EditorialGuy

1:20 am on Mar 19, 2016 (gmt 0)

It [penalizing] ups the stakes for anyone considering aggressive SEO. If the risk was that it might not work, more people would go for a scattergun approach until they happened upon something that works. Making it a high-risk proposition will deter many.

As a bonus, it keeps the offending sites off the streets for a while. For Google and searchers, it's a win-win.

austtr

6:07 am on Mar 19, 2016 (gmt 0)

I doubt we will ever know �What EXACTLY is the Penguin Algorithm?� The mind set of most of our members seems to focus on finding a technical reason, explanation, solution that we can use to solve the problem. IMO if you do that with Penguin you are looking in absolutely the wrong place. All that follows falls squarely in the speculation basket. No facts, just observations and �best guess interpretations� and it�s hard to walk that thin line without overbalancing into Google bashing.

But here goes�.

To understand Penguin you have to accept that Google�s priority is profit and appeasing the demands of Wall Street. It is a company run by accountants, not search support teams. IMO Penguin as a blunt force instrument intended to clear the playing field of categories of sites competing for money that Google was/is intent on dominating. e.g� shopping, travel etc etc. The wipe-out of affiliates was a pre-cursor to this.

We have all seen the posts by site owners who laboured for years to develop top quality sites, enjoying deserved top rankings, and the traffic that flowed from that because Google considered those sites to be better than the competition. Then bang�. instant oblivion! So what happened?

Did a whole bunch of better sites suddenly appear? No.
Did the site owner stray from the Google Guidelines? No.
Was the site hacked? No.
Did some other major disaster befall thousands of quality sites all at the same time? No.
Did Google need a different set of SERP�s Yes�. enter Penguin stage left!

Penguin was not an accident, it�s impact was intended and the reason was commercial. I can point at local holiday niches that, pre-Penguin, had many very competitive quality sites operated by some knowledgable people�. not fools by any stretch of the imagination. Penguin took out every single site�. and despite the very best efforts of those competent operators, not a single one has ever returned. We have seen similar stories in all sorts of niches for the best part of four years. And how many sites recovered? You can count them on the fingers of a person with poor power saw skills.

If Google was not satisfied with the search results brought about by Penguin, they have had close on four years to fix things. They haven�t made any changes so presumably they achieved their objective and the arrival of the �authority sites� SERP domination was intentional.

As I said in the beginning, with a conversation such as this, it is hard to �tell it like it is� without Google bashing, but the real point I�m trying to get across is don�t blindly assume there is a technical solution to every problem. Sometimes you need to also follow the money and see where it leads.

martinibuster

6:54 am on Mar 19, 2016 (gmt 0)

No facts, just observations and �best guess interpretations� and it�s hard to walk that thin line without overbalancing into Google bashing.

You're right, there are a LOT of best guesses I call them garbage SEO speculation. Even worse, there are those who promote ideas that are based on "reasonable" assumptions. But in the end both of those are just pulling opinions out of their rear ends.

Here's an example of garbage SEO speculation. A few years ago the idea circulated that Facebook Likes might have an effect on ranking. Most SEOs countered that correlation is not causation. While that's true, there's a better way to cut through the B.S.

The TRUE measure of whether something is possible is to cite actual scientific research or patents showing that this method has actually been researched. Period.

If someone can cite research or patents, then (at the very least) the speculation is within the realm of possibilities. Everything else is baseless speculation, just air pulled out of someone's rear end.

The best speculation is based on known scientific research and patents. Those are two things that give you the authority to say that something is within the realm of possibilities.

To say that Google's algorithm is a black box is a cop out. The specifics of Google's algorithm can't be known. But the science of Information Retrieval, all the ways it can be done, that's 100% within your reach to know. That's why I say that the idea of the Black Box is an intellectual cop out and a myth. The scientific knowledge is publicly available. It's taught at universities.

There are many facts regarding Penguin and many algorithms that can be Penguin. No need for baseless speculation or guessing.

Penguin is like any problem that needs study. You can settle for the observation that the world is flat. Or you can take the initiative and find the facts. Penguin is exactly the same. Do not wait for the facts to be published on a blog. I gave a hint in my post previous to this one. That's all I'm saying in a public forum.

Andy Langton

9:32 am on Mar 19, 2016 (gmt 0)

...don�t blindly assume there is a technical solution to every problem. Sometimes you need to also follow the money and see where it leads.

Even assuming that your theory is correct, Google still need to implement this, and they way they implemented is guaranteed to be technical in nature.

I don't agree with your theory, incidentally. Google was at serious risk of embarrassing search results dominated by anyone with a credit card and a "pay per post" budget. I've seen a handful of sites that appear to be genuine collateral damage, and many, many sites who (knowingly or otherwise) were gaming the system in an obvious way. I think you're right in a way, though - Google doesn't care about your "mom and pop" or small business site - because their users don't care either. Without a proper solution to the link problem, they went for a fairly drastic solution that favours "safe" sites.

I don't believe traditional statistical analysis of anchor text percentages, of link velocity, and other similar old school statistical analysis has anything to do with Penguin

I believe the confounding variable is the idea of site and page quality. A simplification of how I see the concept:

Anchor text from any site => authority site => all sins forgiven = ranking

Anchor text to mid-range site => depends who's linking

Anchor text to low quality site => negative impact in almost all cases

Of course, sites are not categorised into those three, so everything is relative. Assuming that having bad links lowers site quality, than a site can be "trapped" in that it's too low quality to receive any link value at all.

martinibuster

8:45 pm on Mar 19, 2016 (gmt 0)

So Wilburforce showed me his site and I have to agree it's not a flypaper site. It's a high quality web page. However it's also been determined why the site no longer ranks for the generic phrase. The short version is that the site is ranking where it should rank.

After reviewing his backlinks, reviewing the context of his site, then identifying what his site is relevant for, it's abundantly clear and beyond a doubt that the page still ranks for the phrase, but only when the query intent matches the topicality of his site. Thus, his site isn't penalized. The web page he was concerned about is simply ranking where it should rank.

For the generic phrase, there is no explicit user intent tied to it. Thus most of the presently ranking sites for the generic phrase are informational. Wilburforce agrees with this assessment.

The Wilburforce question is settled. Further discussion should probably be considered off topic. Let's please return to the topic of the thread, identifying what the Penguin algorithm is.

Thanks. ;)

[edited by: martinibuster at 9:29 pm (utc) on Mar 19, 2016]

Shepherd

9:26 pm on Mar 19, 2016 (gmt 0)

Let's please return to the topic of the thread, identifying what the Penguin algorithm is.

So MB, just so I'm clear, seems to me that the Wilburforce tangent was looking into the types of keyword/searches/reasults (transactional, informational, navagational) that are affected by penguin, would that not be the base information we would need to understand what penguin is/how penguin works? Maybe I'm not on the same page in the book of what you're looking for.

martinibuster

9:33 pm on Mar 19, 2016 (gmt 0)

...was looking into the types of keyword/searches/reasults (transactional, informational, navagational) that are affected by penguin, would that not be the base information we would need to understand what penguin is/how penguin works?

Thanks for asking the question. I can understand how one can arrive at that question but the answer is no, not really, not at all. Wilburforce's issue had to do with user intent and how the algorithm modifies the ranking scores to match that user intent. That has nothing to do with Penguin. The Wilburforce tangent is tied to click log mining, machine learning and learning to rank algorithms that revolve around understanding user intent and showing results that satisfy the most users. That has nothing to do with link spam. Absolutely not Penguin in any way.

Shepherd

9:51 pm on Mar 19, 2016 (gmt 0)

Yes, after review your opinion is that Wilburforce's page was not affected by penguin, got that, but we got to that tangent by my asking if anyone has seen a non-commercial (transactional) keyword/search affected by penguin. Wilburforce's specific situation aside, don't we need to know what type of keyword/search is affected by penguin in order to understand what it is?

It is my belief/opinion that only commercial keywords/searches are affected by penguin, I'm looking for evidence to the contrary.

This 102 message thread spans 4 pages: 102