What EXACTLY is the Penguin Algorithm?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

What EXACTLY is the Penguin Algorithm?

martinibuster

4:02 am on Mar 17, 2016 (gmt 0)

Read an article last month last month that asked a dozen "Internet Experts" what their opinion of what Penguin was. Many of the responses were clearly about on-page Panda issues.

Funny thing. Nobody discusses what the algorithmic foundations of Penguin are. Have you noticed? Nobody says it's link analysis and points to a patent. In fact, speculation of what the Penguin Algorithm actually is, it's totally missing. So please, throw your two cents into this discussion. Three if you have it.

I have my ideas about what Penguin is. But I'm interested in yours.

(Note: Facts and speculation only. Jokes and complaints are Off Topic)

[edited by: Robert_Charlton at 7:51 pm (utc) on Mar 17, 2016]
[edit reason] Moved description line to body of post. [/edit]

EditorialGuy

10:16 pm on Mar 19, 2016 (gmt 0)

It is my belief/opinion that only commercial keywords/searches are affected by penguin

Even if that's true, couldn't the explanation be that commercial sites are the ones most likely to spend time and money on SEO (including link building)? The ROI from optimizing for "[commercial keyword[" is almost certainly a lot better than the ROI from optimizing for "St. Catherine of Siena" or "Australian marsupials."

martinibuster

10:20 pm on Mar 19, 2016 (gmt 0)

Thanks EG, that's exactly it.

Shepherd

10:59 pm on Mar 19, 2016 (gmt 0)

Yes EG, absolutely, but it is a chicken and egg situation. So, if we want to analyze correctly we need to start at the beginning.

If we assume that all keyword/search types are affected by penguin and that turns out to be incorrect it would be difficult to come to a scientific conclusion as to what penguin is because the data would be flawed.

martinibuster

2:13 am on Mar 20, 2016 (gmt 0)

...don't we need to know what type of keyword/search is affected by penguin in order to understand what it is?

Penguin affects all links within the web graph. "Sectors" have never been a part of any link spam algorithm. Spammers don't care about the "sector" of a niche or topic. Every page, every TLD, every topic, every thing is up for grabs when it comes to spam. Even .edu links are spammed out, thus it would not make sense to ignore links from the .edu sector or any other sector, whether that sector is a niche or an entire TLD. All link analysis algorithms have always focused on the entire link graph. So you see, there is no chicken and there is no egg. The entire barnyard is up for scrutiny. ;)

Don't take my word for it. I encourage you to research it by searching for "Link analysis" (in quotes). You will see that link analysis has nothing to do with "sectors." ;)

I'm rather surprised that the idea that algorithms specifically target things like sectors, affiliate links and buy buttons is still around. This is an old, old rumor that began at the very beginning of the search marketing industry- without foundation. For example, there used to be a rumor that Google was using Optical Character Recognition to read the word "BUY" written in the buy button, in order to target ecommerce sites (never mind that the notion is ridiculous considering that the presence of shopping cart code and shopping content is a tip off that a site is ecommerce and that employing OCR to confirm that a site is ecommerce is laughably redundant).

The fact is, there is simply no foundation for the idea that Google algorithmically targets sectors. There is no research, patent or any other foundation for that belief- only things like speeches that are taken out of context. Do a little research, read a few articles about at least five or more algorithms, if not the algorithms themselves, and it will be understood that there is no foundation for the belief that Google creates algorithms to target specific sectors.

Yes, certain link networks have been the targets of manual actions. But a manual action is not an algorithm.

So, it is understood that Penguin, like every previous link algorithm, is an algorithm that affects the entire link graph, measuring both the normal and the spam sites.

tangor

5:53 am on Mar 20, 2016 (gmt 0)

So, it is understood that Penguin, like every previous link algorithm, is an algorithm that affects the entire link graph, measuring both the normal and the spam sites.

I think we all agree on that, but to what specific purpose and to what extent? That's the part that remains elusive. If there is something that triggers Penguin, then if one does not "do that thing" theoretically one would not be injured. For all the known computer science and statistical know how out there, there does remain a mystery to the workings of the "black box" (and to disagree with that is just as silly as relying on that).

That's why they are called "trade secrets". :)

Shepherd

11:49 am on Mar 20, 2016 (gmt 0)

Penguin affects all links within the web graph. "Sectors" have never been a part of any link spam algorithm.

Ok, first, I mentioned nothing about "sectors", I said search types (transactional, informational, navagational).

Now, penguin affects links, really, what does it do to them? In my view, penguin is a filter that affects the search results. Data and scoring are gathered prior to a search however, the filter is only applied after a search is made.

martinibuster

6:00 pm on Mar 20, 2016 (gmt 0)

The answer is still the same. Commercial keywords/searches are a part of the commerce sector. Makes no difference if I substitute the word sector for commercial keyword or commercial sector. It's still the same. Penguin algorithm is a LINK algorithm, not a keyword algorithm. Link Algorithms have always focused on the entire web graph, regardless of commercial intent.

Same quote, substituting terms as above...

Penguin affects all links within the web graph. "commercial keywords" have never been a part of any link spam algorithm. Spammers don't care about the "commercial keywords/search" of a niche or topic. Every page, every TLD, every topic, every thing is up for grabs when it comes to spam. Even .edu links are spammed out, thus it would not make sense to ignore links from informational .edu sites, sites with commercial intent or any other sector, whether that sector is a niche or an entire TLD. All link analysis algorithms have always focused on the entire link graph. So you see, there is no chicken and there is no egg. The entire barnyard is up for scrutiny. wink

Don't take my word for it. I encourage you to research it by searching for "Link analysis" (in quotes). You will see that link analysis has nothing to do with "commercial keywords."

I'm rather surprised that the idea that algorithms specifically target things like commercial keywords, affiliate links and buy buttons is still around. This is an old, old rumor that began at the very beginning of the search marketing industry- without foundation. For example, there used to be a rumor that Google was using Optical Character Recognition to read the word "BUY" written in the buy button, in order to target ecommerce sites (never mind that the notion is ridiculous considering that the presence of shopping cart code and shopping content is a tip off that a site is ecommerce and that employing OCR to confirm that a site is ecommerce is laughably redundant).

The fact is, there is simply no foundation for the idea that Google algorithmically targets commercial keywords. There is no research, patent or any other foundation for that belief- only things like speeches that are taken out of context. Do a little research, read a few articles about at least five or more algorithms, if not the algorithms themselves, and it will be understood that there is no foundation for the belief that Google creates algorithms to target specific commercial keywords.

Yes, certain link networks have been the targets of manual actions. But a manual action is not an algorithm.

So, it is understood that Penguin, like every previous link algorithm, is an algorithm that affects the entire link graph, measuring both the normal and the spam sites.

[edited by: Robert_Charlton at 1:28 am (utc) on Mar 21, 2016]
[edit reason] Added quote formatting for clarity [/edit]

aristotle

7:55 pm on Mar 20, 2016 (gmt 0)

There are a number of ways for a site's backlink profile to look un-natural:

-- Most of the backlinks have the same anchor text.
-- Most of the backlinks come from general directories ( or some other particular type of site).
-- Nearly all of the backlinks point to the home page.
-- A high percentage of the backlinks are dofollow.
-- Nearly all of the backlinks were created in a short time period.
-- A site that only gets about 10 visitors per day has tens of thousands of backlinks.

I think that google especially likes sites that have backlinks pointing to a lot of different pages, rather than just to the home page. For example, if most of the individual articles are able to attract some backlinks of their own, then google might give the whole site an extra boost.

Andy Langton

9:04 pm on Mar 20, 2016 (gmt 0)

@aristotle, I think we could discuss some of these specifically. At least, I have some data on a number of them, and some opinions on others.

Most of the backlinks have the same anchor text

This doesn't correlate very well with Penguin, or with link penalties in general. A fast-growing site has these characteristics. I don't doubt that anchor text has a part to play, but I don't think frequency (or percentage of overall links) is a factor. I think anchor text from dubious pages or sites is a strong signal.

Most of the backlinks come from general directories ( or some other particular type of site).

This correlates pretty well with small businesses hit with Penguin. I tend to believe this is because these are the only sorts of links your average business owner looks to acquire. They're one Google search away from thinking this is a god SEO tactic. However, I don't believe Google has a "directory filter" in particular. I think it's an issue with sites that link out in abnormal patterns.

Nearly all of the backlinks point to the home page.

I would discount this one entirely, as it applies to an extremely large percentage of sites.

A high percentage of the backlinks are dofollow

Nofollow is a niche thing. A typical site has an overwhelming majority of dofollow links. I doubt this is a signal at all. I'm sceptical on a lot of what Google's reps put out there, but I believe nofollow is the same as robots-exclusion for links. I can't see that counting nofollow can be meaningful.

Nearly all of the backlinks were created in a short time period.

A spike from news mentions or social activity causes this, and there are many well-ranked sites with huge link spikes and little in the way of future acquisition. I don't see it as a factor.

A site that only gets about 10 visitors per day has tens of thousands of backlinks

The citation model pretty much demands that this can occur. Well-referenced does not mean well-read. If anyone remembers Google Trends for Websites, it was decent enough, but wildly inaccurate for most sites. I don't think Google knows how much traffic you get. Or at least, doesn't know it accurately enough to base fundamental ranking decisions on it.

I think Martinibuster had it right earlier in this thread when he dismissed the common "link stats" as a way to correlate with Penguin. Personally, I think it is because it's not just about the link - it's about the site and the linking page. I think links from bad sites and bad pages cause Penguin, not bad links.

aristotle

9:40 pm on Mar 20, 2016 (gmt 0)

Andy Langton -- I read your post and understand your arguments, but didn't find them to be very convincing. I still believe that most of the points I made are valid.

EditorialGuy

10:01 pm on Mar 20, 2016 (gmt 0)

Personally, I think it is because it's not just about the link - it's about the site and the linking page. I think links from bad sites and bad pages cause Penguin, not bad links.

The recipient site may come into play, too (as someone mentioned earlier in the context of "informational." "transactional," and "navigational").

It might be "natural" for a news site or blog to attract large numbers of links quickly, or for a reference site to attract large numbers of links over a period of time, but when an e-commerce site that isn't Amazon or Target gets a thousand links from out of nowhere (suddenly or otherwise), is that likely to fit a normal pattern?

Kelowna

3:33 pm on Mar 21, 2016 (gmt 0)

I think sometimes people try to get too technical and confuse the simple issues. Nobody but google employee's know for sure all the details, but this is my simple explanation of what the running of the Penguin is about.

The net is constantly bombarded with low level content and links made solely to help (money) pages rank. These links help build up and rank pages all across the web and there are many millions being built every day. It seems that the normal algo does not do a good enough job of cleaning them out so the gorg runs the penguin as it is called to clean all the crap links out of their data, and of course many sites see a drop from it.

So the penguin is the flushing of the toilet or taking out the trash when it comes to cleaning up the database(s). A fine tuned spam link cleaner of sorts.

tangor

8:20 am on Mar 22, 2016 (gmt 0)

So the penguin is the flushing of the toilet or taking out the trash when it comes to cleaning up the database(s). A fine tuned spam link cleaner of sorts.

I think there's a bit of truth in that observation. Fits many of the "visible" pieces to penguin.

martinibuster

12:05 pm on Mar 22, 2016 (gmt 0)

Andy Langton makes some great observations about what Penguin is as well as what Penguin is not. :)

...the penguin is the flushing of the toilet or taking out the trash...

That's your opinion of what you believe Penguin does. And while we may gain insight from observing what something does, that is not a description of what Penguin is. This discussion is about discussing what Penguin actually is.

This discussion is not about black boxes. It's not about how inscrutable Google's secret sauce is. Yet somehow this discussion keeps limping back to those excuses.

I remember when Update Florida happened, the entire SEO industry was at a loss for months to understand what was happening. Penguin hit us a few years ago and the industry is still at a loss to describe what Penguin is. Does anyone else find that curious?

In a way, this discussion is as much about the SEO industry's unwillingness to understand how search engines work as about what the Penguin industry actually is.

There was a blog post published a few weeks ago that asked so-called experts their opinion of what Penguin was. About half of the SEO "experts" (whom I'd never heard of) responded with answers describing the symptoms of Panda. What does one say about an industry who can't even distinguish between a link algorithm and an on-page quality algorithm?

The industry limps along on babbling about Facebook likes and clicking on SERPs. The SEO industry hobbles along on one crutch called "the black box" and another crutch called "Google's secret sauce."

Wilburforce

12:56 pm on Mar 22, 2016 (gmt 0)

And while we may gain insight from observing what something does, that is not a description of what Penguin is.

I'm not sure how useful it is trying to separate form from function here: what it is is code. What distinguishes it from other code is what it does, and for us what it does is its only observable characteristic.

If there is a useful way to describe it that doesn't address its function, perhaps you can give us an idea of what you mean.

martinibuster

1:32 pm on Mar 22, 2016 (gmt 0)

You are of course correct Wilburforce. I can't argue with that.

I was being diplomatic. Perhaps living in New England has gotten me into the habit of leaving things spoken between the lines. What I really meant to say is that descriptions such as, "taking out the trash" are not particularly accurate.

Andy Langton's suggestion is intriguing and quite descriptive.

...it's not just about the link - it's about the site and the linking page. I think links from bad sites and bad pages cause Penguin, not bad links.

randle

1:53 pm on Mar 22, 2016 (gmt 0)

This is a great thread - Penguin and Panda clearly changed so much; the very look of the SERP's can be divided before and after the introduction of these changes. Its tough to move up, when you cant understand what pushed you down.

What exactly was 'Penguin"? I'm not smart enough to know beyond whats been stated above; a resetting of how back links factor into rankings, and I do believe there is an anchor text element at play as the downgrading of keyword rich domains somewhat coincided with all this. (speculation I admit, no proof)

Proving what you "think' is incredibly challenging, and really when it comes down to it, if you cant prove it then you really cant be sure.

I will offer that in working through this, and in trying to embrace "proof", a good first step is eliminating all the noise and elements that "have to be connected", and patents are one thing I feel should be put right into the speculation bin. Google files about 10 patents per day; year after year after year. (granted over 3,000 in 2015 alone, not counting ones they bought) They do this for lots of reasons but a big one is just plain old disinformation; if your competitors, or the rabid seo community, thinks something your filing a patent on is important (its gotta be it! I mean they filed a patent for pete's sake) they will focus on that, when the real answers lie elsewhere.

Wilburforce

2:43 pm on Mar 22, 2016 (gmt 0)

patents are one thing I feel should be put right into the speculation bin

Patents are a complicated issue, but an important characteristic of them is that you cannot patent something that is already in use. You have to apply for the patent before putting it into use.

Google might file a lot of patent applications, but that doesn't mean they are doing anything unusual, or that any particular patent will (or will not) be put into use. Their entire business is founded on IP, so everything will have as much legal protection as they can possibly give it, and because the patent has to precede use I can say with a high degree of certainty that everything they are using today already has a patent in place.

Because there are so many of them, and because any function (including Penguin) may stem from more than one of them, there are probably more effective ways to locate the needle in this particular haystack than by trawling through their patent pile, but don't believe for one moment that it isn't in there.

Spiekerooger

10:47 pm on Mar 22, 2016 (gmt 0)

As far as I know first Panda and than Penguin came with machine learning or even deep learning entering the scene of information retrieval in the open web graph.

Now about several years ago I saw a lot of staffing service companies all over the world slashing out ads for manual webpage raters (and by reading the job offer you knew it was Google behind the scenes). Those clickworkers that are in need in machine learning to get classified data on learning set that is than used in deep learning algorithms as a blueprint to categorize a much larger set - the indexed web.

Now just imagine 65% out of those up to 40000 clickworkers downvoted some 500 sites (out of 5000 websites they had to rate).

A machine learning algorithm would try to find a set of many vectors that would help determining algorithmatically the features that those bad sites have in common. For Penguin they trimmed the algorithm on external vectors. Correlations found by the algorithm through feature extraction could be:

- domain/brand never mentioned without being linked.
- domain/brand never mentioned in books/newspapers
- no backlink or mention from site that is ranking highly in topic area of site
- high amount of backlinks are colored (in regard to css) red (such a vector could have a high correlation and could cause false positives in identifying bad links - as far as I understand machine learning (esp. unsupervised) you use statistics/correlation as a substitute for cause)
- high percentage of backlinks from sites without about/contact pages
- low amount of backlinks from websites bound to a business/institution/etc. in the "real" world
- high percentage of backlinks from topicless directories
- high percentage of links from domains registered by persons with inital A in last name (see red colored links and false positives)
- high percentage of links from outside the main content area of the linking page
- ...

So as far as I understand it Penguin is not trying to define some kind of link as bad but tries to define what bad pages are and tries to find bad pages through offpage correlation.

As was said here before: bad websites attract bad links. Penguin finds the offpage correlation of humanly classified bad websites through machine learning and applies it to the whole search index.

Atomic

3:37 am on Mar 23, 2016 (gmt 0)

@Spiekerooger

Great post! Thank you.

Ebuzz

5:02 am on Mar 23, 2016 (gmt 0)

Here's a question: has anyone been affected by penguin for a keyword/search term that is NOT commercial?

I have a site affected by Penguin. It is otherwise a well regarded site by Google. My content is stellar, unique, and most original. People like my site and even email me to tell me that.

Well guess what? After Penguin, ALL my commercial keyword rankings dropped. The penalized keywords all get demoted a few pages down from the first SERP page, and they are forbidden from ever going to the first page. The non commercial ones still rank well.

To me, this speaks volumes about Google's thinking on how it orders the SERPs. They have A LOT of data, they know what is commercial and what are worthless keywords.

[edited by: aakk9999 at 1:05 pm (utc) on Mar 24, 2016]
[edit reason] Please keep on topic, no Google bashing [/edit]

Wilburforce

8:02 am on Mar 23, 2016 (gmt 0)

The penalized keywords all get demoted a few pages down from the first SERP page, and they are forbidden from ever going to the first page.

What has taken their place?

Also (see martinibuster's earlier comments on how the algorithm modifies the ranking scores to match user intent), are they still ranking for searches that clarify user intent, or that contain other context-information (e.g. location)?

aristotle

12:35 pm on Mar 23, 2016 (gmt 0)

Spiekerooger -- Excellent post. You make some really good points.

Regarding the idea that penguin doesn't like backlinks from "bad sites and bad pages", I'm not sure that "bad" is the right description. For example, pinterest is a fairly respectable website, at least to some eyes, but a lot of link-builders found it easy to get backlinks from it. (I believe that pinterest eventually started putting nofollow tags on outgoing links, but penguin could conceivably still count the ones that were originally dofollow when they were created.)

Spiekerooger

7:17 pm on Mar 23, 2016 (gmt 0)

@aristotle:

that looks like a missunderstanding (probably caused by my not-so-good language skills regarding english).

I wasn't talking about bad pages as linkgivers but about bad pages as linktargets. That is: an algorithm had a predefined set of websites deemed below quality and looked at the backlinks pointing to these sites. If all of them would have a backlink from pinterest (1.0 vector) but none of the high quality sites have one (0.0 vector), having a backlink from pinterest could be a factor for penguin kicking in. If both would have a vector of 0.5, having a backlink form pinterest would be neutral or positive regarding penguin.

aristotle

9:07 pm on Mar 23, 2016 (gmt 0)

Spiekerooger -- Actually the quote "bad sites and bad pages" came from earlier in the thread, and several members discussed it before I did. In any case, I agree with what you just wrote.

Andy Langton

9:59 pm on Mar 23, 2016 (gmt 0)

For example, pinterest is a fairly respectable website, at least to some eyes, but a lot of link-builders found it easy to get backlinks from it.

I'm not quite sure I follow. If we assume that Pinterest is a "good" site, the fact that anybody could create a profile and add a bunch of links doesn't make those links good links - quite the reverse. Social sites add nofollow to deter spam, not to prevent Google being manipulated.

Another example - a whole bunch of people have catalogued all of the .gov and .edu sites that use a (dofollow) script to redirect external links, with a page that says "you are now leaving example.gov" or something similar. Many of these scripts don't validate the links, so you can "create a link to your own site" just by adding a URL (e.g. example.gov/redirect-warning?url=yoursite.tld). Create a page linking to all of your new links, and, hey presto! dozens of links from government and educational websites. Link tools will love you! But these are obviously bad links. They don't become good because they're associated with a 'trustworthy' website.

aristotle

10:54 pm on Mar 23, 2016 (gmt 0)

If we assume that Pinterest is a "good" site, the fact that anybody could create a profile and add a bunch of links doesn't make those links good links - quite the reverse.

Actually that's the point I was trying to make. The earlier discussions seemed to imply that you can only get bad backlinks from "bad sites and bad pages". I don't agree, and tried to use Pinterest as a counter example. I'm sorry for the confusion. Spiekerooger explained it better than I did.

Andy Langton

11:02 pm on Mar 23, 2016 (gmt 0)

The earlier discussions seemed to imply that you can only get bad backlinks from "bad sites and bad pages".

I'm not sure if it was one of my earlier posts that you're referring to, but certainly, I absolutely meant to imply that (although "only" is too far). But it may be a problem of definition.

For example, a sponsored post on a newspaper site is a bad page. A social profile purely for backlinks is a bad page. An "SEO directory" is a bad site, and probably all of its pages are bad. Of course, you can slip an anchor text link into a "good" page - perhaps that makes it a bad page or a bad link - or both - or neither, and Penguin-type algorithms won't notice at all. But from all the examples I've seen, the source of links is much more significant than the links themselves.

FranticFish

1:40 pm on Mar 24, 2016 (gmt 0)

the source of links is much more significant than the links themselves

Makes sense when you remember that before Penguin came long there were already algorithmic penalties concerned with abuse of anchor text. Penguin could have modified this - perhaps a 'good' page (or site) can become a 'bad' page if it crosses a threshold regarding anchor text in its OBLs.

I seem to remember reading that Penguin was effective at stopping sites that built on 'link laundering' pyramids (an operation where the money site is at the top of a pyramid, linked to by sites that look fairly trustworthy, but have links from less trustworthy sites each level you go down, with more and more self-generated or even bot placed links towards the bottom).

Maybe Penguin looked further back down the food chain than previous link scoring algorithms?

aristotle

3:00 pm on Mar 24, 2016 (gmt 0)

FranticFish -- So the link-builder puts his or her site at the apex of the pyramid. But if penguin determines that the foundation is weak, the whole thing will collapse.

This 102 message thread spans 4 pages: 102