|Google Panda and thin content threshold|
I would like to get some opinions on on what constitutes thin content.
I was hit by Panda and decided to code a script that basically pulls every thread on my forum and its word count.
I'm surprised to see nearly 4,000 out of 20,000 have 100 words or less.
That to me is a pretty hight ratio of thin content.
What are your thoughts, is 100 to low, would 200 words min be more inline?
I suspect the algo for measuring whether content is thin is a lot more sophisticated than just word count.
I'm sure from all the communication that "shallow" content (note - it's not "thin" content) is being measured quite differently than word count. However, there certainly is a chance that a forum post of few words might be seen as shallow.
For forum posts, I'd say 100 words is too high a threshold - go for a cut off of 40 words and see what that reveals. Also see if there are a lot of "fluff" posts on the forum.
Agreed its not going to be the only metric, but it's gotta be one, 0 words is as shallow as it gets so it's there in the sauce for sure.
If I go down to 50 the ratio drops considerably. This looks more accurate, a thread with 100 words or 50 and below looks a lot different in the browser than numbers in a table.
The posts below 50 words are no more than a few sentences, so not very 'deep'. They dont appear to be getting any traffic and are years old too so might as well make them vanish and see what happens.
You say it's not about words but i skipped all the Pandas until I stupidly reduced forum posts per thread page from 25 to 10 to see if page speed increased. The unaware side effect was spreading my content 'thinner'. 5 days later after a deep crawl Panda 2.4 released and i got dumped.
Thanks for both your input.
I'll chime in here with a thought or two.
Panda wise, I've been clobbered, then beaten to a pulp by every new instance.
OK fine, I'm not especially surprised considering what I've read about "Shallow Content" here at WW.
Why? Because I made a deliberate choice years ago to publish mostly what appears to be the very definition of "Shallow Content".
Just the facts!
That means that most of my pages contain text that simply recites known facts, facts that are available on a lot of sites. Doesn't seem to matter how many facts or words are involved, just that they are available in a lot of places on the net.
Those facts only play a supporting role on the pages. They support the unique content on the page.
Unique content on my pages is in the form of images that I personally took at various events over the years. Doesn't seem to matter how many unique images are on a page. Seems the text is more important at the moment.
I could probably easily have come up with a few paragraphs of opinion about the subject of any of those images, but my opinion isn't what the site is about.
The result is that from a more technical point of view there isn't anything "unique" about my site, no "added value" so to speak.
I'm not especially happy with being Pandalized to a pulp. But it was a good ride up to that point and it's still doing ok due to Bing/hoo and direct traffic. And there's a distinct limit to how much I'm willing to change to accommodate G.
Here's my view......
Shallow Content = duplicate (from one source) OR rewritten (from one source) OR lacking any real detail on the subject.
In my vertical I see even small affiliate sites ranking top because they've pulled together a huge amount of detail on a specific subject which reflects a real specialist interest in that subject. All the information is available elsewhere on the web but their site is the only place you can find ALL that information in one place, presented well and in an easy to digest manner.
Usually they have written the content themselves based on the information gathered. Sometimes it's mostly duplicate content but from multiple sources.
The more successful ecommerce sites (despite selling the same thing as everyone else) are the ones with genuine customer reviews or information on the products that show they have hands on experience of those products. Some, however, are still doing okay without anything more than duplicate product descriptions because they are associated with huge brands.
Several of my sites were hit by Panda in April. Two were not - one was 100% duplicate content but from multiple sources, one was a 6 page site focusing on one specific topic in great depth based on my own personal experience (with a few affiliate links to multiple merchants).
Our most important ecommerce site has not yet recovered. We regard ourselves as specialists in our field but despite a lot of effort in content the site still doesn't reflect the level of in depth knowledge we've built up over 8 years, which is what we're working to address so customers and google can see we really know about the products we sell.
It's tough finding the time to put all that knowledge into writing and it's detracting from the day to day operation of the business but we we either have to spend money marketing ourselves in other ways or invest time in conveying our years of experience via our website if we want Google to continue to be our main source of business.
I don't think word count matters, it's what you're saying that is most important. What makes your site different or special generally?
It's very subjective and totally relative to the competition you're up against but if you think you have something special and your site reflects that I believe Google won't consider your content shallow.
Claaarky, that's one of the most insightful posts on Panda I've read.
|That means that most of my pages contain text that simply recites known facts, facts that are available on a lot of sites. Doesn't seem to matter how many facts or words are involved, just that they are available in a lot of places on the net. |
Those facts only play a supporting role on the pages. They support the unique content on the page.
Let's analyze this line of reasoning that comes from the Google people, to justify Panda.
First, this rule doesn't seem to apply to big brands. Surprising, right?
Second, if you were writing about "Frog habitats in planet Jupiter" you'd be #1 of course, Panda or no Panda. So if your site has info that is found nowhere else, Panda does nothing to you since there's no site to outrank you. Google forgot to mention this.
In real life through, BMW's latest car has four tires, Obama is the US President, Germany lost WWII, Google supports big brands stifling small businesses, sun rises from the east, Canada is north of USA and so on. Obviously more than one site will have the same facts or write about today's news.
Even if you sell jeans and place Levi 590XCSE on sale for an unheard of price of $15 for example, under that reasoning you'd be penalized. Why? All the words are found somewhere else and Google cannot tell the difference in price. You can argue that maybe you'll get links and this and that but we don't know the details, I'm talking content wise.
But since the major sites and the famous ones have been pretty much whitelisted for Panda, they can afford to throw all these nonsensical theories at us. They have zero pressure, since mostly small sites that don't have the money to be brands or as big brands as those liked by Panda, are hurt. Those sites also have very little voice in Twitter and media. If a tree falls in a forest and no one hears it...
Walkman, if lots of people were writing about "Frog habitats in planet Jupiter" you'd only be #1 if you were offering a more insightful view or a more granular level of detail about Jupiter, the frogs in question and the challenges those frogs faced getting to and inhabiting the planet.
If you used content someone else produced first without adding anything to it to show your greater knowledge and understanding of the issues frogs face on Jupiter, you'll probably be Panda'd once the search volume and number of relevant sites becomes significant enough for Google to regulate it.
Do you think Google has the ability (and will) to differentiate between shallow and good/organized info? I am skeptical.
It is a stupid software.
Many sites with useful content having thousands of followers and top presence in social media sites found themselves nose diving these months. Just have a look at compete dot com.
Is there a logic in which claaarky sites (or any other) were affected by panda? Quality/uniqueness is good in theory.
They blindly go for - Competition plus brand name (Y/N).
[edited by: Zivush at 1:51 pm (utc) on Sep 22, 2011]
|Do you think Google has the ability (and will) to differentiate between shallow and good/organized content? I am skeptical. |
I'd normally be skeptical too - but Panda seems to have actually done this to some degree (fallout aside).
My guess is, they looked at some pages that they considered 'good' content, compared it to some pages they considered thin and some of their language PhD's came up with some variables to test for. Then all they did was integrate those variables into the algorithm. We just haven't figured out what those variables are yet.
I'm confident it's possible though. Certainly any one of us can look at a page from the content farms and it's immediately obvious it was basically a regurgitation of a 5 minute drive by on the web. Why we can tell that by reading isn't clear to me - but if you can define why that is I think you'll be a lot closer to determining what factors are being used by Panda.
|Walkman, if lots of people were writing about "Frog habitats in planet Jupiter" you'd only be #1 if you were offering a more insightful view or a more granular level of detail about Jupiter, the frogs in question and the challenges those frogs faced getting to and inhabiting the planet. |
Let's grab an article from cdc.gov on how to treat an ear or a viral infection. Let's leave it as it is but where it says "...and drink lots of water," we'll replace it with "don't drink water, but drink lots of bleach instead." Can Google, as it it is right now, tell the difference, other maybe sending a few keywords with 'bleach' in them?
|My guess is, they looked at some pages that they considered 'good' content, compared it to some pages they considered thin and some of their language PhD's came up with some variables to test for |
Panda has hurt even sites that have no content you can analyze. Think of product specifications for example: Height: 40.12cm etc, while not touching their competitors.
Their self-serving brand bia$ is the root of this. Let's see Macys and Wikipedia held to the same standards, panda would not last a day due to the uproar. How many people sincerely believe that Google has not doctored /cooked Panda to allow some high profile cases escape it?
We too have been heavily hit by the first version of Panda released in the UK in April. We are an Eccomerce site (20K pages) with a wide and differing product base, we have many products with manufacturer supplied descriptions and many with low word count descriptions.
We decided to 'no index,follow' all those pages that had a low word count, and set about trying to write unique descriptions for those duplicate.
This is a massive task for us, and have many months of work left.
Beginning to wonder whether it was such a good idea to No Index the ones with low word count, since it had zero effect.
Look at the traffic of Q/A sites. Most of their pages are thin content.
• Brands (Panda winners): askville.amazon.com, aolanswers.com, answers.yahoo.com
• Have no brand name (Panda losers): fluther.com, chacha.com, answers.com, answerbag.com, grupthink.com
Who are the people answering in yahoo or OAL answers and in answersbag? Are those who write in yahoo/aol have PhDs?
What is the difference in quality of the answers? None.
Rest my case
While Google's algorithm is rather impressive, I'm highly skeptical that it can objectively judge the completeness or correctness of a site's information about any given subject.
Going with the "frog habitats on Jupiter" theme, Google's algorithm is not an expert in planetary science. It can't determine if that section about frog habitats on your website about Jupiter makes your site a more complete and reliable source of information on Jupiter than other sites about Jupiter, or if it's just pure fantasy.
The only way it can make even a rough judgement along those lines is if it compared the information on your site with information it's gathered from other sites, and... whoops... see the problem?
Unique information isn't necessarily valuable information, and valuable information can not possibly stay unique (as in "existing in a single place"). That is, in my opinion, a major problem with one of the suggestions Google had made to avoid getting eaten by the Panda update.
My own experience (across many sites) pretty closely supports what claarky is saying. Here's an example.
One of my ecommerce clients (the one I oversaw into a new platform last year) has soared with every Panda update. He's a brand, but he's not a big brand by a long shot. In fact his most direct competitor is the big brand in this niche, but only in catalog - we've been eating their lunch online for almost fifteen years.
This site does have a significant percentage of thin and shallow product pages (not a question of duplicate MFR content as much as a lot of specs and measurements.) But where that couldn't be helped, we built up the category pages, and added a lot of supporting CMS pages with text and videos and whitepapers to talk about applications and usage, maintenance, how to pick the right product, etc. In the cases where we have to go up against, for example, Amazon, for some products, we've built several related products into bundles and kits and all-in-one solutions that Amazon can't do. And we post *why* we've bundled products a certain way too, and we give the product line a branded name of some sort, and whip up a logo for it.
I don't pretend to know how Google does it, but I think Panda pays attention to this sort of thing. And I've been applying the same practices not only to other client sites but also to some of my own, and as far as I can tell, it works.
You don't have to be big, but you do have to look like you know what you're doing.
Brands get away with stuff the rest of us can't because they've already established themselves in the world by doing something significant (e.g. Amazon, AOL, Yahoo) and I think Google rightly in most cases assumes they'll put some serious resources behind anything they do to ensure their brand name remains unblemished.
If they don't look after their brand and let standards slide, people will eventually turn away from them which google will detect and then any sub-standard web presence associated with them will be more vulnerable to Panda. But big brands don't tend to let that happen.
Lots of real world businesses tap into the trust associated with a known brand (e.g. franchises) and that brand ensures the franchisee maintains their standards so customers continue to trust the brand.
The reality is that google is no longer the free for all it once was because there are too many people trying to cash in. It's back to the real world now - old rules apply. You need a USP and you need to constantly ensure you still have your USP because your competitors will be watching, copying and innovating in an attempt to overtake you.
The bar is now very high and a much more professional approach is needed to make serious money from Google. Once a single person could produce a website that could compete with the big boys who have thousands of staff and huge overheads. Is that fair? If that single person is doing something really different to the big boys then yes, but how likely is that.
I don't think everyone has to be a brand to make it in Google (not quite yet anyway!) but you definitely need to think more about what makes you different. If you're trying to cut corners Google will be able to tell. Now, if I ever say "there's no way Google can know that" I stop myself and think about why I'm saying that. It's usually because I'm trying to cut a corner. Google knows. Whatever you think they can't possibly know, they do. Sometimes I think they've bugged my office!
|This site does have a significant percentage of thin and shallow product pages (not a question of duplicate MFR content as much as a lot of specs and measurements.) But where that couldn't be helped, we built up the category pages, and added a lot of supporting CMS pages with text and videos and whitepapers to talk about applications and usage, maintenance, how to pick the right product, etc. In the cases where we have to go up against, for example, Amazon, for some products, we've built several related products into bundles and kits and all-in-one solutions that Amazon can't do. And we post *why* we've bundled products a certain way too, and we give the product line a branded name of some sort, and whip up a logo for it. |
Every single item we sell can be bought on Amazon, quite often more cheaply. The items we sell are all bought from Distributors (who are warranted to ONLY sell to trade), and the price we buy at can be had by any 2-bit reseller operation. The features and specs of every product is widely disseminated.
Yet we are successful, where most are not. We do re-write most copy text, but not from scratch. As stated, we do not compete on price. What we do have is real people offering real advice on the phone. Online, we have in depth advice, calculators, selectors and comparisons.
The key as I see it is differentiation. And not ad-stuffing. But mostly differentiation.
Oh, and when Google moves your traffic around, it helps if that traffic does not (statistically) have a higher satisfaction elsewhere. That's about long-clicks, in the main.
I don't think Google algorithm is so smart. They couldn't even tell which is the scrapper and which is the original. Recently, they created a tool asking people to report scrappers.
With the Panda hit, what Google is actually telling us - Just don't play in the big brands yard, take the side street.
Find a niche that gets little to no attention and flourish.
If you want to make any difference, make it elsewhere.
|If they don't look after their brand and let standards slide, people will eventually turn away from them which google will detect and then any sub-standard web presence associated with them will be more vulnerable to Panda. But big brands don't tend to let that happen |
Brands are relative, in some niches it takes little to become one.
But to be the devil's advocate, eHow was (is it still?) a brand, so was Mahalo and Goldman Sachs. Now banks don't even care becuase "every bank does it, where are you going to go." Even Google was once trusted as being fair and beyond any suspicion of wrong-doing. Maybe 5 years from now even Amazon will get greedy and try to squeeze as much as possible [finance.yahoo.com...] not happy of makign many times less than Google is now. The real truth is that they push their luck, little by little. There's also the degree of trust placed on a brand especially since it becomes almost a winner takes all and ruins small competitors.
|The reality is that google is no longer the free for all it once was because there are too many people trying to cash in. It's back to the real world now - old rules apply. |
Ding, ding, ding, and that's because it's beneficial to Google. See the G business section and anti-trust hearings.
I suspect user engagement too, but they vary depending on the keyword sent to you by Google. To use them as a sitewide metric when the SEs can screw up is insanity. If Google sends users here after searching for "Walkman" and they meant to search for Sony's walkman, how is it Brett's fault? With small traffic sites this can make a huge difference, especially since the plum keywords are being sent to to the top brands.
Not to mention that a certain site brought back from panda has an 80% bounce rate and very flimsy time on site.
Zivush, they have a tool for reporting spam, paid links, scrapers and anything else that people might do to game the system so if you are or were (pre-Panda) anything approaching a major player in your vertical, between the algo and the vast array of information and resources at google's disposal, they know how you operate.
If you have something to hide they will find it. I don't think it's possible to flourish in a niche that gets little attention simply to avoid having to produce something of real value. If a low quality site is too successful it'll get noticed and Panda'd eventually.
Well after developing my script some more it revealed some 1,000 pages of my forum had less than 30 words in the full thread and a whopping 4,000 had 2 or less words in the title. This equates to 25% of the total site.
Rewriting 5,000 pages was not an option and as Google wasn't sending any traffic to them and on browsing them they do nothing for the user they have gone. Time will tell.
dave_hybrid: you could just write a little script that checks whether the thread has received a reply yet, and if it hasn't then slap a noindex metatag on it.
that's what i do with my forum (phpbb). that would probably bring the number down to an acceptable level. and you wouldn't have to keep managing it.
I do that too, already, 0 reply threads get noindex until 1st reply is posted, its a simple if conditional.
I do this mostly because a thread with no replys is not a good user experience and carries a high bounce rate imo.
But that doesnt cater for users posting one word threads then a second user doing the same.
A 2 word/2 post thread bypasses the noindex trap, as does someone posting a thread with the words 'Help' in the title, which is common.
These issues are less these days as i am wiser and edit bits and pieces, but that doesn't stop there being a bit of rot imo that was hiding in a dark corner of my site.
I bet most forum owners would be shocked if they run the same script of their site. Panda aside content like the above is a horrible user experience and wastes bandwidth with crawling, sever load waste, pagerank flow and loads of other stuff.
It's not like its getting search referrals to warrant it.
You may want to check the other active thread on thin content. [webmasterworld.com...] There's new info that suggests a 500-word limit could be valid.
|You say it's not about words but I skipped all the Pandas until I stupidly reduced forum posts per thread page from 25 to 10 to see if page speed increased. The unaware side effect was spreading my content 'thinner'. 5 days later after a deep crawl Panda 2.4 released and i got dumped. |
For my main site I have too many thin pages (less than 250 words), which I want to merge into pages with at least 500 words. My concern is that since my content was copied by other sites, if I merge pages, Google could compare them to existing content on older pages and conclude the other site(s) created it. Then I get hit with a dupe content/Panda penalty.