| This 386 message thread spans 13 pages: < < 386 ( 1 ... 3 4 5 6 7 8 9 10 11 12  ) || |
|Matt Cutts and Amit Singhal Share Insider Detail on Panda Update|
| 10:54 pm on Mar 3, 2011 (gmt 0)|
Senior member g1smd pointed out this link in another thread - and it's a juicy one. The Panda That Hates Farms [wired.com]
Wired Magazine interviewed both Matt Cutts and Amit Singhal and in the process got some helpful insight into the Farm Update. I note that some of the speculation we've had at WebmasterWorld is confirmed:
Outside quality raters were involved at the beginning
|...we used our standard evaluation system that we've developed, where we basically sent out documents to outside testers. Then we asked the raters questions like: "Would you be comfortable giving this site your credit card? Would you be comfortable giving medicine prescribed by this site to your kids?" |
Excessive ads were part of the early definition
|There was an engineer who came up with a rigorous set of questions, everything from. "Do you consider this site to be authoritative? Would it be okay if this was in a magazine? Does this site have excessive ads?" |
The update is algorithmic, not manual
|...we actually came up with a classifier to say, okay, IRS or Wikipedia or New York Times is over on this side, and the low-quality sites are over on this side. And you can really see mathematical reasons. |
| 4:56 pm on Mar 29, 2011 (gmt 0)|
Hello Aaron, and welcome to the forums.
My take on it is that the algorithm is much more complex than a simple "doing X will hurt a page by N%." To share more than they have already done (by describing the training set in this interview) would give away not only the specifics about what is being measured right now (and that will evolve) but also exactly how their processes work.
Google already gave webmasters a lot more detail than they did with other major updates, and more than any other search engine ever has. Also, even though there is a focus on what can hurt a site, the document classifier algorithm is also designed to classify some sites as high quality or mixed quality.
|"we actually came up with a classifier to say, okay, IRS or Wikipedia or New York Times is over on this side, and the low-quality sites are over on this side. And you can really see mathematical reasons... |
| 6:10 pm on Mar 29, 2011 (gmt 0)|
|what would be the consequence if everyone knew Google's definition of quality? |
As a result of Panda, and the ambiguities concerning what is "low quality" and what isn't, publishers are (for the first time) being discouraged from filling their sites with low quality pages (because enough "bad" pages could sink the entire site).
Before Panda, the system was skewed in favor of quantity over quality. Many publishers concluded that the most profitable strategy was to publish as many pages as possible, without worrying about the quality of those pages.
Since low quality pages tend to be cheaper to produce than good quality pages, publishers responded to that incentive by cranking out millions of pages of garbage every day, in the hope that some of those pages would rank, or at least provide some economic benefit by enabling them to send more links to pages which would rank.
The system encouraged site owners to publish as many pages as possible, even if those pages were unreadable, or were virtual duplicates of existing pages.
The unintended consequence of it's exclusive focus on "relevance" was that Google wastes resources trying to keep up with crawling, analyzing, and indexing millions of low quality pages that were produced specifically to "game" Google's system.
One predictable consequence if Google were to tell us exactly how they are defining "quality" is that the incentive to crank out millions of pages to game Google's system would resume, with the minor proviso that everyone would make sure their pages are just barely over the minimum quality threshold.
In other words, instead of creating an incentive to only publish "good" pages, incentives would be similar to what they have been for years -- publish as many pages as possible -- but subject to the constraint of just barely qualifying as "not low quality."
| 6:26 pm on Mar 29, 2011 (gmt 0)|
You write that before Panda the system rewarded quantity over quality. That simply isn't true, it's post Panda that the system is rewarding quantity over quality. It's known as the social web.
| 6:55 pm on Mar 29, 2011 (gmt 0)|
|That simply isn't true, it's post Panda that the system is rewarding quantity over quality |
It does seem that scraped copies are increasing faster than ever (indicative of a 'quantity' goal: let's spider/index every duplicate out there, and with near zero cache memory sufficient for a simple cache comparison to identify the original). As I have mentioned, I can search sentences from my homepage (written in January) and find over 11,000 scraped copies (it was 7,000 last week). This is very discouraging. I have rewritten portions of the page but it won't be long until the lazy leeches scrape it up so Google can spider more junk.
| 7:04 pm on Mar 29, 2011 (gmt 0)|
I have found when adding pages to redo my xml sitemap that is dated and add it to google as a new sitemap. This gets the page spidered within minutes. It could be considered as pinging I guess, but this has been working for me for the last several years. If your adding pages daily then your sitemap is changing daily and IMO this is a good thing to be doing anyway.
| 7:09 pm on Mar 29, 2011 (gmt 0)|
Do you see this for existing pages also or only for new pages?
| 7:16 pm on Mar 29, 2011 (gmt 0)|
I am really only intrested in the new pages getting hit, but yes it does help the older pages as well get redated. I will add this and I don't have an snswer as to why but all of our sites I manage didn't get hit with this change. I have been beating my brains out trying to figure out why as we are copied by other sites but for some unknown reason right now I don't have an answer. I do know I am building a new sitemap for our sites we are adding new content, and just reading through all the post this hasn't been brought up. I doubt this is an answer but heck when your drawing straws might as well get them all in the pot.
| 5:22 am on Mar 31, 2011 (gmt 0)|
Tedster, thanks for the welcome!
I understand that the algorithm is much more complicated than my simple example, and I don't expect Google to roll out every detail for our inspection. I just wish they'd be honest about the reason: that they don't want to give their competition any help. That's a perfectly good reason for keeping their details secret, and much more honest than pretending they do it to protect the web from unscrupulous SEOs.
Still, they could share some general pointers. If too high an ad/content ratio is a problem, they could say so. Instead, we have to read interviews where someone mentions offhand that they used focus groups where one of the questions asked was, "Do you think this site has too many ads?" and extrapolate from that to a theory that maybe there's a new ad ratio penalty. (And this despite the fact that AdSense is still sending us emails telling us to add more ad units.)
I'm not looking for hand-holding, just some general indication of what they want -- beyond "make quality sites with useful content" -- since they've set themselves up as the arbiters of what can and can't be done on a worthy web site.
As far as Farmer goes, my impression is that it's simply broken in some respects. Something went wrong, maybe some unexpected conflict between the Farmer update and the recent supposed anti-scraper update, that's messed up the ranking of original content over scraped, especially if the scraping site isn't as ad-heavy as the original. That's probably not what they intended, so I suppose there's hope they'll fix it.
| 5:44 am on Mar 31, 2011 (gmt 0)|
|If too high an ad/content ratio is a problem, they could say so. |
Glad you joined and I hope you don't think I'm picking on you too much, but this is where the complexity of Google and the 'situationality' (I make up so many more new words these days when I'm talking about Google it's not even funny ... could be defined as 'situational relativity' + 'situational application') of everything they do makes it so that would actually be an incorrect answer (<-- That's important) ... They can't accurately say, 'Oh, well a high ad to content ratio is bad...', because it's not in all situations...
The preceding is one of the things I think people totally miss, often ... It's ALL situational, and it's ALL relative, and imo it's entirely possible ad to content ratio is not high enough for some sites or pages.
To give a 'single answer' like you're wishing for or thinking they can do is not possible with the complexities of the system ... In a 'transaction' or 'shopping' query, a high 'ad to content' ratio would be expected, so it would be completely wrong to tell a shopping site, EG eBay, they need to add more content and have less 'pure sales' pages and pages of links to those sales pages ... People keep looking for some simple answer these days and with the complexity of Google, imo, all answers are site, and even page specific...
Sorry if I'm sounding ranty, but if people want to understand Google better, imo, they have to understand what's right for my site(s) and page(s) is not necessarily right for all sites or pages, so there's no 'one size fits all answer' for everyone to focus on.
In some cases a high ad to content ratio may be bad, but in others it may be good ... A person doesn't look at FleaBay and RankiPedia the same way, and neither do they...
[edited by: TheMadScientist at 5:59 am (utc) on Mar 31, 2011]
| 5:48 am on Mar 31, 2011 (gmt 0)|
|One predictable consequence if Google were to tell us exactly how they are defining "quality" is that the incentive to crank out millions of pages to game Google's system would resume, with the minor proviso that everyone would make sure their pages are just barely over the minimum quality threshold. |
Granting that's true, you seem to assume that this "minimum quality threshold" would be "just barely better than crappy." Why? Why couldn't Google release a threshold definition that's "pretty darn good"? What if they said, "Look, if you expect a page to rank well for anything competitive, you'd better make sure it's got 500+ words (or equivalent of other media) of relevant, original, quality content; no more than 10% ad space above the fold; no annoying sales-pitch popups; no participation in shady link schemes that we can detect; and a design that doesn't send a significant number of searchers scurrying back to us in five seconds to click on the next site." If someone manages to crank out millions of pages that meet those requirements, isn't that a good thing?
| 5:55 am on Mar 31, 2011 (gmt 0)|
I think we were posting at the same time ... See above ... They won't because that would be an inaccurate answer...
| 6:31 am on Mar 31, 2011 (gmt 0)|
I think one key card that Google let us see this year is their concept of a "document classifier". It sounds to me like this all important step might occur early in the automated evaluation. Then depending on what assignment is made, very different metrics can apply. Just my current operating theory of course - but if it's true, then I'd love to know is what classifications are currently possible.
| 6:40 am on Mar 31, 2011 (gmt 0)|
|Just my current operating theory of course... |
My guess is you should run with it, and then some ... I think most people underestimate what they are doing and the way they are doing it, and 'get stuck' somewhere in the last decade when trying to figure out what's 'right' and what's 'not right' for a site.
| 9:02 pm on Mar 31, 2011 (gmt 0)|
I think it lets Google off the hook too easy to say the algorithm is too complex to draw any simple rules from it. Yes, it's very complex, but any algorithm ultimately boils down to a bunch of if/then type logic gates. There may be a huge number of them, and they may be interdependent on each other in many ways, but it can still be reduced to simple tests and results.
As an analogy, consider a rules-based spam filter like Spamassassin. It's surely not as complex as Google's algorithm, but it has hundreds or thousands of rules against which each message is checked, some of them situationally dependent on each other. So it's true that you can't say, "Messages with these specific attributes will be blocked and messages with those won't," because there are too many different factors. But it is possible to say, "If your message is 90% URLs, there's a good chance it's going to get blocked. If your message uses the word "v1agra" numerous times, it's probably going to get blocked." The fact that's it's not possible to give simple, exacting rules doesn't mean you can't give general guidelines covering the clearer cases. And if they added a new rule that seriously down-graded messages with a particular attribute (or combination of attributes), they could say so.
In Google's case, I'm certain that their engineers have a tool that lets them input a page's URL and a keyword, and see a rundown of exactly how the algorithm treats that page and what "scores" it gives it for that search, both negative and positive. (If they don't have such a tool, they're stupid.) If they wanted to, they could make that info available through Webmaster Tools. Not necessarily in great detail, but they could let you inspect one of your URLs and report, "this page has these two major strikes against it."
I've always been a fan of Google, and I still use several of their products. (Reader and Chrome are too good not to use.) But they're making me tired. I've got a client whose site got hammered by Farmer. To be honest, this site was pretty much MFA at one time, but they've put a lot of work into adding content and making everything relevant over the past several years -- much of it on specific instructions from Google techs, including adding more AdSense when they recommended it. Now they're asking me how to fix it, and when I go to Google's support forum, all I get from their volunteers and interviews with engineers is, "It's not us, it's you. You're obviously doing something spammy/black hat. And if you don't know what it is, that just shows how corrupt you are that you can't see the difference between spam and quality." It's getting very old.
| 7:28 pm on Apr 1, 2011 (gmt 0)|
|Granting that's true, you seem to assume that this "minimum quality threshold" would be "just barely better than crappy." Why? Why couldn't Google release a threshold definition that's "pretty darn good"? |
Google says it is attempting to detect and downrank "low quality" pages/sites. They've said nothing about below-average or high quality.
Consider what would happen if they were to set a higher standard, where any page or site that is below par, (where "par" is set at "pretty darn good") would be pushed down the SERPs: Many more sites would be affected and the tradeoff between quality and relevance would be much more severe. As well, the collateral damage from flaws in the system would be far more severe than anything we've seen with Panda.
Aside from that, there is also the problem that their system is probably a lot weaker than they would like it to be, and the more they tell us about their criteria for "quality" the more obvious the flaws would be. Can you imagine how bad the PR would be if they articulated a clear and unambiguous definition of poor quality? Everyone would start noticing all of the instances in which a poor quality page ranks highly, and a high quality page ranks poorly.
| 7:38 pm on Apr 1, 2011 (gmt 0)|
|If your message uses the word "v1agra" numerous times, it's probably going to get blocked." |
And as soon as you say that, what's a spammer who's any good at spamming do?
They immediately work around it.
|If they wanted to, they could make that info available through Webmaster Tools. |
It's something they would probably do if spammers weren't an issue, but unfortunately, they're in the mix with the non-spammers too, and the more information they give the easier they are to game, in fact, my guess is your idea would probably totally backfire for the 'little guys' because the 'little guys' who need 'more information to rank' are way behind the professionals who already do, so the more they give away the farther ahead the professionals would be, simply based on time, interpretation of information and resources ... The more they give the more they help the people who already know what they're doing, because then those people can 'mirco-tune' their approaches.
They give plenty of general guidelines on their webmaster guidelines page, imo, and more info helps the pros more, not the Mom & Pop sites ... Do you have any idea how much more focused on manipulation people could be if they didn't have to spend time researching approaches? Really, it currently takes resources and time to test ideas and methodologies, and the more of those that can be 'eliminated' or 'confirmed' through information given away, the more time there is to fine tune approaches.
| 8:54 pm on Apr 1, 2011 (gmt 0)|
|Google says it is attempting to detect and downrank "low quality" pages/sites. They've said nothing about below-average or high quality. |
I think they did. Here's a quote from the article that started this thread:
|...the key is, you also have your experience of the sorts of sites that are going to be adding value for users versus not adding value for users. And we actually came up with a classifier to say, okay, IRS or Wikipedia or New York Times is over on this side, and the low-quality sites are over on this side. And you can really see mathematical reasons |
| 8:57 pm on Apr 1, 2011 (gmt 0)|
People are posting some interesting examples of poor SERPS on the official Google/Panda forum. A handful of sites seem to be reappearing at the top for some queries.
| 8:40 pm on Apr 13, 2011 (gmt 0)|
|Google has been using visual page simulation for a while - their "reasonable surfer" model leans on it, and it has modified the way PageRank is calculated. |
Last year someone had a penalty reversed because an iframe generated a false positive for their "too much white space" metric. It was documented on Google's own forum, and JohnMu got involved to place a flag on the site, in case it ever triggered that penalty again.
I have been looking around for more information about this case, but haven't been able to. I'm sure I will eventually find it, but I wanted to ask you if this was only about visible white space, or does messy html with a lot of breaks in the code (looks like excessive whitespace in the html) or inflated html/text ratios cause this also? I have read that the optimal html to text ratio is 42, but I have a page hard hit by Panda sitting at 60% (the highest on my site). That can't be good. Not saying that's the cause of its Pandalization, but it's definitely worth cleaning up. I do see A LOT of messy code on my site, whereas some well-ranking sites (including another one of mine) have cleaner code with fewer line breaks in the html.
| 9:01 pm on Apr 13, 2011 (gmt 0)|
The details are not public - at least not down to the kind of detail it would take to answer definitively. However, there's no doubt in my mind that HTML tangles could easily wreak havoc with any visual simulation model.
| 9:25 pm on Apr 13, 2011 (gmt 0)|
|The details are not public |
Oh ok, thought maybe it was something on Google Webmaster Central. Thanks for the reply. I am going to jump on this html cleanup today because I had no idea how messy it had become.
| 9:29 pm on Apr 13, 2011 (gmt 0)|
It was on Webmaster Central - but all we know is that the iframe triggered a false positive in Google's page layout algo and JohnMu had to put it on the exception list. We don't know "why" it triggered the false positive.
| 9:38 pm on Apr 13, 2011 (gmt 0)|
|It was on Webmaster Central - but all we know is that the iframe triggered a false positive |
Gotcha. Yeah, I can't find it on GWC, but it doesn't matter if there aren't a lot of public details. I think you're right about the html tangles and my 60% html/text ratio on a hard-hit page needs to be cleaned up immediately. We'll see if I can get a boost. The content on the page is good otherwise.
| 1:12 am on Apr 14, 2011 (gmt 0)|
for me, the big mystery is why so few reports of any recoveries to date - even partial recoveries. Many webmasters are trying all the suggested repairs, you'd expect to see something move for some of them.
| 1:19 am on Apr 14, 2011 (gmt 0)|
My opinion is Panda set a new ground zero for any site [domain name]. Page by page improvements can improve rankings, but up to this date there is no global fix.
| 1:44 am on Apr 14, 2011 (gmt 0)|
Has any body tried ripping back a large 50,000 Page Site to just a few good quality pages.
Surely this has got to work especially on domains that are old and have been trusted in the past.
| This 386 message thread spans 13 pages: < < 386 ( 1 ... 3 4 5 6 7 8 9 10 11 12  ) |