Panda 4.2 Used 18 Months of Data? Is This Why So Few Recovered?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Panda 4.2 Used 18 Months of Data? Is This Why So Few Recovered?

Globetrosky

5:50 pm on Oct 1, 2015 (gmt 0)

Just heard that one of the reasons Panda 4.2 did not produce any notable "winners" or "losers" like it normally does is that the Google used 18 months of historical data, not fresh data in the update. What this means is that if your site got hit by Panda and you made the expected changes, you are unlikely to see a full (or potentially even a partial) recovery.

Google employees are not to speak about this externally.

Anyone get hit by 4.1 and expect a full recovery but didn't see it?

aakk9999

8:38 pm on Oct 1, 2015 (gmt 0)

How reliable your source is on Panda 4.2 using last 18 months of data?

Thinking about it, it does not make sense to me. Are you saying that the data snapshot is 18 months old or that the data is a cummulative data from the last 18 months?

Cummulative data would be odd as this means there would be multiple version of the same URL with different content as it changed historically, including different robots directives, different redirects - I cannot imagine how would this be evaluated all together and then the single page scored.

The only thing I can think of by cummulative data is that if the page does not exist any more, they looked back 18 months to find the last point where it did exist. Either this or a snapshot. Anything else does not really make sense.

tangor

8:47 pm on Oct 1, 2015 (gmt 0)

I would question the source of this statement. I can't imagine G using anything less than CURRENT data for any decisions (which would include all previous archived data, too). Just does not make sense. In any event, Panda does not produce winners and losers, merely sites accepted via compliance.

The moment that G should become involved in winners and losers is the day they made a stupid business decision!

Itanium

9:03 pm on Oct 1, 2015 (gmt 0)

I doubt it. That would include data of at least 2 other Panda refreshes.

Globetrosky

9:30 pm on Oct 1, 2015 (gmt 0)

To clarify, my understanding is that 4.2 used cumulative data extending back 18 months from the refresh date, not data that was 18 months old at the refresh. @aakk999 - not sure I fully understand your thinking on the URLs. Panda is a site-wide, not a page-specific, classifier, which looks cumulatively at your site(s) "quality" across a large sample of urls. Google always has looked at data over a period of time to make sure they have enough data to make an accurate classification - a one day snapshot, for example, would not be enough data to decide on a penalty - especially for a smaller site. And whether the window is 1 week or 1 month, the issue about changing URLs and redirects is something Google has always had to deal with.

I do not know how long the time window Google normally uses is, but given that historically sites that are hit can be released in the next update, presumably it was pretty short (a site I worked on was hit in 2012 and they recovered 90 days later). This update was different.

aristotle

12:51 am on Oct 2, 2015 (gmt 0)

This possibility reminds me of the "moving average" commonly shown on stock market charts. It averages a stock's price over some pre-defined preceding period, such as six months, up to the current day, and advancing by one day per day. It smoothes out the day-to-day jitters in the price, and is often a good indicator of the long-term trend, but tends to lag behind the most recent movements. It can't predict future movements, but only shows past trends, although it is well known that long term trends tend to persist in general, (but don't bet your life savings that this will happen in any particular case). More sophisticated methods give more weight in the calulations to recent prices, but the same basic concepts apply.

So using 18 months of data would bring out the underlying long-term trends in the evolution of the site, and give less importance to the most recent changes.

petehall

8:52 am on Oct 2, 2015 (gmt 0)

Sounds to me like they are trying to re-build something that went horribly wrong!

aakk9999

3:12 pm on Oct 2, 2015 (gmt 0)

Panda is a site-wide, not a page-specific, classifier, which looks cumulatively at your site(s) "quality" across a large sample of urls.

Yes, I know that, but if you are looking at 18-months window, then the same page would have a number of instances, with content being changed over that time. The question is - which instance would Panda use? Or would it do something alongside "moving averages" the way aristotle described?

E.g. I have a page (the same URL) that has content "A", which improved to content "AB" which then improved to content "ABCD"
What would panda use? Treat this as a four separate pages? Or average it out? Or take the last one "ABCD"? What if that page then gets deleted and does not exist any more (returns 404). Will it still be scored? The last instance of it that existed within the 18-months window? Or all instances of it that existed in 18-months window?

Globetrosky

4:06 pm on Oct 2, 2015 (gmt 0)

@aakk9999 I don't think Panda looks at individual page scores. I think its looking at overall site quality metrics (e.g. bounce rates, % duplicate content, whether people skip your site in search results, etc) across all pages, and then give your site(s) an aggregate score. I have recovered sites by taking a lot of their thin/long tail pages out of the index, keeping just their highest quality pages in. This helps the overall site score.

aakk9999

6:43 pm on Oct 2, 2015 (gmt 0)

My impression was that yes, it is overall site quality metrics, but somehow it must be made of individual page scores. I am not saying what is being scored, just that pages have to be looked at (crawled, processed, looked their user engagement, etc).

Otherwise how would noindexing many thin pages or adding content to thin pages improve overall site? The only way is to somehow look what is on the page. Even to detect duplicate content, you need to look what is on the page.

Or are you saying that if you had 500 thin pages 12 months ago that had bad user metrics and you added a really good content to these 500 pages making them very useful to visitors which then greatly improved user metrics, that the average is calculated across all 18 months and that the old figures will hold new improved figures down?

aristotle

7:24 pm on Oct 2, 2015 (gmt 0)

My impression was that yes, it is overall site quality metrics, but somehow it must be made of individual page scores

I agree with Sandra -- the quality of the individual pages determines the overall quality of the site.

As for the supposed use of 18 months of data, if it is true, it would allow Google to take account of a site's history. For example, a site that repeatedly undergoes drastic changes in its content would likely be considered less trustworthy than a site that is built slowly and gradually and always stays true to its original purpose and theme.

Globetrosky

9:32 pm on Oct 5, 2015 (gmt 0)

@aakk999 Yes, that is what I think is happening. I think with classifiers like Panda, it is based on an overall site score over some period of time. So If you had 500 thin pages, you would be in penalty. If you added good quality content to those, and Google's algo agreed they were good, then you would be out (at least in prior releases).

In the current one, though, they seem to be weighing the 12 months of data before you added content to those pages into their calculation. So even if you fixed the issue, you may not be out.

Shepherd

9:46 pm on Oct 5, 2015 (gmt 0)

Not commenting on the validity of the information leading to this topic but if I wanted to see how an algo update compares to a previous version of the algo I would have the new algo evaluate the same data as the previous algo.

*Might* lay a foundation as to why this roll-out is taking so long, first make sure the new algo is what you want and then feed it new data, idk?

aakk9999

12:21 am on Oct 6, 2015 (gmt 0)

In the current one, though, they seem to be weighing the 12 months of data before you added content to those pages into their calculation. So even if you fixed the issue, you may not be out.

@Globetrosky, interesting! This could also for example happen if the site user engagement metrics are given greater weighing, and the last 18 months is used for engagement metrics. Then although each page would have to be scored as I theorised above, the historical data would impact the overall score.

In such case, even if abunch of pages are deleted from the site in order to escape Panda, and even if googlebot crawled and has seen 410 Gone response, the historical user metrics from deleted pages would still count in overall site score. And if the overall user engagement is looked at, then noindexed pages would also contribute to things like pages per session, time on page, etc. despite page not being shown in SERPs.

At this stage we can just speculate on this but it would be interesting to hear more from site owners that are either unable to recover, or even from ones that have recovered in the previous Panda, and then without any further (major) changes to the site got Pandalised again (to a degree) which would hapen if the historical data weighed the site back down.

SnowMan68

3:45 pm on Oct 21, 2015 (gmt 0)

Where is this information coming from?

Walt Hartwell

6:03 am on Oct 22, 2015 (gmt 0)

Globetrosky

I'd guess that's a shell account that sucked people in to responding.
Good analysis people.

tangor

6:45 am on Oct 22, 2015 (gmt 0)

I'm getting a chuckle out of this one, kiddies. I generally check g commentary on Bing and guess what, the only place you can find Panda and 18 months is webmasterworld. Not enough source info for this thread to have any value, though what has been said is good fun and great speculation.

Else, the rest of the web (including Bing) would have something to say on page 1.

I can buy into the Panda crawl coming every 18 months, after all, it is expensive, even for g, to crawl the world wide mess. As to what Panda truly does, and how to recover, is still a minor mystery. As far as I know there are no true success stories of recovery out there, only corrections for things that never should have been done in the first place. (And probably dropped 70 billion thin pages from the crawl and disappeared just as many MFA sites, too).

But, kiddies, any UPDATE will be a final result of SOMETHING, even if using 18 months of data, that is EFFECTIVE the day it rolls out, thus becoming the new BENCHMARK. Else we're talking circle something or other that a group of techies in Silliconehead Valley (sic) without a clue have named THE NEXT BEST THING IN SEARCH.

Without true parameters this is all speculation and the black box remains an enigma. NOT COMPLAINING! Just reality.

Mentat

12:08 pm on Oct 23, 2015 (gmt 0)

It's a very interesting theory and it surely matches my situation.
In the last year I've deleted thousands of pages and compacted a lot of content.
Huge site modifications and metrics alterations (for the better), but the only result is that I'm slowly going into oblivion.

In the same time, my direct competitors are thriving by ignoring webmaster guidelines and good sense.
In my niche is like a top of 5 sites, that seems to be untouchable no matter what they do!

[math.ucr.edu...] - so you might be right!

The simplest explanation for some phenomenon is more likely to be accurate than more complicated explanations.

aok88

9:33 pm on Oct 25, 2015 (gmt 0)

It's a very interesting theory and it surely matches my situation.
In the last year I've deleted thousands of pages and compacted a lot of content.
Huge site modifications and metrics alterations (for the better), but the only result is that I'm slowly going into oblivion.

I could have written the exact same thing myself. We too have deleted thousands of pages (a little over half actually), compacted, improved metrics, but the only result is that this site is slowly going into oblivion as well.

Simon_H

8:39 pm on Oct 27, 2015 (gmt 0)

+1 to the theories here. Again, it's all speculation, but it seems to fit what we're seeing too and it makes sense. If a site has been spamming Google for a long time, Google wouldn't want to remove a Panda hit simply because the site was clean as of a threshold date. It would want to ensure that site had been clean for a much longer period of time.

This may also explain why sites like seroundtable that were hit in 4.1 seem to be recovering in 4.2 despite making no changes. The offending content may have been around long before 4.1, such that it counted towards the 4.1 hit, but is now a sufficiently long time ago that 4.2 is ignoring it.

seoskunk

8:52 pm on Oct 27, 2015 (gmt 0)

This may also explain why sites like seroundtable that were hit in 4.1 seem to be recovering in 4.2 despite making no changes.

I think some changes were made on seroundtable regarding comments, an iframe or something.....

Simon_H

9:18 pm on Oct 27, 2015 (gmt 0)

@seoskunk Yes, I believe disqus comments were originally in the source HTML but then moved out to be in an iframe or js instead. But absolutely no visible change to the user so this shouldn't have impacted Panda. And if that really was the reason for the recovery, then, hey, anyone can recover from Panda simply by sticking their spammy content in an iframe or generating it in js. I hope it's that simple but doubt it!