Forgive long post in advance.
I doubt the homepage hijack is possible without it already being penalised.
Not necessarily penalized, but we're thinking along similar lines. The big question for me has been about why the hijacked page disappeared. I'm very impressed by Andy's sleuthing, and if he says it's not a proxy I trust that. But the diagnosis that "Google has totally screwed up" and that this is a canonical error leaves me not fully satisfied. Following up his description of what he did see, here's some speculation on what I'm thinking might be happening....
First, just to emphasize... this isn't simply a case of scraped content replacing the original. This is churn and burn spam doing something outside the range of normal algo. Even though the original hasn't been touched, there's a network of hacked and cloaked sites carrying the results and then redirecting searchers to nefarious targets. There might also be a bunch of hacked sites supplying the link juice to augment these rankings, as that's how it's usually done, albeit in this case maybe not all that extra link juice was necessary.
Here's a thread that's a good overview Google and hacking...
Understanding hacked sites that rank in Google April, 2013 https://www.webmasterworld.com/google/4561487.htm [webmasterworld.com]
So, I wouldn't say that Google is screwing up when it's got to deal with a bunch of hacked sites. If Google used the normal algo to deal with the general type of spam, which throws a lot of link juice at target pages for very competitive queries, we would see rankings getting distorted in ways that we don't like.
Here, I think that the hacker/spammer is doing something ingeniously different from the normal hacked site approach, possibly accidentally so. In this case, the nature of the queries themselves suggests that on vordmeister's site, it's the
rarity of these pages that is what's being gamed.
A query for
[product + model number] taken from a manuals site isn't likely to be very competitive. As vordmeister describes his page, it was ranking #2 without much in the way of backlinks. All this suggests that it's not in very competitive area. I don't know exactly what he was targeting, but apparently it fits what the scraper is targeting.
In the case of dupe content, it's generally been that the page with the most PageRank usually wins. The scraped directories we often see generally rank for highly focused searches or for exact quotes. Most of the directories like this I've seen have been essentially scavengers, going after what's already been killed off... like penalized sites, Pandalized site, or sites with dupe content. And yes, the rankings probably are mostly random.
But the particular difference here, which intrigued me a lot, is that it sounds like vordmeister's site had been replaced precisely, as if it were a proxy hijacking or some kind of redirect hijack, which we now know it isn't.
What I now think might be enabling the apparent page by page replacement is roughly as follows...
- the likely queries are very specifically focused on the content of the scraped pages...
- chances are that competing relevant pages are comparitively rare...
- the original pages generally don't have much in the way of inbound links...
- and the resulting scraped sites are highly focused.
I'm theorizing that when Google makes a choice between the original page and the duplicate pages, these scraped pages in a highly focused site (ie, a dupe of vordmeister's but with a bit more link juice), are probably the only decently linked pages that come close to matching the queries. This is why it may seem that they're replacing the originating site on a page by page basis. There may not be enough competition to offer other good alternative choices, so it might seem as if it's a page by page replacement, even though it might not be, as it would be if it were a canonical replacement.
I don't know how competitive vordmeister's target phrases are, so this is all a guess, but in this case, from what's been described, this is how I think it might be working.
I think the hacker's overall approach as Andy describes it is actually fairly clever, as it systematically identifies targets that will rank easily and might survive for a while. In very competitive areas, like
payday loans, Google was motivated enough to eventually come up with a special algo that went after high profile spammy queries and spammy links.
Here, I'm thinking, what spammers might be doing here is to turn the spam pattern around... by going after very non-competitive queries that Google isn't looking for, and winning perhaps on the basis of easy relevance with not a huge amount of link juice.
Lengthy speculation... but more satisfying to me than a canonical screw-up.
[edited by: Robert_Charlton at 10:00 am (utc) on Apr 23, 2016]