Google Result Hijacking

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Result Hijacking

vordmeister

5:46 pm on Apr 15, 2016 (gmt 0)

I found this evening that when searching for a product offered by my website I found my normal result in the number 2 spot but with a different URL than my website. It's like the days when you used to be able to hijack a result with a 302 redirect.

Upon clicking on the result I was directed to that irrelevant looking url which then redirected to a rather disgusting site containing the sort of videos whose name cannot even be mentioned on here.

I have reported this to Google using the link at the bottom of their results page. Is there a better way to report the issue? I didn't find their help pages useful.

What might I have done wrong to allow this hijacking to happen? The files on the server appear to be sound and have not been altered by others.

Andy Langton

5:50 pm on Apr 15, 2016 (gmt 0)

It may be a re-emergence of redirect hijacks - I would definitely be thorough in checking for hacking, though - e.g. review "Fetch with Googlebot" output in Webmaster Tools. A lot of hackers cloak for Googlebot.

vordmeister

6:10 pm on Apr 15, 2016 (gmt 0)

Thanks for that - I've checked webmaster tools and they show the correct website. The files including htaccess appear to be correct on the server.

The only thing I can think of at my end is a new security certificate SHA256 which my host says is necessary for TLS 1.2 Already the website redirects http to https which could be making it vulnerable to whatever is happening with this serps hijack.

Andy Langton

6:40 pm on Apr 15, 2016 (gmt 0)

I would have a look at the site involved also. If you do a site:search for the hijacker's, what does the rest of the site look like? More hijacked pages? Who owns it/links to it/mentions it?

vordmeister

7:11 pm on Apr 15, 2016 (gmt 0)

Another good idea.

The site in question is a holiday car rental firm. I have emailed them to let them know their website has been hacked.

Using the site: search operator in Google they have many pages with the format example.com/xxxxxx__xxxxxx . They are attempting to hijack Amazon pages, Youtube, manufacturers. Authoritative websites on the whole. I can't see a pattern with https. No bad results appear in Bing at all.

I can't see any unnatural linking to the site. it's all pages you might expect to mention a holiday car rental firm. Mystery. Hope the 302 hijacking isn't back.

Andy Langton

7:16 pm on Apr 15, 2016 (gmt 0)

For specific pages that are hijacking, anything notable in the source? Ideally, view the source code using a Googlebot useragent, and look for anything referencing the third party site. This can get tricky, as any significant code is likely to be concealed (e.g. if it's in javascript, it's probably "encrypted" - you might see long sequences of code like "\54\x23\x74\x47\54\x23\x74\x47\54\x23\x74\x47\54\x23\x74\x47"). Given that Google now parses javascript, might be a possibility if this is old-school-style hijacking.

From what you've said, this looks more like content theft than hijacking:

Hack a high value site >> Steal content from others >> Hope Google considers you the canonical >> profit

vordmeister

7:42 pm on Apr 15, 2016 (gmt 0)

Hack a high value site >> Steal content from others >> Hope Google considers you the canonical >> profit

And I was feeling good that I was among the industry leaders they had decided to copy on their other fake pages.

Now I feel bad for not having enough pagerank to defend against this sort of thing.

Last time I checked any unique phrase on my website I found 20 other pages with the same phrase. I should get on a bit more with the whack a mole thing. I only have a few thousand pages so 20,000 moles should be just an afternoon of work.

dipper

7:50 pm on Apr 15, 2016 (gmt 0)

have you implemented DNSSEC ? .. if not, you should. This will rule out DNS hijacking.

Andy Langton

8:08 pm on Apr 15, 2016 (gmt 0)

@dipper - DNS does not seem to be involved in this. I've seen the results in question and can confirm that the original ranking URL has been completely replaced in results. Doesn't appear in site: searches for the original site and an id: query shows the hacked page.

I may be right in the sense that this could be a "canonicalisation hack" - i.e. Google is tricked into thinking the hacked page is the best version to show in results. However, I would expect to find that the original URL would still be indexed - just not shown. In this case, Google believes the hacked page is the canonical, as if it redirected or used a canonical tag.

Some things I can say for certain:

- Third party site is hacked
- A bunch of pages created that serve the exact source code of third party pages with href targets replaced with references to the hacked domain. These are served to Google (the hacker's check is for any reference to "googlebot")
- Visitors are served a meta refresh to a pr0n affiliate campaign
- Pages are "scattergun", and not all are affected (e.g., Amazon haven't flinched)
- I couldn't find any evidence that this relied on the original site to be involved. It looks like standard "hit and hope" (which has been very successful in Google at times!)

The hacking does not seem unusual at all. In fact, I don't believe the hacker intended to hijack URLs - rather just hope some of the stolen content ranked and get some clicks. I believe this is a Google issue, where their canonicalisation procedure has gone badly wrong - very seriously wrong in this particular case.

Robert Charlton

10:13 pm on Apr 15, 2016 (gmt 0)

Please forgive my rush in this post, as I'm on my way to an appointment, and I'm posting this without a thorough reading of what's been posted here thus far... but after a quick scan of the posts, my first inclination is that this is a proxy hijack, similar to what was discussed here in February (I'll link to that below).

Essentially, a user agent pretending to be Googlebot can take over your rankings in the serps. The hijack is cloaked for Googlebot, and hijackers are getting fairly sophisticated about hiding even this.

It used to be that if you navigated to your site independently of Google, you would see the site as if all is normal. However, viewed via a query on Google, you will see the site at the hijacker's url. That may or may not be the case with some of the hijack schemes now in use, as some of the payloads are being further cloaked to obscure the setup.

We can't post examples as some of these hijacked sites are being used to carry malware.

In the February thread, I cite our classic discussion on proxy hijacking, and I'm posting a link to that discussion first....

Proxy Server URLs Can Hijack Your Google Ranking - how to defend?
June, 2007
https://www.webmasterworld.com/google/3378200.htm [webmasterworld.com]

And here's the link to our February discussion....

My site's being de-indexed and replaced by others
Feb, 2016
https://www.webmasterworld.com/google/4790240.htm [webmasterworld.com]

Robert Charlton

10:28 pm on Apr 15, 2016 (gmt 0)

PS: Assuming that this is a proxy hijack... essentially the fix, depending on what your server will permit, is either "a double reverse-DNS lookup on the IP address requesting as Googlebot", or else using an IP whitelist of Googlebot IPs, which will require some ongoing maintenance.

Andy Langton

10:34 pm on Apr 15, 2016 (gmt 0)

Appreciate the reply, Robert. I don't think "proxy" is a good term to describe this. There's no problem with OP's site, and I don't think it is targeted in particular.

What I see is this:

The hacker is scraping thousands of URLs, and serving an exact copy of the original HTML to Google. Google's canonicalisation procedures confuse some of these copies as being a canonical version of the content. In those cases (appears to be a minority, as far as I can tell) the original URL is replaced with the hacker's copy. This means they get the rankings of the original copy. If my assessment is correct, the fault is with Google's canonicalisation procedure.

The copied pages include everything from Amazon to tiny businesses, and there is no particular pattern in who is affected. I suspect the hacker knows that the activity can result in traffic, but does not know the mechanics behind it.

Shai

6:37 am on Apr 16, 2016 (gmt 0)

Does the op site contain self canonicalisation?

Robert Charlton

9:14 am on Apr 16, 2016 (gmt 0)

I don't think "proxy" is a good term to describe this. There's no problem with OP's site, and I don't think it is targeted in particular.

Andy, we may be seeing different parts of the same problem, but I do believe that a proxy is involved. The "attacked" site, though, isn't what is getting proxied. I believe in this case that it's Googlebot that's getting proxied. The attacked site when accessed directly and not via Googlebot should appear to be perfectly normal.

First... here are parts of tedster's and jdMorgan's descriptions from the above-cited 2007 proxy hijack thread. I'm leaving a lot out, and my emphasis is added below. I recommend reading the whole thread...

Proxy Server URLs Can Hijack Your Google Ranking...
https://www.webmasterworld.com/google/3378200.htm [webmasterworld.com]

tedster...

First thing I want to clarify -- as I understand your post, you are talking not about someone directly hijacking your traffic through some kind of DNS exploit. You're also not talking about someone strealing your content, although that can play into this picture at times.

Instead, you are talking about a proxy domain that points to yours taking over your position in the SERPs, sometimes a position that your url has held for a long time.

jdMorgan...

Given an understanding of what a proxy *is* and how it works, the only step really needed is to verify that user-agents claiming to be Googlebot are in fact coming from Google IP addresses, and to deny access to requests that fail this test.
If the purported-Googlebot requests are not coming from Google IP addresses, then one of two things is likely happening:

1) It is a spoofed user-agent, and not really Googlebot.
2) It *is* Googlebot, but it is crawling your site through a proxy.

The latter is how sites get 'proxy hijacked' in the Google SERPs -- Googlebot will see your content on the proxy's domain.

Here's the translated description of the problem quoted from the Search Engine Roundtable article that I cite in our February 2016 thread....

My site's being de-indexed and replaced by others
Feb, 2016
https://www.webmasterworld.com/google/4790240.htm [webmasterworld.com]

The quoted Polish webmaster reporting this problem said...

We (owners of "website B") are not hacked by him, he just copies code of our website and puts in the iframe on "website A". Now, the problem is that Google algorithm in many cases considers the malicious copy put by hacker on website A as THE ORIGINAL and the website B (our website which is the original) disappears from Google results.

The situations we're seeing have nothing to do with the payload or collateral damage in the serps. These payloads are varied, IMO to obfuscate the methodology and the motive. So in the above Feb thread, there were no iframes. There were some bizarre tricks to throw us off the scent. I feel that hidden among these, among other things, are pr0n and malware, whatever is opportunistic for the hijacker.

Several new users posted specifics, which unfortunately we needed to deleted, because we don't out sites and because in some cases the serps were dangerous. User joncmac, who had been posting also in the Google threads, mentioned "proxy hack", which rang a big bell for me, and saved a lot of further speculation.

That was consistent with reports on the cited SE Roundtable discussion and with what I saw of the sites that had been brought to the mods' attention here, and with our 2007 proxy hijacking discussion noted above. What was particularly consistent was...
a) that the original website appeared to be intact when you navigated to it directly
b) that the new results replaced the old website in Google serps
c) that a DNS report (which I ran on one set of results) showed no problems, so it wasn't DNS hijacking
d) that viewing the duplicated page as Googlebot showed the hijacked page's content

aakk9999 reported on both the original hijacked url and the obfuscation of the page receiving the content. Again, a referrer was necessary for all this to happen, and there was no consistency in payload. I did a quick and dirty check to determine that this only happened on Google, and saw that the page's ranking was normal on Bing.

One poster, glutimax, did report that blocking the hijacker's domain/IP in their .htaccess fixed the problem.

So, I'm assuming that, whatever else is done with the hijack, the hijack originates as proxy crawler spoofing Googlebot and intercepting the content and ranking signals of the site.

With regard to canonical hijacking... I think I know the types of domain "directory" sites described... where they "review" the site and will often rank above penalized sites for the domain name, as well as for meta description, title, and some primary content from the site.

Up till now, I'd assumed that these pretty much preyed on penalized sites, include sites hit by Panda and Penguin. This episode, I'm thinking, suggests that they might also be hitting proxy hijacked sites, and that this may or may not be coincidental. It could well be coordinated with the Googlebot crawl through a proxy. The intercepting site would have all the scraped content... and the hijackers certainly could make concurrent use of it. I don't think I'd attribute this to an algorithm weakness, though. If Google isn't seeing a page because Googlebot has gotten intercepted, it's not clear what the regular algorithm could do to sort this out.

It's also possible that what appear to be miscanonicalized listings simply could be random rather than coordinated, with replaced pages getting hit. A crawler spoofing Googlebot, btw, would be targeting a broad range of sites, so no one site would look targeted. Not all pages in the site would even get hijacking at once... they'd be subject to the vicissitudes of a natural crawl. On example site in our Feb thread got nibbled away over time.

Also, not all of the sites listed in these "directories" could be hijacked. Those sites that were verifying Googlebot and blocking the rogue IPs would be immune to this kind of replacement. The pages sites getting hijacked could look like the big sites with lots of PageRank, but that IMO would be a coincidental correlation... there's no reason the PageRank should have anything to do with this. Amazon is likely to be using rDNS.

As for motive, I was originally perplexed... as, from smb111's example I wondered why would a hijacker crawl a page and return it only as a 404. My guess is that they might be making multiple uses of the hijacked pages over time... but that the essence of the hijack was being able to replace a popular page in the serp with a desired page load at a given time.

It would be interesting to see what happens if the OP on this thread blocks the hijacking IPs by whitelisting Googlbot. If that fixes it, it would seem to me to be a clearcut diagnosis.

Anyway, these are top of my head thoughts. I'm not really an IT guy or a security expert, but the above is bringing together reports from several sources... the SER thread, this current thread, our Feb thread, and several of our old proxy hijacked threads referenced above.

Again I'd love to see some follow up on this by members who routinely deal with bots and hijacking.

Robert Charlton

9:33 am on Apr 16, 2016 (gmt 0)

Does the op site contain self canonicalisation?

I've been assuming that Andy is using "canonicalisation" here in the sense of returning the right domain and the right page for a domain search... and not in the sense of getting www and https and index html correct as the canonical form of a domain... but that might not be what he meant.

I assume that something else is happening, which is causing Google to not see the correct domain in the serps at all... and I think it's likely that this is proxy hijacking. It probably doesn't matter how well "canonicalized" a victim domain happens to be.

Beyond that, the mismatches are Google taking their best shot to return the most relevant page from what's left, since the preferred original" domain is effectively gone, and Google now only sees the page on the proxy's domain.

Andy Langton

9:58 am on Apr 16, 2016 (gmt 0)

Shai - good point, no canonical. I've checked other hacked pages that do contain a canonical, and the canonical code has not been altered (i.e. still references original site). So, the hacker's find/replace routine just affects hrefs in links (and the addition of a base href in the <head>). It's not a very good routine, either, as doesn't always replace correctly.

A canonical attribute may afford some protection from the hijacking aspect (no use for OP, though, as the HTML has already been scraped).

I'm going to see if I can find some further "hijacked" examples, and if there are any patterns.

vordmeister

10:01 am on Apr 16, 2016 (gmt 0)

Thanks for the pointers, I have some reading to do.

It could well be a form of proxy hijacking. If I do a site search on Google all of the pages on the website appear present and correct. If I log into the Google search console (webmaster tools) it shows only 3 pages indexed, though the crawl rate appears normal and no errors are shown.

I don't have canonicalisation tags, though those may have helped in this case..

The offending website has been replaced in Google results with a second hacked website. When searching google for unique text on the page the two results are these hacked sites but not mine.

[edited by: vordmeister at 10:06 am (utc) on Apr 16, 2016]

Robert Charlton

10:06 am on Apr 16, 2016 (gmt 0)

...and the canonical code has not been altered (i.e. still references original site)

Before I head off to bed... I believe I remember on the set of pages I checked that the canonical code had been removed.

There are obviously many different hackers playing with this.

Andy Langton

10:20 am on Apr 16, 2016 (gmt 0)

The offending website has been replaced in Google results with a second hacked website. When searching google for unique text on the page the two results are these hacked sites but not mine.

The same set of pages has been uploaded to a few different places - hacked Wordpress and Joomla sites. It's many thousands of product pages scraped from different sites. Could be a way to keep things going as hacked sites clean up.

Robert Charlton

10:22 am on Apr 16, 2016 (gmt 0)

If I do a site search on Google all of the pages on the website appear present and correct.

Again, memories from several months back... but I believe I got different results on a particular site doing a site search vs doing a search for a unique text string that was global on the page template.

On the query for the global text string, the pages returned showed the hijacker's domain. On the site search, I believe these showed the results for the original domain. I apparently caught the site, btw, at a state where it was in flux, with some pages going back and forth between the hijacker domain and the native domain.

vordmeister, there are several threads linked to from the 2007 discussion I site that are also extremely informative. I'm hoping you will at least be able to try whitelisting the Googlebot IPs and report what happens.

Andy Langton

10:36 am on Apr 16, 2016 (gmt 0)

I've crawled a good chunk of the hacker's pages from a few different sites. The process seems to have been this:

- Scrape product names/model numbers from a "free manuals" website
- Grab the top Google results for product + model number
- Scrape those pages, find and replace internal links
- Upload these pages, along with a sitemap linking to them all, to hacked WordPress and Joomla sites

The vast majority of pages don't rank at all, because the top results are manufacturers, eBay, Amazon and Youtube who don't flinch. Some of the results that happen to be less competitive see the hacker's pages rank. In the case of OP, I believe Google has totally screwed up and mistaken his site as being a copy of the hacker's page, replacing him in results.

I don't see any evidence of proxy or canonical cleverness - it's fairly standard hacker content-theft.

But it looks like if you copy tens of thousands of pages and upload them more or less verbatim, Google is going to get it totally wrong in a small number of cases, and credit the hacker with your content and rankings.

The only contact from the hacker to OP's site is when the content was scraped, as far as I can tell.

vordmeister

10:37 am on Apr 16, 2016 (gmt 0)

I've set my user agent as googlebot which allows me to view the source code on the hacked site rather than be redirected to a prawn site.

In the pages on the hacked sites every instance of my domain name has been replaced with their domain name, so canonical tags would probably not have worked in this case. It does look like they have simply copied the page, altered it, hosted it on their hacked sites, and beaten me in the serps. I don't get redirected anywhere with the googlebot user agent.

The one thing that doesn't explain is why the original page is no longer visible when searching for a unique string as the changes mean the sites are not identical. I would have hoped I would at least still appear under the list of similar pages in the results page.

Andy Langton

11:15 am on Apr 16, 2016 (gmt 0)

canonical tags would probably not have worked in this case

I've actually checked other examples, and canonical attribute is not replaced.

Hacker is replacing href="" attributes in the body only. It seems to be a very basic replacement, as for instance links to javascript:// are also mistakenly replaced (e.g. href="http://hacked.example.com/javascript://") so that side of things is fairly crude. Replacing canonicals is straightforward, of course, so it is not a long term solution.

vordmeister

12:43 pm on Apr 16, 2016 (gmt 0)

That's well spotted. I have added canonical tags to the website to at least give me a bit of chance in the future.

NickMNS

4:58 pm on Apr 16, 2016 (gmt 0)

Is there a better way to report the issue?

If I am not mistaken this issue was raised in the Google Webmaster Hangouts a few weeks (March or February). John Muller said, please send many any examples. So I would definitely send message to John through G+ or Twitter.

tangor

12:11 am on Apr 17, 2016 (gmt 0)

Fascinating read! Almost like a forensics novel. I will be watching this thread. Never been hit with anything like this, so all is very instructive.

Andy Langton

5:29 pm on Apr 19, 2016 (gmt 0)

Just to note that the hacked site still ranks. Google seems to have no idea it's hacked because they're cloaking for Googlebot.

vordmeister

6:25 pm on Apr 19, 2016 (gmt 0)

It's annoying it is still there. I've tried the feedback link at the bottom of the Google results page without luck. Tonight I reported the offensive page using the safe search and also suggested the original page for inclusion into the index via webmaster tools. I don't wish to report the issue publicly on Google's own support forums as I don't want to say to the world here's my site you can replace in serps with your own prawn pages.

Out of interest the reason few indexed pages appeared in webmaster tools was I hadn't added in the https version after changing to https.

Thanks for the help with this Andy and all.

lucy24

8:45 pm on Apr 19, 2016 (gmt 0)

or else using an IP whitelist of Googlebot IPs

Does this mean something different from the usual precaution of denying any visitor who claims to be the Googlebot but doesn't come from 66.249.whatever? (Plus other country-specific local IP if it applies to your site.)

Andy Langton

1:53 pm on Apr 21, 2016 (gmt 0)

TV documentary voice...

Day 7: hacked site still ranks.

This 48 message thread spans 2 pages: 48