Homepage Being Copied - Google Cache Showing Different URL Than Ours - Webmaster General forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Homepage Being Copied - Google Cache Showing Different URL Than Ours

Looking for advice to combat copying and cache redirection

mhansen

6:11 pm on Jun 8, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

In a nutshell - Several of our 6-8yr old popular (but small in size, 8-10 pages) consumer information sites have started being copied. Exact copies. Someone is grabbing the source code of the homepage, copying it in full, and after replacing ONLY the canonical url, pasting the code into hacked website directories, and pinging the sites to be indexed in search engines.

Since we block images/css, etc from displaying elsewhere, the pages only show the text and placeholders. It is however enough to have a negative effect on us. The result is that every few days, I notice a large drop in traffic, check Google indexing, and notice that the cached copy in Google is a url with a full copy of our homepage. The end result is that we lose a few days of traffic/income, while Google sorts through the mess.

A few facts:

- We have a registered LLC, established in 2010, which all our sites are under the umbrella.

- 2-3 Months ago, the person/company doing this was using newly purchased domains to do the same, with privacy enabled. We tried hard to gain the ownership info (Name/address, etc) to no avail. Since that time, they have moved to using hacked domains.
- Root domains that the copies are being placed on, have no idea of the hack within their system. Nor do they know the content has been placed there. The title the page the exact as our domain. (-.com)
- There is no common type of CMS. Some have been WordPress, others straight html sites, others Joomla, etc.
- Whoever is doing this, is copying source-code, pasting source code into a static html doc. A full and exact copy of our homepage.
- Google is indexing OUR site per normal, but the cached-copy, shows the URL of the hacked site.

Every time I find a copy, I go through DMCA steps, get it removed etc. I also use GWMT to "Fetch and Render" the homepage, and request indexing to speed up the process. Sometimes it works within minutes, others, it can take a few days.

Has anyone experienced this and if so, how did you deal with it?

not2easy

7:53 pm on Jun 8, 2017 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

There's a related discussion here [webmasterworld.com] about this kind of problem, it might offer more insight. At this time we don't know if/whether any of the suggestions helped resolve it.

mhansen

8:00 pm on Jun 8, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Thanks for the link. Will review.

Robert Charlton

5:42 am on Jun 9, 2017 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

mhansen, I've posted an answer to the part of your post regarding getting anonymous registration information on the other thread (the one that not2easy recommended you look at). I trust you've looked through various suggestions that members posting on that thread have made. Extended discussion about your site, though, should stay here... as it's really impossible to maintain the same discussion on two threads.

Please forgive rushed reply in advance. I'm going to speculate a bit, prompted by what you've posted, and then post several threads for reference.

From what you've posted in the OP on this thread, I'm guessing that you've been scraped by hackers who are hacking networks of vulnerable sites they find, and are then copying your scraped content, cloaked for Googlebot, into various pages on those hacked sites. What you see in the Google cache is what Googlebot sees.

One of the things that's not clear, but I'm going to guess at from your post above...

Several of our 6-8yr old popular (but small in size, 8-10 pages) consumer information sites have started being copied.

You don't say so precisely, but I'm guessing... and this could well be wrong... that the common point of vulnerability for these sites is that they're on the same hosting account. Is this the case? If it is the case, then it's very possible that your host has been hacked... but not necessarily so, as static content from the sites (if we are talking about static content) could have been managed without server access.

If not by common hosting, how else might your several sites have been associated? From your description, it appears that this is simply static content that's been taken, but I'm not precisely clear about this.

Have you viewed your sites using fetch as Googlebot? When you do, are the canonical tags rewritten?

What happens when you search for text strings on your home page(s) in quotes? What about your other pages?

What I'm trying to establish here is where the content is coming from.

Essentially, a hacker/hijacker needs to get access to your content, and, if dynamic, to maintain that access. Additionally, a hijacker needs a mechanism by which to replace your domain. That can be DNS hijacking, proxy hijacking, or simply pumping so much link juice into the hijacked content that it outranks your normal site.

Here are two prior threads to start you off, each with some useful references to other threads and articles. The threads themselves give you some idea of the intricacy involved, and also describe expected behaviors for hijack scenarios. Please note that we do not allow public site reviews, and I ask you now not to post urls of any of the sites.

I'm probably not going to be available for more detailed discussion, but I hope others will jump in....

My site's being de-indexed and replaced by others
Feb, 2016
https://www.webmasterworld.com/google/4790240.htm [webmasterworld.com]

Google Result Hijacking
April, 2015
https://www.webmasterworld.com/google/4800812.htm [webmasterworld.com]

mhansen

2:34 pm on Jun 9, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Hi Robert - Thanks for your lengthy reply, as well as moving the discussion back over here. I agree about keeping them separate. Much easier.

From what you've posted in the OP on this thread, I'm guessing that you've been scraped by hackers who are hacking networks of vulnerable sites they find, and are then copying your scraped content, cloaked for Googlebot, into various pages on those hacked sites. What you see in the Google cache is what Googlebot sees.

Correct. From what I can tell, it's a basic source-code scrape, then copy/paste job. I've previously tested changes to the site, refreshed the cache and seen no changes on the hacked copy. I concluded that there was nothing "live" about the copy. After later finding other copies, I determined they were copies of the older version of the pages previously found and DMCA'd off the web.

You don't say so precisely, but I'm guessing... and this could well be wrong... that the common point of vulnerability for these sites is that they're on the same hosting account. Is this the case?

No. We maintain several separate hosting accounts for the sites, however they are with the same hosting company on different servers. (There are mainly 2 sites affected by this, and only the homepage of the sites) They are however similar in topic, and can be easily connected by our parent company privacy statement. We do not use privacy on our whois, and knowing how easy it is to track down other sites owned by the same person, it would only take a few minutes for the offender to do so.

Have you viewed your sites using fetch as Googlebot? When you do, are the canonical tags rewritten?

What happens when you search for text strings on your home page(s) in quotes? What about your other pages?

Re: Canonical - No - The fetch and render, view of the source code shows our own canonical tag, as we'd expect.

Re: Text string search - Of course, we find many sites with snippets of our content, but rarely have we found a full copy.

Several of your other questions are answered in the preceding paragraphs as well. To summarize:

- The host/server are secure. The copies we've found are EXACT html/source code copies, with the ONLY exception being the replacement of our canonical URL, with their own.
- The sites are easily connected through our "about" type pages, as well as whois/owner information, however they are not linked to each other.
- They are not overly strong, high authority domains. This is likely our main point of weakness.
- Using "site:domainDOTcom" search in Google, we see our site and our url's in the index. However, when clicking the small arrow to the right of the serp url on the HOMEPAGE link only, the Goog cached copy URL shows the domain/url the copy resides on.
- In previous occurrences, we were able to simply "Fetch & Render" in GWMT's, then request indexing. The serps would update the cache of the homepage and all would be well until the next time again. (Its happened a few times, but not a daily occurrence) This time however, for whatever reason Google is not re-crawling immediately. I assume I'll just have to wait for gbot to come, reindex and update the cached url.

In my opinion - and I've been around WebDev for almost 20 years now, this is a simple scrape > copy > paste hack, where the person or company links to the scraped copy, thus confusing Googlebot into replacing our cached url with their own url. Whats worse, is that some of the backlinks we've chased down to these sites are nothing but forum-profile, poor quality, comment-spam links. While we have a historically and fairly solid link profile and have seen these poor quality links to OUR sites in the past, they've never affected us until this.

I'll follow the other links you shared and read further, and again, thank you for taking the time to respond to both threads.

mhansen

8:53 pm on Jun 14, 2017 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I though I would post a followup, for anyone finding themselves in a similar situation.

Symptoms of a Google Cache Hijack (Homepage only, in my case) -

- Sudden / Immediate traffic drop (In my case, from 1000's of daily Google referrals, to 100's)
- Referrals from obscure domain in logs.
- Lost the Answer Box result for previously shown phrase, as well as top 2 or 3rd organic result. Not found on first few pages of serps.

What I checked / result -

- Webmaster Console (WMT) for issues / errors. None found.
- Google Index status, using the "site:domain.tld" query in google search to see all pages indexed. Nothing seemed awry, links clicked through to site as expected, all of site was indexed.
- Checked cached version and date of home page. Found the cached version showed a different domain url. The url was formed: ahackedwebsite.tld/mydomainname.html. and the cache date was close to the same timestamp traffic dropped.
- Clicked on cache link, and it redirected back to Google with a 302 status code.
- Copied link into browser to view. Page was an exact "html source code" copy of the homepage of my site. Someone pretty much just visited the site, right-clicked and copied source code, saved as mydomainname.html. The absolute ONLY change in the source code, was the "canonical" which was changed to their url.

What I did to fix -

- Sent Webmaster of hacked site an email, and submitted a DMCA to host on the affected url.
- Back in GWT, I used the "Fetch and Render" tool to generate a fetch of my page.
- Requested indexing. My own experience with "requesting indexing" is that on some pages/sites, it can be an instant cache-refresh.On other pages or sites, it can take a few days. My understanding is that this is part of the index frequency gbot uses for that particular page or site.
- No change within 1 hour - incorrect cached url still in index.
- I went back to my own site, updated the content as I normally would, saved.
- Repeated the Fetch & Render as well as Request Indexing.
- Pinged / fetched from a few ping tools.

Waited... and wrote this post here looking for help.

The next day, the Webmaster removed the cached page from his website. Unfortunately, he used a 301 to his homepage, and didn't reply when I asked him to switch to a 404.

I visited the Google URL removal tool [google.com] - and requested the cached url be removed. Be sure to include the full Google link they ask for.

- Repeated the Fetch & Render 2-3x per day, as well as Request Indexing.

2 Days later, (3-4 days total) the offending page was dropped from search cache, and our site and cached copy, was indexed as expected. The answer box result was back immediately, as well as the organic results in the same position.

Now... How to prevent this kind of thing?