Welcome to WebmasterWorld Guest from 54.167.75.28

Forum Moderators: not2easy

Need Guidance on Site Copying

Site being copied - Need help to make it stop

     
7:33 pm on Aug 16, 2017 (gmt 0)

Preferred Member from US 

5+ Year Member

joined:June 14, 2010
posts: 606
votes: 4


I had a previous post [webmasterworld.com] about this a month or two ago, thought it was corrected to no avail.

Basically - We have a website which is small in size (7-10 pages max) but commands its presence in a moderately popular market. The site has (and still does) rank well in the top 3 results for many highly competitive phrases, for more than 5-7 years. We have a strong presence in the Google instant answer box, and throughout the serp.

Recently, I started finding backlinks from randomly hacked pages on live domains and after reviewing them, found (assumed) that someone had copied our site from the source code (right-click, view source, copy) and pasted it into a blank html document, changed ONLY the canonical URL to their URL, and let it fly. They are pointing thousands of low quality links to the new pages, which are being indexed with our exact copy content. Again, the only change is the canonical URL, which bit us a month or two ago, when Google used their URL as the cache copy in serps.

Skip ahead a month or so, and we've found about 30 more copies of our homepage on new domains.

A few facts:

- ONLY the homepage is being copied. No subpages.
- They are naming the URL after our website domain, which is a close EMD for the market were in.
-- Ex: Ours: thiswebsitedomain.tld
-- Ex: Theirs: hackeddomain.TLD/thiswebsitedomain.html
- The ONLY change they are making to the source code, is the canonical URL, which they point to the hackeddomain.TLD/thiswebsitedomain.html page.
- I was able to get all previous versions shut down with the use of DMCA. Several of the unhappy webmasters made it clear they thought it was us hacking their sites for backlinks. /sigh
- They have a JS referrer script in the header (hosted on a Google property no less) that redirects any search visitors to the search engine itself.

Some of the domains they previously used almost made it seem like a personal vendetta against the site and it's rankings. Domains were named abstract phrases like: "youllneverrankagainmywebsitename.tld", "thiswillknockyououtofserpsmydomainname.tld, etc. Now, they are using obviously compromised domains to handle the page postings, which tells me it's more than a script kiddie at work. All domains they previously used were behind a privacy registration, thus no info could be found about who the owners were.

As mentioned earlier, I assumed they were doing it by hand, Copy our html source > paste html source > change canonical url, publish. After this recent round, I decided to try and learn more since its obviously more than that. The hack pages are updating as our content updates.

I updated our site yesterday after finding a new page again, and after viewing the hacked page a second time, the content updated itself to reflect what we currently have on our site. In other words, its updating automatically. Chasing down the logs, I found that the referrer is the domain, and a php script (php 4.5, etc) which is likely just scraping the site and re-posting to the hacked page.

I've been around hosting/servers enough in the last 15+ years to spot hacks. I have checked my site for vulnerabilities and as far as I can tell, there are none. It's hosted on a shared VPS, and aside from a minimum standard load of WP plugins I have known and trusted for years, there appear to be no vulnerabilities. No changes to WP core files, nothing in the file system that looks out of place, etc.

I have a list of roughly 30 or so domains that currently have the copy in a hacked page. It's getting to be too much to rely on DMCA, which is just not scalable IMO.

Now... for questions and hopefully, good opinions.

1 - How do I quickly get our content off these sites?

I've considered temporarily setting my homepage to a simple 301 redirect. ie: Header( “HTTP/1.1 301 Moved Permanently” );
Header( “Location: mydomainname.tld” );

Then, visit each of the 30 pages, refresh them to grab the redirect and move along. This should work for a temporary fix and will only take 2-3 minutes to do all the pages. After the pages have indexed the fresh redirect, revert my page back to the original.

2 - How do I trap the script and stop it from copying the homepage going forward?

Any help is appreciated... We actually have this issue on 2 websites that are similar in nature, but hosted on different servers, etc. Its not WPMU, or any other connecting factor, other than they are registered under our parent company LLC.

Thanks in advance.
7:55 pm on Aug 16, 2017 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3555
votes: 196


When you say it is updating automatically, it sounds more as if they are displaying the page as iframe content, though if the entire page is available via rss that could also instantly refresh remotely. Have you viewed their source code to look for any sign of those tactics? One way to see the script (if that is what they're doing) would be to save the file and view with a text editor. I would also examine access logs, though that may or may not be helpful, depending on what they are doing.
8:27 pm on Aug 16, 2017 (gmt 0)

Preferred Member from US 

5+ Year Member

joined:June 14, 2010
posts: 606
votes: 4


Thanks @not2easy

I have viewed the source code of the page, and don't see any signs of an iframe, or framing at all. I have an htaccess directive to stop framing as well.

When I access a URL in the browser, the page takes several seconds to load. What I **thought** was happening, is their script is refreshing via rss/ xmlrpc, though xmlrpc is blocked through a firewall, and rss is disabled on our site as well. Thus I *think* they have some php script written to scrape the page as theirs loads, thus providing a current copy.

In the logs - the only thing that shows is the domain name accessing the site, and the php version.

I could block all these domains from accessing the site, but that would still leave the current copied page live. In an ideal world, I first force a blank page index on their site or the 301 header refresh, then block them completely from future attempts.

Thanks
9:34 pm on Aug 16, 2017 (gmt 0)

New User

joined:Jan 14, 2014
posts:11
votes: 3


In the logs - the only thing that shows is the domain name accessing the site, and the php version.

I could block all these domains from accessing the site, but that would still leave the current copied page live. In an ideal world, I first force a blank page index on their site or the 301 header refresh, then block them completely from future attempts.


Since that domain automatically duplicates your content, then YOU decide what's on their page(s).

I believe you could serve that domain a different content (via rewrite in htaccess). And that content could be very similar to yours (so the thieves won't notice the difference easily) and serving your interests, or anything else. For instance, you could keep your content and just add "yourdomain.tld" (just the text, no link) in a couple of inconspicuous places. And with this on their page(s), you could even go DMCA on them.

Been there :(

Take care!
2:28 pm on Aug 17, 2017 (gmt 0)

Preferred Member from US 

5+ Year Member

joined:June 14, 2010
posts: 606
votes: 4


@asterickx -

Since that domain automatically duplicates your content, then YOU decide what's on their page(s).


Thats exactly what I had in mind, to eliminate the pages from the sites.

I believe you could serve that domain a different content (via rewrite in htaccess)


That is appealing to me... and eliminates the need to pull content off my site, even if just temporary. I could put up a page, redirect to that page for the scrape, and decide what to do from there as far as notifying the site owners their website was compromised.

Thanks for the feedback.
3:03 pm on Aug 17, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Feb 12, 2006
posts:2678
votes: 105


If you're going to do that why not just try inserting an old-fashioned meta redirect in the head.
4:56 pm on Aug 17, 2017 (gmt 0)

Preferred Member from US 

5+ Year Member

joined:June 14, 2010
posts: 606
votes: 4


@Londrum

If you're going to do that why not just try inserting an old-fashioned meta redirect in the head.


Can you elaborate a bit?

I assume you mean something to the affect of redirecting based on source? <if> from thehackeddomain.tld <redirect>?

Would this be any different than using htaccess? Beneficial one way or another?

Thanks
5:06 pm on Aug 17, 2017 (gmt 0)

Preferred Member from US 

5+ Year Member

joined:June 14, 2010
posts: 606
votes: 4


For what it's worth - I just applied the htaccess redirect based on referrer, and it worked to erase the page on the hacked site.

Now it shows no content, but responded with a 302, found... and I redirected to Google. So... opinion again:

Should I redirect to Google, or somewhere else? I worry about directing to my own site, since they they have pointed thousands of comment spam links to these hacked pages. If I redirect to my own, it may seem as suspect as the original issue.

In htaccess, I used:
RewriteEngine On
RewriteCond %{HTTP_REFERER} .*hackeddomain.tld*$ [OR]
RewriteCond %{HTTP_REFERER} .*anotherhackeddomain.tld.*$ [OR]
RewriteCond %{HTTP_REFERER} .*hackeddomain.tld.*$
RewriteRule ^(.*)$ [google.com...]
6:12 pm on Aug 17, 2017 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3555
votes: 196


If you are seeing these referrers in your logs you can easily block them using domain names. That could be a little more efficient (using only the distinct portions of the referrers that you can see in your logs)
RewriteEngine On
RewriteCond %{HTTP_REFERER} (hackeddomain|anotherhackeddomain|hackdomain)

Rather than sending them to someone else who might not appreciate your efforts you can send them to your Forbidden! page or whatever you have at your 403 error destination:
RewriteRule .* - [F]


6:31 pm on Aug 17, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Feb 12, 2006
posts:2678
votes: 105


i wouldn't redirect to my own site either... might be a bit risky
if the page is now showing as blank then there's no harm to your site anymore.

i would steer clear of the temptation of redirecting to another site, or typing something horrible on the page, because the hacked site might be totally innocent and not have a clue what has been going on. maybe they've just had an old wordpress plugin that they hadn't updated and it's been hacked
you might just make an extra enemy for yourself -- why risk it?
maybe send them to the actual example.com -- http://example.com

ps. it doesn't sound like you need it anymore, but the meta redirect was just this in the <head>
<meta http-equiv="refresh" content="0; url=http://example.com/" />
6:48 pm on Aug 17, 2017 (gmt 0)

Preferred Member from US 

5+ Year Member

joined:June 14, 2010
posts: 606
votes: 4


@not2easy -

The hackedpage.html document now just shows a blank page. It does report a 302 in the header, but the link shown is google-com.

Page shows: 302 Found, Document Moved Here (Here is anchored with google-com) - There is no actual "redirect" so to speak, just a flat page with a link on it.

My ultimate goal was to remove the copy of our site from the hacked pages, which worked perfectly. The 302 was unexpected, and the link to Goog was the only place I could think of sending them, to keep everyone happy.

Of course, I immediately thought of pointing all the pages to a competitor. :/ - Even I'm not that evil though!

@londrum -

I agree. I did not want to show anything related to my own domain. All the toxic links pointing to the pages would raise more flags and possibly pass over to us.

Thanks again for the suggestions, all.
8:25 pm on Aug 17, 2017 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3555
votes: 196


Glad to hear your goal was accomplished. The 302 should be dealt with so it is not seen as a temporay situation. Redirect via rewrite defaults to 302 unless you add on the flag [R=301] or [L,R=301] depending on what/where the rewrite is.
8:32 pm on Aug 17, 2017 (gmt 0)

Preferred Member from US 

5+ Year Member

joined:June 14, 2010
posts: 606
votes: 4


Redirect via rewrite defaults to 302 unless you add on the flag [R=301] or [L,R=301] depending on what/where the rewrite is.


Thanks for that info! Much appreciated.
8:34 pm on Aug 17, 2017 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:10629
votes: 630


I would recommend to any site the following safeguards against hijacking:

#Stop Google from offering a cached version of all your pages in the SERP:
Header set X-Robots-Tag "noarchive"


# 2 methods to block iframes:

A script:

<script type="text/javascript">
if (parent.frames.length > 0) {
parent.location.href = location.href;
}
</script>

and the header tag in htaccess:
 
Header append X-FRAME-OPTIONS "deny"


I would also advise blocking the Internet Archive bot from scraping your content and displaying all your pages on their server.
12:20 am on Aug 18, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14426
votes: 576


Incidentally...
In the logs - the only thing that shows is the domain name accessing the site, and the php version.
Is it in your power to change this? In general, logs are much more useful if you set them up to show the actual requesting IP, not the looked-up name.
3:30 pm on Aug 18, 2017 (gmt 0)

Preferred Member from US 

5+ Year Member

joined:June 14, 2010
posts: 606
votes: 4


@Keyplyr - Thanks for the tips on this. I DO run the htaccess XFRAME deny directive, but not the others.

@Lucy24 - I'll check and see. I'm on a self-managed VPS, and just about all entries do have an IP address. To clarify, I did not get this from the raw logs, I grabbed it through the standard cpanel, recent log file.
5:30 pm on Aug 18, 2017 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3555
votes: 196


While I agree that the htaccess header deny tag is something for all sites, anyone who is using WP will find some of their functions return a 403 if they're using the .js version in their WP pages. That's because the editing panel for css/appearance and some plugins' settings are shown in an iframe. It's on their own domain but in an iframe interface.
6:30 pm on Aug 19, 2017 (gmt 0)

Junior Member from US 

5+ Year Member

joined:Dec 23, 2008
posts:153
votes: 4


Instead or a redirect to Google (as was mentioned),
you could redirect to 127.0.0.1 .....
7:20 pm on Aug 19, 2017 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:10629
votes: 630


Instead or a redirect to Google (as was mentioned),
you could redirect to 127.0.0.1
Never ever do that. That's a kids trick that has been passed around for years and sure to get you a malicious site penalty.
11:49 pm on Aug 19, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14426
votes: 576


Never ever do that. That's a kids trick that has been passed around for years and sure to get you a malicious site penalty.

How would a search engine even know, if you're applying the redirect only to malign agents--who, to top it off, are rarely dumb enough to follow the redirect? (Same goes for its close relative, redirecting to the requesting IP.)
12:21 am on Aug 20, 2017 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:10629
votes: 630


@lucy24 - I personally do not understand much of how Googlebot, Bingbot, Yandexbot, etc process data, but they do recognise those redirects.

Besides, it's unethical to redirect requests to anywhere other than to update location (file moved) or correct protocols (http to https) for example.
3:11 pm on Aug 21, 2017 (gmt 0)

Preferred Member from US 

5+ Year Member

joined:June 14, 2010
posts: 606
votes: 4


One of my concerns over "where to redirect", is that I don't want to instill any harm, even indirectly, to the sites where the hacked pages reside. I figured a link to Google was the easiest and least harmful.

Thanks again for all the advice and help on this - for now, it seems to be controllable, but I'm sure new pages will show up again.