Our website is scraped, Does it affect our rankings?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Our website is scraped, Does it affect our rankings?

goodbyedee

12:31 pm on Mar 2, 2018 (gmt 0)

Hello!

Another site is scraping our entire site (content, images, urls). It is only changing the company name and replacing our brand with theirs.
The scraping site is not accessible to visitors (i suspect it has a robots.txt file that is blocking agents) but only to bots.
For many long tail keywords it ranks first on Google and our site is nowhere to be found.
We filed reports to the domain registrar, hosting company and google with no response yet.
Looking more into it we have discovered dozens of other (competitor) sites that have copied entire paragraphs from our site.

I wonder how hurtful is this for our site. Are there ways to prevent other sites from scraping our content?

Thank you!

keyplyr

7:53 pm on Mar 2, 2018 (gmt 0)

Hello goodbyedee,

Scraping is indeed a huge problem that affects most everyone's digital properties. It devalues our branding & unique content, steals our traffic, takes our ranking and affects our money income.

If you've done the proper procedure in filing Cease & Desist notifications with the respective agencies, all you can do is wait and see how that progresses. Keep after them.

Proactively, there are things you can do, but it takes consistent diligence:

• If the scraping is done manually (someone with a browser just downloading and cut'n pasting) there's no way to block them before hand, but if you can identify their IP address, they can be blocked.

• If the scraping is done by automated methods (bots, scripts, etc) then there are several things you can do to stop the scraping from happening again.

Here are a few helpful links:

Search Engine Spider & User Agent ID Forum [webmasterworld.com]

Server Farm IP Ranges [webmasterworld.com]

Blocking Methods [webmasterworld.com]

- - -

lucy24

9:29 pm on Mar 2, 2018 (gmt 0)

i suspect it has a robots.txt file that is blocking agents

That's not how robots.txt works.

levo

11:05 pm on Mar 2, 2018 (gmt 0)

I recently fought with such scrapper. They were 'proxying' our website realtime, only changing the brand, logo and the domain (in the content). They were using tor network, in addition to some datacenters, so it was impossible to block their crawler. Their server was behind cloudflare, so I couldn't take it down, and cloudflare didn't take any action. And since their website is only accessible to search engine crawlers, Google etc. didn't take any action either -- they only saw a blank page.

First of all, create a new disposable google account, add their domain to your new Google Search Console account, and verify it using 'HTML file' method (upload it to your domain, verify from theirs). If successful, you can remove the infringing domain from search results.

Second, and more technical, you can create a honeypot folder, block it in robots.txt, and ping a random file in that folder from infringing domain so you can catch every single bot IP they are using. For me, it was more than 3K and continuously changing.

seoskunk

10:30 pm on Mar 3, 2018 (gmt 0)

First of all, create a new disposable google account, add their domain to your new Google Search Console account, and verify it using 'HTML file' method (upload it to your domain, verify from theirs). If successful, you can remove the infringing domain from search results.

Awesome!

seoskunk

10:34 pm on Mar 3, 2018 (gmt 0)

Second, and more technical, you can create a honeypot folder, block it in robots.txt, and ping a random file in that folder from infringing domain so you can catch every single bot IP they are using. For me, it was more than 3K and continuously changing.

Blocking it in robots.txt kinda gives away the honeypot but you can reverse/forward dns googlebot and show them a different robots.txt containing the honeypot and show the attacker robots.txt without honeypot.

seoskunk

10:38 pm on Mar 3, 2018 (gmt 0)

Oh in answer to OP question, if you are affected by scrapers its normally an indication of more serious problems on the site, in most cases just ignore.

levo

2:01 am on Mar 4, 2018 (gmt 0)

Blocking it in robots.txt kinda gives away the honeypot but you can reverse/forward dns googlebot and show them a different robots.txt containing the honeypot and show the attacker robots.txt without honeypot.

Maybe honeypot is not the right word to describe it, it's just a script that no one else would visit, so you can trigger yourself from the infringing domain continuously to detect and block their crawler IPs. It's a trap, but you've to poke it yourself.

keyplyr

3:05 am on Mar 4, 2018 (gmt 0)

It is important to determine *how* the scraping is being done. It may not actually be scraping at all, but hijacking.

• If hijacking is done by using a proxy (A record hijack), you can block direct access by using a PHP snippet in your site-wide header:


$servername = $_SERVER['SERVER_NAME'];
if($servername == 'your-domain.com'){

}elseif($servername == 'your-domain-with-www.com'){

}else{
 die("Direct ip access not allowed!");
}

More info: [serverfault.com...]

• If hijacking is done by iFraming your content, I recommend both these 2 methods to block iframes:

1.) a simple script:


<script type="text/javascript">
if (parent.frames.length > 0) {
parent.location.href = location.href;
}
</script>

2.) and the header tag in htaccess:

 
Header append X-FRAME-OPTIONS "deny"

TravisDGarrett

8:51 am on Mar 4, 2018 (gmt 0)

If the pages of your site are dynamically generated , you can add, somewhere, the IP address from where the page was accessed. then if you find a scrapped copy of your site, you my find from which IP(s) the scrapper operated. Of course they certainly use different IP, or hiding behind Cloudflare, Aws, etc... But it might always help a bit.

I wouldn't display the IP in plain text, because scappers my search and replace such text, but for example, you can remove the dots, or use a function like ip2long in PHP.