Welcome to WebmasterWorld Guest from 54.162.224.176

Forum Moderators: Robert Charlton & goodroi

Our website is scraped, Does it affect our rankings?

     
12:31 pm on Mar 2, 2018 (gmt 0)

Junior Member

10+ Year Member

joined:Sept 14, 2004
posts: 45
votes: 0


Hello!

Another site is scraping our entire site (content, images, urls). It is only changing the company name and replacing our brand with theirs.
The scraping site is not accessible to visitors (i suspect it has a robots.txt file that is blocking agents) but only to bots.
For many long tail keywords it ranks first on Google and our site is nowhere to be found.
We filed reports to the domain registrar, hosting company and google with no response yet.
Looking more into it we have discovered dozens of other (competitor) sites that have copied entire paragraphs from our site.

I wonder how hurtful is this for our site. Are there ways to prevent other sites from scraping our content?

Thank you!
7:53 pm on Mar 2, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:11815
votes: 746


Hello goodbyedee,

Scraping is indeed a huge problem that affects most everyone's digital properties. It devalues our branding & unique content, steals our traffic, takes our ranking and affects our money income.

If you've done the proper procedure in filing Cease & Desist notifications with the respective agencies, all you can do is wait and see how that progresses. Keep after them.

Proactively, there are things you can do, but it takes consistent diligence:

If the scraping is done manually (someone with a browser just downloading and cut'n pasting) there's no way to block them before hand, but if you can identify their IP address, they can be blocked.

If the scraping is done by automated methods (bots, scripts, etc) then there are several things you can do to stop the scraping from happening again.

Here are a few helpful links:

Search Engine Spider & User Agent ID Forum [webmasterworld.com]

Server Farm IP Ranges [webmasterworld.com]

Blocking Methods [webmasterworld.com]

- - -
9:29 pm on Mar 2, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14803
votes: 640


i suspect it has a robots.txt file that is blocking agents

That's not how robots.txt works.
11:05 pm on Mar 2, 2018 (gmt 0)

Preferred Member

10+ Year Member Top Contributors Of The Month

joined:Dec 12, 2004
posts:634
votes: 7


I recently fought with such scrapper. They were 'proxying' our website realtime, only changing the brand, logo and the domain (in the content). They were using tor network, in addition to some datacenters, so it was impossible to block their crawler. Their server was behind cloudflare, so I couldn't take it down, and cloudflare didn't take any action. And since their website is only accessible to search engine crawlers, Google etc. didn't take any action either -- they only saw a blank page.

First of all, create a new disposable google account, add their domain to your new Google Search Console account, and verify it using 'HTML file' method (upload it to your domain, verify from theirs). If successful, you can remove the infringing domain from search results.

Second, and more technical, you can create a honeypot folder, block it in robots.txt, and ping a random file in that folder from infringing domain so you can catch every single bot IP they are using. For me, it was more than 3K and continuously changing.
10:30 pm on Mar 3, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Sept 14, 2011
posts:1045
votes: 132


First of all, create a new disposable google account, add their domain to your new Google Search Console account, and verify it using 'HTML file' method (upload it to your domain, verify from theirs). If successful, you can remove the infringing domain from search results.


Awesome!
10:34 pm on Mar 3, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Sept 14, 2011
posts:1045
votes: 132


Second, and more technical, you can create a honeypot folder, block it in robots.txt, and ping a random file in that folder from infringing domain so you can catch every single bot IP they are using. For me, it was more than 3K and continuously changing.


Blocking it in robots.txt kinda gives away the honeypot but you can reverse/forward dns googlebot and show them a different robots.txt containing the honeypot and show the attacker robots.txt without honeypot.
10:38 pm on Mar 3, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Sept 14, 2011
posts:1045
votes: 132


Oh in answer to OP question, if you are affected by scrapers its normally an indication of more serious problems on the site, in most cases just ignore.
2:01 am on Mar 4, 2018 (gmt 0)

Preferred Member

10+ Year Member Top Contributors Of The Month

joined:Dec 12, 2004
posts:634
votes: 7


Blocking it in robots.txt kinda gives away the honeypot but you can reverse/forward dns googlebot and show them a different robots.txt containing the honeypot and show the attacker robots.txt without honeypot.


Maybe honeypot is not the right word to describe it, it's just a script that no one else would visit, so you can trigger yourself from the infringing domain continuously to detect and block their crawler IPs. It's a trap, but you've to poke it yourself.
3:05 am on Mar 4, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:11815
votes: 746


It is important to determine *how* the scraping is being done. It may not actually be scraping at all, but hijacking.

If hijacking is done by using a proxy (A record hijack), you can block direct access by using a PHP snippet in your site-wide header:

$servername = $_SERVER['SERVER_NAME'];
if($servername == 'your-domain.com'){

}elseif($servername == 'your-domain-with-www.com'){

}else{
die("Direct ip access not allowed!");
}
More info: [serverfault.com...]


If hijacking is done by iFraming your content, I recommend both these 2 methods to block iframes:

1.) a simple script:

<script type="text/javascript">
if (parent.frames.length > 0) {
parent.location.href = location.href;
}
</script>

2.) and the header tag in htaccess:
 
Header append X-FRAME-OPTIONS "deny"
8:51 am on Mar 4, 2018 (gmt 0)

Junior Member

joined:Feb 22, 2018
posts:146
votes: 22


If the pages of your site are dynamically generated , you can add, somewhere, the IP address from where the page was accessed. then if you find a scrapped copy of your site, you my find from which IP(s) the scrapper operated. Of course they certainly use different IP, or hiding behind Cloudflare, Aws, etc... But it might always help a bit.

I wouldn't display the IP in plain text, because scappers my search and replace such text, but for example, you can remove the dots, or use a function like ip2long in PHP.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members