Blocking Unwanted Traffic

Forum Moderators: phranque

Message Too Old, No Replies

Blocking Unwanted Traffic

jc2021

6:23 am on Mar 21, 2021 (gmt 0)

This discussion was split off from a discussion about Content Management to the Apache forum where actual blocking methods are welcome.

I manage a news website and someone with a RSS scraper site is copying all my articles without giving credit while crediting all others he copies.

So clearly a big hater.

Except contacting the host, what else can be done?

[edited by: not2easy at 2:42 am (utc) on Mar 22, 2021]
[edit reason] Cleanup discussion split, moved thread [/edit]

not2easy

1:01 pm on Mar 21, 2021 (gmt 0)

You could block their access to your site. Creating an RSS feed generally is intended for sharing partial articles. If you are not creating the RSS feed they are using, then check your logs to find the UA and block it.

jc2021

7:54 pm on Mar 21, 2021 (gmt 0)

If you are not creating the RSS feed they are using, then check your logs to find the UA and block it.

How can this be done?

not2easy

8:32 pm on Mar 21, 2021 (gmt 0)

Here's an oooold example to get the general idea: [webmasterworld.com...] - warning - don't copy that old stuff, please. It is just an example and most of those ould UAs are long gone.

To use this method, it assumes you are using a server with Apache though this example goes back to Apache 2 so you would want to read up on more current information. How to know what the UA is? Become familiar with your access logs and various ways to determine "who did what, from where and when".

You can look for newer discussions by searching (upper right corner for desktop, or mobile menu) on WebmasterWorld for a common element of the method:

RewriteCond %{HTTP_USER_AGENT}

jc2021

8:48 pm on Mar 21, 2021 (gmt 0)

Thank you sir.

JorgeV

10:58 pm on Mar 21, 2021 (gmt 0)

Hello,

Blocking IP from ASN which are owned by Datacenter, will already get ride of most of the problem. Of-course, they can still scrap conten from their own home / office IP , but that is less convenient and easy to automate the process.

getcooking

1:40 am on Mar 22, 2021 (gmt 0)

I routinely deal with this (weekly?). Here are the things I recommend:

1) if you are not already tracking every IP address that accesses your site, you need to start. It's ridiculously easy to see natural patterns of IP traffic versus unnatural. Even easier to block that unnatural traffic so they can't automatically steal your content.
2) do not publish your content via rss or social media until Google has crawled it.
3) if you do find stolen content, file a DMCA with Google first (assuming it's actually stolen verbatim - simply rewritten is a different monster). 99% of the time if you've done #2 above, they will follow through accordingly. This removes any organic traffic YOUR content is receiving on the offender's website.
4) file a DMCA with the webhost. 80% of the time they will comply without question. If they do question it, I provide additional proof of ownership as needed (receipts for paid writers, copies of images in photoshoot, etc) and remind them that they are party to a lawsuit should it be required. 99% of the time, they do the right thing if you follow through.
5) I almost never contact the website owner unless I'm sending them a bill. My website has a published policy about using our content and the costs and penalties involved. About 10% of them pay from just an invoice. Depending on what it is, sometimes I send it to a collections company. The remaining times I do steps 3 and 4 if I think I can't collect on something substantial.

not2easy

3:22 am on Mar 22, 2021 (gmt 0)

Since this discussion is now in the Apache forum I am adding a more current example that can be used for those on Apache server who don't mind editing their .htaccess file. This version may have more current format on newer versions of Apache. I have used it for well over 15 years on various Apache versions.

This section should be with any rewrite rules and before your canonical rewrite rules. If you use WordPress, their WP snippet of code goes at the end of other rewrite rules.

RewriteCond %{HTTP_USER_AGENT} (A6|access|appid|blog) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (capture|client|crawl|curl) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (wget|win32|wotbox|wsr) [NC]
RewriteRule .* - [F]

This is not intended to be used as-is, it is a format. As you can see, you can add in 'parts' of UAs, the entire name is not needed, so long as you do not use portions of UAs that actual human visitors might have such as

(chr|moz|pera|safa)

within the parentheses.

Copying and pasting this into your .htaccess file will NOT fix any specific problem you are seeing on your site, it is intended only as a helper 'template' code and you will need to replace these UA bits with your own for best results.

The text within parentheses is separated by the pipe symbol: | which in this context means 'or'. That [NC,OR] flag at the end of each line means [either upper or lower case] and [see the next line for more]. Note that the last line of UAs does not include the [OR] flag. That [F] flag at the end means 'forbidden access' so those folks will get a 403 response. It does not prevent them from visiting to ask again but over time they may become discouraged at failures.

I'm sure I've left out something, so take it, adapt it, ask questions.

wilderness

4:37 pm on Mar 22, 2021 (gmt 0)

jc2021,

check your logs to find the UA

Their actually named 'Raw Access logs'.
Check your hosts cPanel for either options to view, or turn on.
Assuming your not aware of what these are than you've neither been able to confirm the IP-identity or UA (UserAgent) identity of the person copying your feeds.
Interpreting Raw logs is a learned process.

Kendo

10:17 pm on Mar 22, 2021 (gmt 0)

I'm sure I've left out something

A good article that hits the nail on the head. My own user-agent checker lists about 300 offenders that include everything from Adobe PDF to Xenu and SEO spies.

TorontoBoy

5:15 pm on Mar 23, 2021 (gmt 0)

I like RSS as a mechanism to keep current with a specific site. But some rare RSS bots scrape all content and everything, sometimes multiple times a day. You can see their traffic in your raw access log. Then ban them with their IP range or the bot user agent name. I had to ban a lot and you should as well.