Welcome to WebmasterWorld Guest from 54.159.51.118

Forum Moderators: phranque

Are scrapers a fact of life?

No way to completely prevent them

     
11:01 pm on Nov 29, 2018 (gmt 0)

Full Member

10+ Year Member Top Contributors Of The Month

joined:Dec 7, 2005
posts:281
votes: 47


It seems that the more popular your website gets, the more it will be under attack by scrapers.

I've come to the realization that it's impossible and just not worth my time to completely prevent scrapers from crawling my site.

I've decided that as long as the server is running smoothly, I'd rather spend my time improving the site than stopping each and every one of them which, as I said, is near-impossible to do.

The only problem I have is when some extremely inconsiderate, impatient and greedy scraper decides to hammer the site with multi-threaded requests, using multiple IPs, thereby slowing the server down for everyone else, almost akin to a DOS attack. Those are the scrapers which literally make me lose sleep and I have to divert my time from improving the site to dealing with these unscrupulous cretins.

In closing, I would like to say to scrapers out there. If you're going to scrape, at least do it in a somewhat "ethical" way (if you can call scraping ethical) by not overloading your victim's website.
11:21 pm on Nov 29, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15305
votes: 703


I've denied a couple user-agents by name because they are ultimately triggered by humans who are too stupid to understand that a single click on their end translates to thousands of actions on my--that is, my site's--end. If, instead, they slowly visited page-by-page and at each step told the browser to save the whole thing ... I'd never even know they were scraping.
2:19 am on Nov 30, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:8773
votes: 706


Then there's the other side of proactive ... whitelisting. Allowing ONLY those you select and banning all others. Less stress.

YOU MIGHT inadvertently deny some which might be of benefit, but it is easier to add to your whitelist than it is to play whack-a-mole with a blacklist.

YMMV
2:31 am on Nov 30, 2018 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11465
votes: 174


there are several effective Blocking Methods [webmasterworld.com] that don't involve maintaining a list.
2:34 am on Nov 30, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15305
votes: 703


that don't involve maintaining a list.
There's always a list. It might be a list of IP addresses, or it might be a list of User Agents, or it might be a list of header fields and values, or it might be a list of behaviors. Any time you're matching something against something else, there's ultimately a list involved.
2:49 am on Nov 30, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:8773
votes: 706


Too true! Only decision involved is "long list" or "short list". :)
2:58 am on Nov 30, 2018 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4959
votes: 38


Steady with the whitelisting as there's quite a resurgence of alternative search engines nowadays, especially in Europe, after a bit of a doldrum since Yahoo bought up the alternates. It's probably the easier proactive option, as most (all?) you'd consider whitelisting come from definite hostnames and ranges.

I agree with your conclusion, OP. The guys on this site will surely help you eliminate the 'background noise' of hack attempts, general purpose stuff (and more, they're on top of the subject)

If anyone is determined to scrape your site, they can. Let's put it in simple terms, you can get 1 IPv4 for $1/m. They rent 30 and scrape you one page per second rotating them, 1 fetch per IP every half minute. Ban it? They can easily look like human visitors with a headless browser, you get into a recursive nightmare trying to decide otherwise. Nowadays it's simply not expensive or difficult to get a huge amount of IP diversity and combating it involves so much maintenance or trust of a maintainer.

There are sophisticated anti-scraping solutions but it's counter-productive after a while. My own view is to focus on what you intend to build.
3:38 am on Nov 30, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:8773
votes: 706


As with anything we do ... test, test, test! And test for at LEAST 90 days, if not 180 days. It takes time to get any returns on what changes make a difference.

Cultivate patience!
11:09 am on Nov 30, 2018 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Nov 16, 2005
posts:2830
votes: 143


There are ways of throttling anyone who hammers you server too hard. Hiawatha ( [hiawatha-webserver.org...] ) has per IP throttling, or you can use fail2ban with any Linux server.

You might also find a honeypot link that is blockws in robots.txt and bans IP that follows it may work.
11:43 am on Nov 30, 2018 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 9, 2000
posts:25722
votes: 821


Banning and blocking is just one way. DMCA is another, and requires less intensive effort.
The other thing to bear in mind is to use the scraper as a way of branding your own site: take advantage where you can.
3:57 pm on Nov 30, 2018 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:May 1, 2018
posts: 82
votes: 10


Hi,

I hear you on that, I have been trying to stop them for almost 1.5 years.

Have you looked into the type of ips they are using?

Are they using Hacked machines (less likely) or services on the cloud?

There are hundreds of cloud servers these days like quickweb,kvchosting.com,ioflood.com,servint.net,24shells,amanah.com,yesup.com,pair,datayard.us,hostdrive,king-servers.com,greenhousedata.com,hostrocket,inmotionhosting.com,onr.com,versaweb,a2hosting,gogrid,servepath,us.net,forked.net,joesdatacenter,vividhosting,interserver and many many more.

I thought the cloud servers had some type of anti-abuse mechanism from a third party to monitor abusive traffic? (X outbound requests to Y * amount of ips requesting) over a period of Z = BAN!

I think the easiest way to battle this is to make a server side stats script that shows only ips that are not on your whitelist. I have since created this and I am now able to block all ips from the day before in a few minutes.

When will places like Google finally block Cloud Servers from actions like performing searches ? If they are not able to make their make chunk of money, maybe they wouldn't have the resources to bot the small users like us on a daily basis. It's one of the key selling points of these cloud proxie providers these days. Here are the two i've found that are botting me:

www.proxymillion.com
www.cloudproxies.com
4:46 pm on Nov 30, 2018 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 9, 2000
posts:25722
votes: 821


@Steven29
I agree, there are so many cloud servers out there.
Not all of it is getting anywhere, and anything that does could have a DMCA, or, as I mentioned, use the scraped site for your own branding.
2:24 am on Dec 1, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:8773
votes: 706


use the scraped site for your own branding.


That works if they KEEP scraping ... However, most times one scrape is all they want and won't UPDATE with your "branded" stuff.

For that to work, you have to BRAND it BEFORE they strip it for lazy nefarious reasons...

Whitelist keeps 'em out before they get started ... but RESEARCH that as whitelisting means "these and only these I will let in" and to do that correctly you have to ALSO look at agents, etc.

Lock it down ... or play whack-a-mole. Forever.
1:29 am on Dec 3, 2018 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:59
votes: 0


I'm blocking about 400 million ipv4 IP addresses (in my er3-lite router) from being able to access my company website. I block chinese IP networks by the /16. Amazon's entire 54.0.0.0/8. Blocking based on the AS number and grabbing all the listed subnets when specific countries are involved, or specific entities (OVH, digital ocean, etc). Doing this big time for India and Brazil. Russia, Ukraine and eastern europe, malaysia, philipines, pretty much all of south america. We sell scientific research devices, so our market can range from US/Canada, UK and West europe, scandinavia, australia, japan, korea. Would like to sell to Russia, but that market is a black-box for us (maybe they do physics and chemistry research, but probably not much pharma/biomed stuff). Same with India. We have sold to China, but it's very rare, so I have no problems blocking Chinese IP's, including baidu bots. Also blocking yandex bots. I see lots of tor exit nodes, cloud and other hosters, vpn's (I block them all upon discovery, doesn't matter what country they're in). I see questionable hits from European and US/Canada residential ISP's, but I'm not touching those at the moment. Naturally I don't touch IP's / networks assigned to EDU's (except for blocking /24's that do IP-scans for "educational" reasons).

Edit: For the past few years I've been blocking upwards of 95% of all in-use routeable IPV4 on our mail server. This has really cut down the spam. Our "contact us" web page explains that we do heavy mail filtering, and that if they haven't contacted us before, try using our gmail address first. We've switched our main phone and fax numbers to voip (voip.ms) a year ago and I just love being able to filter entire area codes (and voip is SO CHEAP! We spend more on ground coffee and cream in our office kitchen now then for telco service! So yea, lots of filtering and blocking going on.
4:42 pm on Dec 5, 2018 (gmt 0)

Junior Member from US 

5+ Year Member

joined:Dec 23, 2008
posts:159
votes: 7


I block .... Amazon's entire 54.0.0.0/8
... We sell scientific research devices...

Really? Your company is blocking Merck?

NetRange: 54.0.0.0 - 54.35.255.255
CIDR: 54.0.0.0/11, 54.32.0.0/14
Organization: Merck and Co., Inc.
6:06 pm on Dec 5, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15305
votes: 703


Your company is blocking Merck?
Have you ever had a visit from Merck? Or Halliburton or DuPont or Eli Lilly or any of the other companies that started out controlling a whole /8 before discovering they hardly need any of it--and certainly not externally--so it's more profitable to sell to AWS?

No, I don't block /8 slabs wholesale either. But there's a reason they're all getting sold off.
6:20 pm on Dec 5, 2018 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:May 1, 2018
posts: 82
votes: 10


I also block many of those networks. Places like semrush and lots others. Everybody says they can benefit you, but i fail to see how. Im not going to let someone send thousands of requests to me every single day to ... benefit me? Seems the only one benefitting is them and the competator that now has insights to my data. No thanks.
7:44 pm on Dec 5, 2018 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Nov 25, 2003
posts:1193
votes: 331


As implied above the very first step is to decide which bots (by category or specific) are considered beneficial, neutral, or harmful given your business model and requirements.

Second step is to decide on which to allow/deny in theory. I say 'in theory' because the third step depends just how deep one wants to go down this particular rabbit hole.

In broad strokes the bot war is sort of like each step only getting one half the way there - one is always closer but never all the way. The first 50% is generally easy and simple, however each subsequent 50% of the remainder is exponential harder.

Because I find it fun and because I can I have quite extensive real time bot defences beyond the interest and/or capabilities of most webdevs - and I know, because of content identifiers, that my stuff is still successfully scraped albeit far less than otherwise.

Which is where, as previously mentioned, one can utilise various regulatory weapons such as lawyers letters, DMCA notices, small claims court, etc. depending on circumstances. Do be aware that these options can be time consuming and expensive.

It seriously helps ones case when going the legal route if one has registered ones site/content rather than relying solely on simple publication copyright as
registered copyright typically allows going for damages not just a take down.

A complex subject. As, it seems, is just about everything about doing business online.
2:25 pm on Dec 6, 2018 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:59
votes: 0


> Really? Your company is blocking Merck?

Merck has hit us, and they've bought from us. Same with Eli Lilly. A lot of those companies have merged over the past 10, 15 years. Thing is, even if they still have huge CIDR assignments, their hits never seem to come from those cidr's. Same with email. (well, maybe back say during 2000 - 2005 their hits and emails came from there, well before I started doing IP-based web blocking. Not any more).