Welcome to WebmasterWorld Guest from 3.94.202.88

Forum Moderators: phranque

Huge (bad) bots activity

     
5:20 pm on Aug 2, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1194
votes: 285


Hi,

I didn't know where to post, but, I am seeing an extremely huge bot activity today. I mean you know, the bots which are pretending to be humans, sending referer URLs (usually google.com).

I am in use of dealing with this kind of bad bots, and I think I am pretty good at blocking them. But I always keep an eye on what my script is blocking, and today, there is really a peek of activity, from IP ranges I've never seen before , but still caught by my script (because of wrong or missing header fields, awkward reverse dns, "un natural" pattern, etc...).

I wonder if, scrapers are using the July / August, where lot of people are on vacation worldwide, to run their bots, to take all they can , before getting caught ...

edit: by peek of activity, I mean, 20 times more than the usual.
5:54 pm on Aug 2, 2019 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4504
votes: 347


Are these typical human UAs and IPs? Do they request all resources for a requested page?
but still caught by my script

Does your script show them a 403 page? Does your 403 page offer a means of human contact?
If so, do you see them repeating the request rather than making contact? It would mean pulling a copy of the access logs for some review, but then you can see whether/what action to take.
6:04 pm on Aug 2, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1194
votes: 285


Yes, human look a like UA, but, often with missing or malformed header fields.

As for IP range, the reverse goes to hosting companies, that I didn't yet know. Lot of requests without reverse, which is the kind of thing which upsets my script :). I did some look up at these IP without reverse, and it might be compromised servers, several are associated e-com sites, worldwide.

When my script suspects a non human visitor, it displays a CAPTCHA-like challenge, that I created myself , so real humans can still pass, and this allows me constantly evaluate the performance of my script, and, if needed to refine my filtering rules too. See, If I was a start up , I would claim it feeds my Artificial intelligence learning system, but it just feeds my own human and natural intelligence :)
6:05 pm on Aug 2, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15867
votes: 869


I wonder if, scrapers are using the July / August
Option B is that they've got a long list of sites that they work through, top to bottom, and today it is your turn.

As a crude but useful metric, I glanced at the size of the last few days' log files. Yesterday or the day before (depending on site) did seem to be unusually plump--especially on the http side, on those sites that have both.

Tangential but useful to know: Most malign robots still seem to use http by default. It's especially noticeable on my test site, where http logs are consistently bigger than https logs, up to twice as big, even though https is what gets the human visits--including myself--with all their supporting files.

Are many of your fake referers claiming to come from google.com.hk? I don't generally pay much attention to blocked requests--after all, they're already blocked, so what more do I need to do?--but on closer inspection, almost half of recent blocked requests claiming a google referer* claim to be coming in from Hong Kong.

:: detour to headers for closer look ::

Real: https://google.com.hk/ (https, closing slash)
Fake: http://google.com.hk (http, no slash)

But I digress.


* Thanks to that officially sanctioned misspelling, my fingers got as far as “refererer” before I stopped them.
6:14 pm on Aug 2, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15867
votes: 869


When my script suspects a non human visitor, it displays a CAPTCHA-like challenge
I take an even simpler approach: forward to a page that says in effect “Whoops! You’ve accidentally replicated the behavior of an undesirable robot” and then there’s a link to the originally requested page on the slim chance that it really was a human. At any given time, there will be a couple of botnets with a predictable pattern of requests ... and also bona fide humans from {country} requesting pages in {directory} where the redirect is an alternative to saying “Past experience tells me you’re too stupid to glance at a search snippet or even at the title of the page, [ they can’t all be using the I Feel Lucky option, can they? ] and I really don't see why I should put my server to the work of sending you up to a hundred image files for a page you'll never even look at.” None of them ever follow the link.
9:36 pm on Aug 2, 2019 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:May 1, 2018
posts: 102
votes: 17


Hi, it doesnt sound like you are asking a question, so ill throw my 2 cents in on the situation.

This is exactly what im experiencing with the fake bots with a reverse ip to a hosting company. In fact, there are so many of them i now have over 200 million ips on my "special" list.

The strange part is... these bots are somehow connected to my competitors.

The reason i know they are connected to them is because they are reposting content in the niche i find immediatly after i blog about it. I can wait hours or even days, but rest assure the second i add something they will be right along to re reblog about it and share to pages on facebook with 200,000 reach.

Since collecting and not allowing access from these hosting ip addresses has made a huge improvement in what is happening and i can even get some articles past them now.

Woukd be interested to see your alexa time on site if its dropped 80% and other things related to analytics all showing poorly.

We can block them or captcha them, but even that is probably hurting user experience.

The people doing it in my niche stand out like a sore thumb and even still its "ok". Thats "seo" whoever has no morals and can repost content in the niche from ecerybody the fastest and bot facebook. Yeaaa
9:48 pm on Aug 2, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1194
votes: 285


Most malign robots still seem to use http by default

When you restrict access to https protocole and TLS 1.2 and 1.3, you are already getting ride of plenty of requests. Sure, it's possible that, some collateral damages for few real humans, still living in the prehistoric era (before year 2000), in the other hand with the known exploits of SSL and TLS < 1.2 it's safe to reject these requests.
11:05 pm on Aug 2, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10450
votes: 1087


I have a "If you are a human and think you have been blocked in error, click here to gain access."

Pretty sure only five in the last year have ever done so. So I don't lose any sleep on blocking humans by accident. Either that, or they are so ticked off they wouldn't say boo ... goes both ways, I suppose.
2:42 am on Aug 3, 2019 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:97
votes: 0


My IP blocking list covers 598 million IP's, of which 595 million were added soley based on hits to my single http site. My https site (an exact mirror of http site) has been up since last november, and google is now offering https search results to me starting this past month. I do not direct from http to https. For the new bots that get through (there are always some every day, I block their entire AS when I discover them) I have been seeing some requests for new files over the past couple of weeks. Eg - if my domain is "mydomain.com" I've seen requests for "mydomain.zip" and "mydomain_backup.zip". That and several other files (data base and script files?) as if they're targeting a specific server platform. Just today I ran through the past 6 months of my https logs and added about 3 million IP's (about 2-dozen AS networks) to my blocking list. So there are bots and junk hitting my https site directly that never showed up in my http logs.
3:46 am on Aug 3, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10450
votes: 1087


Aside: I recently embarked on COUNTRY level blocking rather than relying on all the other indicators. There's bad actors everywhere, but some regions seem to be more prone to bad actions than others. On the road to taking out 3 billion if necessary ... my time is too short... and even if hosting is "kinda sorta cheap" I don't have time for the NOISE of those with NO INTEREST.

Life is too short.

YMMV.
8:25 am on Aug 3, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1194
votes: 285


Yes, there are countries i am totally blocking , it's too bad, because "may be" there are legitimate visitors there, which are "punished" as a collateral damage.

I am also blocking all VPN (as much as I could detect them). VPN are not bad by themselves, but, there are way too many abuses about it... in all events, this is what the Internet is made of nowadays, good things being abused.
10:43 am on Aug 3, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1194
votes: 285


Also, I am not unsure what to think of requests from IP with no reverse. As I said, that's a big warning for my script, and carries lot of "negative score". Sometimes I wonder if, there are legitimate requests which could come from IP with no reverse. I assume that all ISP have reverses. "Good" crawlers have reverse (excepting half of Bing requests). So I am considering simply banning all IP without reverse.

edit: by "reverse" in fact, I mean FCrDNS.
3:34 pm on Aug 3, 2019 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member graeme_p is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 16, 2005
posts: 3001
votes: 205


Either that, or they are so ticked off they wouldn't say boo ... goes both ways, I suppose.


Most people are used to reCpatcha (I think its a horrible solution) so having to click one link is hardly noticeable.
3:34 am on Aug 5, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10450
votes: 1087


I give the humans a shot ... bots not so much. :)

The IPs that a ticking me off at the moment are the bad actor counties that send 706 wp or php (it is usually the same list) at one time ... obviously looking to break in.

I don't do wp, or php and 403 'em all. Still ... makes my logs look funny as the number of 403, 404, 406, 408 are about 80% of the log these days... That noise is just increasing.
4:59 am on Aug 7, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2064
votes: 2


tangor, that massive exploit barrage is up to a record 812 over here, so you've got more to come:) What a bizarre collection o' pestilences. And to think each represents a hole in a known, typically PHP script.

I got so sick of scrolling through the hits that I reworked my 'quick glimpse' log script to exclude all but "manager/html" (for looking up and later kill-filing as desired).

Back to Dimitri... I don't think the time of year matters to bot-runners. SysAdmins and webmasters and log-addicts and site-tenders and such are either paying attention 24/7 or they're not. Too many aren't.
5:59 am on Aug 7, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10450
votes: 1087


@Pfui ... thanks for much! (not!) for head's up. ;)

Just for fun i have been tracking these attempts the last month or so to see how much of it is cut and paste between bad actors and the ones with SOME effort expended to do their evil deeds.

Fortunately, the majority are dumb as posts, while the more inventive have devolved into creating gibberish requests ending in .php. :)
6:01 am on Aug 7, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15867
votes: 869


It just occurred to me that robots may be more visible in summer because they're always the same, while there are fewer humans, so the robots become a higher proportion. It does seem to work that way for me: although it isn't really an academic site, human visitors do drop off in the summer.

:: idly wondering how this works in sites targeted to Japan, which has an entirely different academic calendar from the rest of the northern hemisphere ::
11:58 pm on Aug 7, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10450
votes: 1087


Interesting observation! I suspect the noise is probably constant, just more obvious when the humans are out playing in the sunshine. :)
4:19 am on Aug 18, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:July 29, 2007
posts:2014
votes: 215


Instead of country level blocking consider blocking all of the Amazon AWS IP ranges. I get very, very few actual visitors from those ranges but it accounts for almost all of the cruft.
6:18 am on Aug 18, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10450
votes: 1087


@JS_Harris ... thanks ... I have looked at AWS in the past ... oddly enough I don't get much grief from those IPs ... bad actors are really a country thing for me for some reason. The web is a strange place these days. :)
2:43 pm on Aug 24, 2019 (gmt 0)

Junior Member from US 

10+ Year Member

joined:Dec 23, 2008
posts:167
votes: 10


Now we find this:

[krebsonsecurity.com...]

Jonesy
6:10 pm on Aug 25, 2019 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts:575
votes: 59


@Jonesy Interesting article. That explains a lot about the huge bot activity coming from residential host providers. Unfortunately I cannot differentiate between legit residential IPs and these resold IPs.
7:13 pm on Aug 25, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15867
votes: 869


an impressive stable of IP blocks — totaling almost 70,000 IPv4 addresses
That sound you hear in the distance is European network administrators sobbing brokenly at the thought of, let's say, sixty-eight /22 sectors available for the grabbing.
9:53 am on Aug 26, 2019 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member graeme_p is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 16, 2005
posts:3001
votes: 205


I was wondering where the IPs used by scraping/crawling services came from, it had to be something like this.

For those not familiar with that murky world, there are people who sell proxies that rotate a scrape through multiple addresses so the target does not see a lot of requests from one IP.
4:46 pm on Aug 26, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10450
votes: 1087


Sad thing is, even if identified by behavior, it's a lose-lose proposition taking these out by ip since these pools do change over time ... though bad actor versions and be identified AFTER a lot of study.
5:15 pm on Aug 26, 2019 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:97
votes: 0


> Now we find this:
> [krebsonsecurity.com...]

I Don't have AS396153 in my web-blocking list. I have zero hits to my website going back to at least 2015 from 198.228.0.0/16. I have 3 email spams from 198.228.215.x and 198.228.212.X that happened in Aug / 2013.

AS396153 is currently showing ZERO associated IP blocks. No peers, zero originating IP's. It seems to have been moved to AS46723 (which I'm also not blocking) which has the following:

65.210.64.0/21
63.69.72.0/22
63.68.140.0/22
63.38.224.0/20
63.38.192.0/20
2604:3540::/32
23.129.160.0/24
208.255.136.0/21

I have no web-hits from any of those networks in my logs during the past 4 years. Who-ever was using RESNET wasn't doing much crawling from I can see...
9:07 pm on Aug 26, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15867
votes: 869


I have no web-hits from any of those networks in my logs during the past 4 years. Who-ever was using RESNET wasn't doing much crawling from I can see.
That sounds like a twist on the age-old issue of telling victims of {insert type of crime ad lib} that they should be flattered because obviously they’ve made themselves attractive to {insert type of offender ad lib}. Should I feel sad that my website hasn’t drawn the attention of some particular variety of malign robot?