Welcome to WebmasterWorld Guest from 18.232.171.18

Forum Moderators: phranque

Huge (bad) bots activity

     
5:20 pm on Aug 2, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1193
votes: 280


Hi,

I didn't know where to post, but, I am seeing an extremely huge bot activity today. I mean you know, the bots which are pretending to be humans, sending referer URLs (usually google.com).

I am in use of dealing with this kind of bad bots, and I think I am pretty good at blocking them. But I always keep an eye on what my script is blocking, and today, there is really a peek of activity, from IP ranges I've never seen before , but still caught by my script (because of wrong or missing header fields, awkward reverse dns, "un natural" pattern, etc...).

I wonder if, scrapers are using the July / August, where lot of people are on vacation worldwide, to run their bots, to take all they can , before getting caught ...

edit: by peek of activity, I mean, 20 times more than the usual.
5:54 pm on Aug 2, 2019 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4396
votes: 314


Are these typical human UAs and IPs? Do they request all resources for a requested page?
but still caught by my script

Does your script show them a 403 page? Does your 403 page offer a means of human contact?
If so, do you see them repeating the request rather than making contact? It would mean pulling a copy of the access logs for some review, but then you can see whether/what action to take.
6:04 pm on Aug 2, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1193
votes: 280


Yes, human look a like UA, but, often with missing or malformed header fields.

As for IP range, the reverse goes to hosting companies, that I didn't yet know. Lot of requests without reverse, which is the kind of thing which upsets my script :). I did some look up at these IP without reverse, and it might be compromised servers, several are associated e-com sites, worldwide.

When my script suspects a non human visitor, it displays a CAPTCHA-like challenge, that I created myself , so real humans can still pass, and this allows me constantly evaluate the performance of my script, and, if needed to refine my filtering rules too. See, If I was a start up , I would claim it feeds my Artificial intelligence learning system, but it just feeds my own human and natural intelligence :)
6:05 pm on Aug 2, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15756
votes: 828


I wonder if, scrapers are using the July / August
Option B is that they've got a long list of sites that they work through, top to bottom, and today it is your turn.

As a crude but useful metric, I glanced at the size of the last few days' log files. Yesterday or the day before (depending on site) did seem to be unusually plump--especially on the http side, on those sites that have both.

Tangential but useful to know: Most malign robots still seem to use http by default. It's especially noticeable on my test site, where http logs are consistently bigger than https logs, up to twice as big, even though https is what gets the human visits--including myself--with all their supporting files.

Are many of your fake referers claiming to come from google.com.hk? I don't generally pay much attention to blocked requests--after all, they're already blocked, so what more do I need to do?--but on closer inspection, almost half of recent blocked requests claiming a google referer* claim to be coming in from Hong Kong.

:: detour to headers for closer look ::

Real: https://google.com.hk/ (https, closing slash)
Fake: http://google.com.hk (http, no slash)

But I digress.


* Thanks to that officially sanctioned misspelling, my fingers got as far as “refererer” before I stopped them.
6:14 pm on Aug 2, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15756
votes: 828


When my script suspects a non human visitor, it displays a CAPTCHA-like challenge
I take an even simpler approach: forward to a page that says in effect “Whoops! You’ve accidentally replicated the behavior of an undesirable robot” and then there’s a link to the originally requested page on the slim chance that it really was a human. At any given time, there will be a couple of botnets with a predictable pattern of requests ... and also bona fide humans from {country} requesting pages in {directory} where the redirect is an alternative to saying “Past experience tells me you’re too stupid to glance at a search snippet or even at the title of the page, [ they can’t all be using the I Feel Lucky option, can they? ] and I really don't see why I should put my server to the work of sending you up to a hundred image files for a page you'll never even look at.” None of them ever follow the link.
9:36 pm on Aug 2, 2019 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:May 1, 2018
posts: 102
votes: 16


Hi, it doesnt sound like you are asking a question, so ill throw my 2 cents in on the situation.

This is exactly what im experiencing with the fake bots with a reverse ip to a hosting company. In fact, there are so many of them i now have over 200 million ips on my "special" list.

The strange part is... these bots are somehow connected to my competitors.

The reason i know they are connected to them is because they are reposting content in the niche i find immediatly after i blog about it. I can wait hours or even days, but rest assure the second i add something they will be right along to re reblog about it and share to pages on facebook with 200,000 reach.

Since collecting and not allowing access from these hosting ip addresses has made a huge improvement in what is happening and i can even get some articles past them now.

Woukd be interested to see your alexa time on site if its dropped 80% and other things related to analytics all showing poorly.

We can block them or captcha them, but even that is probably hurting user experience.

The people doing it in my niche stand out like a sore thumb and even still its "ok". Thats "seo" whoever has no morals and can repost content in the niche from ecerybody the fastest and bot facebook. Yeaaa
9:48 pm on Aug 2, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1193
votes: 280


Most malign robots still seem to use http by default

When you restrict access to https protocole and TLS 1.2 and 1.3, you are already getting ride of plenty of requests. Sure, it's possible that, some collateral damages for few real humans, still living in the prehistoric era (before year 2000), in the other hand with the known exploits of SSL and TLS < 1.2 it's safe to reject these requests.
11:05 pm on Aug 2, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10136
votes: 1010


I have a "If you are a human and think you have been blocked in error, click here to gain access."

Pretty sure only five in the last year have ever done so. So I don't lose any sleep on blocking humans by accident. Either that, or they are so ticked off they wouldn't say boo ... goes both ways, I suppose.
2:42 am on Aug 3, 2019 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:95
votes: 0


My IP blocking list covers 598 million IP's, of which 595 million were added soley based on hits to my single http site. My https site (an exact mirror of http site) has been up since last november, and google is now offering https search results to me starting this past month. I do not direct from http to https. For the new bots that get through (there are always some every day, I block their entire AS when I discover them) I have been seeing some requests for new files over the past couple of weeks. Eg - if my domain is "mydomain.com" I've seen requests for "mydomain.zip" and "mydomain_backup.zip". That and several other files (data base and script files?) as if they're targeting a specific server platform. Just today I ran through the past 6 months of my https logs and added about 3 million IP's (about 2-dozen AS networks) to my blocking list. So there are bots and junk hitting my https site directly that never showed up in my http logs.
3:46 am on Aug 3, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10136
votes: 1010


Aside: I recently embarked on COUNTRY level blocking rather than relying on all the other indicators. There's bad actors everywhere, but some regions seem to be more prone to bad actions than others. On the road to taking out 3 billion if necessary ... my time is too short... and even if hosting is "kinda sorta cheap" I don't have time for the NOISE of those with NO INTEREST.

Life is too short.

YMMV.
8:25 am on Aug 3, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1193
votes: 280


Yes, there are countries i am totally blocking , it's too bad, because "may be" there are legitimate visitors there, which are "punished" as a collateral damage.

I am also blocking all VPN (as much as I could detect them). VPN are not bad by themselves, but, there are way too many abuses about it... in all events, this is what the Internet is made of nowadays, good things being abused.
10:43 am on Aug 3, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1193
votes: 280


Also, I am not unsure what to think of requests from IP with no reverse. As I said, that's a big warning for my script, and carries lot of "negative score". Sometimes I wonder if, there are legitimate requests which could come from IP with no reverse. I assume that all ISP have reverses. "Good" crawlers have reverse (excepting half of Bing requests). So I am considering simply banning all IP without reverse.

edit: by "reverse" in fact, I mean FCrDNS.
3:34 pm on Aug 3, 2019 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member graeme_p is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 16, 2005
posts: 2980
votes: 201


Either that, or they are so ticked off they wouldn't say boo ... goes both ways, I suppose.


Most people are used to reCpatcha (I think its a horrible solution) so having to click one link is hardly noticeable.
3:34 am on Aug 5, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10136
votes: 1010


I give the humans a shot ... bots not so much. :)

The IPs that a ticking me off at the moment are the bad actor counties that send 706 wp or php (it is usually the same list) at one time ... obviously looking to break in.

I don't do wp, or php and 403 'em all. Still ... makes my logs look funny as the number of 403, 404, 406, 408 are about 80% of the log these days... That noise is just increasing.
4:59 am on Aug 7, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2047
votes: 1


tangor, that massive exploit barrage is up to a record 812 over here, so you've got more to come:) What a bizarre collection o' pestilences. And to think each represents a hole in a known, typically PHP script.

I got so sick of scrolling through the hits that I reworked my 'quick glimpse' log script to exclude all but "manager/html" (for looking up and later kill-filing as desired).

Back to Dimitri... I don't think the time of year matters to bot-runners. SysAdmins and webmasters and log-addicts and site-tenders and such are either paying attention 24/7 or they're not. Too many aren't.
5:59 am on Aug 7, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10136
votes: 1010


@Pfui ... thanks for much! (not!) for head's up. ;)

Just for fun i have been tracking these attempts the last month or so to see how much of it is cut and paste between bad actors and the ones with SOME effort expended to do their evil deeds.

Fortunately, the majority are dumb as posts, while the more inventive have devolved into creating gibberish requests ending in .php. :)
6:01 am on Aug 7, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15756
votes: 828


It just occurred to me that robots may be more visible in summer because they're always the same, while there are fewer humans, so the robots become a higher proportion. It does seem to work that way for me: although it isn't really an academic site, human visitors do drop off in the summer.

:: idly wondering how this works in sites targeted to Japan, which has an entirely different academic calendar from the rest of the northern hemisphere ::
11:58 pm on Aug 7, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10136
votes: 1010


Interesting observation! I suspect the noise is probably constant, just more obvious when the humans are out playing in the sunshine. :)
4:19 am on Aug 18, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:July 29, 2007
posts:2011
votes: 211


Instead of country level blocking consider blocking all of the Amazon AWS IP ranges. I get very, very few actual visitors from those ranges but it accounts for almost all of the cruft.
6:18 am on Aug 18, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10136
votes: 1010


@JS_Harris ... thanks ... I have looked at AWS in the past ... oddly enough I don't get much grief from those IPs ... bad actors are really a country thing for me for some reason. The web is a strange place these days. :)
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members