Forum Moderators: open

Message Too Old, No Replies

Bots slamming my site

         

timchuma

11:08 am on Jan 1, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



The hosting tech support is not helping.

AhrefsBot 290,387+56 45.04 GB 30 Dec 2020 - 12:05
SemrushBot 168,034+2491 2.30 GB 30 Dec 2020 - 12:04
BLEXBot 136,350+1320 1.73 GB 30 Dec 2020 - 05:48
Unknown robot identified by bot\* 48,961+166 1.11 GB 30 Dec 2020 - 12:06

engine

11:14 am on Jan 1, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I sympathise, and you might want to have a read of this and some of the references in the thread.

[webmasterworld.com...]

JorgeV

11:16 am on Jan 1, 2021 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Hello,

The hosting tech support is not helping.

Which kind of help, are you looking for? To block these bots?

timchuma

11:48 am on Jan 1, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Some assistance to block the bots. I have had my Wordpress hacked even though it was automatically updating and I didn't even know as it looked fine to me.

JorgeV

1:18 pm on Jan 1, 2021 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Hello -again


Most of these bots are obeying the robots.txt file, so you should be able to get ride of most of them that way.

Then, if you are using apache, you can block requests based on user agent.

TorontoBoy

4:57 pm on Jan 1, 2021 (gmt 0)

5+ Year Member Top Contributors Of The Month



I believe all three of these bots do read the robots.txt and do no malicious harm, but with their daily scans they also can overwhelm your site. I've long banned all three. Your host provider will not block them as you may, for whatever reason, want them to read your site.

WP is a great platform for writing, but really deficient in security. You should consider 2FA 2 factor authentication for your WP install. This is critical for WP site security.

lucy24

6:06 pm on Jan 1, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



with their daily scans they also can overwhelm your site
A few requests for robots.txt really should not overwhelm any site, unless you're running a micro-server out of your garage.

timchuma

4:23 am on Jan 2, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



1 bot is doing 99Gb per month worth of requests.

jmccormac

6:04 am on Jan 2, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Unless a bot is being run by a search engine that provides traffic or is from some service that you use, deep six it or restrict it. First try the robots.txt approach and if that fails, deep six the IP ranges associated with the requests. You are paying for the bandwidth. I regularly see scraped blog content, often complete clones of the original blog sites, in web usage surveys. It is quite common in the Chinese market new gTLDs.

Some of these bots are well-behaved but others have to be banned at an IP level. These are separate from scrapers which may masquerade as bots from search engines or other services. Scrapers simply take your content. Some of them are from data centres, others from VPN privacy services and others are from compromised PCs or scraper networks using mobile phone apps or iffy browser plugins.

Some bots from search engines (especially the Chinese ones) can hit a site hard and the SEs may not provide enough traffic.

You need to do some log analysis (not analytics) to see what bots are major problems. You need to check which IPs are hitting the site most each day and whether they are genuine search engine bots from IPs associated with the search engine, genuine human users from ISP ranges or requests from data centres.

Regards...jmcc

lucy24

4:36 pm on Jan 2, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



1 bot is doing 99Gb per month worth of requests.
If they're disregarding a robots.txt Disallow, then yeah, you'll have to block 'em. (I am not prepared to believe that any robot in the world, including bingbot, eats 99GB worth of robots.txt in a month.)

The bad news is that blocking a request doesn't prevent the request from being made, and very few robots respond to repeated 403s by going away. At best, your real pages are presumably heavier than your 403 page, so the server is sending out less content. A consistent 403 may also prevent the robot from learning about all your links, so it can't make as many different requests overall.

If it's your own server, some type of firewall may be appropriate, but this doesn't seem to be the case here.

JorgeV

10:29 pm on Jan 2, 2021 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Hello,

The problem with error codes, is that, it doesn't prevent robots to continue knocking at the door. This is why you have to favor the robots.txt (for good bots).

timchuma

2:06 am on Jan 3, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



This seems to have worked for now in .htaccess:
RewriteCond %{HTTP_USER_AGENT} AhrefsBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SemrushBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} BLEXBot [NC]
RewriteRule . - [R=403,L]

I have informed the server admin as my website is only a virtual host on a shared server so they would be having the same issue with all of the sites they host.

timchuma

2:07 am on Jan 3, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



The main issue was that they were requesting photos off my site which sucked a huge amount of bandwidth.

jmccormac

2:35 am on Jan 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's strange activity for Ahrefs and Semrush. They are generally looking for links to evaluate a site for SEO purposes. Using .htaccess is one solution but it creates more of a load on a busy server. While it is a shared sever, those rewrite rules are better placed in the httpd.conf file. The problem with share servers is that they are shared and some of the other customers might be using those services.

Others might have a different view on making the photos downloadable by search engines and bots but it is not a good idea in terms of bandwidth unless they bring in useful traffic to the site. If the images are in separate directories beneath the web root directory, it might be possible to disallow those directories in the robots.txt. The bots that read and observe robots.txt rules will not check these directories. It is also possible to use rules in .htaccess to restrict access on a per-user agent basis.

Just checked Blexbot and it is a link crawler too. It might be best to check the logs as, to paraphrase Obi Wan Kenobe, these might not be the bots you are looking for. :)

Bing can be a problem on large sites. Google does have an image search option and Google might be one of the problem bots. It is possible to slow down Bing with a rule in robots.txt but I am not sure about Google. One possible solution might be a thumbnails directory for legitimate search engines. That way you would still feed them images but they'd be thumbnails and the bandwidth used would be much lower.

Regards...jmcc

not2easy

2:54 am on Jan 3, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



For bots that ignore robots.txt there are a lot (really, thousands) of discussions here on how to block them in our Search Engine Spider and User Agent Identification forum [webmasterworld.com]

This is a common method that can be customized to suit your needs:
RewriteCond %{HTTP_USER_AGENT} (Ahrefs|Access|appid|Blex) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Capture|Client|Copy|crawl|curl) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Data|devSoft|Domain|download) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Engine|fetch|filter|genieo) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Jakarta|Java|Library|link|libww) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (MJ12bot|nutch|Preview|Proxy|Publish) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (scraper|Semrush|spider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Wget|Win32|WinHttp) [NC]
RewriteRule .* - [F]

I found this example with the site search, at: [webmasterworld.com...]

It is not new, it is from 2013 but I am using similar on several sites, and logs tell me they do what they are intended to do. If it is for use within a VPS setting, it may differ, this format is common for use in htaccess.

jmccormac

3:00 am on Jan 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The python-requests UA might be another UA that could be problematic.

Regards...jmcc

tangor

3:52 am on Jan 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For those odd situations where rewrite is not available, this in .htaccess also works a charm:

SetEnvIfNoCase User-Agent "blexbot" ban

Meanwhile, these three bots do honor robots.txt. I also allow all to read robots.txt, even the denied

From a more drastic viewpoint one can deny ips in the firewall ...

Good luck!

phranque

5:56 am on Jan 3, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



this in .htaccess also works a charm:

SetEnvIfNoCase User-Agent "blexbot" ban

this is whispering in a forest without some mod_access_compat directives.

timchuma

6:00 am on Jan 3, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Answer from the server support:
The bots are not actually hitting the server, they are hitting the websites on it.

This is the reason why website firewall services like Wordfence and Sucuri exist so that they can nullify any traffic to the sites that is not of the "good kind".

The server has a firewall as well, however, it protects the server, not the websites on it and by that, we mean that it does not protect them from any attacks that happen to them through the web.

We do not offer any website security on our packages, however, we do offer Sucuri Security packages that come with a Sucuri Firewall feature

===

They want more money to fix it. Is on a virtual server at a legacy rate as it is so old now.

timchuma

6:14 am on Jan 3, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Someone from France?

62.210.204.151,6571,6575.44 GB31 Dec 2020 - 23:57
62.210.180.1642,6092,6098.56 GB31 Dec 2020 - 23:56
62.210.83.2062,4542,4548.04 GB31 Dec 2020 - 23:55
62.210.139.122,7152,7158.89 GB31 Dec 2020 - 23:53
118.169.83.248447.25 KB31 Dec 2020 - 23:51
162.244.34.1487474249.00 MB31 Dec 2020 - 23:47
178.159.37.1539930.34 MB31 Dec 2020 - 23:46
62.210.180.1462,5562,5568.36 GB31 Dec 2020 - 23:45
195.154.242.891,4211,4214.65 GB31 Dec 2020 - 23:42
62.210.215.112322322788.09 KB31 Dec 2020 - 23:39
62.210.122.2411,6751,6755.49 GB31 Dec 2020 - 23:34
195.154.222.291,4291,4294.68 GB31 Dec 2020 - 23:33
195.154.222.311,5781,5785.17 GB31 Dec 2020 - 23:28
195.154.242.1891,4131,4134.62 GB31 Dec 2020 - 23:28
203.221.102.128

jmccormac

6:47 am on Jan 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



62.210.128.0 - 62.210.255.255 range is from Online SAS / Scaleway in France. Probably not human.

118.169.83.xxx is from Hinet in Taiwan. Maybe human.
162.244.34.0/24 is from King Servers US. Probably not human.
178.159.37.xxx is from SBY-Telecom Ukraine. Maybe human.

195.154.128.0 - 195.154.255.255 FR-ILIAD-ENTREPRISES-CUSTOMERS/Online SAS Probably not human.

203.221.102.xxx is from TGP Internet Pty in Australia.

Online SAS/Scaleway generally pops up on lists here. The King Servers one might also be problematic. Given the distribution of the IPs and the GBs downloaded, they may be part of the problem and a necessary block at IP level.

Regards,...jmcc

wilderness

8:01 pm on Jan 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



We do not offer any website security on our packages


1) did you look and see if each site has an ".htaccess" file?
a) or your main server has a "config" file?
2) Are Euro and/or Ripe IP's beneficial to your sites?

lucy24

8:37 pm on Jan 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The bots are not actually hitting the server, they are hitting the websites on it.
An interesting way of putting it, but technically correct. What your hosts are essentially saying is: It's not our problem, it's yours. Some sites may welcome visits from some of the named robots, so it would be wildly inappropriate for the entire server to deny access.

Some hosts do offer addons such as Apache's mod_security (technically I think it's third-party, but made to work with apache), which can be set up to block patterns--request, UA, what-have-you--that are exclusively associated with malign robots. But even then, it's an optional extra: perhaps your site gets a lot of human visitors using MSIE 6, so it would not do to slam the door in their faces.

If this question was ever answered explicitly in this thread, I missed it: Does the site's robots.txt file include Disallow: lines for each of the named robots?

jmccormac

8:48 pm on Jan 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The problem, from looking at the IP ranges posted, seems to be a scraper one rather than an SE or service one. Service bots don't generally crawl images/photos. The main SEs (Google and Bing) do but tend to respect 304s. Shared hosting does limit the range of available options. Deep sixing (blocking ranges at IP level) is not generally an option on shared hosting. Putting the IP ranges in a .htaccess is probably, short of modifying the code itself to reject by IP, the easiest way to do things.

Regards...jmcc

timchuma

10:01 pm on Jan 3, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Set up Cloudflare for my site as was recommended for a large review board. Also blocking the bots on .htaccess

timchuma

10:04 pm on Jan 3, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



I have full access behind the scenes of each of the websites as the server admin is usually hands off and I do most things on the website (I have had it for 18 years.) Things like the Wordpress being hacked I had to get help with as I had no idea it was hacked until someone else said it was redirecting and Google said my site was returning blog posts such as "gay amputee #*$!".

phranque

1:54 am on Jan 4, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



The bots are not actually hitting the server, they are hitting the websites on it.
An interesting way of putting it, but technically correct. What your hosts are essentially saying is: It's not our problem, it's yours.

while i thought the same thing, that response sure made my skin crawl.

timchuma

9:14 am on Jan 4, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



The company in question were a lot better 4 owners ago. That's what I get for choosing a host based in Slough.

timchuma

6:14 am on Jan 12, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



I closed the ticket with them.

The results speak for themselves:
12.76 GB2.76 GB
22.96 GB2.96 GB
31.11 GB1.11 GB (set up Cloudflare & blocked bots)
4440.31 MB440.31 MB
5461.73 MB461.73 MB
6527.08 MB527.08 MB
7450.14 MB450.14 MB
8513.75 MB513.75 MB
9472.98 MB472.98 MB
10494.8 MB494.8 MB
11525.71 MB525.71 MB
12119.12 MB119.12 MB

99% of my traffic is to the one subdomain and 99% was from Blexbot downloading my photos.

lucy24

7:30 am on Jan 12, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



99% was from Blexbot downloading my photos.
How very odd--both because in my experience they're robots.txt compliant, and because I’ve never seen them request an image.

Could it be a faker? I realize it's not so easy to tell, with distributed robots, as BLEXbot is. (Note casing. If they really call themselves “Blexbot” like that, it’s an impersonator and can be blocked. In fact I’m surprised you haven’t got their most common IPs blocked already.)
This 33 message thread spans 2 pages: 33