Forum Moderators: open

Message Too Old, No Replies

Bots slamming my site

         

timchuma

11:08 am on Jan 1, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



The hosting tech support is not helping.

AhrefsBot 290,387+56 45.04 GB 30 Dec 2020 - 12:05
SemrushBot 168,034+2491 2.30 GB 30 Dec 2020 - 12:04
BLEXBot 136,350+1320 1.73 GB 30 Dec 2020 - 05:48
Unknown robot identified by bot\* 48,961+166 1.11 GB 30 Dec 2020 - 12:06

jmccormac

9:16 am on Jan 12, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



From looking at the Blexbot web page (http://webmeup-crawler.com/ ), it seems that they have reverse DNS on their crawlers and they should be *.webmeup.com in the logs if lookups are on. The IP ranges used for the crawlers seem to be concentrated on Hetzner's German ranges. The problem ranges, according to the IPs above are the French Iliad/Scaleway ones. It may well be a faker using the Blexbot user agent. Contact Blexbot and verify but deep six the problem ranges first.

Regards...jmcc

[edited by: jmccormac at 9:36 am (utc) on Jan 12, 2021]

jmccormac

9:33 am on Jan 12, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It is possible to block IP ranges in .htaccess (I think -- zero mugs of coffee at the moment) and this might be the best way of dealing with fakers that rotate user agents to scrape your site. The next step up would be to block in httpd.conf or to use iptables for a server-wide block. Blexbot is a link follower rather than an image scraper. So are the others. This is why the image requests are odd.

Regards...jmcc

lucy24

6:09 pm on Jan 12, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It is possible to block IP ranges in .htaccess (I think -- zero mugs of coffee at the moment)
Why, certainly it is. In 2.2 and earlier the line begins in “Deny from...”; in 2.4 it would be “Require...” (either “Require not blahblah”, or more often “Require blahblah” inside a <RequireNone> envelope).

In the specific case of BLEXbot and a handful of other distributed robots, an alternative is
SetEnvIf Remote_Addr {IP in RegEx form here} bad_range
followed by
BrowserMatch {name of acceptable robot} !bad_range
leading up to
Require env bad_range
so you can poke holes in a range that’s used by some acceptable robots.

In 2.4 you can also get more granular, like saying "block all of this large range, except this smaller sub-range which is used by a known quantity”, but that's a matter for the Apache subforum.


[edited by: not2easy at 10:14 pm (utc) on Jan 12, 2021]
[edit reason] typo [/edit]

This 33 message thread spans 2 pages: 33