Forum Moderators: open

Message Too Old, No Replies

BLEXBot

new UA, range

         

keyplyr

8:42 pm on May 4, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




UA: Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
Protocol: HTTP/1.1
Robots.txt: Yes
Host: webmeup.com
5.9.18.0 - 5.9.18.31
5.9.18.0/27
Parent: hetzner.de
5.9.0.0 - 5.9.255.255
5.9.0.0/16

New UA, new dedicated crawl range within hetzner.de.

Previous dioscussion: [webmasterworld.com...]

lucy24

1:41 am on May 5, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The previous thread said
hope they get their crawl tactics in better order

I, for one, hope they get their ### database in order. Every day's logs show a fresh batch of BLEXBot 404s where they have clearly appended someone else's URLs onto my paths--not just once but dozens or hundreds of times, all different. It's gotten to where I start by globally deleting the pattern ".+?404 .+?BLEXBot.+\n" so I can pick out any real 404s. Grr.

Edit: 5.9? Really? I'm seeing them from 148.251.244.204 (exactly). Earlier in the year it was 144.76.176.195; I think they changed around the beginning of April. I've met other robots that seemed to use the identical IP for every visit. And then if you've got more than one site, you see more than one IP.

[edited by: lucy24 at 1:46 am (utc) on May 5, 2017]

keyplyr

1:44 am on May 5, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, I am seeing similar shenanigans...

keyplyr

5:48 am on May 5, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The ranges you list are all hetzner.de. They have always used hetzner.de. It makes little difference what numeric range, it has more to do with how many server nodes the company is paying for, which can be spread across different machines at the host

However, I'm assuming that now they have their own crawl range assigned at hetzner.de (see 1st post) they will likey come only from there after the others run out.

TorontoBoy

12:12 am on May 6, 2017 (gmt 0)

5+ Year Member Top Contributors Of The Month



The visit me daily. Yes, they read Robots.txt, but then ignore it. Blexbot hits me with force.
SetEnvIf User-Agent BLEXBot keep_out

lucy24

2:52 am on May 6, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, they read Robots.txt, but then ignore it.

Huh, that's interesting. They've never requested anything in a roboted-out directory of mine. Or did you mean that you deny them by name and they keep making requests? Unless there was a glitch in my record-keeping, they understood when they found their name in the middle of a User-Agent: list. (Very rarely, robots only understand if they get a paragraph to themselves.)

You have to concede it does not currently appear to be the most intelligent of robots ...

keyplyr

3:06 am on May 6, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sent them an email concerning their bots erratic behavior. Yesterday I had close to 400 404s from BLEXBot on one site alone. Unless they clean up their act, the next step will be blocking them. I hate to do it because they offer a product that indirectly benefits my interests, but running a bot that stupid should have ramifications :)

TorontoBoy

3:10 am on May 6, 2017 (gmt 0)

5+ Year Member Top Contributors Of The Month



148.251.244.204[29/Apr/2017:15:58:27GET /robots.txt HTTP/1.1 200 709 - Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
148.251.244.204[29/Apr/2017:15:58:28GET /wp/?p=4176&buy-cephalexin-no-prescription HTTP/1.1 403 13 - Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)...

I allow anyone to read my robots.txt, but then send them 403s. Still, they visit me every day, like a dear friend. Oh BLEXbot...We are just not right for each other.

In the 120 server requests/day, every day, BLEXy will read my robots.txt 3-4 times. And ignore it. This has been going on for over a year.

keyplyr

3:20 am on May 6, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



TorontoBoy - You're saying that your robots.txt disallows BLEXbot...
User-agent: BLEXBot
Disallow: /
...but the bot goes on to make requests anyway?

They say:
we of course take any request to desist crawling any site... If this is the case for you please don't hesitate to contact us at customercare@webmeup.com
So if you don't want them to crawl your site, try emailing them. They will need your site's IP address in addition to your domain name.

Give it 30 days and let us know if that worked, eh?

lucy24

5:37 am on May 6, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I allow anyone to read my robots.txt, but then send them 403s.

Yes, fine, but what does your robots.txt say about BLEXBot?

running a bot that stupid should have ramifications

That belongs on a bumper sticker :)

TorontoBoy

12:24 pm on May 6, 2017 (gmt 0)

5+ Year Member Top Contributors Of The Month



I really find robots.txt near useless, totally ineffective and outdated guideline. Most bots either ignore it, or read it to find out where they are not supposed to go and then purposely go there. Robots.txt is the stop sign that everyone does a rolling stop through.

I'll add them to my robots.txt and in 30 days we'll see.

Contacting each rogue bot owner or host is a lot of work and I find it rarely works. Bots should be smart enough to know that if you are receiving a mouthful of 403s you really should go elsewhere. There are too many rogue bots to try to contact, way too many. We cannot even get bot owners to self-identify, most prefer to hide behind the anonymity of Mozilla. Most don't read the robots.txt, so I don't usually bother synching my robots.txt to my ban list.

lucy24

7:49 pm on May 6, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'll add them to my robots.txt

So they were guilty of ignoring a directive that doesn’t exist? Imagine the brazen nerve of them.

keyplyr

7:58 pm on May 8, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Since they haven't taken any measures to correct their crawl, I just disallowed BLEXBot via robots.txt. We'll try that and see what happens.

I really find robots.txt near useless, totally ineffective and outdated guideline. Most bots either ignore it, or read it to find out where they are not supposed to go and then purposely go there. Robots.txt is the stop sign that everyone does a rolling stop through.
Well that's pretty much always been the case. That's why it works so well for filtering. Good bots support it, bad bots don't. It's just that there were fewer bad bots a few years ago.

lucy24

10:07 pm on May 8, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Since they haven't taken any measures to correct their crawl,

Aw, phooey. I thought they'd finally got it fixed, based on yesterday's log wrangling, but I guess they were just taking the weekend off. I ran over to take a look, and today's logs are filled with
148.251.244.204 - - [08/May/2017:04:10:27 -0700] "GET /ebooks/hhtravel/023183.html HTTP/1.1" 404 1462 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)" 
148.251.244.204 - - [08/May/2017:04:11:06 -0700] "GET /ebooks/hhtravel/023201.html HTTP/1.1" 404 1463 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
148.251.244.204 - - [08/May/2017:04:11:28 -0700] "GET /ebooks/hhtravel/023216.html HTTP/1.1" 404 1463 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
148.251.244.204 - - [08/May/2017:04:11:36 -0700] "GET /ebooks/hhtravel/023217.html HTTP/1.1" 404 1463 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
148.251.244.204 - - [08/May/2017:04:11:40 -0700] "GET /ebooks/hhtravel/023234.html HTTP/1.1" 404 1463 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
148.251.244.204 - - [08/May/2017:04:11:44 -0700] "GET /ebooks/hhtravel/023236.html HTTP/1.1" 404 1463 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
... which, incidentally, suggests that somebody out there has some really, really useless URLs. (They're not dates, which would have been the only possible justification.)