MauiBot

Forum Moderators: open

Message Too Old, No Replies

MauiBot

keyplyr

9:37 pm on Mar 30, 2018 (gmt 0)

UA: MauiBot (crawler.feedback+wc@gmail.com)
Protocol: HTTP/1.1
Robots.txt: Yes
Host: AWS
54.160.0.0 - 54.175.255.255
54.160.0.0/12

MitchNginx

10:09 am on Mar 31, 2018 (gmt 0)

Hi keyplyr+ also picked this up in my logs this morning. This one may soon warrant blocking their IP addresses/ranges too but my blocker kicked them off anyway will keep monitoring these guys.

35.153.*.* - - [31/Mar/2018:09:26:51 +0200] "GET /robots.txt HTTP/1.1" 444 0 "-" "MauiBot (crawler.feedback+wc@gmail.com)" "-"PORT:80 0.000 - . "GZIP:-"

lucy24

5:45 pm on Mar 31, 2018 (gmt 0)

The name seemed naggingly familiar, but it took a case-insensitive search before I remembered “MAUI WAP browser”. No relation, I suppose.

And that’s why NoCase or [NC] needs to be used with extreme caution.

lucy24

9:15 pm on Apr 1, 2018 (gmt 0)

Follow-up: Based on its attested behavior in the last few days' logs, this may intend to be a compliant robot. (Loads of requests, but nothing in a roboted-out directory.) I'll see what happens after I Disallow.

keyplyr

9:25 pm on Apr 1, 2018 (gmt 0)

It requested robots.txt 60 times yesterday at one of my sites (the only site to see it.) IMO this means nothing regarding compliance. We don't know what they plan to do with our files, and actually I don't care if an agent respects robots.txt or not.

My criteria for allowing remote actors to use my property is benefit. If they are not benefitting my interests in some way, they can't have access. There is too much activity on the net not to be idiocentric.

If they're not benefiting you, they're benefiting themselves or someone else.

keyplyr

3:59 am on Apr 4, 2018 (gmt 0)

[Update]
Since I disallowed MauiBot in robots.txt, it hasn't requested other files.

lucy24

12:39 am on Apr 6, 2018 (gmt 0)

My criteria for allowing remote actors to use my property is benefit.

If I come home to find that someone has been in my house, and I know this because they have vacuumed the rug, done the laundry, washed the dishes and cooked me a gourmet dinner ... they’re still housebreakers. (One of the Discworld books has a great riff on this theme. I can’t remember the nice technical term they came up with.)

I, too, have seen a whole lot of MauiBot requests for robots.txt, and nothing else since they were disallowed.

Do you supppose the MauiBot is “crawler” by a new name?

keyplyr

12:49 am on Apr 6, 2018 (gmt 0)

Say what?

jehoshua

5:16 am on Apr 6, 2018 (gmt 0)

Most of the requests for the past few days have been from MauiBot (crawler.feedback+wc@gmail.com). Not sure whether to ban the IP or not.

keyplyr

5:40 am on Apr 6, 2018 (gmt 0)

jehoshua - well that's the problem when the bot owner does not include a link to an info page describing who they are and what they do with our data.

Personally, I block all Amazon (AWS) IP ranges, but allow beneficial agents through. So if they don't provide info they are beneficial, I don't allow them.

jehoshua

6:06 am on Apr 6, 2018 (gmt 0)

jehoshua - well that's the problem when the bot owner does not include a link to an info page describing who they are and what they do with our data.

Thanks, I have disallowed that one. :)

keyplyr

7:51 am on Apr 6, 2018 (gmt 0)

Amazon (AWS) IP ranges [webmasterworld.com]

lucy24

10:49 pm on Apr 17, 2018 (gmt 0)

More about this robot's behavior.

I disallowed them in robots.txt at the beginning of the month. Normally I reassess access controls once a month; this time I felt so sorry for them, I removed the disallow and poked the appropriate holes after 10 days. Well, they were just so polite ...

Site 1:
IP: exactly 54.234.aa.bb for all requests, beginning about 2 days before I authorized them.

Requests: top-to-bottom spidering, although earlier requests (when they were blocked but not disallowed) suggest they knew about certain interior pages already.

Crawl frequency: clumps of 3-6 requests in a single second, followed by a gap of rpetty exactly 30 seconds.

Site 2:
IP: different from Site 2, but same pattern: about 2 days before I authorized them, they settled on a single IP all the time. (I've noticed the same thing in some European search engines: for any given site, they always crawl from the identical IP.)

Requests: proceeded directly to selected interior pages, although earlier requests were only for top-level directories (linked from front page and 403 page). In other words, the exact opposite of their Site 1 behavior.

Final quirk: Site 2 is HTTPS. Requests came though on HTTP and were redirected to HTTPS, so nothing except robots.txt got a 200. To date they have not followed-up the redirects; I waited before posting to see if they'd be back, but it's been almost a week.

MauiBot

keyplyr

MitchNginx

lucy24

lucy24

keyplyr

keyplyr

lucy24

keyplyr

jehoshua

keyplyr

jehoshua

keyplyr

lucy24

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week