Forum Moderators: open

Message Too Old, No Replies

MauiBot

         

keyplyr

9:37 pm on Mar 30, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




UA: MauiBot (crawler.feedback+wc@gmail.com)
Protocol: HTTP/1.1
Robots.txt: Yes
Host: AWS
54.160.0.0 - 54.175.255.255
54.160.0.0/12

MitchNginx

10:09 am on Mar 31, 2018 (gmt 0)

5+ Year Member



Hi keyplyr+ also picked this up in my logs this morning. This one may soon warrant blocking their IP addresses/ranges too but my blocker kicked them off anyway will keep monitoring these guys.

35.153.*.* - - [31/Mar/2018:09:26:51 +0200] "GET /robots.txt HTTP/1.1" 444 0 "-" "MauiBot (crawler.feedback+wc@gmail.com)" "-"PORT:80 0.000 - . "GZIP:-"

lucy24

5:45 pm on Mar 31, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The name seemed naggingly familiar, but it took a case-insensitive search before I remembered “MAUI WAP browser”. No relation, I suppose.

And that’s why NoCase or [NC] needs to be used with extreme caution.

lucy24

9:15 pm on Apr 1, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Follow-up: Based on its attested behavior in the last few days' logs, this may intend to be a compliant robot. (Loads of requests, but nothing in a roboted-out directory.) I'll see what happens after I Disallow.

keyplyr

9:25 pm on Apr 1, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It requested robots.txt 60 times yesterday at one of my sites (the only site to see it.) IMO this means nothing regarding compliance. We don't know what they plan to do with our files, and actually I don't care if an agent respects robots.txt or not.

My criteria for allowing remote actors to use my property is benefit. If they are not benefitting my interests in some way, they can't have access. There is too much activity on the net not to be idiocentric.

If they're not benefiting you, they're benefiting themselves or someone else.

keyplyr

3:59 am on Apr 4, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




[Update]
Since I disallowed MauiBot in robots.txt, it hasn't requested other files.

lucy24

12:39 am on Apr 6, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My criteria for allowing remote actors to use my property is benefit.
If I come home to find that someone has been in my house, and I know this because they have vacuumed the rug, done the laundry, washed the dishes and cooked me a gourmet dinner ... they’re still housebreakers. (One of the Discworld books has a great riff on this theme. I can’t remember the nice technical term they came up with.)

I, too, have seen a whole lot of MauiBot requests for robots.txt, and nothing else since they were disallowed.

Do you supppose the MauiBot is “crawler” by a new name?

keyplyr

12:49 am on Apr 6, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Say what?

jehoshua

5:16 am on Apr 6, 2018 (gmt 0)

10+ Year Member Top Contributors Of The Month



Most of the requests for the past few days have been from MauiBot (crawler.feedback+wc@gmail.com). Not sure whether to ban the IP or not.

keyplyr

5:40 am on Apr 6, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



jehoshua - well that's the problem when the bot owner does not include a link to an info page describing who they are and what they do with our data.

Personally, I block all Amazon (AWS) IP ranges, but allow beneficial agents through. So if they don't provide info they are beneficial, I don't allow them.

jehoshua

6:06 am on Apr 6, 2018 (gmt 0)

10+ Year Member Top Contributors Of The Month



jehoshua - well that's the problem when the bot owner does not include a link to an info page describing who they are and what they do with our data.


Thanks, I have disallowed that one. :)

keyplyr

7:51 am on Apr 6, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Amazon (AWS) IP ranges [webmasterworld.com]

lucy24

10:49 pm on Apr 17, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



More about this robot's behavior.

I disallowed them in robots.txt at the beginning of the month. Normally I reassess access controls once a month; this time I felt so sorry for them, I removed the disallow and poked the appropriate holes after 10 days. Well, they were just so polite ...

Site 1:
IP: exactly 54.234.aa.bb for all requests, beginning about 2 days before I authorized them.

Requests: top-to-bottom spidering, although earlier requests (when they were blocked but not disallowed) suggest they knew about certain interior pages already.

Crawl frequency: clumps of 3-6 requests in a single second, followed by a gap of rpetty exactly 30 seconds.

Site 2:
IP: different from Site 2, but same pattern: about 2 days before I authorized them, they settled on a single IP all the time. (I've noticed the same thing in some European search engines: for any given site, they always crawl from the identical IP.)

Requests: proceeded directly to selected interior pages, although earlier requests were only for top-level directories (linked from front page and 403 page). In other words, the exact opposite of their Site 1 behavior.

Final quirk: Site 2 is HTTPS. Requests came though on HTTP and were redirected to HTTPS, so nothing except robots.txt got a 200. To date they have not followed-up the redirects; I waited before posting to see if they'd be back, but it's been almost a week.