Forum Moderators: open

Message Too Old, No Replies

BuzzSumo

         

keyplyr

10:13 pm on Oct 20, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: Mozilla/5.0 (compatible; BuzzSumo; +http://www.buzzsumo.com/bot.html)
Protocol: HTTP/1.1
Robots.txt: No
Host: Nobis Technology Group (a hosting server farm)
23.104.0.0 - 23.111.255.255
23.104.0.0/13

Their bot page simply says...
Please email us at help@buzzsumo.com with the subject "Stop Crawling" and your domain name/website if you want us to stop crawling your site.
So, a stupid bot that can't support robots.txt.

keyplyr

4:25 am on Oct 21, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This thing is insatiable - beware!

Just in the last few hours it has been eating a lot of 403s at 3 of my sites, sometimes several sub-ranges, from the following:

- Nobis -
23.80.0.0 - 23.83.255.255
23.80.0.0/14
23.104.0.0 - 23.111.255.255
23.104.0.0/13
64.120.0.0 - 64.120.127.255
64.120.0.0/17

- ColoCrossing -
23.94.0.0 - 23.95.255.255
23.94.0.0/15
192.3.0.0 - 192.3.255.255
192.3.0.0/16
198.23.128.0 - 198.23.255.255
198.23.128.0/17

- Google Cloud -
23.236.48.0 - 23.236.63.25
23.236.48.0/20

- braveway.com -
104.218.195.0 - 104.218.195.255
104.218.195.0/24

- Quadranet -
155.94.128.0 - 155.94.255.255
155.94.128.0/17
204.44.64.0 - 204.44.127.255
204.44.64.0/18

- OVH -
158.69.0.0 - 158.69.0.0
158.69.0.0/16
167.114.0.0 - 167.114.255.255
167.114.0.0/16
192.99.0.0 - 192.99.255.255
192.99.0.0/16

So either the agent is distributed - or - has leased all these accounts - or - has infected all the accounts. I did in fact send that email to them. I hope that wasn't a mistake :)

keyplyr

8:00 pm on Oct 21, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Received email reply:
I just wanted to confirm that your domain has been removed. Let me know if you need anything else.
The "anything else" would be to support the rights of webmasters by supporting robots.txt. That should be the very first file requested by any bot.

lucy24

9:48 pm on Oct 21, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I just wanted to confirm that your domain has been removed.

You're a braver man than I. I would worry that the next thing you'll see is a slew of referer spam now that they've established that someone at your site reads logs.

keyplyr

10:20 pm on Oct 21, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You're a braver man than I [Gunga Din]

eeek

8:53 pm on Nov 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The large number of IP address certainly doesn't make me feel good about this bot. Neither does it continuing to eat 403 responses.

lucy24

10:21 pm on Nov 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The large number of IP address

There do exist law-abiding distributed crawlers [webmasterworld.com]. MJ12 comes to mind*.

continuing to eat 403 responses

Given the choice, I'd rather have a robot blindly working through a shopping list than one which receives a handful of 403s and promptly goes away. Robots with rudimentary intelligence are scary.


* ... although maybe it should stop coming to mind. I checked raw logs and found a surprising number of requests for a roboted-out directory, most recently May 2014 on one site. Far as I can remember, the directory has always been roboted-out. Did they change their ways after that?

eeek

10:55 pm on Nov 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Since when is MJ12 legit?

keyplyr

11:54 pm on Nov 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Since when is MJ12 legit?
"legit" may be a subjective term, but any company that builds their business model on aggregating my intellectual property into a product they sell does not qualify as a beneficial agent IMO.

As I remember, it kept hitting my server for weeks even after I installed their suggested robots.txt deny code; a real PITA so I eventually found it necessary to block MJ12.

eeek

12:00 pm on Nov 13, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Given the choice, I'd rather have a robot blindly working through a shopping list than one which receives a handful of 403s and promptly goes away.


I'm not talking about a "handful" of 403 responses. I have millions of pages; they should get the hint and I shouldn't need to firewall huge address ranges.

keyplyr

12:34 pm on Nov 13, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"firewall huge address ranges"

Can't you block the UA? IMO it seems futile to block IPs if a bot is distributed.

lucy24

7:06 pm on Nov 13, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



they should get the hint

That was actually my point. Robots that take hints are more intelligent, and therefore scarier, than robots that plow through a whole shopping list regardless of response.

This doesn't, of course, apply to law-abiding robots-- the ones that read robots.txt, find themselves excluded by name, and then refrain from making further requests. (It's admittedly rare to meet these living at blocked ranges, but it's worth trying robots.txt, since the only thing better than a blocked request is no request at all.)

eeek

10:41 pm on Nov 13, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can't you block the UA?


Of course I can and I have. But I prefer not to have thousands of requests to process.