Forum Moderators: open

Message Too Old, No Replies

crawler4j

         

keyplyr

3:10 am on Jul 24, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: crawler4j (https://github.com/yasserg/crawler4j/)
Protocol: HTTP/1.1
Robots.txt: No
Host: softlayer.com
198.23.64.0 - 198.23.127.255
198.23.64.0/18
Open Source Web Crawler for Java

Previous mentions:
[webmasterworld.com...]
[webmasterworld.com...]

lucy24

3:48 am on Jul 24, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, ###. I didn't notice that they'd stopped asking for robots.txt. I authorized them a while back after establishing that they only understand Disallow: lines if they get a block to themselves. It may be time for a reassessment.

Within the past year or so, seen from:
158.69 (various from 240 on up)
192.99.101
160.16.241

Yup. Time for reassessment.

keyplyr

3:54 am on Jul 24, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I authorized them a while back
You authorized them? Who's them?

You do understand this is just a tool anyone can use for anything they want, like stealing all your web property, right?

(just playing the Devil's Advocate)

lucy24

6:35 am on Jul 24, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Until I went back to take a closer look, I hadn't realized just how widely used it is. In fact there are two entirely different UA strings containing the “crawler4j” element:

crawler4j (http://code.google.com/p/crawler4j/)
(that one dates back to at least 2013, seen as recently as late 2017, but I'll bet if it were up-to-date it would say https)

crawler4j (https://github.com/yasserg/crawler4j/)
(isolated sighting late 2015, but otherwise not until late 2017)

In any case, I forgot that somewhere along the line I'd added the "github" element to my bad_agent list, so they were getting blocked regardless. Heh.

keyplyr

6:49 am on Jul 24, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The github UA is default when the user downloads it from the repository.

The Google UA is when the bot is run from the open source platform of Google Developers group.

Same bot, different botrunners. Heh.