Forum Moderators: open

Message Too Old, No Replies

CCBot

another Nutch Variant

         

Ocean10000

1:50 pm on Mar 27, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



CCBot/1.0 (+http://www.commoncrawl.org/bot.html)

Just seen this bot come by today, obeyed Robots.txt enough to take it read it and leave without problem. The link in the User-Agent doesn't work but they do have a FAQ page which covers the basics. They are currently claiming they are trying to get 501(c)3 status as a Non-Profit "CommonCrawl Foundation". They are using the same line of creating a new wave of search engine but do not fully state anything other then that. After reading there FAQ I found that this web crawler is based off of Nutch crawler (this is noted in the FAQ).

Current Range Listed on the website for this bot is.
38.103.63.16 through 38.103.63.18

keyplyr

11:46 pm on Mar 27, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I just ban "nutch" across the board.

IMO there are just way too many of these, most of which disobey robots.txt. If these companies end up succeeding with their business models and they also contribute to mine, then I'll reevaluate them on a case by case basis once they start using their own unique UA.

incrediBILL

11:47 pm on Mar 27, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is pretty bleeding edge find as I just saw it hit on 3/26 for the first time and return today with no other past bot activity tracked for that IP range.

38.103.63.* "CCBot/1.0 (+http://www.commoncrawl.org/bot.html)"

incrediBILL

11:51 pm on Mar 27, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I just ban "nutch" across the board.

That won't help you in this case since it uses "ccbot" as it's robot name.

Now you know why I whitelist ;)

OK, actually the crawlers using gibberish strings like "qwewyeowueouweoiu wieuwoie" caused me to whitelist but nutch and heritrix variations would've pushed me in the same direction inevitably so the end result would've been the same.

keyplyr

1:32 am on Mar 28, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I just ban "nutch" across the board.

That won't help you in this case since it uses "ccbot" as it's robot name.

Well, I didn't know the UA string since the OP didn't post it. The thread title is:


CCBot
another Nutch Variant

...so I assumed it had "nutch" somewhere in the UA string. As it turns out, I have "ccbot" banned already, so it can't be that "bleeding edge." LOL

I agree that white listing is the way to go for some sites. I do it on a small scale for a dozen UAs. At some point I may expand white listing across the board to see how well it works for me.

blend27

11:56 am on Mar 28, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



isn't that from PSI range(38.0.0.0/8) anyway?

Ocean10000

1:39 pm on Mar 28, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes it is part of that range. And is one of the things that tripped my bot filter so I would even notice it in the reports.