Forum Moderators: open

Message Too Old, No Replies

CommonCrawl.org

         

Brett_Tabke

6:03 am on Nov 23, 2009 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



What system of spidering do they use? Straightup ccBot as they say? Or is it a distributed crawler still using nutch?

After reading this over the weekend, I really don't want to have anything to do with them:

[radar.oreilly.com...]

keyplyr

10:56 am on Nov 23, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



IP: 38.107.191.**
rDNS: auth1.dns.cogentco.com
UA: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)

All I can tell is it did not request robot.txt and their bot info page is 404.

Staffa

12:19 pm on Nov 23, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you go to the home page there's a link to the bot page :

"The ccBot crawler is a distributed crawling infrastructure that makes use of the Apache Hadoop and Nutch projects."

To put it politely, the guy who wrote that article is a pretty arrogant chap. Nevermind, he won't have the chance to ignore my robots.txt file since 38.nnn.nnn.nnn has been banned for years, nothing much good coming from there.

blend27

11:46 pm on Nov 23, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Slightly OT, but FROM THE COMMENTS on that page: -- should government webmasters be allowed to dictate to organizations..... --

Nothing for nothing, but most government webmasters would not now what to do with Robots.txt file to start with. The job usualy held by an EX-Maiframe programmer that is stock in meetings and that had learned how to use fronpage 3.0 back in a day. And such...

On the other hand, the last project I did for that sector, the webmaster had Poster(home made, with the crown) that Said: "CONTENT IS KING", I swear...

Pfui

5:07 am on Nov 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Saw this for the first time two days ago -- note the "CC":

static-71-160-113-nnn.lsanca.dsl-w.verizon.net
CC-rget/5.818 libwww-perl/5.805

robots.txt? NO

A variation on a theme, perhaps?