cpollett - 5:08 pm on Mar 3, 2013 (gmt 0)
Some quick remarks on YioopBot... It does range requests of 50000 bytes by default to conserve hard drive space since it is running off some mac mini's in my guest room with 4tb drives attached. It might scarf down more if the data is chunked, but then does a post-download chop to 50K. In November I was doing single day test crawls. Dec 17 - present I have been doing a longer term crawl about 240million pages so far. Periodically, this has been stopped for brain transplants, and also I have been testing some other kinds of indexing operations on Wikipedia dumps and the UT Zoo Usenet archives.