My name is Terry and I'm the founder/creator of the bot (and the company). What we're doing is tracking conversations around links on Twitter; so we follow RSS feeds for blogs and then try to find any conversations about those posts by following links on Twitter, Facebook, and other blogs.
If you guys could provide me with some access logs (here or e-mail in profile), I'd love to see which URLs it was hitting. This is my first serious bot (written in C/C++; I tried a bunch of more standard tools from Apache and the like but am not proficient enough in Java to make the changes I needed) so it's still a little rough around the edges.
I apologize for hitting your servers so hard. As for the fishing, I've been working to normalize the URLs it looks for regarding the trailing slash so it's likely there was some overlap in fetching pages.
What is the normal rate at which a crawler accesses your site? I originally had it set to 5 requests/second max, but I usually only have 5 or so URLs per blog to check so it seemed useless; I can add it back in with a little more direction.
Sorry about any confusion; I haven't had time to properly document the crawler with such heavy development. Anyways, any access logs I can compare to my own logs would be hugely appreciated.
Bots should ALWAYS respect robots.txt. Many (myself included) whitelist (allow only some bots and disallow all others). Any bot that fails that simple request is generally frowned upon... and nuked from accessing the site altogether.
I should note that the RSS Check Bot, RSS Bot and Crawler are all different bots with different purposes (they should never hit the same pages or overlap). I thought it would be best to differentiate for debugging purposes, but I could consolidate them if that's a more standard practice.
RSS Check Bot - Checks your site for an RSS feed we can follow. RSS Bot - Periodically checks your RSS feed for new posts. Crawler - Crawls the new posts found in the feed.
I have been working very hard to get robots.txt working properly, caching the results (and it's existence), making sure I don't hit that file repeatedly, etc. Since it's all written in custom C/C++, text parsing is a bitch; but I assure you that it is coming soon (I have not found a decent robots.txt parsing library).
I have consolidated the bot user-agents (all now "Jaxified Bot") and added robots.txt checking to all except one bot which follows links from Twitter using HEAD requests and never downloads any page content; I am using libcurl to get pages and it simply follows any 301 and 302 redirects itself.
You can now add User-agent: Jaxified to your robots.txt file to disallow our bot specifically, though it will also follow rules for the wild card user-agent as well.
I have taken the RSS Bot and Crawler daemons offline until I can gather more information on the duplicate requests. I appreciate any follow ups.