Hi I have recently written a new robot for our search engine, it was written in C++ and it is hoped that it will increase the speed of indexing web sites. The problem we have had has been that our present indexing system takes around five minutes per web site (Depending on the number of pages found) but it slows down heaps when it tries to get to grips with flash images. It was choice of either multiple robots running or a re write and in the end we oted for a complete rewrite. It seems to take an age for the Robots registration web site to get back to us, the last email address we had was one in the UK greenhills or a name like that. I was wondering how long it has taken for others to register a robot?
likewise, and I've never heard of registering your robot. Sounds like someone's pet project.
5 minutes indexing a site is fairly short if it has several hundred pages. The time you take to index a site shouldn't be an issue if your crawlers are running in parallel, which I assume is why you did a rewrite.
If you are respecting the new crawl-delay directive in robots.txt (which I highly recommend) you may find you have to rearchitect again. 10 seconds seems to be a common delay that webmasters would like between robot hits. So if you have a single thread that is waiting 10 seconds between each request to a single website it is crawling, that is very inefficient. Try queue up a list of urls that belong to a range of websites and alternate which site you're hitting making sure you're respecting crawl-delay for all sites. Not a trivial piece of code, but its IMHO the best way to do things. That way you can have each thread of your crawler firing off 100 requests per second and not overwhelm any site.
There are a few places on the web where you can let people know what your bot is used for, one such place is www.robotstxt.org and another is www.botspotter.net that way people get to find out what robots are loose out there and what they are used for! Some come into web sites and just collect email addresses, others just suck bandwidth. Some read robots.txt and some don't so the thing appears to have been set up to try and enable people to see what is coming into there websites and for what reason. Too many people download a spider and sit down and suck at people's bandwidth for no known or valid reason. So where you guys and cirls look at your log files and note a new user agent that has been rooting around your web site you should at least be able to find out whet the score is.