Step 1: Remove useless entries Step2: Compress multiple records
I have noticed a marked size increase in my robots.txt files as more and more automated spidering software becomes available. So, I started looking for ways to reduce the size of this file without losing any of its functionality. I found three methods which, used together, can achieve a significant size reduction in robots.txt, making the file faster to load, and easier to maintain and keep organized.
The first step is to go through and pull out any lines disallowing robots which never respect robots.txt. These lines are a waste of space, so get rid of them. In the first list below, Larbin would be a good candidate, as I've never seen it check or obey robots.txt. Such rogue user-agents should be actively blocked by other means, such as by the use of .htaccess or httpd.conf directives on Apache server, or global.asa, ISAPI filters, or .asp scripting on IIS.
The second step is to notice two statements in the Standard for Robots Exclusion [robotstxt.org]:
Step 1: Remove useless entries
Step2: Compress multiple records
That means that you can take a typical robots.txt and cut its size by more than half. For example, the first few lines of the file might read:
All of that can be compressed to:
Step 3: Remove redundant user-agents
Now, lets look a little farther down in the file, where we find this:
The second entry should not be necessary, since good robots are supposed to use a substring-matching compare.
Again, from the Standard for Robots Exclusion [robotstxt.org]:
One notable exception to the above is our friends at Nutch.org [nutch.org]. They have painted themselves into a corner recently because of the way they have specified their crawler's user-agent names. The user-agent for their development team is "NutchOrg" and the user-agent for others who wish to use their crawler is just "Nutch." We can be reasonably sure that the development team is not going to allow their crawler to be abusive or use it for untoward purposes, but it's yet to be seen whether they can and will enforce requirements for "good behaviour" on the part of their licensees. Therefore, if we wish to err on the side of caution, and to allow NutchOrg but disallow unknown licensees, we have to use:
I have suggested to the authors that they require a licensing agreement for use and specify in that licensing agreement that users must properly identify themselves in the user-agent string as seen in our log files. At a minimum, the licensed version should carry a user-agent of something like NutchUsr to prevent the substring-matching problem cited above. I wish them luck, but they have several loose strings to tie up in order to avoid having their robot become the next Indy Library (spambot).
Even though "compressed" robots.txt files are valid, and will pass validation [searchengineworld.com], it is possible that some robots may not understand robots.txt files written as suggested above. If that is the case, then they are not following the recommendations of the Standard for Robots Exclusion, and may be candidates for stronger measures, such as outright banning of their user-agent from your site - for example, by blocking them as mentioned above. Hopefully, they may notice the blocks and correct their robots' behaviour as a result (I don't mean to be exclusive, but it's the only "vote" we've got).
If you make any changes to your robots.txt, take the time to validate it before putting it on your site. This may save you from disaster [webmasterworld.com]. The easiest way to do this is to upload your new robots.txt file to your server using a unique filename such as "robots.tst" and then check it with the Search Engine World Robots.txt Validator [searchengineworld.com].
New! Improved! - robots.txt Lite!
Using these three tricks on your robots.txt can significantly shrink its size and make it easier to maintain. I've had no problems using my new robots.txt Lite file, but of course, your mileage may vary.