homepage Welcome to WebmasterWorld Guest from 54.205.247.203
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Put your robots.txt on a diet
Three easy ways to reduce bloat in robots.txt
jdMorgan




msg:1529032
 4:37 pm on Jun 4, 2003 (gmt 0)

robots.txt bloat
I have noticed a marked size increase in my robots.txt files as more and more automated spidering software becomes available. So, I started looking for ways to reduce the size of this file without losing any of its functionality. I found three methods which, used together, can achieve a significant size reduction in robots.txt, making the file faster to load, and easier to maintain and keep organized.

Step 1: Remove useless entries
The first step is to go through and pull out any lines disallowing robots which never respect robots.txt. These lines are a waste of space, so get rid of them. In the first list below, Larbin would be a good candidate, as I've never seen it check or obey robots.txt. Such rogue user-agents should be actively blocked by other means, such as by the use of .htaccess or httpd.conf directives on Apache server, or global.asa, ISAPI filters, or .asp scripting on IIS.

Step2: Compress multiple records
The second step is to notice two statements in the Standard for Robots Exclusion [robotstxt.org]:

The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below.
-and-
If more than one User-agent field is present, the record describes an identical access policy for more than one robot.
(emphasis added)

That means that you can take a typical robots.txt and cut its size by more than half. For example, the first few lines of the file might read:

User-agent: grub-client
Disallow: /

User-agent: grub
Disallow: /

User-agent: WebZip
Disallow: /

User-agent: larbin
Disallow: /

User-agent: b2w/0.1
Disallow: /

User-agent: Copernic
Disallow: /

All of that can be compressed to:

User-agent: grub-client
User-agent: grub
User-agent: looksmart
User-agent: WebZip
User-agent: larbin
User-agent: b2w/0.1
User-agent: Copernic
Disallow: /

Step 3: Remove redundant user-agents
Now, lets look a little farther down in the file, where we find this:

User-agent: ia_archiver
Disallow: /

User-agent: ia_archiver/1.6
Disallow: /

The second entry should not be necessary, since good robots are supposed to use a substring-matching compare.
Again, from the Standard for Robots Exclusion [robotstxt.org]:

The robot should be liberal in interpreting this field. A case-insensitive substring match of the name without version information is recommended.

Any user-agent which matches the User-agent string it finds in robots.txt either exactly or with the string in robots.txt specifying a substring of the actual User-agent name should consider that User-agent line to apply to itself. Therefore, the second line above is redundant; individual version numbers need not be included in robots.txt. Also note that the entries for "grub" and "grub-client" in the preceding example are also redundant - just "grub" should do.

An aside:
One notable exception to the above is our friends at Nutch.org [nutch.org]. They have painted themselves into a corner recently because of the way they have specified their crawler's user-agent names. The user-agent for their development team is "NutchOrg" and the user-agent for others who wish to use their crawler is just "Nutch." We can be reasonably sure that the development team is not going to allow their crawler to be abusive or use it for untoward purposes, but it's yet to be seen whether they can and will enforce requirements for "good behaviour" on the part of their licensees. Therefore, if we wish to err on the side of caution, and to allow NutchOrg but disallow unknown licensees, we have to use:

User-agent: NutchOrg
Disallow:


User-agent: Nutch
Disallow: /

Note also that the example given on their Webmaster information page is technically incorrect, since a robot should obey the first User-agent line it matches in robots.txt.

I have suggested to the authors that they require a licensing agreement for use and specify in that licensing agreement that users must properly identify themselves in the user-agent string as seen in our log files. At a minimum, the licensed version should carry a user-agent of something like NutchUsr to prevent the substring-matching problem cited above. I wish them luck, but they have several loose strings to tie up in order to avoid having their robot become the next Indy Library (spambot).

Caveats:
Even though "compressed" robots.txt files are valid, and will pass validation [searchengineworld.com], it is possible that some robots may not understand robots.txt files written as suggested above. If that is the case, then they are not following the recommendations of the Standard for Robots Exclusion, and may be candidates for stronger measures, such as outright banning of their user-agent from your site - for example, by blocking them as mentioned above. Hopefully, they may notice the blocks and correct their robots' behaviour as a result (I don't mean to be exclusive, but it's the only "vote" we've got).

If you make any changes to your robots.txt, take the time to validate it before putting it on your site. This may save you from disaster [webmasterworld.com]. The easiest way to do this is to upload your new robots.txt file to your server using a unique filename such as "robots.tst" and then check it with the Search Engine World Robots.txt Validator [searchengineworld.com].

New! Improved! - robots.txt Lite!
Using these three tricks on your robots.txt can significantly shrink its size and make it easier to maintain. I've had no problems using my new robots.txt Lite file, but of course, your mileage may vary.

HTH,
Jim

 

spud01




msg:1529033
 11:02 am on Jun 13, 2003 (gmt 0)

I disallow spiders via the .htaccess file.

Is using the robots.txt file a better alternative with respects to robot finding the file and agreeing to the rules placed in it?

jbinbpt




msg:1529034
 12:05 pm on Jun 13, 2003 (gmt 0)

Excellent info…
I never run out of things to do after I stop in here
Thanks for the post

bird




msg:1529035
 12:20 pm on Jun 13, 2003 (gmt 0)

User-agent: NutchOrg
Disallow:
User-agent: Nutch
Disallow: /

I think you need an empty line before the second User-agent line to get valid syntax.
Like this:

User-agent: NutchOrg
Disallow:

User-agent: Nutch
Disallow: /

brotherhood of LAN




msg:1529036
 1:23 pm on Jun 13, 2003 (gmt 0)

Great post Jim!

I was looking at WW's robots.txt last night, and trying to make a spider obey robots.txt.

>>>>The record starts with one or more User-agent lines

Cheeers for that, I think you've saved me an hours worth of mistakes :)

cabowabo




msg:1529037
 1:42 pm on Jun 13, 2003 (gmt 0)

Great info, as a newbie, I have just one question: Why would you want to disallow Grub? Don't they feed Wisenut which is owned by LookSmart?

bcolflesh




msg:1529038
 1:51 pm on Jun 13, 2003 (gmt 0)

A lot of people download the grub client and use it for less than altruistic purposes...

Regards,
Brent

Yidaki




msg:1529039
 2:24 pm on Jun 13, 2003 (gmt 0)

Excellent, really excellent, jd!

>The record starts with one or more User-agent lines, followed by one or more Disallow lines

Ts, ts, ts ... rtfm ... that'll reduce my robots.txt files by more than 30%! Thanks!

Are these hints tested with google and other major crawlers, jd?

jimmykav




msg:1529040
 2:56 pm on Jun 13, 2003 (gmt 0)

Good post jdMorgan

I am not so sure about the practicality of rolling up multiple agents into one section though.

Many of them have trouble reading the file in the more traditional format, and probably do not cater for this functionality.

Have you any figures for those that support this rollup and also how many do supprt matchs on the leftmost(n) chars for the agent name?

richardb




msg:1529041
 3:07 pm on Jun 13, 2003 (gmt 0)

Thank you JD

For cabowabo you might find this useful.

http*//www.pgts.com.au/pgtsj/pgtsj0208d.html

Rich

jdMorgan




msg:1529042
 3:17 pm on Jun 13, 2003 (gmt 0)

All,
Thanks for the positive comments!

spud01,

> I disallow spiders via the .htaccess file. Is using the robots.txt file a better alternative with respects to robot finding the file and agreeing to the rules placed in it?

Use robots.txt to tell good spiders where you want them to go and not go. Use .htaccess on Apache, or similar means on MS servers - to block, ban, or trap robots which do not check or do not comply with robots.txt.

bird,
Well-spotted! I'll plead a formatting problem with pre and code. Each record of robots.txt must be followed by a blank line, including the last one.

cabowabo,
> Why would you want to disallow Grub?

An excellent question! Actually, the robots.txt file used for examples here is not mine, but I've disallowed Grub as well. As bcolflesh notes, the spider has been misused a lot.

For me, this is an open question. If I can identify IP address ranges used by "good" organizations using Grub, I can lift the robots.txt disallow and the .htaccess trap that backs it up. However, the distributed nature of this crawler may make that impossible; there are many good guys and many bad guys. The disallow of Grub in my robots.txt is being used as a compliance test. So far, I've seen very few instances of grub checking robots.txt.

Writers of open-source spiders should bear this situation in mind, and create strong licensing agreements for their works, with enforceable standards for robot.txt compliance, and inclusion of correct contact information and using-organization identification in the user-agent string. (This was included as an aside to the main theme of the post, trying to warn our friends at Nutch - who are headed for a similar problem.)

Yidaki,

>Are these hints tested with google and other major crawlers, jd?

I have run this way for a long time with no problems regarding major 'bots. The information presented here is based on the Standard for Robots Exclusion, as cited at the outset. It works well for me. However, if you are the webmaster of a large corporation making millions in on-line sales, then I'd advise you to not "fix" anything that is not broken!

jimmykav,
> Have you any figures for those that support this rollup and also how many do support matchs on the leftmost(n) chars for the agent name?
Figures? - No, I have only my view from the sites I've administrated. My perspective is from sites where the major U.S. search engines are allowed, and almost everything else is disallowed (or blocked by other means as noted). The "process" I use is to disallow unknown/suspect agents first. If they comply, that's all I do. If they don't comply, I remove the robots.txt disallow, and install a block in .htaccess. It's a matter of an initial premise: I allow well-behaved, correctly-implemented robots to access my sites.

Jim

SEOMike




msg:1529043
 5:36 pm on Jun 13, 2003 (gmt 0)

Great messages! I am only a year old in the SEO world, and info like this is priceless when you are just starting. It makes me look that much more proficient in my employer's eyes! Keep up the great threads!

Yidaki




msg:1529044
 6:49 pm on Jun 13, 2003 (gmt 0)

>It works well for me. However, if you are the webmaster of a
>large corporation making millions in on-line sales, then I'd advise you to
>not "fix" anything that is not broken!

Ok, then i'll better leave my site's robots.txt unchanged and try it with my friends site's first. :P

nowhere




msg:1529045
 9:09 pm on Jun 13, 2003 (gmt 0)

Why does WW disallow Scooter?

jdMorgan




msg:1529046
 9:22 pm on Jun 13, 2003 (gmt 0)

For the purposes of this thread, let's just take it as given that any webmaster is entitled to disallow any robot they please, for any reason. Some will allow Grub and Scooter. Some will decide they are not worth the hassle, given traffic levels on their site from those robots' associated engines. Grub and Scooter/1.1 (specifically) have misbehaved in the past, so they may turn up in many sites' disallow directives.

I commend the Search Engine Spider Identification forum [webmasterworld.com] to all interested in researching robots compliance issues.

So let's not argue about this and that user-agent; just take the user-agents in the code above as examples.

Yidaki - spoken like a true conservative! Always test before you deploy!

However, the point still stands; all techniques in the original post are in full compliance with the Standard. I prefer to design to standards, and then find work-arounds for the specific releases/robots/interpreters/computers/compilers/browsers/etc. that are not standards-compliant.

Jim

Yidaki




msg:1529047
 1:49 pm on Jun 14, 2003 (gmt 0)

>Yidaki - spoken like a true conservative! Always test before you deploy!

jd, i was joking about your million dollars example. :) I know that your great guidelines are absolutely conform with the standard. I was just wondering how close the major bots follow this standard (see Nutch). Anyways, i'll give it a try - excellent hints!

scareduck




msg:1529048
 11:47 pm on Jun 24, 2003 (gmt 0)

Of course, if you're using PHP, you could tie robots.txt to a robots.php file, and do something like


<?
if (preg_match('/Bozobot¦Unruly Crawler v1.0¦Etc/i', $USER_AGENT)) {
print "User-agent: $USER_AGENT.\n".
"Disallow: /\n";
}
?>

This way, non-annoybots see themselves as being restricted (assuming they care, which they probably won't), and everyone else sees a blank robots.txt. Simple, no?

jdMorgan




msg:1529049
 11:44 pm on Jun 30, 2003 (gmt 0)

In response to this post, several members have told me that they modified their robots.txt files. I now have one reliable report that ia_archiver is not in compliance with the Standard for Robots Exclusion, and won't accept multiple user-agents in a single shared-policy record. For those who wish to exclude ia_archiver, it must have its own record, e.g.

User-agent: ia_archiver
Disallow: /user_files/


(each record must be followed by a blank line.)

I'd like to hear reports of any more non-compliant user-agents this technique turns up.

Jim

keyplyr




msg:1529050
 7:00 pm on Jul 1, 2003 (gmt 0)

My email to Wayback (ia_archiver) informing them that their crawler was not in compliance with the Standard for Robots Exclusion, resulted in the reply "we'll look into it." I also invited them to join in this discussion ;)

Brett_Tabke




msg:1529051
 8:36 pm on Oct 28, 2003 (gmt 0)

One thing a few mentioned here about this great post by JD, is #3.

I have alot of near duplicate lines because there are imposter robots out there that try to make you think they are something else. Foobot, may not be the same as foobots.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved