Forum Moderators: goodroi
The robots.txt file in question has almost 2,000 lines, all of them "disallow" directives. Most robots.txt files I've seen have only a handful of directives. Why would it be necessary to have so many separate directives?
Some of the disallow directives disallow access to pages that have been removed from the site. If a page has been removed is there any reason to disallow robots to it?
Out of curiosity, is there any way to find the longest robots.txt file on the Internet?
Any insight or help unraveling this mystery would be appreciated. Thanks.
Dave
The main reason would be that the site is poorly-organized for robots.txt control. In most cases, robots control should be taken into account when architecting a site's directory structure, so that robots can be Disallowed from entire directory branches, rather than disallowing pages/files one-at-a-time.
> Some of the disallow directives disallow access to pages that have been removed from the site. If a page has been removed is there any reason to disallow robots to it?
A Disallow tells the robot not to try to fetch the page, so this will save the site bandwidth on 404 responses, and cut the number of 404 errors in the log file.
Jim
In the case of a huge robots.txt file, who knows what a given robot will do? It might accept it or it might give up and declare it invalid after reaching a certain limit. The only way to find the limit is to research and experiment or to buy a lot of drinks for a crawler engineer from the search company you are most interested in... :)
[added] Interesting to note that the file you refer to could be deemed technically invalid, since it uses tab characters instead of space characters, as specified in the Standard. I wouldn't worry about a 76kB file, but I might worry about a 768kB file... [/added]
Jim
From what I've seen so far, any one of my spiders should take any size Robots.txt and deal with it as long as each line is valid for what it states.
At that point, file size only comes into consideration when the file might be larger than available HD space (ie - 100Gigs+).
Is that completely not possible? You'd think spider developers would have considered this one, or perhaps I'm missing a key aspect of the whole development side.
Simply stated, can I set up the following and hope and pray that it works for our massive site? I guess my question boils down to, "is there such a field as 'allow'"?
User-agent: *
Disallow: /
Allow: /some-specific-directory
Allow: /another-specific-directory