|Oddly long robots.txt file|
Seeking explanation for weird robots.txt file
I'm come across what seems like a highly unusual robots.txt file on a high-visible public web site. I'm puzzled about it and was wondering if any experts in the Robots Exclusion Protocol could explain why this site's robots.txt file is like it is.
The robots.txt file in question has almost 2,000 lines, all of them "disallow" directives. Most robots.txt files I've seen have only a handful of directives. Why would it be necessary to have so many separate directives?
Some of the disallow directives disallow access to pages that have been removed from the site. If a page has been removed is there any reason to disallow robots to it?
Out of curiosity, is there any way to find the longest robots.txt file on the Internet?
Any insight or help unraveling this mystery would be appreciated. Thanks.
> The robots.txt file in question has almost 2,000 lines, all of them "disallow" directives. Most robots.txt files I've seen have only a handful of directives. Why would it be necessary to have so many separate directives?
The main reason would be that the site is poorly-organized for robots.txt control. In most cases, robots control should be taken into account when architecting a site's directory structure, so that robots can be Disallowed from entire directory branches, rather than disallowing pages/files one-at-a-time.
> Some of the disallow directives disallow access to pages that have been removed from the site. If a page has been removed is there any reason to disallow robots to it?
A Disallow tells the robot not to try to fetch the page, so this will save the site bandwidth on 404 responses, and cut the number of 404 errors in the log file.
searching on google for robots.txt will give a site with a very long file. What happens when the file gets larger than that which the robot usualy fetches?
The robots.txt file can be parsed on-the-fly to extract the record that applies to the robot doing the spidering. This information can then be used immediately, or can be put into storage for later use. It might be stored as plain-text, or it might be tokenized to save space. So, it is not necessarily the case that all of a huge robots.txt file needs to be stored.
In the case of a huge robots.txt file, who knows what a given robot will do? It might accept it or it might give up and declare it invalid after reaching a certain limit. The only way to find the limit is to research and experiment or to buy a lot of drinks for a crawler engineer from the search company you are most interested in... :)
[added] Interesting to note that the file you refer to could be deemed technically invalid, since it uses tab characters instead of space characters, as specified in the Standard. I wouldn't worry about a 76kB file, but I might worry about a 768kB file... [/added]
Where is this Robots.txt?
From what I've seen so far, any one of my spiders should take any size Robots.txt and deal with it as long as each line is valid for what it states.
At that point, file size only comes into consideration when the file might be larger than available HD space (ie - 100Gigs+).
Noting someone mentioned that having numerous disallow directives is the result of a very poor design scheme, what do you do when there are just too many directories to be listed individually and you need to reverse the process... say, have an "allow" instead of just "disallow"? Don't suggest redesign the site... that's not an option for me right now.
Is that completely not possible? You'd think spider developers would have considered this one, or perhaps I'm missing a key aspect of the whole development side.
Simply stated, can I set up the following and hope and pray that it works for our massive site? I guess my question boils down to, "is there such a field as 'allow'"?
|Simply stated, can I set up the following and hope and pray that it works for our massive site? |
Yes you can. (Hope and pray, that is.)
|I guess my question boils down to, "is there such a field as 'allow'"? |
No, there isn't. Disallow is the ONLY field.