How long can a robots.txt be?
| 12:54 pm on Oct 21, 2003 (gmt 0)|
I'm looking at a report that suggests FAST is being refused access to certain directories by the robots.txt file.
I'm looking at the robots.txt file.
I'm not looking at a robots.txt file that refuses FAST access to these directories, nor am I looking at an incorrectly formatted robots.txt.
I am, however, looking at a huge robots.txt file. It's not small. It's large. Very.
Could this be a problem? Do you think FAST might just cut off the robots.txt after X hundred or Y thousand bytes? A sudden cut might just annoyingly leave a trailing / in the command and block the entire directory that way.
| 3:19 pm on Oct 21, 2003 (gmt 0)|
Welcome to WebmasterWorld [webmasterworld.com]!
> I am, however, looking at a huge robots.txt file. It's not small. It's large. Very.
How large? I used to have one that was about 16kB in size, and had no problems.
I'd suggest e-mailing FAST and asking them about any specific cut-off size.
Just to be sure: Robots.txt Validator [searchengineworld.com]
A few suggestions: Put your robots.txt on a diet [webmasterworld.com]
| 9:14 am on Oct 22, 2003 (gmt 0)|
Thanks. I'm a long time lurker, first time poster. :)
The robots.txt size is... um, erm, ha... a mere 38,357 bytes. So that's about 37M. Did I mention it was huge?
It passes the validation tests. I found your post on how to shorten robots.txt extremely useful. This particular file is silly in length because it excludes so many different URLs explicitly.
I've never considered emailing a search engine (or the people who work there) directly before. I might just get over my "I'm not worthy!" shtick and give it ago. I've also told the mastermind behind this particular robots.txt that meta tagging the specific pages in question and using robots.txt to protect entire directories might be the way to go in this case.
| 9:47 pm on Oct 22, 2003 (gmt 0)|
Well, it can't hurt to try contacting them. I have had good luck -- even with surprisingly-major players -- in reporting problems. As long as you write it up thoroughly and succinctly, and descibe the problem from their point of view, good results can often be had... even a direct response from someone who can/will fix the problem, occasionally. I write up problem reports as factually as I can, provide links to relevant references and URLs, and invite them to look at my files (e.g. to verify that my robots.txt is valid), and generally go with the attitude of "Heads up - I think you have a problem and here is what the problem is."
There is a practical difference between using robots.txt and using on-page meta robots tags. If only the on-page robots tag is used, then the robot has to actually fetch the page to read it. If a page is disallowed by robots.txt, then it (usually) won't be fetched; This decision should be made with bandwidth in mind.
Long-term, the site should be modified to group pages into directories and subdirectories based upon whether you want them indexed. It is just one of many factors that determine the directory architecture of a site, but it can be important.
I took your initial post "X hundred to Y thousand bytes" to hint that your robots.txt was 500 to maybe 99kB in size, but big is a relative term. So, is it 38,357 bytes, or 38,357k bytes? Either way, it's big, but 38,357k bytes (37.5M) is huge!