Welcome to WebmasterWorld Guest from 54.145.53.251

Forum Moderators: goodroi

Message Too Old, No Replies

Oddly long robots.txt file

Seeking explanation for weird robots.txt file

     

bendede

3:15 pm on Oct 29, 2004 (gmt 0)

Inactive Member
Account Expired

 
 


I'm come across what seems like a highly unusual robots.txt file on a high-visible public web site. I'm puzzled about it and was wondering if any experts in the Robots Exclusion Protocol could explain why this site's robots.txt file is like it is.

The robots.txt file in question has almost 2,000 lines, all of them "disallow" directives. Most robots.txt files I've seen have only a handful of directives. Why would it be necessary to have so many separate directives?

Some of the disallow directives disallow access to pages that have been removed from the site. If a page has been removed is there any reason to disallow robots to it?

Out of curiosity, is there any way to find the longest robots.txt file on the Internet?

Any insight or help unraveling this mystery would be appreciated. Thanks.

Dave

11:08 pm on Oct 29, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


> The robots.txt file in question has almost 2,000 lines, all of them "disallow" directives. Most robots.txt files I've seen have only a handful of directives. Why would it be necessary to have so many separate directives?

The main reason would be that the site is poorly-organized for robots.txt control. In most cases, robots control should be taken into account when architecting a site's directory structure, so that robots can be Disallowed from entire directory branches, rather than disallowing pages/files one-at-a-time.

> Some of the disallow directives disallow access to pages that have been removed from the site. If a page has been removed is there any reason to disallow robots to it?

A Disallow tells the robot not to try to fetch the page, so this will save the site bandwidth on 404 responses, and cut the number of 404 errors in the log file.

Jim

5:19 am on Nov 8, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member powdork is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Sept 13, 2002
posts:3346
votes: 0


searching on google for robots.txt will give a site with a very long file. What happens when the file gets larger than that which the robot usualy fetches?
5:52 am on Nov 10, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


The robots.txt file can be parsed on-the-fly to extract the record that applies to the robot doing the spidering. This information can then be used immediately, or can be put into storage for later use. It might be stored as plain-text, or it might be tokenized to save space. So, it is not necessarily the case that all of a huge robots.txt file needs to be stored.

In the case of a huge robots.txt file, who knows what a given robot will do? It might accept it or it might give up and declare it invalid after reaching a certain limit. The only way to find the limit is to research and experiment or to buy a lot of drinks for a crawler engineer from the search company you are most interested in... :)

[added] Interesting to note that the file you refer to could be deemed technically invalid, since it uses tab characters instead of space characters, as specified in the Standard. I wouldn't worry about a 76kB file, but I might worry about a 768kB file... [/added]

Jim

7:19 am on Nov 28, 2004 (gmt 0)

New User

10+ Year Member

joined:Nov 27, 2004
posts:6
votes: 0


Where is this Robots.txt?

From what I've seen so far, any one of my spiders should take any size Robots.txt and deal with it as long as each line is valid for what it states.

At that point, file size only comes into consideration when the file might be larger than available HD space (ie - 100Gigs+).

mike_fwt

9:36 pm on Dec 2, 2004 (gmt 0)

Inactive Member
Account Expired

 
 


Noting someone mentioned that having numerous disallow directives is the result of a very poor design scheme, what do you do when there are just too many directories to be listed individually and you need to reverse the process... say, have an "allow" instead of just "disallow"? Don't suggest redesign the site... that's not an option for me right now.

Is that completely not possible? You'd think spider developers would have considered this one, or perhaps I'm missing a key aspect of the whole development side.

Simply stated, can I set up the following and hope and pray that it works for our massive site? I guess my question boils down to, "is there such a field as 'allow'"?

User-agent: *
Disallow: /
Allow: /some-specific-directory
Allow: /another-specific-directory

9:31 pm on Dec 9, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member whoisgregg is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Dec 9, 2003
posts:3416
votes: 0


Simply stated, can I set up the following and hope and pray that it works for our massive site?

Yes you can. (Hope and pray, that is.)

I guess my question boils down to, "is there such a field as 'allow'"?

No, there isn't. Disallow is the ONLY field.

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members