Forum Moderators: goodroi

Message Too Old, No Replies

Character Limit

Has anyone head of a character limit in robots.txt?

         

bumpaw

10:54 pm on Feb 6, 2006 (gmt 0)

10+ Year Member



I was looking through the information available on my sitemaps registered with Google Sitemaps and found something interesting.

For verified sitemaps google will show you stats and information on each sitemap plus some other info on your site. It checked my robots.txt and said it had an error which amounted to being "over 2000 characters".

Has anyone heard of this? I searched a while and came up empty.

sitelynx

12:23 pm on Feb 8, 2006 (gmt 0)

10+ Year Member



Yes I have just picked up on this too with the new google tool - I cannt find any supporting documentation which suggests a character limit

Brett_Tabke

12:31 pm on Feb 8, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I have addresses of over 200 robots.txts that are over 1meg in length. I just checked 3 at random, and Google is properly obeying the exceptions - even of one that is 6meg long.

There is no character limit.

2.5 meg:
[lld.dk...]

800k
[vm.ibm.com...]

600k
[lifesite.net...]

500k
[chop.edu...]

bumpaw

2:22 pm on Feb 8, 2006 (gmt 0)

10+ Year Member



I found Googles documentation on this new feature in their sitemaps area. It's not mentioned, but the other problems one might have with their robots.txt are.

I used their error check for my file which is basically the old WebmasterWorld robots.txt with some additions, and it came up in red "over 2000 characters". You can play with it there and retest. When I chopped it down to their size the message was gone.

Why have something like this without documentation?

Dijkgraaf

9:10 pm on Feb 8, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Possibly this limit only exists in respect to the sitemap information.

Pfui

1:50 am on Feb 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm interested in everyone's thoughts about Brett's whopper examples --

IBM's lists directory contents down to the individual files. It would be a LOT shorter if Disallows 'ended' at the directory level. Is there some advantage to the full-path format?

LLD's, the really, REALLY big one, is the same way -- a gazillion directories, a gazillion contents -- like an inverse SiteMap up to six levels deep! Benefit?

CHoP's includes a mix of instructions, some of which are Disallows but they're missing spaces (e.g.: Disallow:http://www...). It also includes directory content down to the individual files, plus a ton of Allows (albeit dynamic) in full-path format. Wouldn't the latter be redundant? Or do you think they're included given that they're dynamic and, in the old days at least, didn't get spidered?

LifeSite's contains no robots instructions whatsoever, 'just' 14,561 URLs. Wouldn't that have the opposite effect -- to specifically offer-up those URLs for spidering? (Interesting idea, that...)

Dijkgraaf

8:39 am on Feb 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well having robots.txt files that large is asking for trouble with bots like Yahoo that request robots.txt files frequently, as it increases your bandwidth costs dramatically.

I can't see any advantage to doing such extensive listing. I'd say some of the failed to grasp how the disallows work.