Forum Moderators: goodroi

Message Too Old, No Replies

Google 5000 character limit for robots.txt?

         

physics

5:22 am on Jun 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm using sitemaps on one of my established sites and had a gander at the pages returning 404 errors. Wouldn't you know it, some of them are forbidden by robots.txt and have been for years. So I tried the Google robots.txt checker tool and when I tried to see if one of those pages was blocked or not it complained that the robots.txt file is longer than 5,000 characters so is too long for this tool to deal with. Well, that set off a red flag. I read my robots.txt (which is full of a bunch of blocked robots like the old WebmasterWorld one) and I had put all of my rules for pages at the end after the rules to block bots (bad idea probably in the first place). So my theory is that google isn't reading past 5,000 characters but not everyone agrees:
[webmasterworld.com...]
In any case I've moved my blocked page rules to the top and shortened the thing to 5000 characters... better safe than sorry.

Pfui

6:53 am on Jun 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A 5,000-character limit will require me to serve individual robots.txt for Bad Guys (nothing shows but the minimum generic stuff), and Google (to meet the limit), and then one for MSN and Yahoo and Ask combined, I guess, and then maybe another for those requiring their own specs and -- and --

And it all reminds me of browser code wars -- a la Javascript v. Jscript -- just with a LOT more players.

Anyone else getting tired of keeping up with each SE's peculiarities?

Heck, if I don't want 'em, I'd rather 403 'em. Quick and painless. (Now don't start, you Standards-and-Protocols folks. We've been around that track before:)

physics

8:15 am on Jun 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




A 5,000-character limit will require me to serve individual robots.txt for Bad Guys (nothing shows but the minimum generic stuff),

If you want to start to get that sophisticated you can rewrite robots.txt to a script and serve a robots.txt with just pages to block to goog et all while serving a disallow all robots.txt to all others, see:
[webmasterworld.com...]
and
[webmasterworld.com...]

That way you don't have to keep track of all of the 'other' or 'bad' bots out there, just the good bots.

Pfui

12:35 pm on Jun 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the links! And here's the
"if ($agent"
part --

[webmasterworld.com...]

(I think that's all of the parts:)

Actually, serving up two different versions -- one for allowed bots and one for all the rest -- has been on my to-do list since last week and I've been meaning to look up some posts by Jim Morgan about how-to when it comes to controlling everything with mod_rewrite. I just hadn't planned on making, updating and serving up a third version just for Google (allowed). Oh, well!

jdMorgan

2:43 pm on Jun 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I read my robots.txt (which is full of a bunch of blocked robots like the old WebmasterWorld one) and I had put all of my rules for pages at the end after the rules to block bots ...

This quote sets off a red flag for me: A single robot will obey one record in your robots.txt -- Either the first one it finds where the User-agent: token matches it's User-agent name, or the one that matches its name most exactly. So, technically, there are no such things as 'rules for robots' and 'rules for pages' -- The control of pages should be part and parcel of the robot-control records.

The Google robots.txt checker tool may have a 5000-character limit, but that does not necessarily apply to the Googlebot itself. Just FYI, I've got sites with robots.txt files ranging from 3kB to 16kB, and *none* have any problems. I would strongly suggest looking at other aspects of this problem instead of filesize.

But a large file is somewhat wasteful of bandwidth over time; In order to reduce the filesize, go through your robots.txt and remove all records pertaining to robots that you do not want on your site, and also those that pertain to robots known to ignore robots.txt

Then add a wild-card record at the bottom to Disallow any robots whose User-agent token does not appear in the preceding records. After you've done so, any 'recognized' and 'good' robot wiil find the record that applies to it and obey that one. Any robot you don't care about will not find a record specific to it, and so will obey the last record -- the wild-card record added above -- and recognize that it should not spider your site. And bad-bots don't heed (or even read) robots.txt at all, so why list them?

If you are strong on server-side scripting, then I'll second physics' recommendation above. But test thoroughly!

Jim

g1smd

8:53 pm on Jul 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



5000 Characters?

I remember seeing a limit of 50 lines stated on the Google URL Console (URL Removal Tool) at some time in the past...