Welcome to WebmasterWorld Guest from

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt Validator

8:45 pm on May 22, 2001 (gmt 0)

Full Member

joined:Jan 24, 2001
votes: 0

Got hit from this one today:

IP: (omitted)
Robots.txt Validator [searchengineworld.com...]

Did it validate??? Oh that's right, I don't have a robot.txt file. Actually, I do- I call it a .htaccess file. :)

8:49 pm on May 22, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 25, 2000
votes: 0

It really is amazing how many people don't use a robots.txt file. I think it is a merited asset just to reduce the 404's.
9:00 pm on May 22, 2001 (gmt 0)

Full Member

joined:Jan 24, 2001
votes: 0

Reduce the 404's-- come on now Toolman. When a spider hits my nonexistant robots.txt file it is redirected to my custom error404 page which contains lots and lots of links to other pages on my site. In fact, I haven't resubmitted to a search engine once and I am very well indexed by all.
9:17 am on May 26, 2001 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
votes: 112

The data I gathered from that crawl is very interesting to me. Ok, I'm an seo wonk and anything related is interesting to me.

I've not collated the last batch of data yet. It takes a very long time to study that many files. All done now and ended up with 180-210k robots.txt files (not counted yet). Those are just the semi valid ones. Of those I have collated data on, here are some highlights:

About half of robots.txt are in msdos format. (should be unix line enders).
About 60% of all requests for robots.txt ended up as a redirects to an html page. This is not good server configuration to have. SE's do have to deal with it, but it is really bad style.

About 6% of all robots.txt are not valid and many search engines will ignore them.

Common fatal and near fatal errors: These are errors that would give a spider cause for concern about the validity of the file:

Multiple disallows per line. Only one disallow is acceptable per line - you can't combine disallows.

There is no ALLOW tag.

Wild gyrations in formatting. From structured formatting with spaces at the begging of the line, to attempts at multi line comments, there are so many variations, that it makes me wonder what se's do with them.

Size. There were hundreds of robots.txt that were near or over a megabyte in size. I simply can't imagine a search engine using that file as valid. When you consider the overhead involved in parsing a file that size, some se boxes would literally run out of memory. I don't know if there is a "safe size", but a meg is in the questionable range. It is represents bad server/directory setup. eg: Ban whole directories, not 10k files IN the directory.

Doc format. Yes, we ran into 50+ robots.txt that were in microsoft word format. No kidding - loaded some of them up in word, and there was a very pretty looking robots.txt.

HTTP redirects. Ran into many robots.txt that were valid, but they were parked under a http redirect. Questionable if the se's would think of that as valid. (ex: foo.com/robots.txt redirected to foo.com/bar/robots.txt or foo.com/robots2.txt)

Bogus txt files: hit a huge server farm that was loading robots.txt with keyword lists...why? who knows.

Red Flags:
We identified over 200 "server farms" or "domain farms" simply by the identical nature of their robots.txt. (keep that in mind cloakers). The largest we found was a robots.txt duplicated on over 800 domains.

Early Phase One: