|Content of Robots.txt files|
You wouldn't believe what some hosts think should be in a Robots.txt file!
As a spider writer I'm very aware of the Robots.txt file, it's syntax, how to parse it, and follow it. So I might have a better grasp on it than most. But would you believe:
No Robots.txt at all on 25% of all hosts.
30% of the time, when a Robots.txt file was found, it contained nothing but HTML code. No Robots.txt syntaxed code at all.
Nearly 40% of those that had the file, and the file didn't contain HTML, were still so badly writen they couldn't be followed.
This is over a few years now, thousands of hosts. So please, make and check your Robots.txt file.
Are you referring to 'hosts' (ISPs) or webmasters?
The webmaster determines the contents of robots.txt and
the host simply stores whatever is sent up if I have that right.
Before I learned much of anything, I had a robots.txt
file that read something like
flakes,weirdos, new-age types ..
I've since put up a better one thanks to the good advice on this forum.
I was refering to the webmasters more than the physical hosts really. I just can NOT believe how many people, servers, hosts, or webmasters think full HTML belongs in the file.
I never use them myself, unless I want to exclude a folder, if in doubt leave it out as it can do more harm than good if you get it wrong.
ps: AlucardSpyderWriter like the handle, just keep your eye out for rellikeripmaveht lurking around here.
My new favorite is also people using something like -
Can we all agree that the asterisc (*) never belongs in this file other than in a "User-agent: *" line?
Also, I feel like I'm getting picky here, what with everything else webmasters have to worry about. But this file is just about all a good spider (and good spider writer) has to go by, other than the robot METAs.
|Also, I feel like I'm getting picky here, what with everything else webmasters have to worry about |
Yes you are being picky, you may understand how it works but a lot of others don't especially the newbies and maybe thats why you have seen so many mistakes made with the coding in this file.
You don't have to have the file and you can have just a blank robots.txt file, the reason for the blank file is to eliminate any 404's you would get in your log file when the file was requested but not available to be served
The robots.txt file only works on well behaved spiders other rouge spiders just walk all over it and don't obey so you have to implement other means to stop these.
Umm, yes, I know, I said that, and that I got.
It's not the lack of it I don't get, and I understand all kinds of simple mistakes..
But when my spiders get more links out of the HTML in a Robots.txt than the HTML of a page itself, I just remember how silly I first thought it was to include code to check Robots.txt for HREFs.
I get fairly annoyed at some spider writers, too. It's unbelievable how many spiders from well-known organizations can't handle the multiple-user-agent policy records clearly described by the Standard. Most just pack up and go home, failing to find their user-agent string, despite the fact that it is included. Others decide they can do what they like, since they deem the file invalid.
So, there is frustration on both ends, and the solution is not to complain, but to inform.
Standard for Robot Exclusion [robotstxt.org]
robots.txt syntax checker [searchengineworld.com]
|30% of the time, when a Robots.txt file was found, it contained nothing but HTML code. |
Possibly some of these are caused by missing robots.txt files. If the robots file is missing, some servers can return a 404 missing error page but a 200 code.