Forum Moderators: goodroi
robots.txt validator:
[searchengineworld.com...]
I was tired of seeing all those 404 errors in my logs for the robots.txt file. So, I went on a quest over a year ago to learn everything I could. Now, one of the first things I do, is set up the robots.txt and disallow directories that contain working files, css, javascript and any other content that I don't want indexed.
There have been many conversations on this topic. I've seen comments to the effect that the spider called the robots.txt file, didn't find it, and left without grabbing anything else. Followup comments stated that the spider had not been back since the first call for the robots.txt file. What does this mean? I'm not really sure. Although, I'm one to play it safe. If that robots.txt contains nothing other than...
User-agent: *
Disallow:
...which tells all spiders that they are welcome to index the entire site, then so be it! I kind of looked at it this way...
They came a knockin' and no one was home (no robots.txt file), so they left. They didn't say when they would return so I missed them, that first time (bummer). I've now put the robots.txt in place. They came a knockin' again one month later, I was home and let them in. They got what they came for!
How fond am I of the robots.txt file? Do a search in Google for robots text or robots text file!
It isn't very common, but I have been seeing it quite a lot over the last few months. I don't know whether this is done by server admin's trying to be more secure or if it's the default for some kind of Apache set-up.
The solution is easy, just upload a blank /robots.txt as DrOliver suggests.
I realize I don't have a robots.txt on my site. Is it important to have that file in the root directory?
As others have mentioned, no. A robots.txt is for the blocking of robots that obey the standard.
Some leave it intentionally missing so that it will go 404 and show in error logs. That way, you can identify obeying spiders easy enough.
Although this is common and acceptable to standard:
User-agent: *
Disallow:
I wouldn't recommend it. There are some spiders that will incorrectly interpret that as blocking all content.
Ya jady, block anything you think is sensitive. I think cgi-bin and java would qualify for that.