Do we need robots.txt?

Forum Moderators: goodroi

Message Too Old, No Replies

Do we need robots.txt?

irock

9:54 pm on Jun 5, 2002 (gmt 0)

I realize I don't have a robots.txt on my site. Is it important to have that file in the root directory?

brotherhood of LAN

9:56 pm on Jun 5, 2002 (gmt 0)

Hi irock,

the short answer is no.

Beachboy

9:57 pm on Jun 5, 2002 (gmt 0)

Not unless you need to issue a specific directive to a spider, usually having to do with excluding certain areas of the site from spidering. If not, then don't worry.

hurlimann

9:58 pm on Jun 5, 2002 (gmt 0)

No. The file is only important if you want to tell the spiders and others to do something not normal.

buckworks

10:13 pm on Jun 5, 2002 (gmt 0)

On the other hand if spiders look for robots.txt and don't find it, that would trigger your custom error page if you had one. You might save bandwidth (and also take a few errors out of your logs) by having a really basic robots.txt for them to find.

MHes

11:37 pm on Jun 5, 2002 (gmt 0)

I disagree with all of you! I think it is important.....
Any feature within a site that makes that site look more complete and competant to a spider has got to be good news. Too many sites have errors and poor html, and having a robot.txt file, even if it has no specific function, will make your site stand out from the crowd and might just give you a bonus point, however small that bonus may be.

ferrari360

1:31 am on Jun 6, 2002 (gmt 0)

Having a robots.txt file doesn't give you any ranking boost. However if you do have a robots.txt file make sure that it is properly configured, otherwise you might find yourself blocking spiders that you may want to crawl your site..

robots.txt validator:
[searchengineworld.com...]

pageoneresults

2:42 am on Jun 6, 2002 (gmt 0)

I look at the robots.txt file as an essential part of the package. Its like the keywords meta tag, do ya? or don't ya?

I was tired of seeing all those 404 errors in my logs for the robots.txt file. So, I went on a quest over a year ago to learn everything I could. Now, one of the first things I do, is set up the robots.txt and disallow directories that contain working files, css, javascript and any other content that I don't want indexed.

There have been many conversations on this topic. I've seen comments to the effect that the spider called the robots.txt file, didn't find it, and left without grabbing anything else. Followup comments stated that the spider had not been back since the first call for the robots.txt file. What does this mean? I'm not really sure. Although, I'm one to play it safe. If that robots.txt contains nothing other than...

User-agent: *
Disallow:

...which tells all spiders that they are welcome to index the entire site, then so be it! I kind of looked at it this way...

They came a knockin' and no one was home (no robots.txt file), so they left. They didn't say when they would return so I missed them, that first time (bummer). I've now put the robots.txt in place. They came a knockin' again one month later, I was home and let them in. They got what they came for!

How fond am I of the robots.txt file? Do a search in Google for robots text or robots text file!

ciml

12:06 pm on Jun 6, 2002 (gmt 0)

If you have no /robots.txt file, and your server is configured to send 403 "Forbidden" for files that don't exist (a bad move IMO) then you will not be spidered by Google. In that case, you need to fix the server or upload a /robots.txt file to allowing spidering (even just an empty one).

chris_f

12:10 pm on Jun 6, 2002 (gmt 0)

I'm one the must have boat. Every one of my sites has a robots.txt file. Even though it just tells the spiders that the entire site is open to them. For instance, see below:

User-agent: *
Disallow:

conor

3:42 pm on Jun 6, 2002 (gmt 0)

I do the same as Chris_F , always, always, always have a robots.txt.

If you have spent time and money building a site, why not spend the extra ten seconds to add a robots.txt file even if all it does is allow all !

willtell

9:21 pm on Jun 6, 2002 (gmt 0)

Ciml, could you elaborate on your comment about sending 403's. We are going to stop certain spam sites at the network card level.

DrOliver

7:57 am on Jun 7, 2002 (gmt 0)

You can even create an all empty text file, call it robtos.txt and make sure it's in the root (as www.yoursite.com/robots.txt). Right, the file could even be completely empty. Spiders will look at it as if it is okay to spider the whole site. But what's said above is right, and will invite spiders to spider your whole site. No 404's anymore.

NotNervous

4:44 pm on Jun 7, 2002 (gmt 0)

We've never used robots.txt files on any of our sites and have never had a problem. Our pages that don't have incoming links are never listed anyway and we have no complaints about being overlooked during regular updates. Frankly, I can't see the use of these files, except maybe in very unusual circumstances.

ciml

1:57 pm on Jun 8, 2002 (gmt 0)

willtell, the 403 problem happens when the server default is set to use 403 forbidden for filed that are not found, instead of 404 not found.

It isn't very common, but I have been seeing it quite a lot over the last few months. I don't know whether this is done by server admin's trying to be more secure or if it's the default for some kind of Apache set-up.

The solution is easy, just upload a blank /robots.txt as DrOliver suggests.

GoogleGuy

8:30 pm on Jun 9, 2002 (gmt 0)

What ciml said. Best not to have a 403 returned if someone tries to fetch your robots.txt file.

jady

1:26 am on Jun 10, 2002 (gmt 0)

It is advisable to have your robots.txt to block /java and /cgi-bin? This is what I do and havent caught on to any problem.

Brett_Tabke

8:30 am on Jun 10, 2002 (gmt 0)

I realize I don't have a robots.txt on my site. Is it important to have that file in the root directory?

As others have mentioned, no. A robots.txt is for the blocking of robots that obey the standard.

Some leave it intentionally missing so that it will go 404 and show in error logs. That way, you can identify obeying spiders easy enough.

Although this is common and acceptable to standard:

User-agent: *
Disallow:

I wouldn't recommend it. There are some spiders that will incorrectly interpret that as blocking all content.

Ya jady, block anything you think is sensitive. I think cgi-bin and java would qualify for that.

pageoneresults

8:40 am on Jun 10, 2002 (gmt 0)

User-agent: *
Disallow:

> I wouldn't recommend it. There are some spiders that will incorrectly interpret that as blocking all content.

Ouch! I guess I need to rewrite all my instructions on creating a robots text file. Brett, do you know which spiders misinterpret the above?

rrl

11:48 am on Jun 10, 2002 (gmt 0)

Funny, I tried that robots.txt validator and it just told me every line of my site is invalid. I haven't found one of these validators to work yet.

jady

12:17 am on Jun 11, 2002 (gmt 0)

Thanks (yet again) Brett.. :)

For the other guys - I go with what people are posting in the forum. If you can avoid a 404 error page - you will be better off..