homepage Welcome to WebmasterWorld Guest from 54.198.94.76
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Do we need robots.txt?
irock




msg:1526151
 9:54 pm on Jun 5, 2002 (gmt 0)

I realize I don't have a robots.txt on my site. Is it important to have that file in the root directory?

 

brotherhood of LAN




msg:1526152
 9:56 pm on Jun 5, 2002 (gmt 0)

Hi irock,

the short answer is no.

Beachboy




msg:1526153
 9:57 pm on Jun 5, 2002 (gmt 0)

Not unless you need to issue a specific directive to a spider, usually having to do with excluding certain areas of the site from spidering. If not, then don't worry.

hurlimann




msg:1526154
 9:58 pm on Jun 5, 2002 (gmt 0)

No. The file is only important if you want to tell the spiders and others to do something not normal.

buckworks




msg:1526155
 10:13 pm on Jun 5, 2002 (gmt 0)

On the other hand if spiders look for robots.txt and don't find it, that would trigger your custom error page if you had one. You might save bandwidth (and also take a few errors out of your logs) by having a really basic robots.txt for them to find.

MHes




msg:1526156
 11:37 pm on Jun 5, 2002 (gmt 0)

I disagree with all of you! I think it is important.....
Any feature within a site that makes that site look more complete and competant to a spider has got to be good news. Too many sites have errors and poor html, and having a robot.txt file, even if it has no specific function, will make your site stand out from the crowd and might just give you a bonus point, however small that bonus may be.

ferrari360




msg:1526157
 1:31 am on Jun 6, 2002 (gmt 0)

Having a robots.txt file doesn't give you any ranking boost. However if you do have a robots.txt file make sure that it is properly configured, otherwise you might find yourself blocking spiders that you may want to crawl your site..

robots.txt validator:
[searchengineworld.com...]

pageoneresults




msg:1526158
 2:42 am on Jun 6, 2002 (gmt 0)

I look at the robots.txt file as an essential part of the package. Its like the keywords meta tag, do ya? or don't ya?

I was tired of seeing all those 404 errors in my logs for the robots.txt file. So, I went on a quest over a year ago to learn everything I could. Now, one of the first things I do, is set up the robots.txt and disallow directories that contain working files, css, javascript and any other content that I don't want indexed.

There have been many conversations on this topic. I've seen comments to the effect that the spider called the robots.txt file, didn't find it, and left without grabbing anything else. Followup comments stated that the spider had not been back since the first call for the robots.txt file. What does this mean? I'm not really sure. Although, I'm one to play it safe. If that robots.txt contains nothing other than...

User-agent: *
Disallow:

...which tells all spiders that they are welcome to index the entire site, then so be it! I kind of looked at it this way...

They came a knockin' and no one was home (no robots.txt file), so they left. They didn't say when they would return so I missed them, that first time (bummer). I've now put the robots.txt in place. They came a knockin' again one month later, I was home and let them in. They got what they came for!

How fond am I of the robots.txt file? Do a search in Google for robots text or robots text file!

ciml




msg:1526159
 12:06 pm on Jun 6, 2002 (gmt 0)

If you have no /robots.txt file, and your server is configured to send 403 "Forbidden" for files that don't exist (a bad move IMO) then you will not be spidered by Google. In that case, you need to fix the server or upload a /robots.txt file to allowing spidering (even just an empty one).

chris_f




msg:1526160
 12:10 pm on Jun 6, 2002 (gmt 0)

I'm one the must have boat. Every one of my sites has a robots.txt file. Even though it just tells the spiders that the entire site is open to them. For instance, see below:

User-agent: *
Disallow:

conor




msg:1526161
 3:42 pm on Jun 6, 2002 (gmt 0)

I do the same as Chris_F , always, always, always have a robots.txt.

If you have spent time and money building a site, why not spend the extra ten seconds to add a robots.txt file even if all it does is allow all !

willtell




msg:1526162
 9:21 pm on Jun 6, 2002 (gmt 0)

Ciml, could you elaborate on your comment about sending 403's. We are going to stop certain spam sites at the network card level.

DrOliver




msg:1526163
 7:57 am on Jun 7, 2002 (gmt 0)

You can even create an all empty text file, call it robtos.txt and make sure it's in the root (as www.yoursite.com/robots.txt). Right, the file could even be completely empty. Spiders will look at it as if it is okay to spider the whole site. But what's said above is right, and will invite spiders to spider your whole site. No 404's anymore.

NotNervous




msg:1526164
 4:44 pm on Jun 7, 2002 (gmt 0)

We've never used robots.txt files on any of our sites and have never had a problem. Our pages that don't have incoming links are never listed anyway and we have no complaints about being overlooked during regular updates. Frankly, I can't see the use of these files, except maybe in very unusual circumstances.

ciml




msg:1526165
 1:57 pm on Jun 8, 2002 (gmt 0)

willtell, the 403 problem happens when the server default is set to use 403 forbidden for filed that are not found, instead of 404 not found.

It isn't very common, but I have been seeing it quite a lot over the last few months. I don't know whether this is done by server admin's trying to be more secure or if it's the default for some kind of Apache set-up.

The solution is easy, just upload a blank /robots.txt as DrOliver suggests.

GoogleGuy




msg:1526166
 8:30 pm on Jun 9, 2002 (gmt 0)

What ciml said. Best not to have a 403 returned if someone tries to fetch your robots.txt file.

jady




msg:1526167
 1:26 am on Jun 10, 2002 (gmt 0)

It is advisable to have your robots.txt to block /java and /cgi-bin? This is what I do and havent caught on to any problem.

Brett_Tabke




msg:1526168
 8:30 am on Jun 10, 2002 (gmt 0)

I realize I don't have a robots.txt on my site. Is it important to have that file in the root directory?

As others have mentioned, no. A robots.txt is for the blocking of robots that obey the standard.

Some leave it intentionally missing so that it will go 404 and show in error logs. That way, you can identify obeying spiders easy enough.

Although this is common and acceptable to standard:

User-agent: *
Disallow:

I wouldn't recommend it. There are some spiders that will incorrectly interpret that as blocking all content.

Ya jady, block anything you think is sensitive. I think cgi-bin and java would qualify for that.

pageoneresults




msg:1526169
 8:40 am on Jun 10, 2002 (gmt 0)

User-agent: *
Disallow:

> I wouldn't recommend it. There are some spiders that will incorrectly interpret that as blocking all content.

Ouch! I guess I need to rewrite all my instructions on creating a robots text file. Brett, do you know which spiders misinterpret the above?

rrl




msg:1526170
 11:48 am on Jun 10, 2002 (gmt 0)

Funny, I tried that robots.txt validator and it just told me every line of my site is invalid. I haven't found one of these validators to work yet.

jady




msg:1526171
 12:17 am on Jun 11, 2002 (gmt 0)

Thanks (yet again) Brett.. :)

For the other guys - I go with what people are posting in the forum. If you can avoid a 404 error page - you will be better off..

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved