importance of a robots.txt file

Forum Moderators: goodroi

Message Too Old, No Replies

importance of a robots.txt file

hughie

12:34 pm on May 13, 2004 (gmt 0)

I have never placed a robots file in any of my websites yet most get listed and some do very well. Is it a help/hinderance to search engines or just something that doesn't really matter.

ta,
hughie

trillianjedi

12:46 pm on May 13, 2004 (gmt 0)

It's there for you to tell the spiders which pages of your site you *don't* want indexed. If you have pages you don't want indexed, use it to stop that happening. That saves bandwidth. Lack of a robots.txt file will mean the spiders will try and index everything.

Similarly, if there are search engines or other automated processes that you don't want indexing your site (like the way back machine for example) then robots.txt is useful.

Have a look at the robots.txt file for WebMasterWorld as an example : www.webmasterworld.com/robots.txt

Personally, I use it to preserve bandwidth and stop the search engines and other bots that I don't want indexing my site as they're a waste of time and waste of my money.

There are a few that ignore your instructions, and they go in the htaccess ban list.

hughie

1:53 pm on May 13, 2004 (gmt 0)

Cheers, very useful info indeed,

Hughie

Daily Sparring

1:51 pm on May 20, 2004 (gmt 0)

Where do you place that information? In an alternate file or within the head tags of each page?

jdMorgan

2:18 pm on May 20, 2004 (gmt 0)

> Where do you place that information? In an alternate file or within the head tags of each page?

In a plain-text file called "robots.txt [robotstxt.org]" in the root directory of your site - along with your "home page." The on-page robots tag is useful for people writing html pages to put on a server where they do not have access to robots.txt, and so can't change it.

> something that doesn't really matter?

One very good reason to put up a robots.txt file even if it is blank is to prevent hundreds of 404-Not Found errors in your server log, caused by robots trying to check your robots.txt file. This reduces error log file clutter, and makes it more useful for finding real problems.

Jim

Daily Sparring

3:47 pm on May 20, 2004 (gmt 0)

Thanks JD, one other question then, do you link to that text file from within the head tags? I am assuming so otherwise, how would the spider no to look for the robot file first? Again, sorry for my lack of knowledge here.

sem4u

3:49 pm on May 20, 2004 (gmt 0)

Just put the robots.txt file in your root directory. There is no need to link to it anywhere. It is usually the first file that spiders read when accessing your site.

jdMorgan

4:05 pm on May 20, 2004 (gmt 0)

Check out the link I cited above. All robots assume that you will have a plain-text file named robots.txt in your document root directory, and look for it there. If it is not found, you'll see a 404-Not Found error in your logs, and the robots will assume that they are allowed to index your entire site.

There are two ways to control robots: robots.txt, and on-page html meta tags. The robots.txt method can be seen as a "server-wide" approach, and the on-page tags as a "piecemeal" approach.

The on-page method was invented later, to allow page authors who did not have access to the server configuration to control indexing of their pages. It is less effiecient than using robots.txt, because the html page must be fetched in order to read the meta-tags. Therefore, using robots.txt will save bandwidth. Also, the meta tag method can only work for html pages, and not for images, scripts, css files, etc. Since the meta tag is an html tag, it can only be interpreted if it appears on an html page.

Unfortunately, as noted, some search engines have taken to listing pages if they find links to those pages anywhere on the Web, regardless of whether the page is Disallowed in robots.txt. In that case you must allow (in robots.txt) the spider to fetch the page, and place the <meta name="robots" content="noindex"> tag on each individual page.

Some spiders interpret a robots.txt Disallow as "Do no list this page in your index", while others treat it as meaning "Do not fetch this page." That is the root of the difference in behaviour.

Google, Ask Jeeves/Teoma, and now Yahoo (Slurp) will list a page if they find a link to it anywhere. In the case of Google and AJ, the page will be listed as a URL only, with no title and no description. Yahoo Slurp's new behaviour is to list the page using whatever link text it found on the link to the page.

Jim

caspita

4:21 pm on May 20, 2004 (gmt 0)

What if the robots.txt file contains:

User-agent: *
Disallow:

That in theory will allow all the spiders to index right? or is it better to keep it empty?

I use this method just because I like to validate all of my files .. html, css and this robots.txt and when it is empty it does not validate, so I put those two lines.. does it hurt in some way?

Carlos.

jdMorgan

4:31 pm on May 20, 2004 (gmt 0)

caspita,

Your code is entirely equivalent to an empty robots.txt file, so the choice is up to you. On small, simple sites, I personally use code like yours as a "stub" to indicate to others who may come later that I intentionally allowed robots to access all files.

Jim

caspita

12:27 am on May 21, 2004 (gmt 0)

Thanks Jim,

So .. looks like it is also a good idea at the begining when your are trying to catch up the spiders .. telling them ..'hey .. here a new site .. you are welcome to spider me :-) '

Thanks again...
Carlos.