Forum Moderators: goodroi
Similarly, if there are search engines or other automated processes that you don't want indexing your site (like the way back machine for example) then robots.txt is useful.
Have a look at the robots.txt file for WebMasterWorld as an example : www.webmasterworld.com/robots.txt
Personally, I use it to preserve bandwidth and stop the search engines and other bots that I don't want indexing my site as they're a waste of time and waste of my money.
There are a few that ignore your instructions, and they go in the htaccess ban list.
TJ
In a plain-text file called "robots.txt [robotstxt.org]" in the root directory of your site - along with your "home page." The on-page robots tag is useful for people writing html pages to put on a server where they do not have access to robots.txt, and so can't change it.
> something that doesn't really matter?
One very good reason to put up a robots.txt file even if it is blank is to prevent hundreds of 404-Not Found errors in your server log, caused by robots trying to check your robots.txt file. This reduces error log file clutter, and makes it more useful for finding real problems.
Jim
There are two ways to control robots: robots.txt, and on-page html meta tags. The robots.txt method can be seen as a "server-wide" approach, and the on-page tags as a "piecemeal" approach.
The on-page method was invented later, to allow page authors who did not have access to the server configuration to control indexing of their pages. It is less effiecient than using robots.txt, because the html page must be fetched in order to read the meta-tags. Therefore, using robots.txt will save bandwidth. Also, the meta tag method can only work for html pages, and not for images, scripts, css files, etc. Since the meta tag is an html tag, it can only be interpreted if it appears on an html page.
Unfortunately, as noted, some search engines have taken to listing pages if they find links to those pages anywhere on the Web, regardless of whether the page is Disallowed in robots.txt. In that case you must allow (in robots.txt) the spider to fetch the page, and place the <meta name="robots" content="noindex"> tag on each individual page.
Some spiders interpret a robots.txt Disallow as "Do no list this page in your index", while others treat it as meaning "Do not fetch this page." That is the root of the difference in behaviour.
Google, Ask Jeeves/Teoma, and now Yahoo (Slurp) will list a page if they find a link to it anywhere. In the case of Google and AJ, the page will be listed as a URL only, with no title and no description. Yahoo Slurp's new behaviour is to list the page using whatever link text it found on the link to the page.
Jim
User-agent: *
Disallow:
That in theory will allow all the spiders to index right? or is it better to keep it empty?
I use this method just because I like to validate all of my files .. html, css and this robots.txt and when it is empty it does not validate, so I put those two lines.. does it hurt in some way?
Carlos.