| 2:40 pm on Nov 18, 2002 (gmt 0)|
I'd like to get this syntax exactly right. I want to stop all search engine spiders visiting ONE directory on my website.
What's the syntax for this please? Also, where do I put the robot.txt file?
| 2:45 pm on Nov 18, 2002 (gmt 0)|
Here is a great tutorial on robots.txt
This bars all robots fro the cgi-bin folder.
Place it on the site's root.
You can also validate it at [searchengineworld.com...]
| 2:50 pm on Nov 18, 2002 (gmt 0)|
Thanks. Is this an acceptable alternative to an .htaccess box?
| 3:06 pm on Nov 18, 2002 (gmt 0)|
If i want my spider to search all my files and directories, is there any benefit from having a robots.txt file. And if so what is the format, as all the examples ive seen only show how to disallow.
| 3:09 pm on Nov 18, 2002 (gmt 0)|
Andrew_Thomas it will stop all of your 404 requests. Also googleguy has said that a custom 404 page with no robots.txt file can confuse the spider and cause your site to not be spidered correctly.
syntax to allow all:
jamesf4218 sorry I don't know about htaccess. I am a NT developer.
| 4:01 pm on Nov 18, 2002 (gmt 0)|
|I want to stop all search engine spiders visiting ONE directory |
|Is [using robots.txt] an acceptable alternative to an .htaccess [block]? |
Yes and no... A valid robots.txt will stop all spiders which request and obey robots.txt from requesting files from your Disallowed directories. However, there are two problems: Some "bad=bots" don't request and/or don't obey robots.txt
Some search engines will list a link to a disallowed file if they find such a link
In the former case, you may need to block these bad-bots using .htaccess, browscap.ini, a firewall, or whatever is available to you.
In the latter case, even though the SE spider does not request the Disallowed file, it may still find the URL in links on your site, links on other sites, or even in a server log or collection of user bookmarks unintentionally left on-line somehwere. It will therefore list the URL without a title or description, but sometimes with the link text found on the page that links to it.
I have previously argued that this flies in the face of the intent of the robots exclusion standard, if not its literal wording. However, it depends on whether you define the word "index" to mean, "fetch a page and parse it" or "include it in our index". Some SE's won't mention a file that's disallowed with robots.txt, but some will - So I've learned, "that's life, get over it, find a work-around, and move on..."
About the only thing I know of to stop these search engines from listing the URL of a private page (without cloaking) is to link to your "private" pages only through another "linking page" that meets the following criteria: The linking page must be Allowed in robots.txt
The linking page must contain a <meta name="robots" content="noindex,nofollow">
All accesses to the "private" files must go through this page.
Then you have the problem of 'bots which don't read and respect robots.txt. These need to be blocked. I use a combination of .htaccess and an automatic bad-bot banning script [webmasterworld.com] that was posted here on WebmasterWorld by Key_Master. I have tweaked it to handle multiple simultaneous requests (by adding file-locking to it) and I'm "evaluating" it now to make sure the tweaks didn't break it. On low-to-moderate-traffic sites, it should work just fine as originally posted.
| 9:48 am on Nov 25, 2002 (gmt 0)|
Hi..I'm just setting up a robots.txt file and have followed the tutorial and validated it so far
however a(probably dumb) question
what about those _vti_* directories and files? I don't use them but need them to stay on the server for other users that may follow
am I right in thinking that they're private anyway or should I include them in the disallow?