Forum Moderators: goodroi

Message Too Old, No Replies

how to disallow spider to index one of my subdomain' files?

         

peony

8:09 am on Apr 4, 2006 (gmt 0)

10+ Year Member



Where should I put the robots.txt, root directory, or subdomain root directory?
How to write robots.txt to stop search engine to index one of my subdomain files?

Pfui

7:14 pm on Apr 4, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Your root directory.

In a nutshell...

Make a new file in a text editor, place the following two lines in it, and save it as "robots.txt". Just these two lines, just like this:

User-agent: *
Disallow: /

That tells robots heeding robots.txt: "Keep Out"

Upload your "robots.txt" file as plain text and put it in your top level or root directory (/public_html or whatever).

I'm not quite sure what you mean by "stop search engine to index one of my subdomain files" but here are examples of completely disallowed directories:

User-agent: *
Disallow: /cgi-bin
Disallow: /includes
Disallow: /private

And here's an example of a disallowed file in a directory:

User-agent: *
Disallow: /messageboard/welcome.html

One reminder: A LOT of 'reputable' search engines do NOT heed robots.txt so if the file you don't want SEs to read is more private than not, robots.txt is not the best way to protect it from prying eyes.

peony

9:53 am on Apr 5, 2006 (gmt 0)

10+ Year Member



But what are good methods to stop that. My site has duplicate content.

One reminder: A LOT of 'reputable' search engines do NOT heed robots.txt so if the file you don't want SEs to read is more private than not, robots.txt is not the best way to protect it from prying eyes.

Pfui

7:44 pm on Apr 5, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I wish the answer to 'how to stop that' -- how to stop intrusive/aggressive bots, spiders, crawlers, whatevers -- was a simple one, and easy to implement. But repelling rogue robots can be both time-consuming and tedious... and best-suited for the obsessive-compulsive amongst us:)

That said, deflecting unwanted automatons initially depends on your server software (and what your ISP allows). For example, if your server runs Apache, there are things you can do (with .htaccess, with mod_rewrite, etc.) based on User-agent, Host name and/or IP address. Check Jim Morgan's superb help/how-tos in his Apache Web Server [webmasterworld.com] forum. The details can be tricky as heck, but extremely effective.

(I run Apache so if you're on a Windows box, I reckon there's a forum around here for that, too.)

As far as duplicate content goes, Googlebot heeds robots.txt, as does msnbot. That they do won't completely protect you against problems but might help you decide where to place your robots.txt disallows. For specific Google info, see the numerous forums in The Google World [webmasterworld.com] category.