excluding a subdomain

Forum Moderators: goodroi

Message Too Old, No Replies

excluding a subdomain

Using the robots.txt file to exclude a subdomain

joebray

8:48 pm on Mar 7, 2007 (gmt 0)

I have a directory on my web server where I keep my experimental files at - the ones under development, in a directory called 'dev'. My question is how do I exclude that whole directory from the bots, when it sits alongside the 'www' directory where the real web files exist?

For instance, my real homepage is located here on the server: /www/index.asp
And my development file is here: /dev/index.asp

I'm wondering if each directory needs its own robots.txt file? Or is it as simple as this:

# Google
User-agent: googlebot
Disallow: /dev/

Thanks for your help in advance...

Joe Bray

phranque

1:00 am on Mar 8, 2007 (gmt 0)

there is only one robots.txt file that matters - in the root directory.
your solution is correct.
however, you may as well exclude all well-behaved robots.

User-agent: *
Disallow: /dev/

this page has the robots.txt standard [robotstxt.org].

joebray

4:25 pm on Mar 8, 2007 (gmt 0)

Thanks Phranque, but I think I may have the robots.txt file placed in the wrong directory on the server.

I have it inside the production website directory;

/www/robots.txt

But if I'm understanding you correctly, it should be located here instead:

/robots.txt

Does that sound right? Is there any way to test this sort of thing?

joebray

6:24 pm on Mar 8, 2007 (gmt 0)

Here is what I've done - I'll post later on whether it worked or not;

I put the modified robots.txt file at the root level: /robots.txt

And I also left the old one where it was: /www/robots.txt

What I will do is check back tomorrow and look in the Google Webmaster Tools, and see which robots.txt Google has cached for the website. Hopefully I will see the modified one, so that I can delete the other.

Joe

phranque

1:26 am on Mar 9, 2007 (gmt 0)

sorry - to be clear i meant the root directory of the domain.

it's the directory that contains your index.html or whatever when you request http://www.example.com/

not the root directory of your file system!

joebray

4:08 pm on Mar 9, 2007 (gmt 0)

Thanks phranque, that does seem to be the case. I checked to see what Google has cached this morning, and it is the one that sits alongside the main index.asp page of the production site - its root.

So, what I need to do is create a second robots.txt and place it into the other directory - the root of the development page.

Thanks for helping me work thru this...

Joe

phranque

1:58 am on Mar 10, 2007 (gmt 0)

So, what I need to do is create a second robots.txt and place it into the other directory - the root of the development page.

i think i misread your earlier posts.
my assumptions now are:
- the production and development sites are separate (sub)domains (i originally thought your dev site was a subdirectory)
- you want to allow all bots in the production directory (/www/)
- you want to exclude all bots in the development directory (/dev/)

therefore use the following files...

/www/robots.txt:

User-agent: *
Disallow:

/dev/robots.txt:

User-agent: *
Disallow: /

you can use the robots.txt tool in the google webmaster tools to verify which urls are allowed and disallowed by googlebot.
you can make tweaks to code from the cached version in the form and then update the file on your site with the final version.
not sure how often they update cache with a new file...

joebray

3:09 pm on Mar 12, 2007 (gmt 0)

Thanks phranque, for your help. I will do just that.

phranque

1:11 pm on Mar 13, 2007 (gmt 0)

please post your success or failure to help future searches on this thread...

System

9:47 am on Mar 23, 2007 (gmt 0)

redhat

The following message was cut out to new thread by goodroi. New thread at: robots_txt/3290795.htm [webmasterworld.com]
6:24 am on Mar. 23, 2007 (utc -5)