Welcome to WebmasterWorld Guest from 54.204.162.36

Forum Moderators: phranque

robots.txt config for multiple domains.

guarding against duplicate content indexing

   
5:46 pm on Jan 12, 2002 (gmt 0)

WebmasterWorld Senior Member chiyo is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I have a question for which the answer may be blindingly obvious to the more experienced here. ..hopefully..

I really would appreciate help here, as I am quite concerned about duplicate content being indexed by SE spiders.

I have just transferred our domains to a new site. The directories are set up so that there is the first publicly viewed level where I keep no files but a few directories with various programs that are administrative, and four directories which host each of our four domains on diff unique IP's. via the directory pointing method.

Normally I would put a robots.txt in each of the domain directories. So far so good.

However the top level directory is also accessable by a URL such as [www27.ispprovidername.com...] And that URL if extended would also point to our actual registered domains - [www27.ispprovidername.com...] [www27.ispprovidername.com...] to dir4/.

Now I want the search engine spiders to spider all of our domain directories (the second level down), but I dont want them to spider the top level directory and the non-domain pointed admin directories. My concern is that they will index duplicate content once from [www27.ispprovidername.com...] to dir 4 and once from [ourdomain.com...] resulting in 2 listings of the exact same content.

However if I put a robots.txt in there disallowing access to that directory itself and the admin directories will it probibit spiders from visiting all the 4 domain pointed directories in deeper directories - e.g [www27.ispprovidername.com...] [www27.ispprovidername.com...] etc

Sorry if this is garbled but ive tried to make it as clear as I can.

Air

6:25 pm on Jan 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The way to test it, is to create your robots.txt file and place it in the root of your top level, then navigate to each of your domains specifying the robots.txt as the page you want i.e.

www.ourdomain1.com/robots.txt
www.ourdomain2.com/robots.txt
www.ourdomain3.com/robots.txt
www.ourdomain4.com/robots.txt

As long as you get a 404 page not found for each of those, you'll know that the search engines will not be able to find a robots.txt file for each of those either.

6:54 pm on Jan 12, 2002 (gmt 0)

WebmasterWorld Senior Member chiyo is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Thanks Air,

So i created a simple robots.txt for the top level domain

User-agent: *
Disallow: /

then i created robots.txt for each of the virtual domains at the second level and used a browser to point to them and got the print out of their own robots.txt

I guess this means that spiders will still index our virtual domains even if they find our robots.txt for the top level domain www.hostprovider/ourusername/ ?

I guess it comes down to definition of the "root" - for virtual domains such as those on a shared server I'm assuming that "root" is the top level for whatever directory is pointed to and resolves that domain.

Air

9:32 pm on Jan 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>I guess this means that spiders will still index our virtual domains even if they
>find our robots.txt for the top level domain www.hostprovider/ourusername/ ?

Exactly, so anything that would have to be retrieved through [www27.ispprovidername.com...] will be disallow(ed) but your other domains have their own root as far as a browser or search engine is concerned.

 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month