homepage Welcome to WebmasterWorld Guest from 54.161.166.171
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Hardware and OS Related Technologies / Website Technology Issues
Forum Library, Charter, Moderators: phranque

Website Technology Issues Forum

    
robots.txt config for multiple domains.
guarding against duplicate content indexing
chiyo




msg:666049
 5:46 pm on Jan 12, 2002 (gmt 0)

I have a question for which the answer may be blindingly obvious to the more experienced here. ..hopefully..

I really would appreciate help here, as I am quite concerned about duplicate content being indexed by SE spiders.

I have just transferred our domains to a new site. The directories are set up so that there is the first publicly viewed level where I keep no files but a few directories with various programs that are administrative, and four directories which host each of our four domains on diff unique IP's. via the directory pointing method.

Normally I would put a robots.txt in each of the domain directories. So far so good.

However the top level directory is also accessable by a URL such as [www27.ispprovidername.com...] And that URL if extended would also point to our actual registered domains - [www27.ispprovidername.com...] [www27.ispprovidername.com...] to dir4/.

Now I want the search engine spiders to spider all of our domain directories (the second level down), but I dont want them to spider the top level directory and the non-domain pointed admin directories. My concern is that they will index duplicate content once from [www27.ispprovidername.com...] to dir 4 and once from [ourdomain.com...] resulting in 2 listings of the exact same content.

However if I put a robots.txt in there disallowing access to that directory itself and the admin directories will it probibit spiders from visiting all the 4 domain pointed directories in deeper directories - e.g [www27.ispprovidername.com...] [www27.ispprovidername.com...] etc

Sorry if this is garbled but ive tried to make it as clear as I can.

 

Air




msg:666050
 6:25 pm on Jan 12, 2002 (gmt 0)

The way to test it, is to create your robots.txt file and place it in the root of your top level, then navigate to each of your domains specifying the robots.txt as the page you want i.e.

www.ourdomain1.com/robots.txt
www.ourdomain2.com/robots.txt
www.ourdomain3.com/robots.txt
www.ourdomain4.com/robots.txt

As long as you get a 404 page not found for each of those, you'll know that the search engines will not be able to find a robots.txt file for each of those either.

chiyo




msg:666051
 6:54 pm on Jan 12, 2002 (gmt 0)

Thanks Air,

So i created a simple robots.txt for the top level domain

User-agent: *
Disallow: /

then i created robots.txt for each of the virtual domains at the second level and used a browser to point to them and got the print out of their own robots.txt

I guess this means that spiders will still index our virtual domains even if they find our robots.txt for the top level domain www.hostprovider/ourusername/ ?

I guess it comes down to definition of the "root" - for virtual domains such as those on a shared server I'm assuming that "root" is the top level for whatever directory is pointed to and resolves that domain.


Air




msg:666052
 9:32 pm on Jan 12, 2002 (gmt 0)

>I guess this means that spiders will still index our virtual domains even if they
>find our robots.txt for the top level domain www.hostprovider/ourusername/ ?

Exactly, so anything that would have to be retrieved through [www27.ispprovidername.com...] will be disallow(ed) but your other domains have their own root as far as a browser or search engine is concerned.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Hardware and OS Related Technologies / Website Technology Issues
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved