Forum Moderators: phranque
I really would appreciate help here, as I am quite concerned about duplicate content being indexed by SE spiders.
I have just transferred our domains to a new site. The directories are set up so that there is the first publicly viewed level where I keep no files but a few directories with various programs that are administrative, and four directories which host each of our four domains on diff unique IP's. via the directory pointing method.
Normally I would put a robots.txt in each of the domain directories. So far so good.
However the top level directory is also accessable by a URL such as [www27.ispprovidername.com...] And that URL if extended would also point to our actual registered domains - [www27.ispprovidername.com...] [www27.ispprovidername.com...] to dir4/.
Now I want the search engine spiders to spider all of our domain directories (the second level down), but I dont want them to spider the top level directory and the non-domain pointed admin directories. My concern is that they will index duplicate content once from [www27.ispprovidername.com...] to dir 4 and once from [ourdomain.com...] resulting in 2 listings of the exact same content.
However if I put a robots.txt in there disallowing access to that directory itself and the admin directories will it probibit spiders from visiting all the 4 domain pointed directories in deeper directories - e.g [www27.ispprovidername.com...] [www27.ispprovidername.com...] etc
Sorry if this is garbled but ive tried to make it as clear as I can.
www.ourdomain1.com/robots.txt
www.ourdomain2.com/robots.txt
www.ourdomain3.com/robots.txt
www.ourdomain4.com/robots.txt
As long as you get a 404 page not found for each of those, you'll know that the search engines will not be able to find a robots.txt file for each of those either.
So i created a simple robots.txt for the top level domain
User-agent: *
Disallow: /
then i created robots.txt for each of the virtual domains at the second level and used a browser to point to them and got the print out of their own robots.txt
I guess this means that spiders will still index our virtual domains even if they find our robots.txt for the top level domain www.hostprovider/ourusername/ ?
I guess it comes down to definition of the "root" - for virtual domains such as those on a shared server I'm assuming that "root" is the top level for whatever directory is pointed to and resolves that domain.
Exactly, so anything that would have to be retrieved through [www27.ispprovidername.com...] will be disallow(ed) but your other domains have their own root as far as a browser or search engine is concerned.