Forum Moderators: open
we are thinking in get several subdomains like this:
http://keyword.mysite.com
But we would also like to mantain the same file in the root directory of my site.
http://www.mysite.com/keyword/index.html
Some internal links would point to the subdomain and others to the folder
Would it be considered spam opr is it right?
If it is, is there any way to avoid it?
A subdomain such as [keyword.mysite.com...] is considered a separate and unique domain by Google and the other SEs. Exposing the saming content by letting Google index the subdirectory will appear to Google as a separate page with duplicate content.
Pick one, and let Google index that.
More content is being added to all 3 sub-domains and I want to avoid duplicate content penalties from Google.
I'm wondering how best to do this. Can robots.txt be used to exclude sub-domains?
Thanks
It's a robots.txt validator tool, and underneath on the page you'll find all you need to know on robots.txt and how to use in easily comprehensible form.
It's great - check it out!
Thanks for the prompt replies. Had looked over the robots.txt info and understand it can be used to exclude spiders from a site in its entirety or just specified directories or files. What I'm unclear on is how to use it to exclude spiders from sub-domains such as:
www.abc.mydomain.com but allow access to www.xyz.mydomain.com and www.mydomain.com
If I want to keep all agents out of www.abc.mydomain.com but allow access to www.xyz.mydomain.com & www.mydomain.com how do I set it up please? Sorry - this reveals my ignorance about sub-domains - do they have the same root. If I place the following robots.txt in the root I presume it would preclude spiders accessing all sub-domains.
User-agent: *
Disallow: /
Thanks
It really was a pretty funny result set though. The company has the top 10 results for a keyword phrase and 18 of the first 20. Each result goes to the same page but the url is a different 3rd level domain.
Moreover, it looks like this company is just an affiliate and not even a true site on its own.
This may annoy some people, but I have to give this guy (or girl) a little respect. He seems to know how to beat google right now. How many of us (if we wanted to) could create a set of sites that hold 18 of the top 20 spots in a fairly competetive set of keywords? Sure, google will probably catch on pretty soon, but as an affiliate, does this person probably even care? No. Just find another high paying affiliate system and try to do the same thing before google changes its algo.
Not defending, just playing a little devils advocate. For those of us with real sites, people like this are just plain annoying and it costs us customers.
what is the purpose of this setup?And would a better solution perhaps be to do independant sites?
Heini - I agree it would make more sense to have 3 seperate sites with 1 for each country. However the client has used sub-domains and that's what I have to work with. Think their reasoning for going this way is proably due due to the fact it uses Vignette which has horrific license fees.
The content and structure of each sub-domain is very similar - basically they've created 1 site then replicated it for each different country with applicable minor content differences. They have legit business reasons for this approach but have inadvertently created mirror sites.
I'm concerned about the site being hit with duplicate content penalties which is why I want to instruct spiders to index only 1 sub-domain. As noted above, Google's already chosen to indexed only one of the sub-domains - abc.mydomain.com. PR is just 1, so I'm assuming a penalty has been applied. FAST has not indexed any of the sub-domains.
Given this, any suggestions as to how I use robots.txt to exclude certain sub-domains please?
Thanks for your help - greatly appreciated.
I have implemented subdomains using one webhosting plan under a single unique ip. I mention this because it is possible to set your subdomains up with indvidual ip#s - I did not do that.
I use my mod_rewrite in my .htaccess file to implement my subdomains. The actual content of my subdomains reside in subdirectories. I use the robots file to block Googlebot from the subdirectories. That has kept googlebot from spidering the subdomain content while they are in development.
My subdomains look like this:
[this is what the user sees]
[word1.mydomain.com...]
[word2.mydomain.com...]
The contents of the subdomains reside on my webhost as follows:
[users never see this]
[mydomain.com...]
[mydomain.com...]
My .htaccess file implements the subdomain using mod_rewrite:
RewriteEngine On
Options +FollowSymlinks
RewriteBase /
# Rewrite Rule for word1.mydomain.com
RewriteCond %{HTTP_HOST} word1.mydomain.com$
RewriteCond %{REQUEST_URI}!/word1/
RewriteRule ^(.*)$ /word1/$1
And I block googlebot as follows:
User-agent: Googlebot
Disallow: /word1/*
Disallow: /word2/*
Bottom line is that if you block a bot from the specific subdirectory containing the content, then it will not go there.
-egomaniac
And I block googlebot as follows:User-agent: Googlebot
Disallow: /word1/*
Disallow: /word2/*
Not sure why you've included an astrix. Doesn't the following exclude all content from a directory?
Disallow: /word1/
By the way, ran your robots.txt file thru the validator at [searchengineworld.com ] and it reports an error line 72.
Cheers
It is an internal URL rewrite, as opposed to an external 301- or 302-type redirect, so a 'bot is not aware of the subdirectory redirection. Therefore, I think this would work properly only if each subdomain-subdirectory had its own robots.txt - Which I think is what biggles wants.
The 'bots don't care about your directory structure, they only care about URLs. So, having a number of robots.txt files, each in a subdirectory representing a subdomain, is not a problem.
Or did ya lose me somewhere?
Jim
The * is a wildcard for everything in that subdirectory. I read somewhere that the * was proper syntax for the robots file. Leaving the * out may work, but I haven't tried it. I do know that what I have has worked for at least 6 mos or more since I first set it up.
PS-Thanks for the error tip on my robots file (its fixed now).
Hi jdMorgan,
If you followed my rewrite rules, then you understand it better than I do. I am a total hack when it comes to my webserver. I got this code from some support docs my webhost gave me. I think they took it from the Apache user guide.
Only one robots file is needed. I can't debate with you why. I just know that my setup works on my webhost. I think that each webhost can be different on how subdomains are implemented, which is one of the obstacles to using them.
-egomaniac
Not sure if it does work - just used [wannabrowser.com...] to mimic googlebot and afraid to say it is able to access the sub-domains/sub-directories you've tried to block it from. :(
Jim
Clearly from other postings you know a lot about .htaccess mod-rewrites. Would you be kind enough to spell out how you'd tackle this please.
Thanks
With this in .htaccess in the top-level directory of the hosting account:
RewriteEngine On
Options +FollowSymlinks
RewriteBase /
# Rewrite Rule for word1.mydomain.com
RewriteCond %{HTTP_HOST} word1.mydomain.com$
RewriteCond %{REQUEST_URI} !/word1/
RewriteRule ^(.*)$ /word1/$1 Any robot that fetches [word1.mydomain.com...] that does not have /word1/ as part of the file path (URI, not URL) will get internally-redirected to word1.mydomain.com/word1/(anything). So, a request for [word1.mydomain.com...] ends up at [word1.mydomain.com...]
Therefore, each subdirectory which "represents" a subdomain must have its own robots.txt
You'll also need to be careful to use only absolute URLs of the form <a href="http://word1.mydomain.com/(anything)">yadda yadda</a> in the on-page links of pages which are not disallowed in robots.txt, or you will expose the fact that a redirect is being used.
Jim
Final question - if you're excluding several sub-domains would the htaccess read like this?
RewriteEngine On
Options +FollowSymlinks
RewriteBase /
# Rewrite Rule for word1.mydomain.com
RewriteCond %{HTTP_HOST} word1.mydomain.com$
RewriteCond %{REQUEST_URI}!/word1/
RewriteRule ^(.*)$ /word1/$1
# Rewrite Rule for word2.mydomain.com
RewriteCond %{HTTP_HOST} word2.mydomain.com$
RewriteCond %{REQUEST_URI}!/word2/
RewriteRule ^(.*)$ /word2/$1
Just watch out folks - the webmaster world software eats the spaces preceding exclamation points, and you MUST have one in those RewriteCond directives! I've been playing with using the
[pre] and [code] bbtags, but even if you use them, you still have to put TWO spaces in, and then it eats one of them. # Rewrite Rule for word1.mydomain.com
RewriteCond %{HTTP_HOST} word1.mydomain.com$
RewriteCond %{REQUEST_URI} !/word1/
RewriteRule (.*) /word1/$1 Also, note that the ^ and $ (start and end) anchors in the RewriteRule are superfluous, and can be removed, as above.
Disclaimer: I have not tried out any of this myself. I'm working with theory here. Please be careful. And let me know if it works! TIA!
Jim
I have one, and only one robots.txt file - it is in my main root directory. I do not need multiple robots files for each individual subdomain, as suggested above.
Google has spidered my "exposed" subdomains correctly for over 6 months. If you want, check the backlinks on the site in my profile. You will see links the subdomain links from my tips subdomain. All links are listed in Google correctly. You will never see any incorrectly spidered links in Google.
All of the major search engines are spidering my subdomains correctly, EXCEPT for INK. INK sporadically screws up my urls. Some of them it lists correctly, some it does not. Doesn't seem to be any pattern as to why. I've emailed PositionTech a number of times with no resolution. I consider it a a minor problem, that I haven't had the time to attend to. Google, FAST, and AltaVista spider all of my subdomain urls correctly though. Check out the following two threads to read more on this problem with INK:
[webmasterworld.com...]
[webmasterworld.com...]
That said, the main intent of my comments in this thread were torwards subdomains and robots files.
I won't argue with you that it works for you, but it's very strange, unless there is a rule like:
RewriteRule ^robots.txt$ - [L] A rule like the one above preceding the other rewrites would allow you to have one "unified" robots.txt for all subdomain-subdirectories, assuming subdomain-subdirectory names were included in the Disallow statements (as in your original post).
If you have any other RewriteRules that would cause your .htaccess to bypass the rewrite on requests for robots.txt or any class of files that would include robots.txt, please post! Otherwise, this is quite a big mystery!
Your post started a very interesting discussion - Thanks!
Jim
The fact that you have excluded the sub directory and the content of that directory hasn't been crawled doesn't have anything to do with the robots.txt file. It hasn't been crawled because there are no links to that sub directory.
If you were to put a link to the unfinished subdomains on a page in your site, that content would get crawled an indexed even if the actual location was excluded. The spider has no way of knowing that [content.mydomain.com...] is the same as [mydomain.com...]