3rd level domains - duplicate issues and how to avoid them with robots.txt - General Search Engine Marketing Issues forum at WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

3rd level domains - duplicate issues and how to avoid them with robots.txt

Are subdomains automatically considered spam?

Digimon

10:01 am on Jun 12, 2002 (gmt 0)

Hello all,

we are thinking in get several subdomains like this:

http://keyword.mysite.com

But we would also like to mantain the same file in the root directory of my site.

http://www.mysite.com/keyword/index.html

Some internal links would point to the subdomain and others to the folder

Would it be considered spam opr is it right?

If it is, is there any way to avoid it?

SmallTime

10:06 am on Jun 12, 2002 (gmt 0)

Hi, welcome to WWM.
Google does not like duplicate content.

egomaniac

4:02 pm on Jun 12, 2002 (gmt 0)

Ditto what smalltime said. Google will penalize you pretty fast for this. Last year I accidentally caused two near duplicate pages on my site to be indexed by Google. On the subsequent update, my site dropped from a top 2 to somewhere in the 20s.

A subdomain such as [keyword.mysite.com...] is considered a separate and unique domain by Google and the other SEs. Exposing the saming content by letting Google index the subdirectory will appear to Google as a separate page with duplicate content.

Pick one, and let Google index that.

biggles

3:41 am on Nov 4, 2002 (gmt 0)

I have a site that is split into 3 country sub-sites using sub-domains. The content and structure of each sub-domain is very similar, i.e. effectively duplicate, which I'm sure is the reason Google's only indexed one of them.

More content is being added to all 3 sub-domains and I want to avoid duplicate content penalties from Google.

I'm wondering how best to do this. Can robots.txt be used to exclude sub-domains?

Thanks

biggles

1:27 pm on Nov 4, 2002 (gmt 0)

<bump>

Can robots.txt be used to exclude sub-domains?

Anyone able to give an opinion on this please?

Thanks

heini

1:31 pm on Nov 4, 2002 (gmt 0)

Biggles, a greatly rewarding way to find this out is checking out this site:
[searchengineworld.com...]

It's a robots.txt validator tool, and underneath on the page you'll find all you need to know on robots.txt and how to use in easily comprehensible form.

It's great - check it out!

Brett_Tabke

1:34 pm on Nov 4, 2002 (gmt 0)

What will happen is either the subdomain or the keyword directory will suffer a minor penalty.

As other suggested, block one of them with a robots.txt.

biggles

1:53 pm on Nov 4, 2002 (gmt 0)

Heini / Brett

Thanks for the prompt replies. Had looked over the robots.txt info and understand it can be used to exclude spiders from a site in its entirety or just specified directories or files. What I'm unclear on is how to use it to exclude spiders from sub-domains such as:

www.abc.mydomain.com but allow access to www.xyz.mydomain.com and www.mydomain.com

If I want to keep all agents out of www.abc.mydomain.com but allow access to www.xyz.mydomain.com & www.mydomain.com how do I set it up please? Sorry - this reveals my ignorance about sub-domains - do they have the same root. If I place the following robots.txt in the root I presume it would preclude spiders accessing all sub-domains.

User-agent: *
Disallow: /

Thanks

bjseiler

2:08 pm on Nov 4, 2002 (gmt 0)

I would like to agree with saying that 3rd level domains lead to problems......but if someone wants a quick smile (this is not my site) go to google and type in -

[kw1 kw2 kw3]

Hmmmmmm?

[edited by: heini at 2:23 pm (utc) on Nov. 4, 2002]

heini

2:25 pm on Nov 4, 2002 (gmt 0)

Bjseiler, there are definitely sites out there utilizing this in a hmm less than innocent way.
But then we could say this about every other way to set up huge sites also.

Brett_Tabke

2:25 pm on Nov 4, 2002 (gmt 0)

Biggles, you can't block on a "subdomain" level, but since you are feeding different content from the subdomain, you must have a different site root for the subdomains? That's where you would place the robots.txt at.

bjseiler

2:39 pm on Nov 4, 2002 (gmt 0)

Sorry all for putting actual keywords in my post. I have been corrected and now I know, sorry.

It really was a pretty funny result set though. The company has the top 10 results for a keyword phrase and 18 of the first 20. Each result goes to the same page but the url is a different 3rd level domain.

Moreover, it looks like this company is just an affiliate and not even a true site on its own.

This may annoy some people, but I have to give this guy (or girl) a little respect. He seems to know how to beat google right now. How many of us (if we wanted to) could create a set of sites that hold 18 of the top 20 spots in a fairly competetive set of keywords? Sure, google will probably catch on pretty soon, but as an affiliate, does this person probably even care? No. Just find another high paying affiliate system and try to do the same thing before google changes its algo.

Macguru

2:44 pm on Nov 4, 2002 (gmt 0)

>>create a set of sites that hold 18 of the top 20 spots in a fairly competetive set of keywords?

The last one I saw like this held positions for about 15 days. I dont think it is worth the effort.

bjseiler

2:48 pm on Nov 4, 2002 (gmt 0)

Hey, trust me, this really annoys me, but if you held the top 18 of 20 spots on certain keyword phrases for 15 days, you might not have to work for a year.........think p or n, gambling, mortgage sites, etc. They have some REALLY big affiliate payoffs.

Not defending, just playing a little devils advocate. For those of us with real sites, people like this are just plain annoying and it costs us customers.

heini

2:54 pm on Nov 4, 2002 (gmt 0)

>I have a site that is split into 3 country sub-sites using sub-domains. The content and structure of each sub-domain is very similar, i.e. effectively duplicate

Biggles, what is the purpose of this setup?

And would a better solution perhaps be to do independant sites?

biggles

8:26 pm on Nov 4, 2002 (gmt 0)

what is the purpose of this setup?
And would a better solution perhaps be to do independant sites?

Heini - I agree it would make more sense to have 3 seperate sites with 1 for each country. However the client has used sub-domains and that's what I have to work with. Think their reasoning for going this way is proably due due to the fact it uses Vignette which has horrific license fees.

The content and structure of each sub-domain is very similar - basically they've created 1 site then replicated it for each different country with applicable minor content differences. They have legit business reasons for this approach but have inadvertently created mirror sites.

I'm concerned about the site being hit with duplicate content penalties which is why I want to instruct spiders to index only 1 sub-domain. As noted above, Google's already chosen to indexed only one of the sub-domains - abc.mydomain.com. PR is just 1, so I'm assuming a penalty has been applied. FAST has not indexed any of the sub-domains.

Given this, any suggestions as to how I use robots.txt to exclude certain sub-domains please?

Thanks for your help - greatly appreciated.

egomaniac

10:20 pm on Nov 4, 2002 (gmt 0)

Hi biggles. I've done this, so I can tell you how my robots file excludes bots from submdomains. I have a number of subdomains in my site. One is active, and three are in development. I have kept google out from the ones in development until they are complete. I am paranoid about any duplicate page penalties until my content is complete.

I have implemented subdomains using one webhosting plan under a single unique ip. I mention this because it is possible to set your subdomains up with indvidual ip#s - I did not do that.

I use my mod_rewrite in my .htaccess file to implement my subdomains. The actual content of my subdomains reside in subdirectories. I use the robots file to block Googlebot from the subdirectories. That has kept googlebot from spidering the subdomain content while they are in development.

My subdomains look like this:
[this is what the user sees]

[word1.mydomain.com...]
[word2.mydomain.com...]

The contents of the subdomains reside on my webhost as follows:
[users never see this]

[mydomain.com...]
[mydomain.com...]

My .htaccess file implements the subdomain using mod_rewrite:

RewriteEngine On
Options +FollowSymlinks
RewriteBase /
# Rewrite Rule for word1.mydomain.com
RewriteCond %{HTTP_HOST} word1.mydomain.com$
RewriteCond %{REQUEST_URI}!/word1/
RewriteRule ^(.*)$ /word1/$1

And I block googlebot as follows:

User-agent: Googlebot
Disallow: /word1/*
Disallow: /word2/*

Bottom line is that if you block a bot from the specific subdirectory containing the content, then it will not go there.
-egomaniac

espeed

10:35 pm on Nov 4, 2002 (gmt 0)

> What will happen is either the subdomain or the keyword directory
> will suffer a minor penalty.

Brett -- Are you saying that if a site mirrors the Jargon File or some other set of documents that it will receive a penalty? If so, how do you know this?

biggles

11:00 pm on Nov 4, 2002 (gmt 0)

Hey egomaniac

What you've posted is ideal - thanks for sharing! :)

biggles

11:10 pm on Nov 4, 2002 (gmt 0)

And I block googlebot as follows:
User-agent: Googlebot
Disallow: /word1/*
Disallow: /word2/*

Not sure why you've included an astrix. Doesn't the following exclude all content from a directory?

Disallow: /word1/

By the way, ran your robots.txt file thru the validator at [searchengineworld.com ] and it reports an error line 72.

Cheers

jdMorgan

11:26 pm on Nov 4, 2002 (gmt 0)

I followed that subdomains-implemented-by-htaccess-redirect-to-subdirectory methodology right up to the final rewrite rule.

It is an internal URL rewrite, as opposed to an external 301- or 302-type redirect, so a 'bot is not aware of the subdirectory redirection. Therefore, I think this would work properly only if each subdomain-subdirectory had its own robots.txt - Which I think is what biggles wants.

The 'bots don't care about your directory structure, they only care about URLs. So, having a number of robots.txt files, each in a subdirectory representing a subdomain, is not a problem.

Or did ya lose me somewhere?

Jim

egomaniac

11:47 pm on Nov 4, 2002 (gmt 0)

Hi biggles,

The * is a wildcard for everything in that subdirectory. I read somewhere that the * was proper syntax for the robots file. Leaving the * out may work, but I haven't tried it. I do know that what I have has worked for at least 6 mos or more since I first set it up.

PS-Thanks for the error tip on my robots file (its fixed now).

Hi jdMorgan,

If you followed my rewrite rules, then you understand it better than I do. I am a total hack when it comes to my webserver. I got this code from some support docs my webhost gave me. I think they took it from the Apache user guide.

Only one robots file is needed. I can't debate with you why. I just know that my setup works on my webhost. I think that each webhost can be different on how subdomains are implemented, which is one of the obstacles to using them.
-egomaniac

biggles

12:01 am on Nov 5, 2002 (gmt 0)

Egomaniac

Not sure if it does work - just used [wannabrowser.com...] to mimic googlebot and afraid to say it is able to access the sub-domains/sub-directories you've tried to block it from. :(

Jim

Clearly from other postings you know a lot about .htaccess mod-rewrites. Would you be kind enough to spell out how you'd tackle this please.

Thanks

jdMorgan

12:41 am on Nov 5, 2002 (gmt 0)

OK,

With this in .htaccess in the top-level directory of the hosting account:

RewriteEngine On
Options +FollowSymlinks
RewriteBase /
# Rewrite Rule for word1.mydomain.com
RewriteCond %{HTTP_HOST} word1.mydomain.com$
RewriteCond %{REQUEST_URI} !/word1/
RewriteRule ^(.*)$ /word1/$1

Any robot that fetches [word1.mydomain.com...] that does not have /word1/ as part of the file path (URI, not URL) will get internally-redirected to word1.mydomain.com/word1/(anything). So, a request for [word1.mydomain.com...] ends up at [word1.mydomain.com...]

Therefore, each subdirectory which "represents" a subdomain must have its own robots.txt

You'll also need to be careful to use only absolute URLs of the form <a href="http://word1.mydomain.com/(anything)">yadda yadda</a> in the on-page links of pages which are not disallowed in robots.txt, or you will expose the fact that a redirect is being used.

Jim

biggles

1:08 am on Nov 5, 2002 (gmt 0)

Thanks very much Jim.

Final question - if you're excluding several sub-domains would the htaccess read like this?

# Rewrite Rule for word2.mydomain.com
RewriteCond %{HTTP_HOST} word2.mydomain.com$
RewriteCond %{REQUEST_URI}!/word2/
RewriteRule ^(.*)$ /word2/$1

jdMorgan

1:20 am on Nov 5, 2002 (gmt 0)

Yes, that's it.

Just watch out folks - the webmaster world software eats the spaces preceding exclamation points, and you MUST have one in those RewriteCond directives! I've been playing with using the

[pre]

and

[code]

bbtags, but even if you use them, you still have to put TWO spaces in, and then it eats one of them.

# Rewrite Rule for word1.mydomain.com
RewriteCond %{HTTP_HOST} word1.mydomain.com$
RewriteCond %{REQUEST_URI} !/word1/
RewriteRule (.*) /word1/$1

Also, note that the ^ and $ (start and end) anchors in the RewriteRule are superfluous, and can be removed, as above.

Disclaimer: I have not tried out any of this myself. I'm working with theory here. Please be careful. And let me know if it works! TIA!

Jim

biggles

1:54 am on Nov 5, 2002 (gmt 0)

Will do. Thanks again Jim.

egomaniac

7:38 am on Nov 5, 2002 (gmt 0)

I am not an .htaccess or mod_rewrite expert by any means. All I can tell you is what works on my server.

I have one, and only one robots.txt file - it is in my main root directory. I do not need multiple robots files for each individual subdomain, as suggested above.

Google has spidered my "exposed" subdomains correctly for over 6 months. If you want, check the backlinks on the site in my profile. You will see links the subdomain links from my tips subdomain. All links are listed in Google correctly. You will never see any incorrectly spidered links in Google.

All of the major search engines are spidering my subdomains correctly, EXCEPT for INK. INK sporadically screws up my urls. Some of them it lists correctly, some it does not. Doesn't seem to be any pattern as to why. I've emailed PositionTech a number of times with no resolution. I consider it a a minor problem, that I haven't had the time to attend to. Google, FAST, and AltaVista spider all of my subdomain urls correctly though. Check out the following two threads to read more on this problem with INK:

[webmasterworld.com...]

That said, the main intent of my comments in this thread were torwards subdomains and robots files.

jdMorgan

8:02 am on Nov 5, 2002 (gmt 0)

egomaniac,

I won't argue with you that it works for you, but it's very strange, unless there is a rule like:

RewriteRule ^robots.txt$ - [L]

preceding the rulesets you quoted originally. Otherwise, how does the 'bot "know" it should not follow a rewrite to the subdomain-directory that it is spidering. It really doesn't have that option - it doesn't know the rewrite is happening, since it's all internal to the server.

A rule like the one above preceding the other rewrites would allow you to have one "unified" robots.txt for all subdomain-subdirectories, assuming subdomain-subdirectory names were included in the Disallow statements (as in your original post).

If you have any other RewriteRules that would cause your .htaccess to bypass the rewrite on requests for robots.txt or any class of files that would include robots.txt, please post! Otherwise, this is quite a big mystery!

Your post started a very interesting discussion - Thanks!

Jim

WebGuerrilla

8:40 am on Nov 5, 2002 (gmt 0)

I'm with Jim on this one. Adding exclusions for subdirectories containing content that is served to a spider as a subdomain will do nothing to prevent a spider from crawling that subdomain.

The fact that you have excluded the sub directory and the content of that directory hasn't been crawled doesn't have anything to do with the robots.txt file. It hasn't been crawled because there are no links to that sub directory.

If you were to put a link to the unfinished subdomains on a page in your site, that content would get crawled an indexed even if the actual location was excluded. The spider has no way of knowing that [content.mydomain.com...] is the same as [mydomain.com...]

This 35 message thread spans 2 pages: 35