Forum Moderators: open
The urls point to essentially co-branded mirrors of the client's main site that have begun to appear in various newspaper sites. The urls are in the form www.newspapername.clientdomain/pagename.
The client... bless them... didn't mention this to me, even though we'd just spent a bunch of time removing mirror sites they'd had up on the .com, .net, and .org variants of their domain and straightening out the confused linking situation. They are a non-profit organization with an online education directory, and they want maximum exposure for their material. The newspaper co-branding brings in lots of visitors.
So, they're not going to take these mirror sites down. The question is, how best to handle the search engine aspect of this? One approach that's been discussed would be to block the spidering of these subdomains. This seems clean and straightforward. My tendency would be to block spidering access and let the Teoma and Google etc index info for subdomains die naturally. Does anyone see any complications from this or reasons not to block spidering access?
User-agent: *
Disallow: /
Although for Google, I think I might do something a bit more creative with Mod Rewrite. (Of course, I'm assuming you're on Unix/Apache)
As the number of the co-branded subdomains grow, Googlebot will stumble across them on a regular basis. Instead of just disallowing Gogglebot, you could send him back to the main site. It could help you get crawled a bit more, and it would be more reliable.
I'm seeing on the new Google results (which are still a little unstable) that an important interior page on the main domain no longer shows any links to it... they've all been credited to one of the subdomained co-branded sites.
As I start exploring the links, I see that some of them are to this page on the main domain, and some of them are to the page on other subdomains... so I think Google is looking at the links to all the subdomains and assigning them to the one that it deems to be most important. If we ban spidering these subdomains, what happens to the links to the subdomains? There are a lot of them.
And, in general, in trying to keep just the main domain indexed by using robots.txt, will we be throwing away links to the subdomains?
>>Is this a problem with search engines? If so, why?<<
Chris - Engines generally don't like dupe content and will drop you because of it. I'm concerned about that with this site, but I'm also concerned about the kind of linking confusion I've just described.
At the moment, I'm working on a project very similar to the one Chris mentioned. In those circumsatances, we absolutely do not want the cobranded versions showing up in search engines.
Rather than use robots.txt, we are setting up rewrite rules that will check UA and referring URLS. If Googlebot shows up requesting sobdomain.domain.com access will be denied, and Googlebot would be given domain.com.
You could do something similar with the referral, so anyone requesting subdomain.domain.com would end up getting just domain.com
You can find some great Rewrite examples here [webmasterworld.com].
Word from the system administrator on this site is that there's a potentially big performance hit with Mod Rewrite, because the server would be checking every request for a page to see if it's Googlebot. How big a factor is this?
Consensus for now is to disallow everything in the robots.txt for each subdomain... and then I'd submit the new urls that have already been indexed to Google's 'remove sites,' etc. We'd then see how the main site fares.
We would be losing a few very high page rank links (like from a major newspaper's home page) to the co-branded site... but I think in the long run that keeping the mirrored sites off the engines is important. Any way to do this and confer the link boost to the main domain?