Forum Moderators: goodroi

Message Too Old, No Replies

Exclude all robots on alias domains?

Good idea or not?

         

jtara

5:52 pm on Sep 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, this is related to our favorite subject of all time: 301 redirects, and "alias" domains! :)

I've found a great redirect solution on another site. It solves the (perhaps non-) problem of pre HTTP-1.1 browsers.

The idea is to use TWO IP addresses.

One IP address is used for the primary site. Pick one: www.example.tld or example.tld.

The other IP address is used for all of the aliases. (www or non, depending on how you go for primary, plus www.example2.tld, example2.tld, www.example3.tld, etc.)

The "other" IP address has a server that redirects to the primary. Typically, of course, this is done using a single server, using both virtual-by-IP and virtual-by-name.

If you use a single IP address, there's a problem with pre-HTTP1.1 user agents. They don't send the server the domain name, so they will never get a redirect, and so (if a browser) won't have their URL bar updated. The user agent is left browsing a duplicate site, rather than being redirected to the primary.

Perhaps this is a non-problem - how many pre HTTP1.1 user agents are there in use? They would all break on millions of virtual-by-name servers...

Anyway, one variant I have seen on this is a recommendation to have a robots.txt on the 2nd IP address that excludes all robots. This is the only file on the 2nd IP.

Yet another variant - if one prefers not to use 2 IP addresses - is to use a mod_rewrite directive to serve-up different robots.txt files (i.e. robots.txt, robots_secondary.txt) depending on the domain name. robots_secondary.txt would exclude all robots. Oh, yes, one more little detail - of course, you don't do a redirect to the primary on robots.txt on the secondary. That is, on the secondary:

robots.txt -> robots_secondary.txt
all other -> primary domain

Of course, the one-IP solution doesn't deal with pre-1.1 user agents. But the issue here is the handling of robots.txt.

(Note that if you are using "DNS redirect" you are already using a 2-address solution. In fact, the DNS provider is simply running a web server that does the redirects. In this case, you can serve-up a different robots.txt for the secondaries if your "DNS redirect" includes the capability of redirecting a specific URL.)

This is seen by advocates as "search engine friendly", since you insure that they won't waste their time indexing a duplicate site.

Is this a good idea or not? At first glance, it is "robot friendly". On the other hand, other sites may link to a secondary domain name, or even an invalid one. (Let's say you are using wild-card DNS, to catch errors like ww.example.tld, wwww.example.tld, etc.) If you have a robots.txt that excludes all robots, will they then ignore links-in to secondary domain names?

I'd think they shouldn't, since enumerating links in!= crawling.

It begs the question, just what DOES "excluding" in robots.txt mean? Are you simply excluding them from crawling? Or visiting at all, even if they followed a link in?

Ideally, you'd like to prevent them from crawling the secondaries, and wasting your bandwidth. (Which they SHOULD be able to figure out on their own, anyway - but perhaps not right away?) At the same time, you want them to follow links in, get the redirect, and give credit to the primary domain.