robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt

3 domains in root

Alternative Future

12:16 pm on Sep 23, 2003 (gmt 0)

Hi all,

I have a small problem, and was hoping that someone might be able to assist me with it. Many thanks in advance :)

I have 3 domains pointing to my website and I require disallowing 2 of the domains from finding anything on the site. With there only ever being one robots.txt existing in the root, how do I manage to disallow the robot from finding two of the sites?

KR,

-gs

dmorison

12:19 pm on Sep 23, 2003 (gmt 0)

If you're on Apache you could use mod_rewrite. I'm not sure of the details, you'll have to check the docs at httpd.apache.org; but you should be able to test for HOSTNAME and robots.txt, and then have it either return 404 or serve robots.txt as you wish.

Alternative Future

12:28 pm on Sep 23, 2003 (gmt 0)

Hi dmorison,

I am on apache and currently have the following in place:

Options +FollowSymlinks
RewriteEngine on
RewriteCond %{HTTP_HOST}!^www\.domain\.net
RewriteCond %{HTTP_HOST}!^0\.0\.0\.1
RewriteRule ^(.*)$ http://www.domain.net/$1 [R=permanent,L]

But the main part of my problem is that Slurp (Inktomi) is having difficulties with the 301. Can you see anything wrong with the above .htaccess or have any other suitable alternatives?

Many thanks George

[edited by: engine at 1:47 pm (utc) on Sep. 23, 2003]
[edit reason] de-linked [/edit]

jdMorgan

1:54 pm on Sep 23, 2003 (gmt 0)

George,

You could add a RewriteRule to silently redirect to a secondary robots.txt for requests to the alternate domain:


Options +FollowSymlinks
RewriteEngine on
RewriteCond %{HTTP_HOST} !^www\.domain\.net
RewriteCond %{HTTP_HOST} !^0\.0\.0\.1
RewriteRule ^robots\.txt$ /alternate_robots.txt [L]
#
RewriteCond %{HTTP_HOST} !^www\.domain\.net
RewriteCond %{HTTP_HOST} !^0\.0\.0\.1
RRewriteRule ^(.*)$ http://www.domain.net/$1 [R=permanent,L]

In this case, if the requested HOST is NOT your "main" domain, and the requested HOST (using IP address) is NOT your site's IP, then alternate_robots.txt will be served if robots.txt is requested, and any other requests will be redirected to your "main" domain.

You could also handle the Slurp problem specifically:


Options +FollowSymlinks
RewriteEngine on
RewriteCond %{HTTP_HOST} !^www\.domain\.net
RewriteCond %{HTTP_HOST} !^0\.0\.0\.1
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*\ \(Slurp/
RewriteRule ^robots\.txt$ /alternate_robots.txt [L]
# 
RewriteCond %{HTTP_HOST} !^www\.domain\.net
RewriteCond %{HTTP_HOST} !^0\.0\.0\.1
RewriteRule ^(.*)$ http://www.domain.net/$1 [R=permanent,L]

OT: You could also combine the first two RewriteCond lines in each case by using


RewriteCond %{HTTP_HOST} !^(www\.domain\.net�^0\.0\.0\.1)

as long as you remember to edit the line to change the broken vertical pipe "�" character to a solid vertical pipe from your keyboard.

Jim

Alternative Future

2:40 pm on Sep 23, 2003 (gmt 0)

Hi Jim,

That looks as though it will do the job just dandy :) I will give it a try tonight when I get home from work...

Many thanks to both you and dmorison for the speedy response in answering my question...

Best wishes,

-gs

Alternative Future

6:15 pm on Sep 23, 2003 (gmt 0)

Hi Jim,

Having a few problems with the options available to me:

Option 1 - displays the alternate_robots.txt regardless of UA.

Option 2 - Works slightly better, when googlebot is the UA it displays my robots.txt but when Inktomi requests the robots.txt from either of the domains it displays alternate_robots.txt

Here is a copy of option 2 :

Options +FollowSymlinks
RewriteEngine on
RewriteCond %{HTTP_HOST}!^www\.domain\.info
RewriteCond %{HTTP_HOST}!^0\.0\.0\.1
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*\ \(Slurp/
RewriteRule ^robots\.txt$ /alternate_robots.txt [L]

RewriteCond %{HTTP_HOST}!^www\.domain\.net
RewriteCond %{HTTP_HOST}!^0\.0\.0\.1
RewriteRule ^(.*)$ [domain.net...] [R=permanent,L]

Can you see anything that I might have done wrong?

PS am using WB for validation purposes.

-gs

jdMorgan

7:16 pm on Sep 23, 2003 (gmt 0)

George,

The domains in each part must be the same. The "!" means not. Therefore, the domain in the RewriteConds
should be the one you want to "standardize" on. For the purposes of this example, I'll assume you want to "keep" the .net domain, serve a special robots.txt for robots.txt requests to the .info domain, and 301-redirect requests for all other files from .info to .net. In order to help you figure out any other problems, here is the code with comments:


# Enable FollowSymLinks, retain all other option settings (Enabling FollowSymLinks is often required to
# allow mod_rewrite if the server isn't already set up with FollowSymLinks enabled for client accounts).
Options +FollowSymlinks
#
# Turn on the rewriting engine
RewriteEngine on
# IF the requested domain is NOT our "standard" domain
RewriteCond %{HTTP_HOST} !^www\.domain\.[b]net[/b]
#
# AND IF the requested domain is NOT our server's IP address (This allows your site to work
# without DNS if and only if you have a unique IP address; otherwise delete this line)
RewriteCond %{HTTP_HOST} !^0\.0\.0\.1
#
# AND the user-agent is Mozilla/<anything><space>(Slurp/<anything>
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*\ \(Slurp/
#
# THEN if robots.txt is requested, serve alternate_robots.txt instead & exit mod_rewrite
# (because of [L]). Your logs WILL NOT show the new URL, only a byte count difference may
# indicate that the rewrite happened.
RewriteRule ^robots\.txt$ /alternate_robots.txt [L]
#
#
# We will never get this far if the UA was Slurp and it was requesting robots.txt from
# the "wrong" domain. So the following code handles all other requests except that case.
#
# IF the requested domain is NOT our "standard" domain
RewriteCond %{HTTP_HOST} !^www\.domain\.[b]net[/b]
#
# AND IF the requested domain is NOT our server's IP address
RewriteCond %{HTTP_HOST} !^0\.0\.0\.1
#
# THEN if any file is requested, tell the user-agent to re-request the same file using
# our standard domain name by sending a 301-Moved Permanently server response with the
# new URL, and then quit mod_rewrite.
RewriteRule ^(.*)$ http://www.domain.net/$1 [R=permanent,L]

The only thing that looks like a problem is the domain name itself. It should be the domain you want to "keep" and it must be the same in both parts.

Make sure you cut and past a "real" Slurp user-agent string into WB. All slashes, spaces, and parentheses are expected to be present in the first part of the string. Here's one hot off the press:

Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; [inktomi.com...]

Jim

Alternative Future

7:33 pm on Sep 23, 2003 (gmt 0)

Superb post/reply Jim,

Many thanks for the commented example I can now make better sense of it :)
I have basically copied and pasted the example on to my server and run the tests, it works perfectly.

Would I be correct in saying that the error lay at this point:
RewriteCond %{HTTP_HOST}!^www\.domain\.info where the exclamation point acted as the NOT? Therefore the condition was checking the domain was NOT the dot info extension, moving onwards and falling into the second condition:
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*\ \(Slurp/
which of course fell true each time I tested it replicating the UA Slurp?

I cannot thank you enough again Jim.

jdMorgan

11:33 pm on Sep 23, 2003 (gmt 0)

Yes, that combined with the domain in the second ruleset being different (.net vs. .info). They should both be NOT'ed and they should both be the same -- your primary domain name.

Glad you got it working,
Jim

robots.txt

3 domains in root

Alternative Future

dmorison

Alternative Future

jdMorgan

Alternative Future

Alternative Future

jdMorgan

Alternative Future

jdMorgan

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week