Forum Moderators: goodroi
I have a small problem, and was hoping that someone might be able to assist me with it. Many thanks in advance :)
I have 3 domains pointing to my website and I require disallowing 2 of the domains from finding anything on the site. With there only ever being one robots.txt existing in the root, how do I manage to disallow the robot from finding two of the sites?
KR,
-gs
I am on apache and currently have the following in place:
Options +FollowSymlinks
RewriteEngine on
RewriteCond %{HTTP_HOST}!^www\.domain\.net
RewriteCond %{HTTP_HOST}!^0\.0\.0\.1
RewriteRule ^(.*)$ http://www.domain.net/$1 [R=permanent,L]
But the main part of my problem is that Slurp (Inktomi) is having difficulties with the 301. Can you see anything wrong with the above .htaccess or have any other suitable alternatives?
Many thanks George
[edited by: engine at 1:47 pm (utc) on Sep. 23, 2003]
[edit reason] de-linked [/edit]
You could add a RewriteRule to silently redirect to a secondary robots.txt for requests to the alternate domain:
Options +FollowSymlinks
RewriteEngine on
RewriteCond %{HTTP_HOST} !^www\.domain\.net
RewriteCond %{HTTP_HOST} !^0\.0\.0\.1
RewriteRule ^robots\.txt$ /alternate_robots.txt [L]
#
RewriteCond %{HTTP_HOST} !^www\.domain\.net
RewriteCond %{HTTP_HOST} !^0\.0\.0\.1
RRewriteRule ^(.*)$ http://www.domain.net/$1 [R=permanent,L]
You could also handle the Slurp problem specifically:
Options +FollowSymlinks
RewriteEngine on
RewriteCond %{HTTP_HOST} !^www\.domain\.net
RewriteCond %{HTTP_HOST} !^0\.0\.0\.1
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*\ \(Slurp/
RewriteRule ^robots\.txt$ /alternate_robots.txt [L]
#
RewriteCond %{HTTP_HOST} !^www\.domain\.net
RewriteCond %{HTTP_HOST} !^0\.0\.0\.1
RewriteRule ^(.*)$ http://www.domain.net/$1 [R=permanent,L]
RewriteCond %{HTTP_HOST} !^(www\.domain\.net¦^0\.0\.0\.1)
Jim
Having a few problems with the options available to me:
Option 1 - displays the alternate_robots.txt regardless of UA.
Option 2 - Works slightly better, when googlebot is the UA it displays my robots.txt but when Inktomi requests the robots.txt from either of the domains it displays alternate_robots.txt
Here is a copy of option 2 :
Options +FollowSymlinks
RewriteEngine on
RewriteCond %{HTTP_HOST}!^www\.domain\.info
RewriteCond %{HTTP_HOST}!^0\.0\.0\.1
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*\ \(Slurp/
RewriteRule ^robots\.txt$ /alternate_robots.txt [L]
RewriteCond %{HTTP_HOST}!^www\.domain\.net
RewriteCond %{HTTP_HOST}!^0\.0\.0\.1
RewriteRule ^(.*)$ [domain.net...] [R=permanent,L]
Can you see anything that I might have done wrong?
PS am using WB for validation purposes.
-gs
The domains in each part must be the same. The "!" means not. Therefore, the domain in the RewriteConds
should be the one you want to "standardize" on. For the purposes of this example, I'll assume you want to "keep" the .net domain, serve a special robots.txt for robots.txt requests to the .info domain, and 301-redirect requests for all other files from .info to .net. In order to help you figure out any other problems, here is the code with comments:
# Enable FollowSymLinks, retain all other option settings (Enabling FollowSymLinks is often required to
# allow mod_rewrite if the server isn't already set up with FollowSymLinks enabled for client accounts).
Options +FollowSymlinks
#
# Turn on the rewriting engine
RewriteEngine on
# IF the requested domain is NOT our "standard" domain
RewriteCond %{HTTP_HOST} !^www\.domain\.[b]net[/b]
#
# AND IF the requested domain is NOT our server's IP address (This allows your site to work
# without DNS if and only if you have a unique IP address; otherwise delete this line)
RewriteCond %{HTTP_HOST} !^0\.0\.0\.1
#
# AND the user-agent is Mozilla/<anything><space>(Slurp/<anything>
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*\ \(Slurp/
#
# THEN if robots.txt is requested, serve alternate_robots.txt instead & exit mod_rewrite
# (because of [L]). Your logs WILL NOT show the new URL, only a byte count difference may
# indicate that the rewrite happened.
RewriteRule ^robots\.txt$ /alternate_robots.txt [L]
#
#
# We will never get this far if the UA was Slurp and it was requesting robots.txt from
# the "wrong" domain. So the following code handles all other requests except that case.
#
# IF the requested domain is NOT our "standard" domain
RewriteCond %{HTTP_HOST} !^www\.domain\.[b]net[/b]
#
# AND IF the requested domain is NOT our server's IP address
RewriteCond %{HTTP_HOST} !^0\.0\.0\.1
#
# THEN if any file is requested, tell the user-agent to re-request the same file using
# our standard domain name by sending a 301-Moved Permanently server response with the
# new URL, and then quit mod_rewrite.
RewriteRule ^(.*)$ http://www.domain.net/$1 [R=permanent,L]
Make sure you cut and past a "real" Slurp user-agent string into WB. All slashes, spaces, and parentheses are expected to be present in the first part of the string. Here's one hot off the press:
Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; [inktomi.com...]
Jim
Many thanks for the commented example I can now make better sense of it :)
I have basically copied and pasted the example on to my server and run the tests, it works perfectly.
Would I be correct in saying that the error lay at this point:
RewriteCond %{HTTP_HOST}!^www\.domain\.info where the exclamation point acted as the NOT? Therefore the condition was checking the domain was NOT the dot info extension, moving onwards and falling into the second condition:
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*\ \(Slurp/
which of course fell true each time I tested it replicating the UA Slurp?
I cannot thank you enough again Jim.