Turn mod_rewrite on for https (SSL)?

Forum Moderators: phranque

Message Too Old, No Replies

Turn mod_rewrite on for https (SSL)?

bendj

6:47 am on Aug 4, 2004 (gmt 0)

I got to the point of this question through another thread (thanks jim).

Basically it seems that when accessing anything from https:// I get an error saying that mod_rewrite is not there. I checked the httpd.conf filel ocated in the /etc/ folder but only saw virtual host information. Is there another file I should check for the SSL config options? Also is there anything I can add to the virtual host for my domain?

Thanks

jdMorgan

4:52 pm on Aug 4, 2004 (gmt 0)

bendj,

You're really going to have to work with your host to resolve this. If they can't handle it, then it's time for a new host, especially if they won't at least send you a copy of all the config files that affect your account.

Occasionally, we get some 'weird' problems reported here that are inexplicable in the context of the modules that are directly related, and it turns out that the host has misconfigured the LoadModule list, and that the module that doesn't seem to be working correctly is not even being accessed. This can happen if, for example, php is placed into the load list after mod_rewrite. Since the modules are processed in reverse order, that means that php will run first, so mod_rewrite can't affect any php requests. You may have the same problem, but with the SSL module instead of php.

Given what you've said about the config files you have access to, only your host can check this for you, so beat on them for some support.

Jim

bendj

11:06 pm on Aug 4, 2004 (gmt 0)

Jim,

Ok this seems to be working, at least it ignores the https until I can get them to fix it.

Questions:
Is this the right way to do the robots_noindex for all of the domains I do not want indexed by robots?

Also am I doing the redirect to a subdirectory from another domain pointing to my server? How will google handle this do you know, does there need to be any special redirect so google looks at the www.somedomain.com and considers it, its own website? By the way I want it to keep the otherdomain.com domain name (internal redirect I guess?)

Does all the syntax look right?

I want otherdomain.com to have its own robots.txt. Any special way to do this?

This is what I have now:
Options +FollowSymLinks
<IfModule mod_rewrite.c>
RewriteEngine on

RewriteCond %{HTTP_HOST} ^(ww+\.)?otherdomain\.com
RewriteCond %{REQUEST_URI}!/somedirectory/
RewriteRule ^(.*)$ /subdirectory/$1 [L]

RewriteCond %{HTTP_HOST}!^www\.accomplishhosting\.com [NC]
RewriteRule (.*) [domain.com...] [R=301,L]

RewriteCond %{HTTP_HOST}!^domain\.com
RewriteRule ^robots\.txt /robots_noindex.txt [R,L]
</IfModule>

[edited by: jdMorgan at 11:15 pm (utc) on Aug. 4, 2004]

jdMorgan

12:09 am on Aug 5, 2004 (gmt 0)

I can't say whether these rules will do what you want them to do, because you didn't describe that in detail. However, It would advise you not to try to use an external redirect for robots.txt -- It may not work at all. Simply internally rewrite the request to your robots_noindex.txt file. After cleaning up one other syntax problem, I get:


<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_HOST} !^www\.example\.com [NC]
RewriteRule /(.*) http://www.example.com/$1 [R=301,L]
 
RewriteCond %{HTTP_HOST} !^example\.com
RewriteRule [b]^/r[/b]obots\.txt /robots_noindex.txt [b][L][/b]
</IfModule>

Also, you don't need the <IfModule> container unless this code needs to be portable across several servers, some of which don't have mod_rewrite loaded or compiled-in.

The first rule will externally redirect all requests for domains other than www.example.com to www.example.com. Because of that, the second rule will always be invoked, since the domain name will always be www.example.com by the time that rule is processed. So, *all* requests for robots.txt will be internally rewritten to robots_noindex.txt. The code is syntactically correct and that's how it works now, but I doubt that's what you intended.

Jim

bendj

12:21 am on Aug 5, 2004 (gmt 0)

Jim,

Ok no that is not exactly what I wanted. Sorry for not being more clear.

I have the <IfModule> in there because it seems to fix the https problem I had. I received this tip from someone.

Basically I want the user/robot to be redirected to robots_noindex.txt for any request for robots.txt that is not my main domain. Therefore www.anyotherdomain.com with a robots.txt request should redirect to robots_noindex.txt. I now see why the robots rule is going to always fire. Am I going to have to each list each one by one?

Then I have some rewrite code setup for a subdirectory to be the root for one of my domain names. If I want robots to look at this as being its own website can I still do an internal redirect and have www.myotherdomain.com point to /www/myotherdomain/ on my server and use that directory as its root except still retain the name in the browser address bar? Do I have to add [R=301]?

This is what I have now:
Options +FollowSymLinks
<IfModule mod_rewrite.c>
RewriteEngine on

RewriteCond %{HTTP_HOST} ^(ww+\.)?otherdomain\.com
RewriteCond %{REQUEST_URI} !/somedirectory/
RewriteRule ^(.*)$ /subdirectory/$1 [L]

RewriteCond %{HTTP_HOST} !^www\.example\.com [NC]
RewriteRule (.*) [domain.com...] [R=301,L]

RewriteCond %{HTTP_HOST} !^domain\.com
RewriteRule ^/robots\.txt /robots_noindex.txt [L]
</IfModule>

[edited by: jdMorgan at 2:14 am (utc) on Aug. 5, 2004]
[edit reason] Removed specifics per TOS [/edit]

jdMorgan

2:31 am on Aug 5, 2004 (gmt 0)

bendj,

You need to take a look at the order of your rules, and make a plan. For example, if you redirect all non-standard domain name requests to your www.example.com domain, then that will also redirect requests for the robots.txt file for each of those non-standard domains, and therefore, no separate robots.txt (rewrite or file) is needed. No domain other than the standard domain will be reachable, and the search engines will figure it out pretty quickly. In short, if you always redirect non-www to www-, then non-www pretty much ceases to exist, so it doesn't need a robots.txt.

However, if for some reason you want to have a separate robots_noindex.txt anyway, then that rewrite must be done before the wholesale domain redirect.

Take a look at the regular expressions tutorial (link in Apache forum charter) and deconstruct the second ruleset above. That will demonstrate why you won't need a separate rule for each non-standard domain. You can match a specific domain name pattern to take action in that domain only, or you can stick a "!" (logical NOT) in front of a domain name pattern to take action for any *but* that domain.

If you rewrite a subdomain to a subdirectory, then that subdirectory will appear as the web root directory for that subdomain -- and therefore, it should contain a robots.txt file for that subdomain... Unless you want to add something else to your mod_rewrite rules to change that. :)

Jim

bendj

5:04 am on Aug 5, 2004 (gmt 0)

Sorry about the urls. It was not on purpose. At this point I am pretty confused about my .htaccess file.

Here is my setup:
I have a main domain. Call it www.maindomain.com.
I have other domain names (call one of them www.somedomain.com) that I simply want permanently redirected to www.maindomain.com. What happens when a robot tries to access the robot.txt file for www.somedomain.com? Will the redirect take place even before this happens?

If this is the case, then I do not even need to worry about the robot_noindex.txt file because it will redirect to the right one anyway with the domain redirect?

One other question is, should the subdirectory redirect be a R=301? (permanent?).

Thanks again jim.

jdMorgan

5:36 am on Aug 5, 2004 (gmt 0)

> What happens when a robot tries to access the robot.txt file for www.somedomain.com? Will the redirect take place even before this happens?

Yes.

> If this is the case, then I do not even need to worry about the robot_noindex.txt file because it will redirect to the right one anyway with the domain redirect?

Yes, correct; The entire domain -- all resources (files) will be redirected. Unless you code it to make exceptions.

> One other question is, should the subdirectory redirect be a R=301? (permanent?).

No, not unless you want the client (user/spider) to see the subdirectory path and to update their links to include that subdirectory path. So generally, no, 99% of the time you simply want to "re-map" the subdomain URL to the subdirectory filespace.

Further explanation that may help -- If it doesn't help, ignore it:
You can view mod_rewrite as a URL-to-URL remapper when using [R] redirects, and as a URL-to-filename remapper when using internal rewrites. Mod_rewrite "lives" at the boundary point where the server translates URLS to filepaths, and these two methods of locating resources can use entirely different names for the same resource because of mod_rewrite. You might suppose that everything after the ".com" in a URL is a "filepath", and that might appear to be true in most cases, but it is not so. It is a URL-path that on many servers just happens to be the same as the filesystem path below document_root. But it's not a filename, it's part of a URL, a completely different naming system. So, I use the term "re-map" to mean changing how the URLs relate to the filepaths. This distinction becomes clearer when considering a dynamic site like WebmasterWorld. If you look at the address bar, you see "forum92/1883.htm". That is a valid URL, but not a valid filename because, in fact, no such file exists. That URL invokes a script that looks up this "page" in a database and serves it. So not only are the URL-path and filepath different, the filepath does not even exist.

Another thing to note is that a redirect (301 or 302) requires communication with the client. If you implement a redirect from "foo" to "bar", then a client will request "foo", and your server will respond with a message that says, "The URL 'foo' has been relocated, please ask me for 'bar' instead." So the client will issue a new -- and form your server's point of view, completely separate -- request for "bar" and your server will serve the contents of that resource. If you specify a 301-Moved Permanently redirect, then clients which are able (such as search engine spiders) will update their database to use the new URL from now on. For some reason, browsers have never included a function to do this automatically or with a user prompt, but they could.

Jim

bendj

6:02 am on Aug 5, 2004 (gmt 0)

I know this should probably be on the google forum but say a spider comes across one of my domains that is using a subdirectory as root. For example www.somedomain.com's root is /somedomain/.

Or even if I submit www.somedomain.com to search engines. Will they frown upon this? I am confused at how something like this will be listed in say google.

jdMorgan

3:48 pm on Aug 5, 2004 (gmt 0)

www.somedomain.com will be listed as www.somedomain.com in Google. The fact that it exists as a subdirectory in server filespace shared with www.someotherdomain.com is unknown and irrelevant to a search engine. There is little differnce between this and virtual hosting, where hundreds of unrelated sites share filespace on the same server. Again, make a distinction between URLs and filepaths. Clients don't know about filepaths unless you make a mistake that gives them a hint, but they don't care about filepaths anyway.

I use test.mydomain.com on several sites to thoroughly test changes before I publish them. These subdomains are actually hosted in filespace at mydomain.com/test. But I did the rewrites correctly, so as far as anyone can tell from the outside via HTTP, test.mydomain.com is a fully-functional Web site completely separate from mydomain.com.

Because the rewrite of the URL subdomain.example.com to the filepath example.com/subdomain is completely transparent if implemented correctly, the search engines will treat it just like any other subdomain, say, search.yahoo.com for example.

Your only concern is that subdomain.example.com does not contain duplicate content from example.com or any other subdomain of example.com. In that case, you need to worry that example.com will be dropped from the listing in favor of subdomain.example.com if subdomain.example.com is the only duplicate-content domain, or that you may get a penalty and be dropped entirely if you have dozens of duplicate subdomains.

Search engines frown upon attempts to mislead searchers by the use of duplicate sites, doorway pages, link farms full of pages of unrelated and irrelevant links that exist for no other reason than to increase "link popularity" or PageRank, cloaking with intent to mislead search engines, and other dishonest techniques. They cannot and do not dictate restriction on the use of technology to make your site more organized or easier to use.

(Honestly, there is a lot of fear and unreasonable doubt engendered by Webmasters whose sites have been dropped for the reasons cited above, or because they made a serious mistake implementing redirects or published an invalid robots.txt that banned the search engine spiders unintentionally. These situatons result in all kinds of myths being spread about search engines' draconian policies. But the fact remains that the fault often lies with the Webmaster. The trick is to thoroughly plan, implement, and test your code -- whether it's mod_rewrite, robots.txt, or your latest shopping cart script.)

Jim

bendj

4:24 pm on Aug 5, 2004 (gmt 0)

Jim,

Thanks a lot for your help. You have explained things well. Do you accept donations anywhere on this site at all?

Ben

jdMorgan

4:47 pm on Aug 5, 2004 (gmt 0)

At the top right corner of every page. :)