ModRewrite Fails Under HTTP1.0

Forum Moderators: phranque

Message Too Old, No Replies

ModRewrite Fails Under HTTP1.0

sends 200 ok no matter what

phish

6:05 am on Jan 5, 2007 (gmt 0)

I'm having a small problem with modrewrite in .htaccess.

Here is what I used below..

RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST}!^www\.example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

The first line should be checking for blank host because http 0.9, or 1.0 send host headers. The rewrite works properly for http1.1, and 0.9 i dont really care about, but some still use http 1.0, and I need to find a fix. I have also tried the standard rewrite without the negative pattern, does the same thing.

phish

jdMorgan

6:29 am on Jan 5, 2007 (gmt 0)

> The first line should be checking for blank host because http 0.9, or 1.0 send host headers.

I assume that's a typo, since HTTP/0.9 and HTTP/1.0 *do not* send host headers, and therefore cannot be used to access name-based virtual servers. That means that for a true HTTP/0.9 or HTTP/1.0 client, your server must be on a unique (non-shared IP address) to be accessible, and the distinction between hostnames does not exist, except at the DNS level. For that reason, the RewriteRule is disabled if the hostname is blank, and no redirection can (or should) take place.

I should note that I used the "%{HTTP_HOST} ." in code that I posted here years ago in preference to using the protocol field at the end of the client request header (available in %{THE_REQUEST}) simply because Googlebot and others "advertise" HTTP/1.0 in that field, but in actuality, they support "extended" HTTP/1.0, which *does* support sending a Host: header in the request. As a result, they *can* access name-based hosts on shared IP addresses, and can handle a hostname-based domain redirect, so we want to give it to them.

Just trying to make sure I understand the question here...

JIm

phish

7:06 am on Jan 5, 2007 (gmt 0)

Hi jd...

yes it was a typo..sorry

and yes, supposedly the site is setup on a static ip, (by itself) which is why i'm also confused.

and yes again..Google is the reason for this..

this is a brand new site , so i want to make sure i have my ducks in a row before releasing it. This is actually for a client who had wanted to use their own hosting over ours, which is why im having this issue. Our own dedicated server we do all redirects in http.conf, whereas im trying to do this thru .htaccess, and i know some of the rewrite rules differ between the 2. I was making sure i didnt have a typo or missed escaping a charachter or some regex stuff

jdMorgan

7:31 am on Jan 5, 2007 (gmt 0)

Well, the rule can't 'fail' with HTTP/1.0 because it can't do anything with an HTTP/1.0 request. Again, a redirect from www.example.com to example.com is meaningless in HTTP/1.0 if the two domains resolve to the same IP address. HTTP/1.0 doesn't support hostnames/domain names at all; They're only used by the client (browser) to look up the server IP address in DNS. After that, the HTTP/1.0 request is sent to your IP address, not to your domain name.

If you have the DNS set up to point those two domains to different servers, then both RewriteCond lines can be omitted, and of course all content should be removed and put on the other server.

Another way to put this is that in the HTTP/1.0 world domains names mean nothing to a server; all it knows/uses/cares about is its own IP address. And the domain name is translated at the DNS level to an IP address before any HTTP/1.0 request is sent by the client, and is then essentially discarded for the remaining duration of the transaction.

So in HTTP/1.0, the rule should not redirect based on domain names, and if it did, it would have to be to a different server at a different IP address. If it were allowed to redirect to itself, it would continue doing so, in an 'infinite' loop, which is why we include the RewriteConds to prevent this should a 'real' HTTP/1.0 request actually arrive at your server.

HTTP/1.1 was released primarily so that name-based virtual servers could be used, because the rate of IP address consumption was projected to exceed the highest possible number of IP addresses very soon, demanding that servers be enabled to share IP addresses. So the only way to do that was to bolt a "Host:" header into the HTTP request, so that a server could tell which of many sites at the same IP address to send the request to. Thus, "name-based virtual hosting" entered our lexicon, and we haven't run out of IPv4 addresses quite yet. But that "Host:" header is not used to route HTTP requests over TCP/IP; That still works in essentially the same manner as it did in the HTTP/1.0 days.

Does this help?

Jim

jdMorgan

8:29 am on Jan 5, 2007 (gmt 0)

Or maybe another approach:

Googlebot lies. It is really capable of both HTTP/1.0 and HTTP/1.1

It sends requests claiming that it's using HTTP/1.0 because older servers can't work properly with HTTP/1.1

But it really sends "Host:" headers in its purported HTTP/1.0 requests, making them "extended HTTP/1.0" requests.

True HTTP/1.0 servers will ignore that header.

HTTP/1.1 servers don't care if a request is HTTP/1.0 as long as it contains the Host: header, if that header is required for the server to work properly (it usually is on any shared server, so the host can limit your domain names).

So I think maybe your test plan is flawed.

Here is the correct behaviour of your server, and I believe you'll find the code complies with this:

If the request contains a "Host:" header, and it *does not* match the canonical domain, 301-redirect to that canonical domain.

If the request contains a "Host:" header, and it *does* match the canonical domain, serve the requested content (presumably with a 200-OK response if the URL resolves to an existing resource).

If the request contains no "Host:" header, serve the requested content (presumably with a 200-OK response as long as the URL resolves to an existing resource).

If the request contains a blank "Host:" header, serve the requested content (presumably with a 200-OK response as long as the URL resolves to an existing resource).

You can verify this operation manually using Telnet and connecting to port 80, if you find this necessary to bypass any inconsistencies in whatever tool you're using. The "hyperterm" program bundled with Windows can be used for Telnet. (I'm not actually sure if they still ship it with newer Windows releases like XP, so search for it on your machine and see). It requires you to literally type in the entire HTTP request, the process is entirely case-sensitive, and there is no support for editing what you type, so it can be a real pain. You can write text "scripts" as shown below into plain-text files, and tell Telnet to send them if you're a poor typer like me.

But just for example, you would type:

GET / HTTP/1.0
User-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Note the two blank lines after the User-agent line: Hit enter twice here. Nothing happens until you do.
For this request, there is no Host header, so the code should do nothing, the server should just serve the home page. You will see the HTML code of the page scroll by, assuming it worked.

Now here's what to type to emulate what Googlebot might actually send:

GET / HTTP/1.0
User-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Host: example.com

Your code's response should be a 301 response, redirecting Googlebot to www.example.com
You will see the response with the correct URL given in the Location: header.

Next, here's what Googlebot would probably send in response to that redirect:

GET / HTTP/1.1
User-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Host: www.example.com

Note that it switched to HTTP/1.1 because it got a redirect and it already knows (based on prior DNS lookups) that the IP address for your two domains are the same. Since the domain is correct, you should again see the HTML code of the requested page scroll by.

Now just to verify that I spoke the truth above, try this:

GET / HTTP/1.0
User-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Host: www.example.com

Despite the fact that this request states that it's an HTTP/1.0 request, it contains the Host header, so your server should examine it, find that it's the canonical domain, and serve the requested page with a 200-OK response.

Jim

phish

12:36 am on Jan 6, 2007 (gmt 0)

Hi Jim,
Well the results are in..

#1 GET / HTTP/1.0
User-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Sends a 301 reidrect to www.example.com

-------------------------------------------------------
#2 GET / HTTP/1.0
User-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Host: example.com

Sends a 301 redirect to www.example.com

---------------------------------------------------------
#3 GET / HTTP/1.1
User-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Host: www.example.com

Sends a 200 displays page..

-----------------------------------------------------
GET / HTTP/1.0
User-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Host: www.example.com

Sends a 200 displays page..

----------------------------------------

If I post an examplified copy of my virtual host container, can you take a look at it? Obviously I've been doing something wrong, which is probably why my rankings tanked. The way they set this up is confusing me....

phish