Welcome to WebmasterWorld Guest from 23.22.46.195

Forum Moderators: Ocean10000 & incrediBILL & phranque

Rewrite rule troubles

   
1:44 pm on Jul 29, 2008 (gmt 0)

10+ Year Member



I have the following set of rules:

***********************
RewriteCond %{HTTP_user_agent} googlebot¦Msnbot¦Slurp [NC]
RewriteCond %{HTTP_user_agent} !AdsBot-Google [NC]
RewriteCond %{HTTP_HOST} !^www\.mydomain\.com [NC]
RewriteCond %{HTTP_HOST} !^mydomain\.com [NC]
RewriteCond %{HTTP_HOST} !^www\.mydm\.com [NC]
RewriteCond %{HTTP_HOST} !^mydm\.com [NC]
RewriteCond %{HTTP_HOST} .
RewriteRule ^.*$ [mydomain.com...] [L]
***********************

They work as I would expect except that they go into an infinite loop and I can't figure out why. Once the url gets rewritten the domain is www.mydomain.com and then the rules should be bypassed because the 3rd line would then be true and the rest shouldn't happen.

I'm sure I'm missing something obvious, but as usual I need your help Jim. FYI [mydomain.com...] is a non-existent page. I want to give the spiders a 404 if they request any from a domain other than the 4 I'm testing for.

Thanks,

Mark

4:25 pm on Jul 29, 2008 (gmt 0)

10+ Year Member



OK, I figured it out, but I still don't understand why it wasn't working. I now have this:

RewriteCond %{HTTP_user_agent} googlebot¦Msnbot¦Slurp [NC]
RewriteCond %{HTTP_user_agent} !AdsBot-Google [NC]
RewriteCond %{HTTP_HOST} !^www\.mydomain\.com [NC]
RewriteCond %{HTTP_HOST} !^mydomain\.com [NC]
RewriteCond %{HTTP_HOST} !^www\.mydm\.com [NC]
RewriteCond %{HTTP_HOST} !^mydm\.com [NC]
RewriteRule ^.*$ bad_domain.html [L]

Obviously the "RewriteCond %{HTTP_HOST} . " line was not needed. I guess I don't get why taking off the domain name on the rewrite would fix the problem, but it now works.

Mark

6:07 am on Jul 30, 2008 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Obviously the "RewriteCond %{HTTP_HOST} . " line was not needed.

If your site is accessible via HTTP/1.0, and you get an HTTP/1.0 request, then your server will go into an infinite loop without that line.

Your code was looping because bad_domain.html matches your ".*" RewriteRule pattern, and so can be externally redirected or internally rewritten to itself. That, combined with a configuration setting of UseCanonicalName on with a ServerName defined as anything other than exactly "www.mydomain.com" would set up a loop. You can use a server headers checker to test your original code to see if this was likely the case. If so, ask your host to turn off UseCanonicalName (See Apache core for more info).

Aside from that, you could save some code space and CPU time with your most-recently-posted version:


RewriteCond %{HTTP_USER_AGENT} googlebot¦Msnbot¦Slurp [NC]
RewriteCond %{HTTP_USER_AGENT} !AdsBot-Google [NC]
RewriteCond %{HTTP_HOST} !^(www\.)?mydomain\.com [NC]
RewriteCond %{HTTP_HOST} !^(www\.)?mydm\.com [NC]
RewriteRule !bad_domain\.html$ bad_domain.html [L]

This eliminates two redundant RewriteConds by making the remaining patterns match your hostnames with or without the leading "www.", and prevents the potential looping problem.

Backing up to take an overview here, I'm wondering why you'd want to create massive duplicate-content on your non-canonical domains, since all requests for *any* URL in those domains will now return the same "bad_domain.html" page; This creates infinite URL-spaces on those domains, and massive duplicate content.

Why not just 301-redirect requests for any of your non-canonical hostnames to the correct page(s) on your main domain instead?

Jim

1:34 pm on Jul 30, 2008 (gmt 0)

10+ Year Member



Thanks Jim for the clarification. You are a great help as always.

Perhaps I should explain why I'm doing this.

I'm using my mydm.com domain in my AdWords ads. I have the DNS setup on that domain so that I can create ads that have "keyword-keyword.mydm.com" as the display url and landing page. I'm just playing around to see how that affects my click through rate on some ads I'm testing. Obviously I don't want Google or Yahoo or MSN crawling those urls.

The full rules that I have are as follows:

RewriteCond %{HTTP_USER_AGENT} googlebot¦Msnbot¦Slurp [NC]
RewriteCond %{HTTP_USER_AGENT} !AdsBot-Google [NC]
RewriteCond %{HTTP_HOST} !^(www\.)?mydomain\.com [NC]
RewriteCond %{HTTP_HOST} !^(www\.)?mydm\.com [NC]
RewriteRule !bad_domain\.html$ bad_domain.html [L]

RewriteCond %{query_string} !AW=
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^www\.mydomain\.com [NC]
RewriteRule ^(.*)$ [mydomain.com...] [R=301,L]

The ads all have the "AW=" as part of the query string so that they will be allowed through. Also I need to allow Google's AdBot through as well. Once the user hits the landing page any further requests get redirected.

I already have "Disallow: /*?" in my robots.txt file so that none of these urls get crawled, but if they come looking to crawl those *.mydm.com domains, I want to return a 404 so that they don't think they exist. The file "bad_domain.html" doesn't exist so it returns a 404.

If there is a better way to handle this or if you think I'm going to create problem for myself, please let me know what you think. I highly value your opinion. Like I said, I'm just experimenting and could easily turn this all of.

Thanks,

Mark

4:14 pm on Jul 30, 2008 (gmt 0)

10+ Year Member



As a follow up, do I still need to worry about your comment below using your improved version of the rules?:

****************
Obviously the "RewriteCond %{HTTP_HOST} . " line was not needed.

If your site is accessible via HTTP/1.0, and you get an HTTP/1.0 request, then your server will go into an infinite loop without that line.
****************

Thanks,

Mark

5:35 pm on Jul 30, 2008 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I recommend using the "HTTP_HOST not blank" check whenever a negative match is used on the HTTP_HOST (or any other HTTP headers variable that is optional and can therefore be blank). This prevents the case of a blank HTTP Host header causing a match with NOT(example.com OR other-example.com), which a blank hostname would otherwise match, resulting in the rule being invoked.

Personally, I always 301-redirect my equivalents of your "AW=blah" URLs to the main domain immediately, since this original request will already be logged. I've never tried your 404 approach. As a result, all I can say is "Let us know how it works out." :)

Jim

 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month