homepage Welcome to WebmasterWorld Guest from 54.198.139.141
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Website
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Rewrite rule troubles
phpmaven




msg:3710221
 1:44 pm on Jul 29, 2008 (gmt 0)

I have the following set of rules:

***********************
RewriteCond %{HTTP_user_agent} googlebot¦Msnbot¦Slurp [NC]
RewriteCond %{HTTP_user_agent} !AdsBot-Google [NC]
RewriteCond %{HTTP_HOST} !^www\.mydomain\.com [NC]
RewriteCond %{HTTP_HOST} !^mydomain\.com [NC]
RewriteCond %{HTTP_HOST} !^www\.mydm\.com [NC]
RewriteCond %{HTTP_HOST} !^mydm\.com [NC]
RewriteCond %{HTTP_HOST} .
RewriteRule ^.*$ [mydomain.com...] [L]
***********************

They work as I would expect except that they go into an infinite loop and I can't figure out why. Once the url gets rewritten the domain is www.mydomain.com and then the rules should be bypassed because the 3rd line would then be true and the rest shouldn't happen.

I'm sure I'm missing something obvious, but as usual I need your help Jim. FYI [mydomain.com...] is a non-existent page. I want to give the spiders a 404 if they request any from a domain other than the 4 I'm testing for.

Thanks,

Mark

 

phpmaven




msg:3710387
 4:25 pm on Jul 29, 2008 (gmt 0)

OK, I figured it out, but I still don't understand why it wasn't working. I now have this:

RewriteCond %{HTTP_user_agent} googlebot¦Msnbot¦Slurp [NC]
RewriteCond %{HTTP_user_agent} !AdsBot-Google [NC]
RewriteCond %{HTTP_HOST} !^www\.mydomain\.com [NC]
RewriteCond %{HTTP_HOST} !^mydomain\.com [NC]
RewriteCond %{HTTP_HOST} !^www\.mydm\.com [NC]
RewriteCond %{HTTP_HOST} !^mydm\.com [NC]
RewriteRule ^.*$ bad_domain.html [L]

Obviously the "RewriteCond %{HTTP_HOST} . " line was not needed. I guess I don't get why taking off the domain name on the rewrite would fix the problem, but it now works.

Mark

jdMorgan




msg:3710881
 6:07 am on Jul 30, 2008 (gmt 0)

Obviously the "RewriteCond %{HTTP_HOST} . " line was not needed.

If your site is accessible via HTTP/1.0, and you get an HTTP/1.0 request, then your server will go into an infinite loop without that line.

Your code was looping because bad_domain.html matches your ".*" RewriteRule pattern, and so can be externally redirected or internally rewritten to itself. That, combined with a configuration setting of UseCanonicalName on with a ServerName defined as anything other than exactly "www.mydomain.com" would set up a loop. You can use a server headers checker to test your original code to see if this was likely the case. If so, ask your host to turn off UseCanonicalName (See Apache core for more info).

Aside from that, you could save some code space and CPU time with your most-recently-posted version:

RewriteCond %{HTTP_USER_AGENT} googlebot¦Msnbot¦Slurp [NC]
RewriteCond %{HTTP_USER_AGENT} !AdsBot-Google [NC]
RewriteCond %{HTTP_HOST} !^(www\.)?mydomain\.com [NC]
RewriteCond %{HTTP_HOST} !^(www\.)?mydm\.com [NC]
RewriteRule !bad_domain\.html$ bad_domain.html [L]

This eliminates two redundant RewriteConds by making the remaining patterns match your hostnames with or without the leading "www.", and prevents the potential looping problem.

Backing up to take an overview here, I'm wondering why you'd want to create massive duplicate-content on your non-canonical domains, since all requests for *any* URL in those domains will now return the same "bad_domain.html" page; This creates infinite URL-spaces on those domains, and massive duplicate content.

Why not just 301-redirect requests for any of your non-canonical hostnames to the correct page(s) on your main domain instead?

Jim

phpmaven




msg:3711141
 1:34 pm on Jul 30, 2008 (gmt 0)

Thanks Jim for the clarification. You are a great help as always.

Perhaps I should explain why I'm doing this.

I'm using my mydm.com domain in my AdWords ads. I have the DNS setup on that domain so that I can create ads that have "keyword-keyword.mydm.com" as the display url and landing page. I'm just playing around to see how that affects my click through rate on some ads I'm testing. Obviously I don't want Google or Yahoo or MSN crawling those urls.

The full rules that I have are as follows:

RewriteCond %{HTTP_USER_AGENT} googlebot¦Msnbot¦Slurp [NC]
RewriteCond %{HTTP_USER_AGENT} !AdsBot-Google [NC]
RewriteCond %{HTTP_HOST} !^(www\.)?mydomain\.com [NC]
RewriteCond %{HTTP_HOST} !^(www\.)?mydm\.com [NC]
RewriteRule !bad_domain\.html$ bad_domain.html [L]

RewriteCond %{query_string} !AW=
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^www\.mydomain\.com [NC]
RewriteRule ^(.*)$ [mydomain.com...] [R=301,L]

The ads all have the "AW=" as part of the query string so that they will be allowed through. Also I need to allow Google's AdBot through as well. Once the user hits the landing page any further requests get redirected.

I already have "Disallow: /*?" in my robots.txt file so that none of these urls get crawled, but if they come looking to crawl those *.mydm.com domains, I want to return a 404 so that they don't think they exist. The file "bad_domain.html" doesn't exist so it returns a 404.

If there is a better way to handle this or if you think I'm going to create problem for myself, please let me know what you think. I highly value your opinion. Like I said, I'm just experimenting and could easily turn this all of.

Thanks,

Mark

phpmaven




msg:3711304
 4:14 pm on Jul 30, 2008 (gmt 0)

As a follow up, do I still need to worry about your comment below using your improved version of the rules?:

****************
Obviously the "RewriteCond %{HTTP_HOST} . " line was not needed.

If your site is accessible via HTTP/1.0, and you get an HTTP/1.0 request, then your server will go into an infinite loop without that line.
****************

Thanks,

Mark

jdMorgan




msg:3711378
 5:35 pm on Jul 30, 2008 (gmt 0)

I recommend using the "HTTP_HOST not blank" check whenever a negative match is used on the HTTP_HOST (or any other HTTP headers variable that is optional and can therefore be blank). This prevents the case of a blank HTTP Host header causing a match with NOT(example.com OR other-example.com), which a blank hostname would otherwise match, resulting in the rule being invoked.

Personally, I always 301-redirect my equivalents of your "AW=blah" URLs to the main domain immediately, since this original request will already be logged. I've never tried your 404 approach. As a result, all I can say is "Let us know how it works out." :)

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved