Forum Moderators: phranque

Message Too Old, No Replies

Mod rewrite URLs from all TLD's to .com -- do you know what's wrong?

         

nutkenz

10:54 am on Oct 22, 2008 (gmt 0)

10+ Year Member



This is a rule in my .htaccess file for a Drupal website:\

RewriteCond %{HTTP_HOST} ^ex\-ample\.com$ [NC]
RewriteRule ^(.*)$ http://www.ex-ample.com/$1 [L,R=301]

RewriteCond %{HTTP_HOST} ^(www\.)?ex(\-)?ample\.(nl¦be¦eu)(.*)$ [NC]
RewriteRule ^(.*)$ http://www.ex-ample.com$4 [L,R=302]

The intention is to 302 redirect all hits from country specific TLD's to their counterparts on the .com TLD by retaining the path after the FQD. For example:

http://ex-ample.nl/ => http://www.ex-ample.com/
http://www.ex-ample.be/ => http://www.ex-ample.com/
http://ex-ample.be/vastgoed-development/lotus-breeze => http://www.ex-ample.com/vastgoed-development/lotus-breeze

However, the path (for instance /vastgoed-development/lotus-breeze) always seems to be lost during the redirect. Does anyone know why?

[edited by: jdMorgan at 6:46 pm (utc) on Oct. 22, 2008]
[edit reason] example.com [/edit]

jdMorgan

12:47 pm on Oct 22, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The main problem is that back-references to matches in RewriteConds are named %1-%9, and back-references to matches in RewriteRules are named $1-$9. So the back-reference name ($4) was incorrect if you wanted to preserve the requested local URL-path. This code also fails to strip trailing periods and port numbers from the FQDN/hostname. I'd suggest:

RewriteCond %{HTTP_HOST} ^ex-ample\.com [NC,OR]
RewriteCond %{HTTP_HOST} ^www\.ex-ample\.com(\.¦\.?:[0-9]+)$ [NC,OR]
RewriteCond %{HTTP_HOST} ^(www\.)?ex-?ample\.(nl¦be¦eu) [NC]
RewriteRule (.*) http://www.ex-ample.com/$1 [R=301,L]

This one rule replaces both of your previous rules, and corrects several other problems and inefficiencies as well. All changes are intentional.

Note that a 302 is no longer used, so as to avoid damaging the search engine ranking of www.ex-ample.com; Using 302's would cause the search engine to "302-hijack" the URLs from www.ex-ample.com, and that is something you do not want to happen.

If it is the case that no other domains or subdomains exist in this .htaccess file's filespace on this server, then this simpler code can be used:


RewriteCond %{HTTP_HOST} !^(www\.ex-ample\.com)?$
RewriteRule (.*) http://www.ex-ample.com/$1 [R=301,L]

This simpler version redirects any request where the hostname is not *exactly* www.ex-ample.com (or blank) to www.ex-ample.com. The allowance for blank is intended to prevent an infinite redirection loop should an HTTP/1.0 request arrive at the server. HTTP/1.0 requests will not include a Host header, so the value will be blank.

Replace the broken pipe "¦" characters in the code above with solid pipes before use: Posting on this forum modifies the pipe characters.

Jim

[edited by: jdMorgan at 12:48 pm (utc) on Oct. 22, 2008]

nutkenz

6:22 pm on Oct 22, 2008 (gmt 0)

10+ Year Member



Hi Jim

Thank you very much, this seems to work, although I don't really understand which "problems and inefficiencies" you're talking about exactly. Could you please clarify how these new rules work exactly and why they're more efficient than the old rules? Perhaps walk me through the steps of what happens when someone loads [ex-ample.be...] using these rules.

Also, why wouldn't I want the .com URLs to be hijacked by my ccTLD URLs? That's exactly what I do want to happen, like described in an article I read: "Question: I have an existing .com site and I just bought a ccTLD.
Answer 1: Use a 302 redirect from the ccTLD to the .com. This tells the search engine that your ccTLD is the "real" domain and that it's being temporarily redirected to the .com. The search engine will index the .com, but keep the  ccTLD as the "original" domain. In short, the .com won't be considered."

Thanks again

jdMorgan

7:24 pm on Oct 22, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




RewriteCond %{HTTP_HOST} ^ex\-ample\.com$ [NC]
RewriteRule ^(.*)$ http://www.ex-ample.com/$1 [L,R=301]
#
RewriteCond %{HTTP_HOST} ^(www\.)?ex(\-)?ample\.(nl¦be¦eu)(.*)$ [NC]
RewriteRule ^(.*)$ http://www.ex-ample.com$4 [L,R=302]

Inefficiencies:
  • Two rules instead of one.
  • Unnecessary escaping of "-".
  • Unnecessary use of parentheses in "(\-)?". Considering the previous point, "-?" is sufficient.
  • Unnecessary anchors on (.*) pattern standing alone in both RewriteRules. ".*" is "greedy" and by default, will match the entire local URL-path.
  • Unnecessary use of back-reference (parentheses) and ".*" pattern following hostname in 2nd rule. Simply omit the end-anchor (see also "problems" below).

    Problems:
  • End anchor on "ex-ample\.com" prevents correction/redirection of "ex-ample.com." (FQDN), "ex-ample.com:80" (port number), and "ex-ample.com.:80" (FQDN + port number), resulting in duplicate content at four URLs (including the canonical ex-ample.com itself).
  • Trailing "(.*)" pattern on RewriteCond of second rule captures only trailing period (FQDN) and/or port number; Local URL-path does not appear in %{HTTP_HOST} variable. Trailing "(.*)$" can be omitted with no change to function after correct back-reference in RewriteRule (next point).
  • Incorrect back-reference in second RewriteRule. $4 would refer to the 4th parenthesized sub-pattern in the RewriteRule itself. Use %1-%9 to back-reference parenthesized sub-patterns in RewriteCond, $1-$9 to back-reference parenthesized sub-patterns in RewriteRule.
  • Use of 302 redirect instead of 301 means search engines will keep original non-canonical-domain URLs, and will not index the canonical-domain URL. This will lead to duplicate-content in the search results.

    If only a few duplicates exist, the search engines' de-duplication filters will pick one (they choose, not your choice) and discard the rest (these may appear in Google's supplemental index). If many duplicate domains exists, your site/sites may invoke a penalty, especially if reported by a competitor.

    Best practice is to choose a single domain for a given Web site, and 301-redirect any and all other domains purchased for "marketing" or "brand-protection" reasons to that single canonical domain. The "302 ploy" trying to get multiple domains listed with the same content won't fool the search engines for a minute, and there is no upside to using it. If language- or location-based content-negotiation is used (in other words, if these domains are not duplicates because the language varies), then this should be a server-internal "transparent" function; A redirect should not be involved.

    For more information, search WebmasterWorld for "duplicate content" -- One particularly good title is "Duplicate Content - Get it right or perish." Another good subject to research would be "302 Hijacking."

    A recent thread [webmasterworld.com] in the Google Search News forum enumerates many other non-canonical URL variations that you should also address. That forum is also more appropriate for the SEO aspects of your question. See the "Google Hot Topics" thread pinned at the top of that forum's thread list for more useful information.

    References and tutorials for mod_rewrite and regular expressions are available in the Apache Forum Charter (link at top left of this page).

    Jim

  • nutkenz

    8:06 pm on Oct 22, 2008 (gmt 0)

    10+ Year Member



    OK, thanks. On a side note, I'm unable to access my website on all my computers/browsers now, even though it works for everyone else. I've tried:

    - Clearing browser cache
    - Running ipconfig /flushdns
    - Alternating computers and browsers
    - Using the other domains
    - Adding ?randomstring to the URL

    Nothing seems to be helping; I always get a "Internet Explorer cannot display the webpage" or similar message. Any idea what could be causing this?

    nutkenz

    8:19 pm on Oct 22, 2008 (gmt 0)

    10+ Year Member



    Ok, I found a solution: delete the .htaccess file on the server and re-upload it. Must have something to do with the timestamp on the modification date (which was in the future for some reason)?

    g1smd

    9:56 am on Oct 23, 2008 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Whenever you make changes like this, use Live HTTP Headers for Mozilla Firefox to check the HTTP responses for a variety of different URL requests (both for a range of expected and unexpected values), don't just rely on looking at a browser screen and seeing your page appear.