Forum Moderators: phranque

Message Too Old, No Replies

Help With Catch-All Redirect + Exclusions

         

kernmedia

5:17 pm on Feb 20, 2015 (gmt 0)

10+ Year Member



Hey everyone, new poster here! I have a client that is migrating a bunch of sites to a single domain (consolidating brands into one single brand) and I'm helping with redirects.

We want to implement a "catch-all redirect" to ensure that every URL on the site is redirected (beyond the list of URLs that we know about, since Google sometimes has other legit or bogus URLs redirected).

But, we also want to exclude the /robots.txt and /sitemap.xml files from this "catch-all redirect" to ensure that Google (and other search engines) can render the /robots.txt file to see the link to the /sitemap.xml file, and then crawl those URLs in the /sitemap.xml file in order to see the redirect rules. We're having difficulty with this.

Here is the Catch-All Redirect the Client is Using:

RewriteCond %{REQUEST_URI} .*\/
RewriteRule ^.*$ http://www.example.com/specific-url-here.htm [L,R=301]


Here are the URLs we want to exclude from the Catch-all Redirect:
  • http://www.example.com/robots.txt
  • http://www.example.com/sitemap.xml


Here is the exclusion rule we tried, which didn't work:
#Sitemap/Robots: exclude anything that includes /sitemap.xml, /robots.txt
RewriteCond %{REQUEST_URI} !^/(sitemap.xml|robots.txt) [NC]


Note: We also tried removing ".xml" and ".txt" from the exclusion rule code, but that didn't work either.

Here is a sample of the .htaccess file to show how it's set up:
NOTE: I swapped out the client's domains and URL slugs/strings with "example" and "specific-url-here" text for privacy reasons.

# Redirect rule to get sitemap.htm to map to example.com sitemap.htm
RedirectMatch 301 /sitemap.htm.* http://www.example.com/sitemap.htm

# Enable the rewrite engine
RewriteEngine On

#Sitemap/Robots: exclude anything that includes /sitemap.xml, /robots.txt
RewriteCond %{REQUEST_URI} !^/(sitemap.xml|robots.txt) [NC]

# Redirect index.html to domain root - check to see if extension might be .htm
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(([^/]+/)*)index\.htm\ HTTP
RewriteRule index\.html$ http://www.example.us/%1 [R=301,L]

RewriteCond %{REQUEST_FILENAME}.htm -f
RewriteRule !.*\.html$ %{REQUEST_FILENAME}.htm [L]


RewriteCond %{REQUEST_URI} .*\/specific-url-here-1.htm
RewriteRule ^.*$ http://www.example.com/specific-url-here-1-new-site.htm [L,R=301]
RewriteCond %{REQUEST_URI} .*\/specific-url-here-2.htm
RewriteRule ^.*$ http://www.example.com/specific-url-here-2-new-site.htm [L,R=301]
RewriteCond %{REQUEST_URI} .*\/
RewriteRule ^.*$ http://www.example.com/specific-url-here-3-new-site.htm [L,R=301]

RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-l


RewriteRule ^(.+)$ index.php?url=$1 [QSA,L]

<FilesMatch "\.pdf$">
header set x-robots-tag: "noindex"
</FilesMatch>

Can anyone help?

[edited by: phranque at 12:04 am (utc) on Feb 21, 2015]
[edit reason] Please Use example.com [webmasterworld.com] [/edit]

phranque

12:17 am on Feb 21, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld, kernmedia!


Here is the exclusion rule we tried, which didn't work:
#Sitemap/Robots: exclude anything that includes /sitemap.xml, /robots.txt
RewriteCond %{REQUEST_URI} !^/(sitemap.xml|robots.txt) [NC]

that's not an exclusion rule.
a RewriteCond directive only applies to the following RewriteRule directive.
from which RewriteRule are you trying to exclude robots.txt and sitemap.xml requests?

RedirectMatch 301 /sitemap.htm.* http://www.example.com/sitemap.htm

you should not mix mod_alias and mod_rewrite directives in the same server configuration.

We're having difficulty with this.

for any given request that is causing difficulty, you must describe the request, what response you expected, and what response you actually got.

lucy24

2:13 am on Feb 21, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are you redirecting everything on the site, or just pages? It may work better if you put things the other way around:

RewriteRule (.+\.html) http://www.example.com/$1 [R=301]

replacing "html" with whatever extensions you currently use for pages. If there's more than one, you can group them:

(.+\.(html|css|js))

In general, the "don't mix mod_rewrite and mod_alias" rule is a matter of it's-just-a-good-idea-on-principle. But in your quoted rules, you can pinpoint the exact problem. mod_rewrite executes before mod_alias. So by the time a request reaches your RedirectMatch rule, it has already been redirected by mod_rewrite; in fact it will never see mod_alias.

Another approach is to start like this:

RewriteRule \.(txt|xml) - [L]

Put this line before the RewriteRules that will be redirecting everything else. There are several other possible approaches; it really depends on your URL structure at the old site.

Incidentally, you don't need to escape / slashes in mod_rewrite. So change \/ to / wherever you see it.

kernmedia

7:35 pm on Feb 23, 2015 (gmt 0)

10+ Year Member



Thanks lucy24. It looks like your code would work only if the URLs are staying the same on the new domain. That's not 100% the case.

Does the RewriteRule \.(txt|xml) - [L] code exclude files ending in .txt and .xml from being redirected? If so, if we just put that before all the redirect rules, should that work?

lucy24

10:05 pm on Feb 23, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, exactly.

phranque

11:28 pm on Feb 23, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



In general, the "don't mix mod_rewrite and mod_alias" rule is a matter of it's-just-a-good-idea-on-principle. But in your quoted rules, you can pinpoint the exact problem. mod_rewrite executes before mod_alias. So by the time a request reaches your RedirectMatch rule, it has already been redirected by mod_rewrite; in fact it will never see mod_alias.

in my experience it is more likely to cause problems than not.

http://httpd.apache.org/docs/current/rewrite/avoid.html [httpd.apache.org]:
when there are Redirect and RewriteRule directives in the same scope, the RewriteRule directives will run first, regardless of the order of appearance in the configuration file.

in your configuration, you appear to be missing a general hostname canonicalization redirect.
assuming you add this (and you should), if a request is made for http://example.com/sitemap.html the first 301 will be to Location http://www.example.com/sitemap.html and it will be followed by that request and a redirect to http://www.example.com/sitemap.htm - you should avoid chained redirects.