Forum Moderators: phranque

Message Too Old, No Replies

Tracking down what's causing a Redirect instead of just rewriting

         

csdude55

6:26 am on Apr 6, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, this'n's a head scratcher for me, too.

My .htaccess is now 130 lines long, with 37 RewriteRules. 14 of those rules include [R=301], and 36 of them include [L] (the one that doesn't is [F]).

At one point (around line 83) I begin a section where I redirect "foo" to "bar". It's supposed to be invisible, but somewhere I've messed something up and it's doing a redirect instead. Meaning, I go to example.com/foo, and it immediately goes to example.com/bar.

Can you guys and gals suggest a way that I might track down exactly which line is causing this unexpected redirect?

This is the only section that rewrites to "bar":

RewriteRule ^foo/(?:(lorem|ipsum)/)?[^/]+/([0-9]+)/? /bar/view/?topic=$1&id=$2 [NC,QSA,NE,L]
RewriteRule ^foo/(lorem|ipsum)/?$ /bar/?topic=$1 [NC,QSA,L]

# I broke this next in to 2 lines so that I ONLY match text after foo if there's a / separating it
RewriteRule ^foo/([\w\.-]+)/?$ /bar/$1 [NC,L]
RewriteRule ^foo/?$ /bar [NC,L]

phranque

7:41 am on Apr 6, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Can you guys and gals suggest a way that I might track down exactly which line is causing this unexpected redirect?


in your server config file:
LogLevel alert rewrite:trace3


source: https://httpd.apache.org/docs/current/mod/mod_rewrite.html#logging

lucy24

6:05 pm on Apr 6, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do you have a canonicalization redirect? If so, did you inadvertently place it after rather than before the internal rewrites?

csdude55

8:09 pm on Apr 6, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



LogLevel alert rewrite:trace3

Awesome, thanks!

canonicalization redirect

@lucy24, I honestly have no idea what that is... :O I do have <link rel="canonical" href="blahblahblah"> in my PHP, but I don't think that actually doesn't anything.

It wasn't redirecting before, and I haven't changed any other scripts since 3/28, so I'm positive it's in the .htaccess. I just don't know where or why :'-(

lucy24

9:37 pm on Apr 6, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I honestly have no idea what that is
Yes, you do, you've just blanked-out on what it is called ;) It’s the redirect that captures all the
http://www.example.com/
http://example.com/
https://example.com/
and sends them to your preferred
https://www.example.com/
(mutatis mutandis).

I always have this unintended-redirect possibility at the back of my mind, because the same thing happens if you neglect to put an early [L] rule involving your error documents, so that a blocked request for
http://example.com/pagename.html
leads to a very much unwanted request for
https://example.com/forbidden.html

The [NS] flag is sometimes useful, but it does not apply to requests arising from mod_rewrite itself.

csdude55

7:25 pm on Apr 7, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ooooooh, I gotcha... I didn't know it had a name! LOL

That part is definitely at the top, though. Before any of the "real" stuff happens, I have what I think are pretty common rules:

RewriteEngine on

# this is new, I haven't uploaded it yet for fear of crashing the server again
#
# no way to set its own log, I just have to grep it from the general error log
# to read log in SSH, this is the command on Apache's site but it never completes:
# tail -f /var/log/apache2/error_log|fgrep '[rewrite:'
#
# I wrote this one and it works; I use "head" to limit to 10 results
# grep "\[rewrite:" /var/log/apache2/error_log | head
LogLevel alert rewrite:trace3

# Force https://www
RewriteCond %{HTTP_HOST} !^(www|ww2|images)\. [NC]
RewriteRule ^ https://ww2.%{HTTP_HOST}%{REQUEST_URI} [R=301,L]

RewriteCond %{HTTPS} off
RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]

# Cache for 1 year; 1 month = 2628000
<FilesMatch "load_subnav\.php|\.(ico|css|js|jpg|jpeg|png|gif)$">
Header set Cache-Control "max-age=31536000, public"
</FilesMatch>

# Hack / Exploit Attempts
RewriteCond %{HTTP_REFERER} service.dropdowndeals.com [NC,OR]
RewriteCond %{REQUEST_URI} ^/(?:crossdomain|wp-|administrator)|(?:a|b|shell|tiki-register|who|xmlrpc)\.php [NC,OR]
RewriteCond %{QUERY_STRING} (?:(?:information|table)_schema|example_dbname|union+all+select) [NC]
RewriteRule .* - [F]

RewriteRule ^(?:post|view)\.php /index.php [L]

# Stub files; not sure if they're still relevant, I set them up a few years ago due to Adsense redirects
RewriteRule ^(ad(?:form|motion|rime|tech)|bonzai|contobox|doubleclick|exponential|eyeblaster|eyewonder|flashtalking|flite|ipinyou|jivox|knorex|kpsule|linkstorm|liquidus|mediaplex|mixpo|pointroll|predicta|revjet|rockabox|sociomantic|spongecell|unicast)/(.+) /stubs/$1/$2 [L]

# Touch icons
# I used to have a ton of these in the error log, so this just makes sure they all find something
RewriteRule ^apple-touch-icon.+\.png$ /apple-touch-icon.png [L]



I always have this unintended-redirect possibility at the back of my mind, because the same thing happens if you neglect to put an early [L] rule involving your error documents...

I actually didn't have [L] on all of my rules before (like the one for apple-touch-icon.png), but I changed it and it didn't help. Now I have every RewriteRule with [L].


The [NS] flag is sometimes useful, but it does not apply to requests arising from mod_rewrite itself.

I checked, I don't have [NS] anywhere, either. I have 12 rules set to [R=301, L]: the 2 above for https and www, and 10 that redirect "old" links to the new format. They're all the same in my live .htaccess, though, so I don't think they could be the problem...?

Do I need [L] on my error documents? Eg,

ErrorDocument 400 /404.php

? These are the last rules in my script, though, so I don't think they would have any impact on the rules before them... but who knows anymore? LOL

lucy24

9:36 pm on Apr 7, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The ErrorDocument directive uses a different module so it doesn't matter where it is located relative to the RewriteRules; each module is an island. (And so is the Core, which technically is what handles ErrorDocument.) Ordering of rules only matters within the same module.

What I do recommend is a rule at the very beginning of your RewriteRules saying something like
RewriteRule ^403\.php - [L]
In addition to the unwanted-canonicalization issue (which makes your error-document names visible to the outside world), this also prevents infinite loops if you have RewriteRules that issue a 403 on their own. Note that this is an exception to the usual “List rules in order of severity” principle. In fact I’ve got a whole clutch of them, involving things like robots.txt and hotlink.png that are largely independent of ordinary access controls.

phranque

11:52 pm on Apr 7, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



not relevant to your proble statement, but:
# Force https://www
RewriteCond %{HTTP_HOST} !^(www|ww2|images)\. [NC]
RewriteRule ^ https://ww2.%{HTTP_HOST}%{REQUEST_URI} [R=301,L]

RewriteCond %{HTTPS} off
RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]

these rulesets should be combined so a request for http://example.com/ doesn't result in chained redirects.
(unless you are implementing HSTS)
# Hack / Exploit Attempts

this may as well precede the canonicalization redirects.

lucy24

12:47 am on Apr 8, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



this may as well precede the canonicalization redirects
Yes indeed, and it ties in with what I said about grouping RewriteRules by order of severity. With rare exceptions, anything with [F] flag should go before anything with [R] flag. (And if you've got anything with [G] flag--most common in older sites where you’ve removed things over the years--that generally belongs between the two.)

csdude55

3:35 am on Apr 8, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'll have to come back to this thread after I get the other big issue fixed, but for now...

these rulesets should be combined so a request for http://example.com/ doesn't result in chained redirects.

Do you mean a third rule to precede these two, that would match if it's HTTP and missing the (www|ww2) ? I tried to figure out a way to mash the two rules in to one, but I'm working with numerous parked domains so I can't just use RewriteRule ^ https://www.example.com%{REQUEST_URI} [R=301,L]; I have to use {HTTP_HOST}, and it will include the www|ww2.

If you can suggest a method to mash them together, I'm all ears :-)

lucy24

4:34 pm on Apr 8, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, canonicalization redirects can be a mess if you have a bunch of sites sharing the same RewriteRules. (But why do they? Why isn’t each one in its own VirtualHost section?) For without-www sites, the ordinary rule--if, as phranque pointed out above, you don’t have to deal with the HSTS complication--goes
RewriteCond %{REQUEST_URI} !^/robots\.txt
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^example\.com$
RewriteRule (.*) https://example.com/$1 [R=301,L]
You can leave out the first Condition if it worries you; I’ve found it useful. Note the [OR] separating the two essential conditions. If you want to be superefficient, list the two in most-likely-to-succeed order. (This is a general principle for anything OR-delimited or pipe|delimited. Conversely, anything AND-delimited gets listed in most-likely-to-fail order.)

Technically of course “!on” means exactly the same thing as “off” since it should be an absolute binary toggle; it’s just another way of protecting yourself against weirdness.

Edit: I actually don’t know if there is any measurable difference between the (.*) $1 capture approach, or the %{REQUEST_URI} approach. As so often, though, it may be more a matter of personal coding preference.

csdude55

8:28 pm on Apr 8, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But why do they? Why isn’t each one in its own VirtualHost section?

The way I set up my sites, it's really one site with nearly 100 domains parked on top of it. Then I read the domain name and query MySQL for information pertinent to that domain... the logo, titles, everything is relevant for that domain. So to the end user, I have 100 websites. But in reality, I have one website.

The best part for me is that when I want to add a new website, I just buy a domain, park it to the main one, and then plug in a few variables to a MySQL table. It takes 5 minutes to launch a new site :-) And when I want to update a feature or something, I update it once and it's changed on every site.

But then, yeah, it does complicate some things.

I really have been debating on whether it would be better (read, more profitable) to change all of my sites to actually redirect from www.example.com to foo.com/example. We've talked about this before, though... pros would be that a single site with more traffic might be worth more (to Adsense) than 100 sites with lower traffic, and it would be cheaper to market foo.com on a larger level rather than 100 different examples.com on smaller levels. But cons would be that search engines might penalize me, and it might be confusing to the long-time subscribers. So it's a 50/50 gamble, I think, and I haven't decided yet.

If I do that then I could set up a redirect like you gave; otherwise, using %{HTTP_HOST} as the rewrite just ends up keeping the www|ww2 if they have it. I don't think there's an environment variable for JUST the domain (minus the www. and .com), is there?


I actually don’t know if there is any measurable difference between the (.*) $1 capture approach, or the %{REQUEST_URI} approach. As so often, though, it may be more a matter of personal coding preference.

Just a thought here, but I think that $1 would require a hard-coded / before $1, where %{REQUEST_URI} would not. So www.example.com would become www.example.com/

lucy24

9:17 pm on Apr 8, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: scrolling back to look at your current rule ::

Well, darn that www/ww2/images three-way complication. If the only permitted hostname were www.etcetera, then you would be able to merge them all. And a hundred-plus domains is clearly far too many for a

RewriteCond %{HTTP_HOST} !^www\.(example1|example2|example3|example4)\.com$

Oh well.

but I think that $1 would require a hard-coded / before $1, where %{REQUEST_URI} would not
Yes, certainly. But I was wondering which way, if any, makes significantly less work for the server. The (.*) business is already a teeny bit of superfluous work, since 9 requests out of 10 are already correct and then the capture just gets thrown away. (This is why I use a %1 arrangement, two steps forward and one back, with the index redirect--where it's more like 999 out of 1000 requests are correct to start with.)

csdude55

12:14 am on Apr 9, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is why I use a %1 arrangement...

I read this recently, but I don't recall if it was on Apache's site or just some blog post so don't stake your life on this... but their explanation was that htaccess actually reads the RewriteRule first, and if it matches then it jumps to the first RewriteCond and reads downward for matches.

If that's the case, then I think that using $1 to match against itself would be faster than %1 to match a condition that hasn't actually been read yet.

Or it could be the opposite... that it reads the "condition" part of RewriteRule without reading the match to see if it actually uses $1, then has to save $1 through all of the RewriteCond before determining whether to use it.

The difference would have to be tiny, though, so I have no idea how you could ever benchmark that to find out.

phranque

12:29 am on Apr 9, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



you are correct - mod_rewrite checks the condition of the RewriteRule before processing any preceding RewriteCond directives.

csdude55

12:39 am on Apr 9, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hey, I'm finally right about something! There's a first time for everything, I guess :-D

lucy24

1:56 am on Apr 9, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm finally right about something!
Yes, and that’s precisely why I use the %1 approach in the specific and narrowly delimited case of an index redirect. Capturing in this case is moderately complex--in my case
^((\w+/)*)index\.html
and it could be a lot worse if your URLpaths include non-word characters. But at best, every single request would be captured all the way through, and then either backtrack “Oh, whoops, I was supposed to leave room for ‘index.html’” or throw it all away when the request turns out not to contain “index.html”. So, instead, the body of the rule says
RewriteRule index\.html$ https://example.com/%1 [R=301,NS,L]
where the [NS] flag is to exclude mod_dir activity. And then--only in the exceedingly rare case where the pattern matches--we go to a RewriteCond whose sole purpose is to do the actual capture.

As it happens, the only time /index.html (for any directory) is requested at all on my main site is when it’s me checking out a new page by clicking in Fetch--where the physical file of course is named index.html, but the URL isn’t. On my personal site, there are rare requests for /index.html because I didn’t remove it from URLs until, oh, 2011 or so, and search engines have long memories. (Also one isolated link from an ancient message board that I still see occasionally.)

csdude55

5:50 am on Apr 10, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just a nice little update to the opening question... the problem appears to have been with the last rule:

RewriteRule ^foo/?$ /bar [NC,L]

After eliminating everything else, I narrowed it down to this specific line. It looked fine, there were no errors in the log, and the tester site said it was fine, but I discovered that adding a / to the end fixed it:

RewriteRule ^foo/?$ /bar/ [NC,L]

I can't honestly say WHY this was causing a problem, though. I have other rules without a / and they work just fine! I looked through the coding for /bar/index.php and there's nothing there that would have caused it, so... I dunno. But that fixed it.

I sorta hate this type of "fix", though. If I don't know why it worked then I'll never be sure that it REALLY fixed it and isn't just a band-aid.

w3dk

12:29 am on Apr 12, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



Aside:

these rulesets should be combined so a request for http://example.com/ doesn't result in chained redirects.


It doesn't look like the rules would result in a "chained redirect" for such a request? The first rule would redirect to "https://www2.example.com" and the second rule would fail, since it's already on HTTPS. (It would result in a "chained redirect" if the rules were written the other way round - which would be required for HSTS.)