Forum Moderators: phranque

Message Too Old, No Replies

check for file . without extension?

         

deesto

7:47 pm on Apr 8, 2010 (gmt 0)

10+ Year Member



I have a complex set of rules on an Apache proxy host to add missing trailing slashes to URLs and send all subsequent requests to a back-end Apache server. One problem I've run into is when a request is for a file that doesn't have a file extension in its name: the following condition assumedly only checks the file name for an extension:
RewriteCond %{REQUEST_FILENAME} !-f


So the case I mentioned above slips by, and "file" (which should be "file.txt) gets a slash appended ("file/"), which becomes invalid. Is there a way to fix this?

g1smd

12:34 am on Apr 9, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Add another RewiteCond that tests for the pattern without period within.

deesto

1:53 pm on Apr 9, 2010 (gmt 0)

10+ Year Member



Thanks g1smd. Interesting idea, especially since I thought I needed to check for the .ext before doing the rewrite ... here's what I have, and what I believe the result should be; please comment as necessary:
# first, add a trailing slash if needed...
# ... if the requested path does not include a file name extension:
RewriteCond %{REQUEST_URI} !(\.[^./]+)$
# ... and if the request is not a directory
RewriteCond %{REQUEST_fileNAME} !-d
# ... and if the request is not a file (but I assume this is not doing a file check,
# just checking the name like the one above?):
RewriteCond %{REQUEST_fileNAME} !-f
# then take it and add a trailing slash:
RewriteRule ^(.+[^/])$ $1/ [R]
# finally, take all requests and send them to the back-end server:
RewriteRule ^/(.*)$ https://back-end.hostname.com/$1 [P]

jdMorgan

3:49 pm on Apr 10, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Use the RewriteRule pattern to its maximum capabilities (as shown) -- RewriteConds are not processed at all unless the RewriteRule pattern matches (See Rule Processing in Apache mod_rewrite doc for more details).

The -d and -f functions of RewriteCond are very 'expensive' in terms of CPU and server resources, in that they invoke a call to the operating system's file manager to check for file- and directory-exists. In the case where the current filesystem 'map' has been swapped out to disk or is stale, these may in fact invoke a physical disk read operation, which must be completed before the HTTP request handling can proceed any further. As a result, these functions, along with the rDNS lookup invoked by RewriteCond %{REMOTE_HOST} must be avoided except when absolutely necessary; For the exists checke, a good-sized pile of additional "known exclusions" for non-proxied directories and files may be added as RewriteConds to eliminate unnecessary checks before a balance is struck with the comparatively-huge time required to do disk checks.

I do not believe that you need (or want) the check for "directory exists" and I have therefore commented it out in the code below. Even if a slashless URL-path request resolves to an existing directory, you should still redirect that request to the correct, canonical trailing-slash directory URL. This avoids having two URLs (slashed and unslashed) resolve to the same content, known as "a duplicate-content issue" and potentially harmful to search engine rankings.

External redirects should include the protocol and domain, and a 301 or 302 redirect should be stated explicitly.

Use the [L] flag on every rule. The addition of an [L] flag to the redirect below prevents your back-end proxy's existence and URL from being 'exposed' to the client. Similar problems can be avoided by using [L] on all external redirects and placing all redirects before any internal rewrites.

Because you included a leading slash in the pattern of your last rule, I presume that this code goes into your server configuration file, outside of any <Directory> containers. If the code goes into a config file and is enclosed in a <Directory> container, or if it goes into .htaccess, then remove the leading slashes from both RewriteRules, as it will not be present in the requested URL-path 'seen' by RewriteRule.

# Add missing trailing slashes on extensionless URL requests
#
# If the requested URL does not resolve to an existing file
RewriteCond %{REQUEST_FILENAME} !-f
# and if the requested URL does not resolve to an existing directory
# RewriteCond %{REQUEST_FILENAME} !-d
# externally redirect requests without a trailing slash or file extension to add a trailing slash:
RewriteRule ^/(([^/]+/)*[^./]+)$ http://www.example.com/$1/ [R=301,L]
#
# Reverse-proxy all requests to the back-end server
RewriteRule ^/(.*)$ https://back-end.hostname.com/$1 [P]

I modified the comments to reflect what the rules actually do, and to (hopefully) clarify their function.

Jim

deesto

4:08 pm on Apr 16, 2010 (gmt 0)

10+ Year Member



Thanks very much, Jim. This seems to work perfectly. Any hints on a good resource to read and be able to figure out problems like these? I've read through the Apache docs and tutorials, but I think your post is actually more informative than those.

Also, I have all this within a <VirtualHost /> in ssl.conf, at the very end of the file. I would think I'd be able to break this out into a separate VH file, but when I do, without changing any of the directives (even with a direct cut and paste of the VH), things break, and I have no idea why.

jdMorgan

4:59 pm on Apr 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If the code is going into a <Directory> container, or if it resides in a .htaccess file, then the path to 'this directory' will be removed from the URL-path 'seen' by RewriteRule. So, that would usually mean that you would need to remove the leading slash from your patterns in order for the rule to match in that context.

I learned by reading (and re-reading, and re-reading) the Apache server documentation (all of it) over many years. So I'm not a good person to answer your question about "good resources" other than to point out the resources cited in our Forum Charter and the threads and tutorials in our Forum Library and available through a search on this site. Unfortunately, a lot of "expert information" has been posted on the Web and written in books by people who've never read the Apache documentation and/or have limited practical experience, and much of it is wrong, very inefficiently-implemented, or both... :(

That's why contributing members here may seem to be "too picky" about details at times -- Those details often make a huge difference in proper server function or search engine ranking side-effects, or both.

For a simple example, often seen on "expert sites" :
 RewriteRule ^(.*)/(.*)/(.*)/$ /index.php?arg1=$1&arg2=$2&arg3=$3 

can easily execute a thousand times more slowly than
 RewriteRule ^([^/]+)/([^/]+)/([^/]+)/$ /index.php?arg1=$1&arg2=$2&arg3=$3 [L] 

depending on the length of the requested URL-path.

The reason for this is hidden in 'how regular-expressions pattern matching works' and the fact that the ".*" pattern is the most greedy and most promiscuous pattern; It will match anything, everything, or nothing, thus leading to many, many 'back-off-and-retry' pattern-matching attempts. The much-more-specific negative-match pattern in the second line can be parsed in a single left-to-right pass, regardless of the length of the input URL-path.

Jim

deesto

7:23 pm on Apr 16, 2010 (gmt 0)

10+ Year Member



Thanks again Jim, and thanks for being picky. ;) By the way, the code you'd suggested doesn't seem to do what I'd hoped; requests to files without file name extensions still fail as they get a trailing slash added. Again, I'm guessing this is because the rewrite is being performed on the proxy machine, which doesn't have any real way to determine whether a URI on the back-end server is a file or not. But maybe I'm wrong about this?

jdMorgan

2:02 am on Apr 17, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, you're right. The code on the front-end server cannot test for 'file-exists' on the back-end server... :(

Note the wording I used on the 'file-exists' check RewriteCond above: "If requested URL-path does not resolve to an existing file." This RewriteCond invokes a call to the OS to go check the physical disk's filesystem, so in this case, the wrong server's disk is being checked.

Consider proxying either form (slash or no-slash) to the back-end, and then doing the slash-fixing back there.

Jim

deesto

6:39 pm on Apr 28, 2010 (gmt 0)

10+ Year Member



I thought so. :( So in this case, is the file check condition purely superfluous and overhead, as it will never match a condition locally? Maybe it's worth removing this check, then, especially if it is as resource-intensive as you'd mentioned.

I've also tried moving all these to the back-end server and just doing a generic rewrite on the proxy, but I haven't been able to find a combination of rewrite rules for both sides that seem to play nice with one another.

g1smd

7:39 pm on Apr 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The 'file exists' checks might be needed when you have a site where, for example,
- robots.txt is going to be served from a file in the filesystem, and
- robot-toy-for-kids is going to be served by a script pulling content from a database.

In many cases you can avoid those checks by careful design of both the URL format, and the matching pattern for the RewriteRule. In the example above one has an extension and the other does not. Easy to design a pattern.

I treat "exists" checks as a last ditch method, to be avoided if at all possible.

jdMorgan

11:22 pm on Apr 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I treat "exists" checks as a last ditch method, to be avoided if at all possible.

... and even then to be extensively qualified with as many preceding request-based conditions as possible to reduce the number of invocations to a minimum.

Jim

deesto

1:28 pm on Apr 29, 2010 (gmt 0)

10+ Year Member



>> In the example above one has an extension and the other does not.
>> Easy to design a pattern.
Ah, you would think so! As did I. The reason for this exercise on my part was the complaint that a file that did not have a file name extension was being treated as a directory (getting a trailing slash) and was thus unviewable. I tried to address the extension bit in my original rules, but I've removed that condition on Jim's advice, and that wasn't helping in this situation anyway.

jdMorgan

1:40 pm on Apr 29, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Really, the most efficient way out of this kind of problem lies in "designing your URLs" as g1smd points out above.

For example, with a new site, it's easy to 'tag' all URLs to be passed to the back-end by starting the URL-path with "/apps" (or similar) and then telling mod_proxy to pass all such URL-paths to the back-end server.

Another method is to define and use a subdomain, like "apps.example.com". Either way, the point is that putting an explicit 'tag' in the URL usually provides for the fastest, most efficient, and easiest way to determine what does and does not need to be reverse-proxied to the back-end server.

I'm not sure this will help with the current problem, but that is how to prevent trouble "the next time."

Jim

g1smd

6:39 pm on Apr 29, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Several methods that have proved useful on previous sites, are where the extensionless URLs that are rewritten to scripts include path-part formats like:
- two words separated by a hyphen
- exactly six digits
- two letters, five digits, hyphen, then some hyphenated words
- prefix c- for category URLs and p- for product URLs, followed by the respective category or product ID, followed by a hyphen then some hyphenated descriptive words.

It's fairly easy to design a pattern to 'filter' those as being the ones to rewrite, and to not rewrite any others.

jdMorgan

1:46 am on Apr 30, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, in fact your "c-" and "p-" qualify as "tags" in the sense that I intended above -- even without precisely-defining the rest of the URL-path. And your method would be easy to "retrofit" into many shopping sites, because "c-" and "p-" are fairly commonly employed by "SEF" plug-ins for popular carts.

Jim