Forum Moderators: phranque

Message Too Old, No Replies

Working Rewrite/RewriteMap Hangs Apache When RewriteLock Added

Vhost Apache won't restart if RewriteLock is added to a working Rewrite

         

kidcobra

9:01 pm on Jun 20, 2012 (gmt 0)

10+ Year Member



Thanks to Jim Morgan's guidance to both StupidScript (http://www.webmasterworld.com/apache/4171842.htm) and myself (http://www.webmasterworld.com/apache/4289161.htm), I have a working set-up for rewriting incoming URLs. The problem is that when I add RewriteLock to the code, it hangs Apache on restart. Without RewriteLock it works fine (but see below, as it won't work well without RewriteLock preventing collisions and confusion in a real life environment). I have had a couple other similar working set-ups to accomplish the same thing that this set-up does, and they all hang when RewriteLock is added. Here are the details:

I am on a Network Solutions VPS, four domain names share the IP. I have a Rewrite / RewriteMap set-up that works. The Rewrite is in the file for the example.com web address at var/www/vhosts/example.com/conf/vhost.conf, the Rewrite being the only thing in the vhost.conf file. It would not work in the main httpd.conf file for the server. I have turned off php output buffering.

The RewriteMap looks at any URL typed in by the user which ends in -aa (capitalization no matter per this (?i)), and uses anything after the / in that URL to check a slug column in the database for a match (http://example.com/bb-aa, or even http://example.com/dfdf-sfsfs-aa) to get a third piece of info (cc) from the database record that matches the slug bb-aa (or dfdf-sfsfs-aa) , uses that third piece of info (cc) which is the index number for the matching row as the query string to load a file, and leaves the originally typed-in URL in the address bar while showing the file based on the constructed query string. So if the typed-in URL, is example.com/collie-dog, it will find the index number associated with the match of collie-dog in the slug column of the database table and use the database row index number for that entry to load the right file, while leaving example.com/collie-dog in the address bar for the user, search engines, etc.

Here is the Rewrite:

Options +FollowSymlinks
RewriteEngine on
RewriteMap newurl prg://var/www/cgi-bin/cleanup.php
RewriteRule ^/((?i).*-aa$) ${newurl:$1} [L]


When I add the following either above or below the RewriteMap line:

RewriteLock /var/lock/mapexample.lock


and try to re-start Apache, it hangs and Apache will not re-start. I have tried different file paths (thinking it might be a permissions issue), taking away the initial /, putting it in quotes, different file types (ie. .txt at the end), different file names, changing users and groups, creating an empty file to point to with 777 permissions, and just about anything, and every time it hangs Apache on re-start if the RewriteLock line is included. The Rewrite / RewriteMap works, but I have read a lot on the importance of the RewriteLock to prevent collisions and confusion, and php is issuing warnings in the log ending in DANGEROUS without RewriteLock. The reasons for the warnings are well founded, so I don't want to put this live on the site until we can have a working RewriteLock going.

Here is the map (located where the Rewrite says). It basically will listen at all times, and when handed anything (only things that meet the RewriteRule of ending in -aa will be handed off), it does the database look-up and returns the right file while leaving the user input in the address bar. If there is no match in the database, it leaves the address bar alone, but does a few funky things as discussed below in the questions at the end.

And about the preg_match, the reason I did it the "anthing" way is because down the road we will have other URL's that will end in -kk for example, which will still be able to be processed by the same map when added to the RewriteRule.

#!/usr/bin/php
<?php
include '/var/www/vhosts/example.com/path-to-database-connection-file';
set_time_limit(0);
$keyboard = fopen("php://stdin","r");
while (1) {
$line = fgets($keyboard);
if (preg_match('/(.*)/', $line, $igot)) {
$getalias = mysql_query("select cc FROM `database`.`table` WHERE slug = '$igot[1]'");
while($row=mysql_fetch_array($getalias)) {
$arid = $row['cc'];
}
print "/asdfasdf/asdfasdf.php?cc=$arid\n";
}
else {
print "$line\n";
}
}
?>


My questions:

The hanging of Apache when RewriteLock is added..., do you guys have any ideas about why this would work without it, and not even be able to restart Apache with it?

While the big headache and most important concern is the RewriteLock issue, the following are other questions that I am considering and I figured to ask about them for any independent thought that might be helpful while we are here on the big issue:

Is there a more efficient way to process everything that comes to the map other than our "if preg_match" line, since the RewriteRule is set-up to only send to the map what should be processed? Basically, whatever comes will be a match (in a perfect world:), so do we need to match it?.

The last code .... else, print "$line\n"; leaves the address bar alone. However, it doesn't throw a 404, it leaves whatever is in the browser from the last run thru if there is no match. So for example, the incoming URL has the bb part mis-spelled. It goes to the map and just does nothing to the address bar, and if it's a browser blank page, loads whatever the map last loaded for the last request even if a different browser or window! Knowing that the RewriteRule is only sending what should be valid requests, does it seem sensible to change that last line of "else" code to toss a 404, or return the request past the .conf file to be processed (not sure we can do that)? Or am I not thinking of unintended consequence?

Finally, all our slugs are lower case. With the code in the RewriteRule for capitalization not mattering so we do get requests for Collie-DOG or collie-Dog for example if the user typed it in that way, are we better off converting/folding or whatever qualifying URL's thru a 301 to the lowercase examples before they hit the current RewriteRule? Now the plan is to just handle any dupe content issues caused by capitalization with canonical tags while only using lowercase on the site, but we thought maybe convert anything ending in aa to lower case first via 301, then process it might be more efficient?

Any help will be greatly appreciated. Greg

g1smd

9:26 pm on Jun 20, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Something should redirect to fix capitalisation issues. Whether it's a previous mod_rewrite rule (and there are several ways to do it), or a PHP or other script is up to you.

Non-valid requests should return 404 status. You do not want infinite URL space, nor warnings for "soft 404 errors" in WMT.

Can't help with the RewriteLock problem.
[httpd.apache.org...]
[httpd.apache.org...]

kidcobra

8:23 pm on Jun 28, 2012 (gmt 0)

10+ Year Member



I have not yet solved the RewriteLock problem, but here is the working solution on the 301 to redirect to lowercase all these URLs before they are rewritten by the rewritemap. The is the first stuff in vhost.conf above the rewrite to the map.

Options +FollowSymlinks
RewriteEngine on
#use the built in apache function for converting to lowercase
RewriteMap lc int:tolower
#look only at requests that end in -aa
RewriteCond %{REQUEST_URI} (.*)-(?i)aa$
#look only at requests that have at least one capital letter
RewriteCond %{REQUEST_URI} [A-Z]
#look only at requests that are not in a directory (so only the root folder,
#knowing the form of all the URL's as shown above, and again,
#this is the request from the browser, not the file we want to appear.
RewriteCond %{REQUEST_URI} !^(.+?)/
#take anything that meets all of the above, and send it to the built in
#apache lowercase map as a 301 redirect.
RewriteRule ^/(.*) ${lc:$1} [R=301]


This is processed before the rewrite as shown below, so anything coming to the rewrite is lowercase. I added one line to the rewrite to conform (the not in a directory line) so the new rewrite code is:

RewriteMap newurl prg://var/www/cgi-bin/cleanup.php
RewriteCond %{REQUEST_URI} !^(.+?)/
RewriteRule ^/((?i).*-aa$) ${newurl:$1}

Three questions:

One thing popped up when working on this. We have the following canonical code in htaccess to redirect all requests away from www and away from index.php (so our address is http://example.com):

Options -Indexes +FollowSymlinks 
RewriteEngine on
RewriteBase /
RewriteCond %{THE_REQUEST} ^.*\/index\.php\ HTTP/
RewriteRule ^(.*)index\.php$ /$1 [R=301,L]
RewriteCond %{http_host} !^example$ [nc]
RewriteRule ^(.*)$ http://example/$1 [r=301,nc,L]

In testing our set-up for vhost.conf, when a www.example.com/bb-aa request is made, assumedly because this vhost.conf file comes before the htacccess file, theses www requests are not handled right. They are redirected to lowercase and are processed by the rewritemap, but they bring up the actual filename (the un-rewritten file name) in the address bar that we don't want to be using (the file shown near the end of the rewritemap above after the print command), replacing what the user typed in, which we want to remain as the URL. So they do load the right file, but they also change the address bar to reflect the actual file path. I assume this is because the redirect away from www is not yet working for these requests (big assumption, I know). The logical move is to take the canonical code above out of htaccess and put in the vhost.conf file, but as this involves the entire site, I figured to ask if that canonical code can/should just be moved to the top of the vhost.conf file, above the other two entries (leaving out the rewrite base and adding the / after the carrot for each of the two rewriterules just to have everything conform in vhost.conf? Or even if this is the cause, is there a simpler solution? And of course, if this is the likely reason.

Second, just as things went in trying so many different things, you'll notice we don't have [L} flags set after the rewriterule in the redirect, or in the rewriterule in the rewrite. All "seems" well, but should one or both of these be added to prevent the endless loop issues?

Third, in terms of efficiency, I assume in the redirect to lowercase that we should start with the rewritecondition that eliminates the most requests first, to reduce wasted processing as much as possible. If that is true, should the "not in a directory" condition be first?
Thanks Greg

g1smd

8:30 pm on Jun 28, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There are a LOT of syntax and loophole logic errors in that code.

There are some non-canonical requests that will blow up the server.

lucy24

10:02 pm on Jun 28, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Overlapping g1 as usual ...

Every rule should be followed by the [L] flag unless you explicitly and unambiguously want to continue to another rule. (Was going to say "Unless your name is jdmorgan" but I think you've got that already.) You'd expect a redirect to have an implied [L] but it doesn't.

Now, about all those non-final (.*) forms ...

#look only at requests that end in -aa
RewriteCond %{REQUEST_URI} (.*)-(?i)aa$

Since you're not capturing, you simply don't need the part before the hyphen. I'm leery about (?i) but here it doesn't matter since you can equally well say [NC] for the whole condition.

#look only at requests that have at least one capital letter
RewriteCond %{REQUEST_URI} [A-Z]

Yup, that's all you need.

#look only at requests that are not in a directory (so only the root folder,
#knowing the form of all the URL's as shown above, and again,
#this is the request from the browser, not the file we want to appear.
RewriteCond %{REQUEST_URI} !^(.+?)/

This doesn't "feel" right-- even aside from the (.+?)/ wording. I kinda think it's safer to do something with %{THE_REQUEST)

#take anything that meets all of the above, and send it to the built in
#apache lowercase map as a 301 redirect.
RewriteRule ^/(.*) ${lc:$1} [R=301]


This is in your config file, right, not .htaccess? The leading / is correct if so, but you can put a lot more into the Rule itself and save your server some trouble.

:: thinking ::

RewriteRule ^/([^/.-]+-aa)$ et cetera

That's if your filenames contain only one hyphen. Otherwise you have to go to

^/(([^/.-]+-)+aa)$

Note the [^.] element. Anything with an extension, like images or stylesheets, should bypass the rule entirely. If you're using extensionless URLs-- which is what the "ends in 'aa'" implies-- you'd have a separate rule to redirect requests that come in with final .php. It would come after the "index.php" redirect (more specific) and before the "example.com" redirect (less specific). And then there are the requests that contain capital letters and end in .php. Make sure they don't get redirected twice.

RewriteCond %{THE_REQUEST} ^.*\/index\.php\ HTTP/
RewriteRule ^(.*)index\.php$ /$1 [R=301,L]

You don't need to escape slashes in mod_rewrite, though you do need to escape literal spaces. But again, since you're not capturing, the Condition can be simplified to

RewriteCond %{THE_REQUEST} index\.php\ HTTP/
RewriteRule ^/(([^/.]+/)*)index\.php$ /$1 [R=301,L]

RewriteCond %{http_host} !^example$ [nc]
RewriteRule ^(.*)$ http://example/$1 [r=301,nc,L]

Probably the most common Rewrite ever. Make sure this rule comes at the very end of your Redirects, right after the index.php rule and before any Rewrites. The most robust form is

!^(example\.com)?$

where the part inside parentheses is the form you do want. Anchors at both ends, so you have to include the tld. People have occasionally got into fights about whether [NC] with the domain name is required, necessary, unnecessary or wrong. But you can search for those posts yourself ;)

kidcobra

6:07 pm on Jul 3, 2012 (gmt 0)

10+ Year Member



For a variety of reasons, including stability issues, the inability to solve the RewriteLock issues, and because our initial program of pregmatching multiple variables was simplified due to the creation of a slug column in the database to complete the URLs, we switched from a program (prig) based map, to a text (txt) based file, which is converted to a dbm hash file which is the actual map. This is our current code. We also solved the 301 redirect issues, and the lack of 404 for bad inputs described earlier in this post, and made most of the changes in the code suggested by the gracious commenters .

So briefly, in vhost.conf we are taking incoming requests such as example.com/aa-aaaff-cdddc-bb and using the fact that they end in -bb to identify them, match the URI against a dbm hash file created from a txt file that itself was created by a stored procedure in a mysql database. Conceptually, the hash file just lists two pieces of info per line: the contents of the database column named ItemID for one database row, and the contents of the database column named slug for that ItemID (that row), and it has one line for every unique slug in the database. It appends that ItemID who's slug matches an incoming URI as a query string to display a file based on that ItemID while leaving the clean URL in the address bar. So for users, search etc., the URL is in the address bar. But what they see is a file with an address selected via the map and the last line of code in the vhost.conf file. This will work using the text file as a txt type map. But converting it to a dbm hash file is supposed to make access much quicker for users making requests. This hash file is updated everyday via some cron jobs detailed below.

in vhost.conf the first code group uses the internal Apache lowercase function to take incoming requests that meet the conditions as described in the code, and lowercases the URI (anything after the /) part of the request with a 301 redirect.

Next is the canonical code to redirect all www and index.php requests. This had been in htaccess, but we moved it, leaving the existing Rewritebase command in htaccess and adding a slash after the ^ in each of the 2 RewriteRules that were not needed when they were located in htaccess.

Finally, the slug map which evaluates all incoming URI's in two ways. First, is it in the root directory? 2nd, does it match a slug in the map (the dbm hash file). It's a forward looking condition this slug match, and if it matches something in the map, it sends it to the map. But if not, it ignores it and that request is processed as any other request would be on the site, which solves the 404 problem. Any requests which qualify to be sent to the map will have been converted (the URI) to lowercase already by the first code group which solves the redirect problem.

Options +FollowSymlinks
RewriteEngine on
RewriteMap lc int:tolower
RewriteCond %{REQUEST_URI} !^(.+?)/ #must be in root, not a directory
RewriteCond %{REQUEST_URI} -(?i)bb$ #end in -bb (case no matter)
RewriteCond %{REQUEST_URI} [A-Z] #has a capital letter
RewriteRule ^/(.*) ${lc:$1} [R=301,L] #anything that survives the three conditions, send it to the lowercase map as a 301 to a lowercase URI

RewriteCond %{THE_REQUEST} ^.*\/index\.php\ HTTP/ #four lines to deal with removing index.php and www via 301
RewriteRule ^/(.*)index\.php$ /$1 [R=301,L]
RewriteCond %{http_host} !^example.com$ [nc]
RewriteRule ^/(.*)$ http://example.com/$1 [r=301,nc,L]

RewriteMap mapfest dbm:/folder2/on-server/map.dbm #define map name and type and pinpoint the location of the file ( the name of the map is mapfest, the file is map.dbm)
RewriteCond %{REQUEST_URI} !^(.+?)/ #incoming URL must be in root, not a directory
RewriteCond ${mapfest:$1|NOT_FOUND} !NOT_FOUND [NC] #NOT_FOUND is something that does not match anything in the map.
#So if !NOT_FOUND (not NOT_FOUND, or more simply if it's found in the map), send it to the map.
#If it is NOT_FOUND, don't send it to the map. This looks forward into the map and only sends requests that will match.
RewriteRule ^/(.*) /folder/file.php?Item_Num=${mapfest:$1|NOT_FOUND} [L] #anything gets here, pull the match
#from the map and append it to the target path as shown. Show that file, but leave the URL in the address bar alone.


A few comments. We have an old version of MYSQL which does not have events scheduling. So we created a stored procedure which is called by a cron job and deposits a file map.txt on the server. This file is then converted to the dbm hash file using the httxt2dbm built in Apache program. That hash file (there are actually two of them created) is copied and pasted into another folder where it overwrites the last created version from the day before. It is that 2nd folder that is the location of the map (the hash file). Then both the originally created txt and hash files are deleted from the original folder. The reason for this is that mysql will not overwrite an existing txt file with the stored procedure. So the txt file cannot be there when the next scheduled file is created else the stored procedure will not make a new file. And on the hash file, we found that when creating a hash file into a folder that already has one, a new hash file will still be created from the txt (it ignores the existing ones, does not overwrite, just makes a new one) and they will just propagate in the folder without overwriting the last one. So humorously, one procedure will do nothing (txt), the other will do too much(dbm). So again, the hash file is created where we do not need it, and then copy/paste to the right place which unlike a newly created hash file, will overwrite the previous file.

Apache caches this map.dbm file and updates the cache when the file is updated or when the server is restarted. You could likely do this procedure with a database trigger on update for example, but we don't need that kind of instant updating to map and for our purposes, once a day is sufficient for an update.

One other point, the httxt2dbm actually creates two files, both of which we copy and paste. They have names like map.dbm.pag and map.dbm.dir, but you only need refer to them as map.dbm in your RewriteMap code.

Here is the stored procedure to make the text file from the database and have it placed in a folder on the server. Each line in the output file is a space separated pair of the ItemID and the slug. The GROUP By instruction gives us only unique slugs and no duplicates. If there were duplicate slugs, they would be selected by incoming requests at random (only among the duplicates, not among the entire list).

DELIMITER //
CREATE PROCEDURE nameofstoredprocedure()
BEGIN
SELECT slug,ItemID FROM TABLE WHERE various like this and not like thats GROUP BY slug into outfile '/folder1/on-server/map.txt' Lines terminated by '\n';
END //
DELIMITER ;


Here are the cron commands:
mysql -h example.com -u youruser name -pyourpassword -D nameofdatabase -e 'CALL nameofstoredprocedure()' #call the stored procedure from cron, no script needed, just this command, and no space after the -p …you need the -p, just no space after it, a space you do need with the -u, the -D, the -e and the word CALL

/usr/sbin/httxt2dbm -i /folder1/on-server/map.txt -o /folder1/on-server/map.dbm #convert the created txt file to a dbm hash file - well, it actually makes two files as shown next

cp /folder1/on-server/map.dbm.dir /folder2/on-server/map.dbm.dir #copy and paste hash file to from folder 1 to folder2
cp /folder1/on-server/map.dbm.pag /folder2/on-server/map.dbm.pag #copy and paste hash file to from folder 1 to folder2

rm -f /folder1/on-server/map.txt #delete txt file

rm -f /folder1/on-server/map.dbm.dir #delete hash file
rm -f /folder1/on-server/map.dbm.pag #delete hash file


If we are doing anything scary with the code, be d-glad to know if of course. We did not change (but will do so time permitting) the canonical code that deals with index.php as what was there worked and we are just running out of time. Same goes for the code that says… if not in a directory. It will take us some time to figure out a cleaner/better way.

Thanks again to Jim Morgan, g1smd, stupidscript, and lucy24.

parochy

6:12 pm on Oct 4, 2012 (gmt 0)

10+ Year Member



Apache hangs if you define more than one RewriteLock directives or if you use it in a VHOST config.

The RewriteLock should be specified at server config level and ONLY ONCE. This lock file will be used by all prg type maps. So if you want to use multiple prg maps, I suggest using an internal locking mechanism, for example in PHP there is the flock function, and simply ignore the warning apache writes in the error log.

See here for more info:
[books.google.com ]