homepage Welcome to WebmasterWorld Guest from 54.237.184.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
mod rewrite + MySQL = clean URIs
Having trouble getting mod_rewrite to process query string
StupidScript




msg:4171844
 9:51 pm on Jul 16, 2010 (gmt 0)

I'm having difficulty using Apache's mod_rewrite and a simple MySQL table to generate clean URL aliases, such as is done by many CMSs (Drupal, Joomla, etc.).

I have a MySQL database table holding clean aliases for various database-stored articles.

Here's the alias table, FYI:

id int(11) auto_increment primary key
rid int(11)
type varchar(10)
alias varchar(128)

I'm running Fedora 13, Apache 2.2 with mod_rewrite, PHP 5.3

I can see mod_rewrite working from its log file, but it's not grabbing the query string, which is the only important part of the rewrite.

Here's what I want:

Request: http://example.com/index.php?param=4
After rewrite: http://example.com/article/tidy-article-title

Here's what I am getting:

Request: http://example.com/index.php?param=4
After rewrite: http://example.com/index.php?param=4

Reading Apache's rewrite documentation, it indicates that setting a "RewriteCond %{QUERY_STRING} ..." is the only way to include the query string in the rewrite pattern matching, and that RewriteCond definitely limits which files are processed by mod_rewrite, however I just cannot get Apache to include the query string in its processing. Every attempt ONLY processes the file requested, never the query.

Here's the output from mod_rewrite's log file:

- WITH RewriteCond (processes file requested with query, linked files have no query string):

init rewrite engine with requested uri /index.php
applying pattern '^[/index.php]+\?([a-zA-Z0-9]+)=([a-zA-Z0-9]+)$' to uri '/index.php'
pass through /index.php
[perdir /var/www/html/] pass through /var/www/html/index.php

- WITHOUT RewriteCond (processes file requested plus linked files):

init rewrite engine with requested uri /index.php
applying pattern '^[/index.php]+\?([a-zA-Z0-9]+)=([a-zA-Z0-9]+)$' to uri '/index.php'
pass through /index.php
[perdir /var/www/html/] pass through /var/www/html/index.php
rewrite engine with requested uri /style.css
applying pattern '^[/index.php]+\?([a-zA-Z0-9]+)=([a-zA-Z0-9]+)$' to uri '/style.css'
pass through /style.css
[perdir /var/www/html/] pass through /var/www/html/style.css
init rewrite engine with requested uri /scripts.js
applying pattern '^[/index.php]+\?([a-zA-Z0-9]+)=([a-zA-Z0-9]+)$' to uri '/scripts.js'
pass through /scripts.js
[perdir /var/www/html/] pass through /var/www/html/scripts.js

It seems to me that "init rewrite engine with requested uri /index.php" is showing that the query string is being ignored.

It seems to me that it should report something like: "init rewrite engine with requested uri param=4" (I have never seen the report for successful processing of the query string, so I don't know what it will look like.)

For the record, here are my relevant server directives in httpd.conf (not limited by Directory):

RewriteEngine on
RewriteLogLevel 9
RewriteLog "/var/log/rewrite.log"
RewriteMap newurl prg://var/www/cgi-bin/cleanurls.php
RewriteCond %{QUERY_STRING} ^([a-zA-z0-9]+)=([a-zA-Z0-9]+)$
RewriteRule ^[/index.php]+\?([a-zA-Z0-9]+)=([a-zA-Z0-9]+)$ $(newurl:$1/$2)? [L]

Here's cleanurls.php (in ScriptAlias directory, chmod 755):

#!/usr/bin/php
<?php
// Connect (as root) to MySQL database containing aliases
include 'mysql_database_connect.php';
// So it doesn't crap out prematurely
set_time_limit(0);
// Grab STDIN (request being made of the server)
$keyboard = fopen("php://stdin","r");
while (1) {
// Read STDIN to variable
$line = fgets($keyboard);
// Match elements to use for db query
if (preg_match('\?([a-zA-Z0-9]+)=([a-zA-Z0-9]+)$/', $line, $igot)) {
// Grab the alias (i.e. 'tidy-article-title')
$getalias = mysql_query("select alias, type from url_alias where rid = '$igot[2]'");
while($row=mysql_fetch_array($getalias)) {
// Print clean alias to STDOUT (back to the "address bar")
print $row['type'] . "/" . $row['alias'] . "\n";
}
}
else {
// Catchall ... should never get here because of the RewriteCond ...
print "$line\n";
}
}

When I include the "cleanurls.php" code in a test page (i.e. where $line = $_SERVER['REQUEST_URI'] instead of STDIN), it shows a result just as expected, grabbing the right entry from the database and substituting it for whatever was in the URI.

Also, I had to allow "root" access to the database, as the "apache" user was not making the call during server initialization, when the map file was being read into memory.

I very much appreciate any thoughts on (a) how I can obtain even more detailed debugging info and (b) something in my code or setup that might be a problem.

Thanks!

 

StupidScript




msg:4171872
 11:29 pm on Jul 16, 2010 (gmt 0)

Correction: My rewrite.log entries all include the same linked files ... I had thought that the linked files were not compared when the RewriteCond was included, but that was the result of my emptying the log file erratically. In fact, the requested file and its linked files are all three processed each time, regardless of the RewriteCond, so the second log example given above is consistent across all attempts.

And I want to add that I think the process shown above is incomplete ... I believe this first step will accomplish the goal of re-mapping the query-heavy URI to an EXTERNAL clean URI (visible in the browser), but I believe that I will need a second INTERNAL process to remap the alias from the db back to the actual content that is not included, here.

1: query => (visible to user) clean (but invalid)
2: clean => (hidden from user) valid

Without the second mapping, the alias won't mean anything to the server.

But I am less concerned with that, right now, and confident that I will be able to make that work. At this point, my primary concern is getting the query string parsed by the rewrite functions. When I get everything running, I'll post the whole shebang.

jdMorgan




msg:4171883
 12:06 am on Jul 17, 2010 (gmt 0)

RewriteRule cannot see query strings, and your use of [index.php] was incorrect. [A-z] also probably won't work (uppercase A, lowercase z). Try something like this instead.

RewriteEngine on
RewriteLogLevel 9
RewriteLog "/var/log/rewrite.log"
RewriteMap newurl prg://var/www/cgi-bin/cleanurls.php
RewriteCond %{QUERY_STRING} ^([a-z0-9]+)=([a-z0-9]+)$ [NC]
RewriteRule ^/(index\.php)?$ ${newurl:%1/%2}? [L]

Note that using the [NC] flag with the [a-z] pattern makes the comparison case-insensitive and is 50% faster.

The rule pattern now matches requests for either example.com/index.php or just example.com/ (with appropriate query string appended). The back-references are now to the RewriteCond pattern matches.

Jim

jdMorgan




msg:4171886
 12:20 am on Jul 17, 2010 (gmt 0)

Note also that your script will be started when the server starts, and it should run forever. The script must handle all errors itself, and do so gracefully. In other words, this script must never be allowed to 'die' under any circumstances. Also, consider whether you want that database connection to be persistent -- Either it will be connected 'forever' or you should open it, do the URL lookup, and then close it immediately.

Also consider whether you're doing this 'right side up' or backwards... The URL is defined by what appears on your published HTML pages as a link. This should be the 'pretty' URL. It would be more usual to have the mod_rewrite code accept the pretty URLs, look up the correct script-calling parameters, and then call the script. Or to simply invoke the main script, and let it get the appropriate 'query' parameters to serve requests for that pretty URL from the database.

The usual usage for the construct you've described here is to speed up re-indexing of your site. That is, it would be more usual to take the query-string URL, look up the pretty URL, and then 301 redirect to it -- but only if that query-string URL is being requested by a client and not as the result of a previously-executed internal rewrite. However, this is a third -and optional- clean-up step, not useful until all of the URLs published as links on your HTML pages are 'pretty'.

This may be confusing, so I'll summarize:
  • Modify your script to publish pretty URLs as links on your pages.
  • Create mod_rewrite code to internally rewrite all pretty URL requests to your script (The script can look up what were originally query string parameters in the database, using the requested pretty URL as the index).
  • Optionally, externally redirect all client request for old query string URLs to the new pretty URLs, using your code, as modified above, but changing the RewriteRule to specify a permanent redirect -- e.g. RewriteRule ^/(index\.php)?$ http://www.example.com/${newurl:%1/%2}? [R=301,L]
    This last step is only useful to speed up the change-over from old to new URLs listed in search results. It's useless (and even counter-productive) without the first two steps.

    Jim

  • StupidScript




    msg:4172107
     5:29 pm on Jul 17, 2010 (gmt 0)

    Excellent! I am most grateful for your response, Jim. I see how {QUERY_STRING} is used, now.

    Thanks, also for the summary. I now see that one of my goals is to prevent the query string from becoming part of the public record in the first place. I should only need to map from the alias to the dynamic address, and not back and forth, as I will be showing only the alias to the public.

    I'll be back with the working code.

    StupidScript




    msg:4173520
     6:05 pm on Jul 20, 2010 (gmt 0)

    Thanks to Jim Morgan, here's code that works for me on Fedora 13 Linux, running Apache 2.2, PHP 5.3 and MySQL 5.1.47:

    Using Apache's RewriteEngine, PHP and MySQL to produce clean URIs from dynamic content.

    Concept:
    - Articles are stored in a database
    - Page request is an alias of that resource
    - Requested alias is checked against database table
    - Leaving the alias in the browser address bar, display dynamic resource

    This takes an aliased URI like:
    http://www.example.com/article/this-is-an-article

    And grabs this actual URI from the server:
    http://www.example.com/index.php?article=11

    It leaves the original URI in the browser's address bar, so the visitor/search engine doesn't know any different.

    Note: All visible links should be aliases. In other words, there should be NO links with queries in them. Every request is looked up in the alias table. Static pages are not re-mapped.

    Table 'url_alias':

    id int(11) primary key auto_increment
    rid int(11)
    type varchar(10)
    alias varchar(128)

    Sample table data:

    id = 11;
    rid = 11;
    type = 'article';
    alias = 'this-is-an-article';

    httpd.conf (sever config, not Directory or Virtual):

    # start mod_rewrite
    RewriteEngine on
    # find processing script here
    RewriteMap newurl prg://var/www/cgi-bin/cleanurls.php
    # enable lock file while rewriting (protection against collisions)
    RewriteLock /var/lock/map.newurl.lock
    # if RewriteRule matches, check for specific case
    RewriteCond %{REQUEST_URI} ^/article.*
    # any URI matches, and is then tested by the RewriteCond, above
    RewriteRule ^/(.*) ${newurl:$1} [L]

    /var/www/cgi-bin/cleanurls.php (chmod 755, apache user)(first line is NOT a comment! It's a 'bang'):

    #!/usr/bin/php
    <?php
    # PHP/MySQL db connection
    include '/path/to/db_connect.php';
    # this program cannot die ... TO DO: graceful error handling needed
    set_time_limit(0);
    # assign STDIN to handler
    $keyboard = fopen("php://stdin","r");
    # always
    while (1) {
    # read STDIN to variable from handler
    $line = fgets($keyboard);
    # check for string '/chars/chars/' in URI
    if (preg_match('/(.*)\/(.*)/', $line, $igot)) {
    # grab pieces for resolving dynamic resource from db table ('article' and '11')
    # matches 'alias' string, so that must be unique!
    $getalias = mysql_query("select type, rid from url_alias where type = '$igot[1]' && alias = '$igot[2]'");
    while($row=mysql_fetch_array($getalias)) {
    $atype = $row['type'];
    $arid = $row['rid'];
    }
    # print dynamic resource reference to STDOUT (does not refresh URI in address bar)
    print "/index.php?$atype=$arid\n";
    }
    else {
    # did not match '/chars/chars/', so just use the original URI (i.e. 'about.html')
    print "$line\n";
    }
    }
    ?>

    StupidScript




    msg:4173590
     8:11 pm on Jul 20, 2010 (gmt 0)

    One quick note:

    Originally in my index.php that displays the dynamic content, I had the following style and script references:

    <link rel="stylesheet" type="text/css" href="style.css" />
    <script type="text/javascript" src="scripts.js"></script>

    Using Apache's mod_rewrite as I have done, above, broke those references.

    I noticed that the PHP include files I was using on that page were coming in just fine. I also noticed that in their invocation, the paths were less-relative ... actually direct filesystem references (/var/ etc.) instead of web-directory references (/images etc.) Of course, that is how PHP include paths go ... but what if Apache was having a hard time resolving the relative paths to the style and script files? Maybe Apache needed a little more help than usual ...

    So I modified the style and script references to add a little extra help for Apache:

    <link rel="stylesheet" type="text/css" href="/style.css" />
    <script type="text/javascript" src="/scripts.js"></script>

    Both paths now include a root web directory reference ("/"), and not just the file names using paths relative to index.php.

    These changes fixed it right up.

    Just incidentally, I had originally been working on this with the target directory as an Apache alias, but I saw that the rewrite processes were appending the DocumentRoot path to the start of all the resolved paths. I moved everything into the DocumentRoot, added the leading slashes to the style and script references, and now everything works as expected.

    I think this little fix is an indication that if you are having trouble getting the rewrite paths ironed out, try removing the extra path info while you troubleshoot, and do everything from your DocumentRoot. Better to remove that particular issue than to let it tie you up while you're figuring everything out.

    g1smd




    msg:4173715
     11:37 pm on Jul 20, 2010 (gmt 0)

    Apache doesn't "have a hard time resolving the relative URLs".

    It is the browser that resolves relative URLs based on the folder level of the currently requested HTML page using the following method: take the current page's URL, strip off the page part of that URL, back to the final slash in the URL and then append the relative reference on the end.

    The cure, as you have found, is to only use references that begin with a slash when referring to images, and CSS and JS files. That's because URLs are 'used on the web'. They have no meaning inside the server. Inside the server there's only internal file paths.

    StupidScript




    msg:4173731
     12:10 am on Jul 21, 2010 (gmt 0)

    Thank you for that, g1smd.

    I still don't quite have it ironed out, though:

    /var/www/html = DocumentRoot

    /var/www/html/index.php = index

    /var/www/html/style.css = stylesheet

    "index.php" links to "style.css"

    Without the rewrite, this works fine.

    With the rewrite, additional path info required, so:

    "index.php" links to "/style.css"

    Note this is certainly not a filesystem path, but a "web" path, relative to DocumentRoot.

    How does that fit with your response, "inside the server"?

    Where is the browser's frame of reference that it can't find a file in the same directory as the requesting file? As far as the browser goes, isn't it still receiving index.php? And so doesn't the "web" relative path still apply?

    Thanks, again. I do see how it's a browser issue, despite my questions. Here's where I get confused with your post:

    (1) take the current page's URL, (2) strip off the page part of that URL, back to the final slash in the URL and then (3) append the relative reference on the end.

    (1) http://www.example.com/index.php
    (2) http://www.example.com/
    (3) http://www.example.com/style.css

    Or are you saying

    (2) http://www.example.com with no trailing slash,

    which is where the relative 'style.css' got lost (http://www.example.comstyle.css)?\

    jdMorgan




    msg:4175193
     6:28 am on Jul 23, 2010 (gmt 0)

    You are greatly confusing URLs (seen in on-page links and includes by browsers and sent by browsers to servers in requests) and filepaths (used only inside servers, and totally unknown to browsers). The specific point of confusion is your impression that a browser could have any idea of DocumentRoot... It cannot, barring any catastrophic SEO/mod_rewrite/scripting errors.

    In essence, you have added a subdirectory called "/article" to all of your URLs, and put the /article-name after that. So the browser sees that "page" (resource) as "/article-name" in "/article" directory.

    Therefore, if you use a relative link on the "/article/article-name" page such as <img src="img.gif"> then as g1smd has explained, the browser will look at the current page's URL "http://example.com/article/article-name" as shown it its address bar, remove "article-name", and then request that image using the URL "http://example.com/article/img.src".

    By using a server-relative link <img src="/img.src">, you tell the browser to remove both the page name and all subdirectory path-info from the page URL, and then append the "/img.src", making the image URL "example.com/img.src". You could of course also use a canonical URL and specify <img src="http://example.com/img.src">, in which case the browser does not refer to the currently-displayed page's URL at all.

    URLs are used "out on the Web" and filepaths are used only inside servers. Browsers requests objects from servers by sending a URL to the server. The server 'translates' that to a request for a static file and returns that file's contents. Or it translates the requested URL to a script filepath, and the script generates the 'content' to be sent back to the browser. Or the server finds that no static file exists, and no directive is present that can be used to map the requested URL to any script, and in that case, the server generates an error response 404-Not Found.

    Nothing too difficult at all (which is why you can run a pretty decent server on a very old PC), but you've got to keep the URL-domain and the file-domain separate when thinking about things...

    Mod_rewrite lives right at the boundary of the server, just past the entrance where requests come in. It can intercept a URL request and modify the way that the server translates it to a filepath access.

    Seriously, install the "Live HTTP Headers" add-on for Firefox and play with it. Look at the browser requests followed by the server responses, using simple non-rewritten URLs on a few simple pages of a site that you understand well. And keep in mind that URLs are not files and that files are not URLs, and that in fact, they need have no correspondence at all -- It is only the action of a server that associates an incoming URL request with a filepath.

    Compare what you see in the Live HTTP Headers window to what you see in your server access and error logs. Note that the access logs shows URLs, while the error log shows filepaths.

    That should take a lot of 'mystery' and apparent inconsistency out of this exercise for you, and save you tons of time and frustration... This is the 'deception' of high-level scripting languages and CMS packages: They lead people to believe that they can create Web sites without learning about servers and the HTTP protocol -- all those annoying little details. :) While that is partially true, as soon as you get into anything the least bit complicated, it turns false -- and fast. In fact, if you were to suspend all current projects for two weeks and go study the server documentation and the HTTP protocol specification --and I mean read *all* of it, I'd wager that by the end of this year, you would be *far* ahead of schedule. No more time or money wasted because of misconceptions...

    Jim

    StupidScript




    msg:4175694
     6:35 am on Jul 24, 2010 (gmt 0)

    You're probably right about the reading all of the Apache docs and HTTP specs, Jim. Honestly, though, with you and g1smd passing out the free advice, the learning is proceeding quite quickly. ;)

    FYI, the way I figured out the /style.css and /script.js issue was through the use of LiveHeaders, which have been an occasionally-useful tool for quite some time ... up to a point.

    The "light bulb" in the latest (and last) issue in this exercise came on when I read 'you have added a subdirectory called "/article" to all of your URLs', which helped me to recognize that the rewriting is more than simply finding the location of the resource.

    It was not a matter of not having the information, it was an *interpretation* of the information that was hanging me up, conceptually. No amount of book larnin' can help with that. ;) Sometimes discussing the data you have been studying to exhaustion is the best idea. You provided the interpretation that made sense of the data.

    The rewritten path BECOMES the web path, for the browser. The name, location and other identifying features of the resource being displayed is irrelevant to the browser ... it eats what the server feeds it.

    The server fed it "/article/blah-blah", not "/index.php?p=123", therefore a request for "style.css" would be relative to the "/article" directory the browser "thought" it was in.

    Simple!

    (Please forgive the anthropomorphisms. It's a mechanism I sometimes use to simplify behaviors to understand them better ... to give them a familiar context. I want to assure you that I don't *really* believe that a web browser "thinks" anything, nor that there are any actual "mysteries" involved with computer programming.)

    So .. THIS "mystery" is solved, and everything is now consistent. I may still take a few weeks to study the HTTP specs, as it's been a long, long while since I did so ... like somewhere around 20 years. Believe me, though, when I tell you that with this issue the Apache docs have been pored over again, again, and yet again ... and look where I came out! :) Not too far from the truth. Just needed some Apache doc translation, which y'all have been kind enough to pass along.

    Thanks again, guys. As far as I'm concerned, I'm done.

    StupidScript




    msg:4175696
     6:36 am on Jul 24, 2010 (gmt 0)

    Double-post removal.

    g1smd




    msg:4175709
     8:34 am on Jul 24, 2010 (gmt 0)

    The rewritten path BECOMES the web path, for the browser. The name, location and other identifying features of the resource being displayed is irrelevant to the browser ... it eats what the server feeds it.

    The browser makes a URL request and the server returns content for that URL. Internally, the server might have pulled the content from a different internal path other than that implied by the URL. That is, the rewrite changes the default internal URL-path to server-filepath mapping inside the server.

    The server fed it "/article/blah-blah", not "/index.php?p=123", therefore a request for "style.css" would be relative to the "/article" directory the browser "thought" it was in.

    The browser requested "example.com/article/blah-blah" and a request for "style.css" would be relative to the "www.example.com/article/" URL the browser requested. The browser works only with URLs.

    jdMorgan




    msg:4175765
     1:41 pm on Jul 24, 2010 (gmt 0)

    And to add to that, the *rewriting* doesn't change the URL, *you* changed the URL that is published on your pages. URLs are defined and they "exist" the very moment they appear on a published Web page. This is where the "/article" path-part got added to the URL -- in the HTML on your pages.

    When analyzing how all this works, it's best (easiest) to start by looking at the HTML on your Web page. That is "where the process starts" as regards the "action" of the Web. The simplest wya to get a handle on all this is to assume that the "action" starts with a click on a link on a Web page. What URL or URL-path-part is in that link? What canonical URL will the browser request (hover over it to find out) when it sees that link? And proceed from that point.

    So, a click on an /article link, or an object-include of a page-relative-linked image/js/css file based on the current article URL being displayed in the browser's address bar results in the browser sending a request to your server with that "/article" path-part in the requested URL.

    Your rule in Mod_rewrite then recognizes that as an "article" link, and rewrites that incoming request to the proper script filepath to generate a new page with that article on it and send that back to the browser...

    Really no magic here, but as I've said, it's important to recognize URLs and filepaths as two entirely-different things, and to understand the "realms" in which they are used: URLs "out there" on the Web, and filepaths "in here" -- inside the server itself.

    Once that concept "clicks," almost all of the confusion goes away... And I only say "almost" because I've just finished up on a site that had three "generations" of old shopping-cart-and-plugins URLs, and was transitioning to a fourth; Trying to keep all of those URL-types and old filepath-in-URL errors in mind while coding... well, no amount of book-learnin' can help with that! :o

    Jim

    wildbest




    msg:4175786
     2:46 pm on Jul 24, 2010 (gmt 0)

    By using a server-relative link <img src="/img.src">, you tell the browser to remove both the page name and all subdirectory path-info from the page URL, and then append the "/img.src", making the image URL "example.com/img.src".

    Hmmm... to achieve this effect I always thought relative link <img src="../img.src"> should be used?

    g1smd




    msg:4175790
     2:54 pm on Jul 24, 2010 (gmt 0)

    God no. Never use that
    ../ syntax. You'll drive yourself nuts trying to figure out what is relative to what.

    If the images for a particular page are located in the
    /var/www/yoursite/media/images/ folder on your server, just call that image using "/media/images/thatimage.png" with a leading slash when you link to it - then it doesn't matter "where" you are linking from.
    wildbest




    msg:4175792
     3:01 pm on Jul 24, 2010 (gmt 0)

    Telling the browser to remove both the page name and all subdirectory path-info from the page URL, means file image.src is in the root. There is nothing complicated here.

    g1smd




    msg:4175827
     4:48 pm on Jul 24, 2010 (gmt 0)

    Yep. So for an image in the root, simply link to "/thatimage.png" with a leading slash.

    However, for a simple life organising your files on your server, I'd suggest having the images in a folder.

    Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Code, Content, and Presentation / Apache Web Server
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved