Welcome to WebmasterWorld Guest from 54.167.157.247

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

mod rewrite help

   
6:31 pm on Feb 16, 2009 (gmt 0)

5+ Year Member



Hello,

I'm trying to create an htaccess file using mod_rewrite for the following scenario:

Visitor clicks on:
http://www.example.com/somefolder/product-name.type/

Note: the '-' and '.' are important - I do have them in the url.

Once he/she clicks I want a script that is in the "somefolder" folder to get the product-name.type and generate a page:

http://www.example.com/somefolder/script.cgi?product-name.type

Note: you don't need any "id=" parameter to invoke the script (I could change that of course but I suppose it's irrelevant), just the "product-name.type" part.

So can anyone give me an example of a mod_rewrite syntax to use in this scenario...

I've toyed around for some time but I just can't get it to work.

[edited by: jdMorgan at 3:02 pm (utc) on Feb. 17, 2009]
[edit reason] example.com [/edit]

3:05 pm on Feb 17, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Hi westman7, and welcome to WebmasterWorld!

Please see this thread in our forum library to get started: Changing Dynamic URLs to Static URLs [webmasterworld.com]

Our Apache Forum Charter [webmasterworld.com] also provides links to useful documents and information on how to get the most from this forum.

Jim

9:24 pm on Feb 17, 2009 (gmt 0)

5+ Year Member



Here's what I've done so far:

Options +FollowSymLinks
RewriteEngine on

RewriteRule ^([^/]+)/?$ script.cgi?f=$1 [L]

I've set the script to print the parameter it gets, and for some odd reason it prints out its name 'script.cgi'... I really have no idea what am I not doing right in the syntax?

11:46 pm on Feb 17, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Your regular expressions pattern isn't specific enough; An initial request get rewritten to /script.cgi, and then that request for /script.cgi gets rewritten to /script.cgi again (Rewrites in .htaccess are recursive).

The easiest solution is to either make the pattern more specific -- for example "^([^/.]+)/?$", or to explicitly exclude /script.cgi requests from being re-rewritten by using a RewriteCond on the rule.

Note also that your pattern accepts a request_URI with or without a trailing slash. I suggest that you pick one or the other form, externally 301-redirect the non-preferred form to the preferred form, internally rewrite only the preferred form to your script, and so avoid duplicate content issues.

Jim

9:47 pm on Feb 19, 2009 (gmt 0)

5+ Year Member



Should there be any change to the syntax if instead of script.cgi?f=$1 I make it work like script.cgi?$1 ... because it doesn't work that way and I don't know if the reason is in the htaccess file.
10:35 pm on Feb 19, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Whatever you script requires is what you should pass to it. If it wants a name/value pair of "f=<value>" then that's what you should use when rewriting. There's no magic here -- and the behavior will be predictable if your syntax is correct.

You didn't reply to the issue that I wrote about above. If you are using a RewriteCond testing %{REQUEST_URI} to prevent looping, then be aware that query strings will NOT be included in %{REQUEST_URI}, and can only be tested using a RewriteCond examining %{QUERY_STRING} or %{THE_REQUEST}.

Jim

12:05 am on Feb 20, 2009 (gmt 0)

5+ Year Member



Hello Jim,

Thanks for helping me out.

I used the following: ^([^/.]+)/?$

The RewriteCond is like totally obscure for me right now.

I think the problem is that I also pass a dot '.' in my <value> string. I'm trying to figure out a way to explicitly allow this symbol in the syntax.

Also about the trailing slash I didn't quite get it - it does work either way, so why change it? What's the logic behind the 301 redirect?

Please excuse me if my questions are totally off beat, but this is REALLY new material for me.

Thank you again for all your help!

5:40 am on Feb 20, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



The *problem* is that it would work either way. The result is that the same content would be accessible from the Web at two different URLs -- xyz/ and xyz, both returning the content generated by a script call for "f=xyz". Returning the same content in response to requests for two different URLs is what we call "Duplicate content" and if you or anyone else links to both variants, causes those two URLs to "compete" for search ranking.

Major search engines run de-duplication processes in their back-ends, but do you want to depend on them to "pick" the right URL? What if they screw up and pick the wrong one -- the one with the currently-worst ranking? What if they set the threshold incorrectly, and decide that only two URLs with the same content qualify to mark the domain as "spammy?"

I don't know about you, but I never "rely on the kindness of strangers" when my sites' ranking/revenue is at stake. It is an almost-trivial matter to pick either xyz or xyz/ as the preferred "canonical" URL, link only to that one, and redirect the non-canonical variant to the canonical URL.

If you pass a dot in the URL, which then gets rewritten into the "f=" query for use by your script, then you must use a RewriteCond here. You will also have problems because you may have to exclude "well-known files" and "non-page" (e.g media file) URLs from being rewritten as well, because their URLs contain dots: robots.txt, sitemap.xml, labels.rdf, favicon.ico, w3c/p3p.xml, etc.


RewriteCond $1 !^robots\.txt$
RewriteCond $1 !\.(gif¦jpe?g¦ico¦css¦js¦pdf)$
RewriteCond $1 !^script\.cgi$
RewriteRule ^([^/]+)/?$ script.cgi?f=$1 [L]

Change the broken pipe "¦" characters above to solid pipes before use; Posting on this forum modifies the pipe characters.

Jim

[edited by: jdMorgan at 5:43 am (utc) on Feb. 20, 2009]

3:39 pm on Feb 20, 2009 (gmt 0)

5+ Year Member



I was thinking, if I originally know exactly what kind of values can be called (they are static) wouldn't it be better to just set those in the htaccess file instead of trying to come up with every possible scenario as this seems more prone to bugs?

So for example in the RewriteCond I set rules to match all the possible values? Like:

RewriteCond =[dir.name1]$
RewriteCond =[dir.name2]$
etc.

Another question, partially related to the trailing slash as well.

This whole thing is located within a subfolder on my domain, eg. example.com/somefolder/script.cgi should I place the htaccess in that subfolder or use the top level htaccess to handle things globally? In this scenario I could use just one 301 rule for the whole domain - eg. all URIs that do not end with a slash would be redirected to ones that end up with a slash, eg:

domain.com -> domain.com/
domain.com/subfolder -> domain.com/subfolder/
domain.com/subfolder2 -> domain.com/subfolder2/
domain.com/subfolder3/subfolder4 -> domain.com/subfolder3/subfolder4/

Again sorry of all this seems off beat, I'm not a programmer, I know just some very basic stuff, so it's quite a challenge for me to get this working. I'm starting to get this slowly but considering I just got into the whole matter just a few days ago I'll probably ask a few more dumb questions...

Thanks again for your time and help!

8:37 pm on Feb 20, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



The answer to both questions is that it is largely a matter of style -- and to a somewhat lesser extent, of code efficiency versus ease of administration. Consider what happens if your site is a huge success, and the number of subdirectories doubles, or quadruples, or increases by a factor of 10... or more.

Be careful to consider all possibilities when thinking up URL-handling strategies. For example, you very likely do not want to redirect "robots.txt" to "robots.txt/" just because it does not end with a slash!
(Hint: it does not end with a slash, but it *does* contain a period in the final URL-path-part. Use both factors to decide if you want to redirect or not.)

Always look at these ideas from multiple angles, and ask yourself, "What if...?"

Jim

4:13 pm on Feb 22, 2009 (gmt 0)

5+ Year Member



OK, I've decided to go with a separate htaccess for the subfolder, and a general rule instead of listing each value separately. But I just can't find a way to create a rule for the dot. To be frank, I can't even come up with the logic for it. Can you advise?
9:04 pm on Feb 22, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member




# Redirect to add a missing slash only if no filetype in final URL path-part
RewriteRule ^(([^/]+/)*[^/.]+)$ http://www.example.com/$1/ [R=301,L]

The pattern reads "Match anything but a slash followed by a slash, and as many of those subpatterns as possible (zero or more), followed by anything except slashes or periods." If there is a match, we take the whole matched URL-path, prefix it with the protocol and domain, add a trailing slash, and redirect.

Note how the second subpattern here matches the one you are already using for the internal rewrite rule already discussed above. The changes to that internal rewrite rule pattern after adding this redirect will be to remove the question mark after the final slash, to make it non-optional, and to remove the RewriteConds, since they are no longer needed:


RewriteRule ^([^/]+)[b]/$[/b] script.cgi?f=$1 [L]

However, if you add any subirectories at all, then they will need to be excluded from this rule, and your script/CMS must not allow real subdirectory names to be used as path names that will later become "f=" values.

That can be problematic, so you might actually want to consider either NOT using trailing slashes, or perhaps even 'tagging' all URL-path names that will become "f=" values by putting a unique directory path in front of them. This largely eliminates the possibility of "collisions" between real (sub)directory paths, and the virtual ones used to feed "f=" into your script. For example, instead of using a URL of example.com/<f=value_here>, use example.com/products/<f=value_here>. The "products" path can now be used to unambiguously determine that the "subdirectory" that follows it is an "f=" value, and not a real subdirectory.

Jim

11:33 pm on Feb 22, 2009 (gmt 0)

5+ Year Member



Thanks Jim!

Yes, that's exactly how I'm doing it - with a "products" subdirectory, so these rules will apply only there (I have a separate htaccess there with these rules).

6:22 pm on Feb 25, 2009 (gmt 0)

5+ Year Member



Sorry to bother you again, but I can't figure out why it would all work when I call it like example.com/products/someproduct/ eg. with an trailing slash, but not when I don't use it?
6:27 pm on Feb 25, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Did you add the additional rule I suggested two posts up?

Jim

8:24 pm on Feb 25, 2009 (gmt 0)

5+ Year Member



Yes.

When I try it without a dot it works both ways - with and without a trailing slash, and when it's without it does redirect to the trailing slash version.

However when I use a dot, it'll only work if I got to the trailing slash version. Otherwise it gives a 404 error as if it were actually looking for such a file/path on the server...

8:52 pm on Feb 25, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Yeah, there's a logical inconsistency with putting a dot before a slash in a URL, trying to make this code work, and canonicalize the products URLs as well.

I suggest you remove the trailing slashes from the URLs on your pages which are to be rewritten to your script, and then replace both rules with


# If requested "products" URL with trailing slash does not resolve to an existing subdirectory
# of this /products directory, externally redirect to remove the trailing slash
RewriteCond %{DOCUMENT_ROOT}/products/$1/ !-d
RewriteRule ^([^/]+)/$ http://www.example.com/products/$1 [R=301,L]
#
# Internally rewrite slashless "products" URLs which do not resolve to existing files to script.cgi
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^([^/]+)$ /products/script.cgi?f=$1 [L]

This will redirect any non-existent trailing-slash "products" URLs to a non-trailing-slash URL to fix up existing bookmarks and links out on the Web, and then rewrite any non-trailing-slash "products" URL that does not resolve to a physical file or directory to your script.

I don't normally like to use file- or directory-exists checks, but since this code is only applied in one subdirectory, the performance impact shouldn't be so bad, even on a busy server.

Jim

[edit] Corrected as noted below. [/edit]

[edited by: jdMorgan at 12:10 am (utc) on Feb. 26, 2009]

10:07 pm on Feb 25, 2009 (gmt 0)

5+ Year Member



Hmm, it gives a 500 Internal Error and I'm sure it's not the script itself...
11:57 pm on Feb 25, 2009 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




..
12:12 am on Feb 26, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Sorry, the first line in the second rule should be a RewriteCond, not a RewriteRule. See correction to code in the post above.

When you get a 500 Server error, please report the relevant contents of your server error log.

Jim

1:32 pm on Feb 26, 2009 (gmt 0)

5+ Year Member



Thank you Jim!

It all works perfectly nice now!

I even understand the logic of the syntax now for most of its part, just the ^([^/]+)/$ and ^([^/]+)$ are still kind of obscure. I mean I get what symbols mean, but I still need some more reading to actually understand what the whole expression means. Maybe I need to practice with a text file or something.

Anyways, thanks again buddy! You've been a true blessing!

P.S. Sorry for the server error log - I'll know better next time.

1:44 pm on Feb 26, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



The pattern "^([^/]+)/$" means "Start at the beginning of the requested URL-path and match one or more characters that are not slashes, save that matched sub-string in $1, and require a slash at the end."

Or equivalently, "Starting at the beginning of the requested URL-path, match all characters up to but not including the first slash found, save that matched sub-string in $1, then require a final slash."

Jim

7:28 pm on Feb 26, 2009 (gmt 0)

5+ Year Member



But there is one thing I don't get about "^([^/]+)/$"...

The URL-path would be http://www.example.com/products/someproduct

So why does it match only the "someproduct" part when that rule would more apply if it states start at the END. I mean there are characters different than a slash in the part before the "someproduct"... it should like match only "http:"

Another thing.

In "RewriteCond %{DOCUMENT_ROOT}/products/$1/ !-d" why is there a "$1" ? There hasn't been anything saved in "$1" yet as this is the first line, so why is "$1" there? Or does it mean "if the requested URL that has "products" in it, has a part after "products" eg. the $1, see if that part, $1, matches a directory, if it doesn't proceed further". So in this case "$1" instead of being a previously saved string is more of a variable that would later be checked if it matches a directory?

7:41 pm on Feb 26, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



RewriteRule patterns are matched before any RewriteConds are evaluated, and therefore, $1 *has* been populated, and can be used by any of the rule's RewriteConds... See the Apache mod_rewrite documentation "Rule Processing", please.

URL-paths "seen" by RewriteRules in an .htaccess file are localized to that .htaccess file's directory. Therefore, the entire URL-path seen by a RewriteRule in /products/.htaccess when "/products/someproduct/" is requested by the client will be "someproduct/" -- again, this is a documented behavior.

This is the reason that RewriteRule patterns in code for use in httpd.conf or conf.d --at the server config level, that is-- must include the leading slash, whereas in "/.htaccess" the rule pattern cannot include it... That's because the URL-path has been localized for use in /.htaccess, and the path to the current htaccess file's directory *is* "/". So the leading URL-path slash is removed by this localization process.

Jim