Forum Moderators: phranque

Message Too Old, No Replies

confused with examples of rules for seo friendly urls

         

Skhan00

6:18 pm on Jul 1, 2009 (gmt 0)

10+ Year Member



Ok so I am in the process of learning rewrite rules to make seo friendly urls.

Learning from many different google sites to get an overall understanding, I am trying to grasp the following with leading slashes.

I have seen the following example:

RewriteRule ^articles/(.*)\.html$ /articles?$1

Wouldnt this cause a double slash in the url once its redirected?

lets say someone requests
www.domain.com/articles/123.html
it would then go to
www.domain.com//articles?123

I have ran across a few sites that do this.

some do it this way:
RewriteRule ^/articles/(.*)\.html$ /articles?$1

and some do it this way:
RewriteRule ^articles/(.*)\.html$ articles?$1

which one is correct? what are the advantages/disadvantages of doing it with any of the above?

[edited by: Skhan00 at 6:24 pm (utc) on July 1, 2009]

g1smd

6:40 pm on Jul 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Wouldnt this cause a double slash in the url once its redirected?

No, because that's an internal rewrite not an external redirect. It is taking a URL that matches the pattern on the left, and will get the content from the internal filepath indicated by the filepath on the right. The target is an internal filepath, not a URL.

The rule needs an [L] flag to be added at the end.

Using the leading slash means the file will be looked for in the web root.

The (.*) pattern is very inefficient, needing multiple backoff-and-retry operations. Use ([^.]+) or similar instead.

jdMorgan

6:47 pm on Jul 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Separate subject:

some do it this way:
RewriteRule ^/articles/(.*)\.html$ /articles?$1
and some do it this way:
RewriteRule ^articles/(.*)\.html$ articles?$1

The required form depends on where the code is to be located. In .htaccess and in <Directory> containers in server config files, the URL-path examined by RewriteRule is "localized" to the directory. So, for example, in /.htaccess, the leading slash is stripped, and the rule sees URL-paths which do not start with a slash.

If the code is located in /directory1/.htaccess, then "/directory1/" will be stripped off by the time the RewriteRule attempts to match it with a pattern.

Jim

Skhan00

7:00 pm on Jul 1, 2009 (gmt 0)

10+ Year Member



thanks for the reply guys, think i understand it.

Jim, all my config files fore redirects and rewrites go into my httpd subfolders that get loaded by apache. (not in htaccess files)

so for my purpose here is what I am trying to accomplish, let me know if my method is good or bad.

Friendly URL:
http://www.example.com/cars/11010052000-P/8636/bmw-e46-m3.html

Unfriendly URL (actual dynamic url):
http://www.example.com/cars/control/prod/~pid=11010052000-P/~model=8636

RewriteRule ^/cars/([^.]+)/([^.]+)/([^.]+)\.html$ /cars/control/prod/~pid=$1/~model=$2 [NC,L]

[edited by: Skhan00 at 7:01 pm (utc) on July 1, 2009]

[edited by: jdMorgan at 7:22 pm (utc) on July 1, 2009]
[edit reason] example.com [/edit]

jdMorgan

7:21 pm on Jul 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Friendly URL:
http://www.example.com/cars/11010052000-P/8636/bmw-e46-m3.html

Unfriendly URL (actual dynamic url):
http://www.example.com/cars/control/prod/~pid=11010052000-P/~model=8636


While the latter *was* your former "unfriendly" dynamic URL, it is not a URL any more. Only the "/cars/control/prod/~pid=11010052000-P/~model=8636" part now exists, and it is now only a server filepath from documentroot. It's important to understand this, as massive confusion can entail otherwise.

Your negative-matches need some tweaking:


RewriteRule ^/cars/([^/]+)/([^/]+)/[^./]+\.html$ /cars/control/prod/~pid=$1/~model=$2 [NC,L]

Also, be aware that since you are 'discarding' the "bmw-e46-m3" part of the requested URL, an 'unfriendly' third party could link to example.com/cars/11010052000-P/8636/worst-lemon-car.html and example.com/cars/11010052000-P/8636/junk-car.html and example.com/cars/11010052000-P/8636/previously-owned-and-wrecked.html and cause you all manner of grief with duplicate-content and 'unwelcome' search results.

Best practice is to rewrite *all* variable parts of the URL-path to your script, and to validate all of them against your database. If nothing can be found using the 'required' variables, then the script must return a 404. If an entry can be found, then validate the non-required URL-path elements against what's in that database entry, and if they do not match, generate a 301-Moved Permanently redirect to the corrected URL. So in this case, if the car in the <car>.html URL-path-part is not *exactly* "bmw-e46-m3", then you'd want to redirect the request. This will prevent both careless and malicious creation of bogus URLs, and the resulting duplicate-content problems.

You will also probably want your script to check that the dynamic path was requested as a result of your rewriterule, and not as a direct client request for the old dynamic URL. If a client directly requests the old unfriendly dynamic URL, then your script should generate a 301 redirect to the corresponding new friendly static URL.

The server variable %{THE_REQUEST} can be used to check the client's HTTP request line for this purpose (you may just want to pass it as a variable to your script for this purpose).

Jim

[edited by: jdMorgan at 7:23 pm (utc) on July 1, 2009]

g1smd

7:24 pm on Jul 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The rule will work, but has a major flaw. The 'bmw-e46-m3' part is not being captured by the rule and is not being passed to your script for verification.

In simple terms if there is an incoming link to /cars/11010052000-P/8636/this-product-is-over-priced junk.html and your site would return the same content, Duplicate Content. Instead, it should 301 redirect to the correct URL.

Skhan00

1:47 pm on Jul 2, 2009 (gmt 0)

10+ Year Member



thanks for the info, not sure i completely understand the %{THE_REQUEST} thing, been trying to learn through google search

is this how it should look?

Friendly URL: http://www.example.com/cars/11010052000-P/8636/bmw-e46-m3.html

RewriteRule ^/cars/([^/]+)/([^/]+)/[^./]+\.html$ /cars/control/prod/~pid=$1/~model=$2/~name=$3 [NC,L]

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /cars/control/prod/? [NC]
RewriteRule ^/cars/control/prod/~pid=([^/]+)/~model=([^/]+)/~name=([^/]+)$ /cars/$1/$2/$3\.html [R=301,NC,L]

jdMorgan

4:06 pm on Jul 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /cars/control/prod/~pid=([^/]+)/~model=([^/]+)/~name=([^/\ ]+)\ HTTP/
RewriteRule ^/cars/control/prod/~pid=[^/]+/~model=[^/]+/name=[^/]+$ http://www.example.com/cars/%1/%2/%3\.html [NC,R=301,L]

The two lines appear to be quite redundant, and this is intentional. We are trying to detect direct client requests for the "unfriendly" URL, and to redirect only those.

I am not sure that the "name=" field will be present in your "unfriendly" URL, because you did not show it in your rule or in your unfriendly-URL example in a previous post. If it is not present, then this rule will not work. In fact, it won't be possible to do the redirect using .htaccess alone, and you will have to do it in your script -- by looking up the correct "name=" value in your database using the pid, and then doing a 301 redirect from within your script itself. However, you will have to check THE_REQUEST in your script just as shown in the rewriterule here, in order to prevent an 'infinite' loop resulting from interaction with your internal rewrite rule.

Jim

[edited by: jdMorgan at 4:07 pm (utc) on July 2, 2009]

Skhan00

5:44 pm on Jul 2, 2009 (gmt 0)

10+ Year Member



thanks for the help g1smd & jdMorgan

will be off the test this out

Skhan00

1:26 pm on Jul 14, 2009 (gmt 0)

10+ Year Member



Hey, I just got my apache server set up locally to test this, but have not been able to get it to work:

this is what i have in my vhosts:

RewriteRule ^/cars/([^/]+)/([^/]+)/[^./]+\.html$ /cars/control/prod/~pid=$2/~model=$3/~name=$1 [NC,L]

This gives me a 404 error. I figured maybe there is something wrong with the rule, but if I change the [NC,L] to [R=301,NC,L] it does do a 301 redirect.

Skhan00

3:04 pm on Jul 14, 2009 (gmt 0)

10+ Year Member



hm, when I look at the erorr log, it tells me the file does not exist

C:/www/beta.example.com/htdocs/cars/control/prod

The URL that its internally rewriting to, is not a folder structure but rather dynamic pages created by my web app thats running.

when I specify [R] it works fine, but without it, it does not.

jdMorgan

5:57 pm on Jul 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok, so this filepath is wrong:
C:/www/beta.example.com/htdocs/cars/control/prod

What should the filepath be to point directly to your web app? Correct the substitution path in the RewriteRule accordingly, and it will work.

You've likely got another RewriteRule, Alias, or ScriptAlias directive that would map "/cars/control/prod/~pid=11010052000-P/~model=8636" to the actual script or application path, but that rule/alias is being executed *before* your rewriterule, and not after. Therefore, it won't apply to these requests. So the trick is to reduce the process from a two-step process to a single-step process by rewriting straight to the actual script or application path.

Jim

Skhan00

6:30 pm on Jul 14, 2009 (gmt 0)

10+ Year Member



Hmm you just lost me.

Here is the virtual host conf:


<VirtualHost *:80>
ServerAdmin admin@example.com
DocumentRoot "C:/www/beta.example.com/htdocs"
ServerName beta.example.com
ServerAlias www.beta.example.com
ErrorLog "C:/www/beta.example.com/logs/error.log"
CustomLog "C:/www/beta.example.com/logs/access.log" common

DirectoryIndex index.cfm index.htm index.html
<Directory />
Options Indexes FollowSymLinks
AllowOverride All
Order allow,deny
Allow from all
</Directory>

ProxyPreserveHost On
proxyPass / ajp://localhost:8009/
proxyPassReverse / ajp://localhost:8009/

RewriteEngine On

RewriteRule ^/cars/([^/]+)/([^/]+)/([^/]+)\.html$ /cars/control/prod/~pid=$3/~color=$2/~name=$1 [NC,L]
</VirtualHost>

Unfriendly URL: [beta.example.com...]
In my browser, I can get this to work, the web app pics it up just fine (this is the unfriendly URL)

Now when i put in the friendly URL:
[beta.example.com...]

I get the above apache log error with the folder /cars/control/prod not existing.

if i modify the rewriterule to look like this

RewriteRule ^/cars/([^/]+)/([^/]+)/([^/]+)\.html$ [beta.example.com...] [P,NC,L]

with the proxy parameter with the full url path, it works fine. but not sure if thats a good way or bad way to do it. I am confused with what you said on your last post (read it over and over and couldn't grasp what it meant).

also on this code here

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /cars/control/prod/? [NC]
RewriteRule ^/cars/control/prod/~pid=([^/]+)/~color=([^/]+)/~name=([^/]+)$ [beta.example.com...] [R=301,NC,L]

if i put that in, it causes an infinite loop, how can i properly test the condition to make sure it doesn't loop so that I can get back the unfriendly 301'd into the friendly. been trying to find a good tutorial on how to test this with no luck

[edited by: Skhan00 at 6:49 pm (utc) on July 14, 2009]

Caterham

6:40 pm on Jul 14, 2009 (gmt 0)

10+ Year Member



Change the L flag to PT or modify your rule to a reverse proxy (P flag);

with the proxy parameter with the full url path,

Why do you want to use the http protocol but with ProxyPass ajp?

RewriteRule ^/rugs/([^/]+)/([^/]+)/([^/]+)\.html$ ajp://localhost:8009/cars/control/prod/~pid=$3/~color=$2/~name=$1 [NC,P]

Skhan00

6:48 pm on Jul 14, 2009 (gmt 0)

10+ Year Member



Hey Caterham, the web application that is running is Apache OFBiz.

In order to get ofbiz running on an already configured Apache server, I followed a tutorial to add

ProxyPreserveHost On
proxyPass / ajp://localhost:8009/
proxyPassReverse / ajp://localhost:8009/

Which allows me to use it on port 80, without any conflicts.
----

I tried the following:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /cars/? [NC]
RewriteRule ^/cars/([^/]+)/([^/]+)/([^/]+)\.html$ ajp://localhost:8009/cars/control/prod/~pid=$3/~color=$2/~name=$1 [NC,L,P]

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /cars/control/prod/? [NC]
RewriteRule ^/cars/control/prod/~pid=([^/]+)/~color=([^/]+)/~name=([^/]+)$ /cars/$3/$2/$1\.html [R=301,NC,L]

This seems to work correctly. I can go from friendly with an internal request to the unfriendly

and i can also 301 from the unfriendly to the friendly without it looping.

This method good/bad?

Caterham

7:46 pm on Jul 14, 2009 (gmt 0)

10+ Year Member



that worked without using the P parameter. thanks

That is _not_ a good idea, because you won't have a reverse proxy in that case.

now to solve the RewriteCond to prevent loop... I am still lost there.

There can't be a loop if you setup everything correctly (no need to check THE_REQUEST via a condition) unless you're reverse proxying to the same location, here beta.example.com, port 80 (which is silly and wastes resources).

jdMorgan

7:50 pm on Jul 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Looks OK to me, but [L] used with [P] is redundant, so not needed.

Also, using [NC] in the first rule means that you can have multiple URLs resolving to the same content. This is "duplicate content" and not recommended, SEO-wise. If you really think you might get an incorrectly-cased request, then I's suggest that you detect it and 301-redirect it to the properly-cased URL using a separate rule.

Jim

Skhan00

7:52 pm on Jul 14, 2009 (gmt 0)

10+ Year Member



Caterham: i edited my post above.

as for as not looping forever, how will it not loop forever without a condition.

if there is no condition, it will go from unfriendly to friendly, then friendly to unfriendly and so on...?

jdMorgan: thanks for the update, so it should look like this?

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /cars/? [NC]
RewriteRule ^/cars/([^/]+)/([^/]+)/([^/]+)\.html$ ajp://localhost:8009/cars/control/prod/~pid=$3/~color=$2/~name=$1 [P]

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /cars/control/prod/? [NC]
RewriteRule ^/cars/control/prod/~pid=([^/]+)/~color=([^/]+)/~name=([^/]+)$ /cars/$3/$2/$1\.html [R=301,NC,L]

jdMorgan

8:03 pm on Jul 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Beyond the case that Caterham describes above, looping is mostly a problem with .htaccess code, due to the way that it is re-executed until now more rules match. In a server config file, you need only be concerned about explicit loops.

Jim

Skhan00

6:11 pm on Aug 26, 2009 (gmt 0)

10+ Year Member



bumping this because now i have an issue with jsessions being lost.
on ever request to the product pages using the seo friendly urls, a new session is being created, which i am having trouble fixing.

btw, the application running the frontend is Apache OfBiz

jdMorgan

7:03 pm on Aug 26, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How are the sessions "remembered"?
Do you use cookies, or pass the sessionID as a query string appended to the URL?

Jim

Skhan00

3:09 pm on Sep 19, 2009 (gmt 0)

10+ Year Member



hey Jim,

I had deployed this on a live environment inside subdomain for testing.
Session IDs are set as a hidden attribute between the server and client. We had removed it from appending to the URL for cleaner looking URLs

The problem here is that everytime the a friendly URL is triggered the old session is not found so a new session is created, this does not occur when browsing the site thru unfriendly urls.

here is the rule:

#Rewrite for cars product
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /cars/? [NC]
RewriteRule ^/cars/([^/]+)/([^/]+)/([^/]+)\.html$ /carsusa/control/prod/~pid=$3/~color=$2/~name=$1 [L]

When this is done, i get a 404 error. I asked my host to take a look and this was their reply
----
The rewrite rule you wrote is correctly formed, and works. However,
the problem appears to be an issue regarding redirecting into the ofbiz
container. Here is what we see in the log:

[Fri Sep 18 21:45:04 2009] [error] [client IP.ADDRESS.XX] File does not
exist: /var/www/domains/example.com/beta/htdocs/carsusa

I suspect that apache is trying to find a file by the name generated by
the rewrite rule, disregarding the JkMount. I suspect that rewrite
rules are evaluated after JkMounts are checked, and the presence of a
JkMount is not checked afterward.
----

I then changed the rule to
#Rewrite for cars product
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /cars/? [NC]
RewriteRule ^/cars/([^/]+)/([^/]+)/([^/]+)\.html$ /carsusa/control/prod/~pid=$3/~color=$2/~name=$1 [L,PT]

and this worked but the issue with the sessions being lost started.

jdMorgan

3:57 pm on Sep 19, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Again, how is the session number/information maintained? "Session IDs are set as a hidden attribute between the server and client" is not a technical description that has any specific meaning here. How is that "hidden attribute" passed? I could assume that it's passed in a cookie, but assumptions very often lead to a lot of wasted effort here...

Note that your host identified the same problem that I did above: the execution order of mod_proxy and mod_rewrite.

Jim

Skhan00

1:32 pm on Sep 21, 2009 (gmt 0)

10+ Year Member



Sessions are stored within cookies with Jsessionid values that are associated to all of the variables that contain the users information, including cart, their login information, etc. Upon redirect from apache, the users get a brand new session, and lose all of their previous values due to getting assigned a new jsessionid

Also, I tried to put the rewrite rules before and after the JKmount info, didn't make a difference.

jdMorgan

1:39 pm on Sep 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Neither a redirect, a proxy throughput, nor any other 'action' of a rule will modify a cookie, since it is stored client-side. I'd suggest you look into your scripts' cookie-handling, and particularly, make sure that the declared 'realm' of the cookie still matches after any redirects or proxy-throughputs are done; Cookies, when defined, specify the domain and URL-paths for which the client will send them back to the server, and if the paths don't match, the cookie won't be sent.

Carefully examining your client-server HTTP transactions using the "Live HTTP Headers" add-on for Firefox/Mozilla browsers (or a similar tool) may prove quite revealing in this matter.

Jim

Skhan00

2:37 pm on Sep 21, 2009 (gmt 0)

10+ Year Member



hey jim, thanks for the addon. i see where the issue was. the app was setting the cookie bath to a specific directory which the seo friendly url was not.

a small edit to the seo friendly url made it work.

Caterham

5:22 pm on Sep 21, 2009 (gmt 0)

10+ Year Member



I suspect that rewrite
rules are evaluated after JkMounts are checked, and the presence of a
JkMount is not checked afterward.

Nope, translate_name is not a RUN_ALL. The first module returning OK wins, others will never see the request. With PT, mod_rewrite returns DECLINED and so mod_jk has a chance to see the request.