Forum Moderators: phranque

Message Too Old, No Replies

Replacing old mod re-write rules and redirecting without slash urls

         

msalman

7:47 pm on Nov 17, 2008 (gmt 0)

10+ Year Member



Hey everyone,

I'm very new to mod-rewrite and after reading some articles here and there i wrote two simple rules for my website. Now, i'm planning to change them a bit and I also need some help redirecting non-slash ending urls to slash ending urls.

So this is how my website's current urls look like:
www.mywebsite.com/categoryA-1/
www.mywebsite.com/categorA/subcategoryA-3/
www.mywebsite.com/categoryA-1/articleA-1.htm

Note: the numbers in the urls correspond to the their ids in the database

My htaccess code:

[2]
RewriteRule ([^/]+)-([0-9]+)/$ category\.php?url=$1&cid=$2
RewriteRule ([^/]+)-([0-9]+)\.htm$ article\.php?url=$1&aid=$2
[/2]

I'm planing to drop off the ids from the urls and also .htm extension. So, the new urls should look like:
www.mywebsite.com/categoryA/
www.mywebsite.com/categorA/subcategoryA/
www.mywebsite.com/categoryA/articleA/

I also want to redirect any url in the form
www.mywebsite.com/categoryA
to
www.mywebsite.com/categoryA/

I'm not too sure how to accomplish this. I was wondering if someone can help me; i would really appreciate, thanking in advance!

jdMorgan

8:57 pm on Nov 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is "/CategoryA" a page, or is it a directory index listing or directory index file?

If it is a page, then the HTTP protocol says its URL should not end with a slash, and adding one is both contrary to standards and a waste of time (and bandwidth)...

There's a common myth that adding a trailing slash is somehow good, but it's wrong (and wasteful, and confusing to people trying to type in a URL).

Also, how will your script determine which database-ID entry to retrieve if the ID number is not provided in the URL?

Sorry, I'm just trying to avoid giving you the right answers to the wrong (or potentially-dangerous) questions...

Jim

msalman

12:10 am on Nov 18, 2008 (gmt 0)

10+ Year Member



^thanks for replying Jim

Is "/CategoryA" a page, or is it a directory index listing or directory index file?
It is a dynamically created page.

If it is a page, then the HTTP protocol says its URL should not end with a slash
I wasn't aware of this and the whole point is to remain consistent and to avoid 404 error. I don't want to run into duplicate content issue either. I can have the urls without trailing slash but what would happen if user types a trailing slash url, would they be redirected correctly?

Also, how will your script determine which database-ID entry to retrieve if the ID number is not provided in the URL?
The URLs are stored in the database and they're unique; so, i'll make the query against them. For example, following the earlier example, look at followings

URL: www.mywebsite.com/categoryA-1/
URL Stored in the Database: categoryA

URL: www.mywebsite.com/categorA/subcategoryA-3/
URL Stored in the Database: categorA/subcategoryA

URL: www.mywebsite.com/categoryA-1/articleA-1.htm
URL Stored in the Database: articleA

Let's forget about trailing slash, how would i write the other rules? Again, thanking in advance for the help!

jdMorgan

1:27 am on Nov 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, I was just making sure you were not expecting mod_rewrite to "magically know" how to associate a text URL with a numerical database entry. Unfortunately, this is a common misconception among many posters here. As you can appreciate, this often results in the poster having to re-think his or her entire approach, so posting detailed answers is often a poor use of time if there is a big misconception at the outset.

> how would i write the other rules?

You can "accept" the page URLs without slashes by removing the trailing slash from the pattern in your rule above that rewrites them to your script:


RewriteRule ([^/]+)-([0-9]+[b])$[/b] category\.php?url=$1&cid=$2 [L]

As for the redirects, the general format is:

RewriteRule ^pattern-matching-requested-URL-path$ http://www.example.com/new-URL-path [R=301,L]

You've already got examples of back-references and how to use them posted above, so I just showed the external redirect syntax in contrast to the internal rewrite syntax you've already used.

Also, note that each and every rule should have an [L] flag on it, so that rule processing will stop if the patterns match and the rule is invoked. This can be a huge CPU-time-saver, and the cases when [L] should not be used are usually extremely rare compared to those when it should.

Please have a go at coding your solutions and post your best effort, so we can stay within our charter [webmasterworld.com] as a discussion forum, and address specific questions.

If you have trouble, please post specific test URL-paths, using example.com (only) if a domain is needed, and tell us what filepath or URL you want to rewrite or redirect to, so we have a clear input-output description.

Thanks,
Jim

msalman

2:11 am on Nov 18, 2008 (gmt 0)

10+ Year Member



^thanks Jim

I understand your concerns; you didn't have to post the redirect syntax because i'm aware of that but thanks anyway.

Removing the ids and .htm extension from the urls

This is my poor attempt:

#Redirect to article.php
#Note: any category URL in this format will also be redirected to article.php i.e. www.example.com/test-abc
RewriteRule ([^/]+)-([^/]+)$ article\.php?url=$1&url2=$2 [L]

#Redirect to article.php
#this will redirect any url without a '.' to category.php
RewriteRule ^([^.]+)$ category\.php?url=$1 [L]

I'm aware of the problems in these re-write rules but I simply don't know how to differentiate them to redirect them to appropriate file without making any changes in the URLs i.e. adding the word "category" in front of category URLs and "article" in front of article URLs.

Thanking in advance for help

jdMorgan

2:26 am on Nov 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK well, you will have to provide *some* difference in the requested URLs, either by "tagging" them with "cat" or "article" or by using a different taxonomy, for example, hyphens versus commas, or *something* -- This is the "magic" problem I mentioned earlier -- The only thing that mod_rewrite has to work with is the requested URL.

Of course, you could pass all requests to the *same* script, and do all this using the database lookup if you wanted to.

Note that with the code you've got, the second rule only runs if the first rule was not invoked, and perhaps you can use that to advantage as well.

However, at this point you need to make those decisions, and settle on a workable URL scheme for today and for the future. Only after nailing down a set of URLs that can be correctly identified and forwarded to the correct script can you proceed to coding.

Now a comment in a wider scope: The rules you've got don't remove ids or extensions. *You* remove the ids and the extensions when you remove them from the links on your pages. Then the code you've posted looks at those id-less, extension-less URLs when the client (browser or robot) requests them, and internally rewrites them to the filepath of one of your scripts. So be clear on where the "removal" is taking place: It is the link on the page that defines a URL, and server-side code can only be used to send URL requests to files inside the server, or to redirect the client by sending it a redirect response and giving it a new URL. In this case, the client has to take the new URL from the redirect response, and ask a second time for what it wanted.

This is slow, and only useful for two things: To ""fix" old URLs in links from other sites beyond you control and in search engine indexes, and to prevent direct client access to your scripts, forcing them to use the 'SEO-friendly' URLs. We haven't yet gotten to either of those functions, and can't really, until the underlying problem of a workable URL *system design* is addressed.

So think about your new URL system, and the problem of allowing the mod_rewrite to unambiguously idnetify which URLs go to which script, and then let's start over with a clean slate; Once a good URL system is in place, writing the rules is fairly trivial.

Jim

g1smd

2:50 am on Nov 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



When you are thinking about your URL design, be aware that Mod_Rewrite can be set to look for an exact number of characters or digits as a match, as happened when I recently worked with a site to change them over from using long horrible garbage filled parameter driven URLs to much simple stuff something like:

www.example.com/
www.example.com/12/1234567
www.example.com/123/1234

It was a doddle to spot the 2/7 and 3/4 URL requests and rewrite them to the internal filepath where the content really resides.

.

Oh, and be aware; in your comments on top of the code examples in #:3788425 you said "redirect" each time - but the code examples you provided are both for a rewrite.

In general:
Rewrites contain two paths, no domain names, and usually end in [L].
Redirects contain one path and one full URL with domain, and ends in [R=301,L].
That's a gross oversimplification, but is the quickest way to spot them.

msalman

3:07 am on Nov 18, 2008 (gmt 0)

10+ Year Member



just a quick question: if i add the ids in the article URLs, then it would that solve the ambiguous problem, right? I don't feel like embedding 'cat' or 'article' in the URLs. If adding ids in the article URLs doesn't make the re-write rules unambiguous or 'pretty' then I would prefer requesting the same script file and do some extra work at the back end.

thanks for the help guys, i'll post my final decision tomorrow morning.

jdMorgan

4:08 am on Nov 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Absolutely *any* unambiguous characteristic of the requested URL can be used to 'tag' articles versus categories. That characteristic can be character-set-based, as in [0-9] for numeric, [a-z] for lowercase, [A-Z] for uppercase, or combinations of these. It can be the length of any particular path-part (for example, the first subdirectory level in the URL. It can be that the first character of the final path-part is uppercase, or the second character is a hyphen, or that it contains a tilde. It does not matter, as long as you can devise a rule that you will never have to break for as long as your site remains on-line, and one that isn't too hard to code efficiently. It also helps if it doesn't look too funky... :)

If it were my site, I wouldn't hesitate to use "article" and "category" in the URLs if those are the most accurate words for what they are, and if there aren't any synonyms more appropriate for or appealing to the audience of the site. Or if I wanted to keep the URLs shorter, and if selling articles in categories wasn't my focus, I might shorten them to "art" and "cat" or even just "a" and "c" -- There's nothing wrong with that at all. It depends on whether you want to rank for the words "article" or "category," or if they are only organizational elements of the site, basically.

Jim

msalman

2:54 pm on Nov 18, 2008 (gmt 0)

10+ Year Member



thanks Jim for your helpful comments; here's my final decision and rewrite rules:

#this rule will only request the file when it finds 'article' in the last bit of the URL i.e. /categoyA/article-abc
#stop if this request is successful, otherwise it is category so go to next rule
RewriteRule article-([^/]+)$ article\.php?url=$1 [L]

#the URLs without the word 'article' will be considered category URLs
RewriteRule ^([^.]+)$ category\.php?url=$1 [L]

I've tested these rules and they seem to be working just fine. Now, the next step is to redirect the old URLs to new URLs. What would be the best approach: adding a redirect line for each URL manually or grouping them together using some rules? Thanks

msalman

3:06 pm on Nov 18, 2008 (gmt 0)

10+ Year Member



sorry for double post, i forgot to mention that this rule

[2]RewriteRule article-([^/]+)$ article\.php?url=$1 [L][/2]

will not work if user types "/test-abc/article-def/

because it has a trailing slash, the above rule will fail and the category.php file will be requested. By making a small change, we can make the rule request the appropriate file

[2]RewriteRule article-([^.]+)$ article\.php?url=$1 [L][/2]

I don't want to add any extra rules to redirect trailing slash URLs to non-trailing slash URLs. I'll simply strip out the trailing slash in my script.

If there is better way to write these rules, please do help me, thank you!

g1smd

4:40 pm on Nov 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just a quick note to say that you only need the \. notation for a literal dot when you are looking at the left side of any rules.

You can just put a dot "as is" on the right side.

.

*** I don't want to add any extra rules to redirect trailing slash URLs to non-trailing slash URLs. I'll simply strip out the trailing slash in my script. ***

Oh yes you do.

If someone requests a URL without the www you redirect to make them re-request it with the www included.

In the same way, if someone requests URL with trailing slash, you redirect to make them ask for the URL with the slash omitted.

If you fail to do these things then both versions of the URL could be indexed by search engines. They will choose which one to list in SERPs (often not the one that you would choose) and/or they could split or dilute your PageRank across the alternative URLs.

Fixing all these issues such that "close but not exactly correct" URL requests are redirected is called canonicalisation and it is an important topic to get your head around.

Have a look at this thread, where a whole load of canonicalisation rules kick in ahead of the actual rewrite: [webmasterworld.com...]

You could issue the 301 redirect from within your script by using a HEADER command. but it would probably be a lot more efficient to fix it in the .htaccess or httpd.conf files.

msalman

5:20 pm on Nov 18, 2008 (gmt 0)

10+ Year Member



^ok! I've made few changes and here are my new rules:

[2]
Options +FollowSymlinks
RewriteEngine on

RewriteRule ^([^.]+)/$ http://www.example.com/$1 [R=301,L]
RewriteRule article-([^/]+)$ article.php?url=$1 [L]
RewriteRule ^([^.]+)$ category.php?url=$1 [L]

RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

[/2]

Off course, i'll replace 'example' with my domain name. So, how do these look? Does the order seem ok or the order doesn't matter?

g1smd

6:03 pm on Nov 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




The general non-www to www redirect should always be the last redirect.

The redirects all need to be placed ahead of the rewrites.

jdMorgan

6:57 pm on Nov 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Clean-up and add comments:

Options +FollowSymlinks
RewriteEngine on
#
# Redirect URL requests with trailing slash to remove the slash
RewriteRule ^(.+)/$ http://www.example.com/$1 [R=301,L]
#
# Redirect non-canonical hostname requests to canonical domain
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#
# Rewrite 'article' URLs to article.php
RewriteRule ^article-(.+)$ article.php?url=$1 [L]
#
# Rewrite all other URLs which do not have a filetype to the category script
RewriteRule ^([^.]+)$ category.php?url=$1 [L]

Jim

msalman

1:22 am on Nov 19, 2008 (gmt 0)

10+ Year Member



Thank you guys for your help; I really appreciated your help and insightful comments!

maxed

2:28 am on Nov 19, 2008 (gmt 0)

10+ Year Member



You should probably also remember to redirect to a 404 from the category page all the values that get passed that do not match with your database records.

g1smd

2:51 am on Nov 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Absolutely! ModRewrite simply gets URL requests that are the right format fed into the script. It is then the job of the script to sanity check those requests and only send content for valid requests and to send HEADER 404 and the error message for requests that have no content to return.

jdMorgan

3:11 am on Nov 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You should probably also remember to redirect to a 404 from the category page all the values that get passed that do not match with your database records.

The basic idea is good, but let's be very careful with the terminology, here: You never want to redirect to a 404 (page) under any circumstances.

Rather, if a requested URL does not resolve to an existing resource, or if a page cannot be built and served in response to the query string parameters appended to a requested URL, then your server or script should output a 404-Not Found Status response along with a friendly, informative error page that tells human visitors how best to find what they wanted on your site.

This page can contain a link to your HTML site map (i.e. table of contents), your 'categories' page, your site search page, and your home page, as applicable and as desired.

But the critical thing is that you send a 404-Not Found or 410-Gone response code with that page, and not a 30x redirect response code.

You can make a real mess of your search engine rankings with a redirect response instead of an error response. Here's a great way to commit search-engine suicide using .htaccess :


ErrorDocument 404 http://www.example.com/error-page.html

What's wrong with that? Well, as documented, if the URL parameter of an ErrorDocument directive specifies a canonical URL (by including the protocol and the domain name) instead of a local URL-path, then the server will generate a redirect response instead of the error response that caused the ErrorDocument to be served.

Although this behaviour was intentional (so that error documents could be served from different domains if needed) and is documented, people make this error all the time.

What's bad about it? Try a search here on WebmasterWorld for "302 Hijacking" and find out... :(

I should note that the design of ErrorDocument isn't flawed: It was intended that if a 404 error document needed to be served from a different domain, that ErrorDocument should refer to a non-existent document on that other domain, so that a 404 response would be served from that domain following the initial ErrorDocument 302 redirect from the original domain. The problem is that all of this was "invented" before search engines as we know them existed, and the search engines robots really just don't like multiple, stacked responses. It makes them dizzy or something, and they get confused.

We now return to our regularly-scheduled programming...

Jim

maxed

5:49 am on Nov 19, 2008 (gmt 0)

10+ Year Member



I just read about 302 hijacking and i have to say that it just sounds very scary, it can be done by anyone to anyone, and there is nothing that can be done about it....

I would have thought all major search engines would have been able to correct something that can have such detrimental effects by now.

jdMorgan

2:49 pm on Nov 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Such "corrections" must be done by the search engines as a back-end process not part of spidering the site, and in addition to the straightforward analysis-and-ranking process for the pages. Therefore, as time goes on, with more and more pages added to the Web (some 7 million per day, last I heard), this approach is not scalable: At some point in the not-too-distant future, the search engines will simply "run out of time" to clean up our sites for us.

Therefore, it is not a good idea to depend on the search engines --or any outside entity-- to do your clean-up for you, and it is a really bad idea to create a mess --or allow one to be created-- if you know that a vulnerability exists and can be fixed on your server, but do nothing.

Run a tight ship, and you'll have few problems. Take a "someone else will fix the leaks for me" attitude, and your ship may sink with all hands lost.

Jim

maxed

9:22 pm on Nov 19, 2008 (gmt 0)

10+ Year Member



But how can you prevent someone 302ing/meta refreshing to your site? isn't it out of your control?

jdMorgan

10:56 pm on Nov 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, it's out of your control, but what someone else may or may not do is not the subject of this thread, and we won't get into it here, since it is primarily an SEO-type subject and belongs in one of the search engine forums.

My point was that you don't want your very own error pages to "302 hijack" the URLs on your site.

Jim

msalman

7:16 pm on Dec 8, 2008 (gmt 0)

10+ Year Member



hey guys,

replying after a while, first thank you everyone for your helpful input. I've run into an issue, it seems my host server doesn't rewrite extenionsless and without trailing slash URLs. It shows a blank page. So, I had to add a trailing slash in my rules to fix the problem. Now, I'm wondering how can I redirect the without trailing slash URLs to a trailing slash URLs. I tried the following rule:

[2]# Redirect URL requests without trailing slash to add the slash
RewriteCond %{REQUEST_URI} !^[^.]*/$
RewriteRule (.+) /$1/ [R=301,L][/2]

but it doesn't do the trick or to be precise this rule doesn't do what I'm looking for. I first want to redirect to trailing slash URL and then do the rewrite but it does the opposite as I know it would. I would appreciate your help, thanking in advance!

msalman

7:39 pm on Dec 8, 2008 (gmt 0)

10+ Year Member



^never mind, i figured it out myself. Here are the new rules:

[2]
#
# Redirect non-trailing slash 'article' urls to trailing slash
RewriteRule ^([^.]+)/article-([^/]+)$ /$1/article-$2/ [R=301,L]

#
# Rewrite 'article' URLs to article.php
RewriteRule article-(.+)/$ article.php?url=$1 [L]

#
# Redirect non-trailing slash 'athour' urls to trailing slash
RewriteRule ^author-([^/]+)/([^/]+)$ /author-$1/articles/ [R=301,L]

#
# Rewrite 'author' URLs to category.php
RewriteRule ^author-(.+)/articles/$ category.php?type=author&url=$1 [L]

#
# Redirect non-trailing slash 'others' urls to trailing slash
RewriteRule ^([^.^/]+)$ /$1/ [R=301,L]

#
# Rewrite all other URLs which do not have a filetype to the category script
RewriteRule ^([^.]+)/$ category.php?url=$1 [L]

[/2]

Please help me to clean it up if required, thanks.

g1smd

11:36 pm on Dec 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Make sure that all redirects are listed first, before all the rewrites.

Ensure that the redirects do state the domain name in the target URL, otherwise non-www requests don't get redirected to the www version in this rule. That then causes either a Duplicate Content issue, or, if another rule does that redirect then it causes a redirection chain instead.

jdMorgan

4:26 pm on Dec 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This redirect-to-wrong domain problem can happen on any server when the domain is not given in a redirect directive, when ServerName is set to example.com, when UseCanonicalName is set to "on", and when www.example.com is the canonical domain you have chosen, or vice-versa.

Since UseCanonicalName and the ServerName are often set by your hosting company and may not agree with the canonical domain you have chosen, it is best practice to always state it explicitly in any redirect directives.

Jim