Forum Moderators: phranque

Message Too Old, No Replies

A very basic question about mod_rewrite

         

Patrick Taylor

10:05 am on Apr 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I am not a programmer, so apologies in advance...

I have a site on a domain: eg ht*p://www.mydomain.com/

Within the homepage is a link that returns the user to the top of the page. When the link is clicked, in the address bar, instead of ht*p://www.mydomain.com/ I see ht*p://www.mydomain.com/#top.

There is also a pager in the homepage - a link that takes the user to further 'pages' of dynamic content (technically still the same page but with an url like ht*p://www.mydomain.com/index.php?s=7&np=2). They can use the pager to return to the start, at which point the homepage URL becomes ht*p://www.mydomain.com/index.php?s=0&np=2.

I am concerned that search engines will index all three URLs for the homepage, when I only want ht*p://www.mydomain.com/ to be indexed. So is the solution a mod_rewrite? Is this some code that goes in an .htaccess file?

As my programming and Apache-related knowledge is very limited, how would I go about solving the problem, without spending a long time learning something I only need to do once? (and possibly failing) Is there a such a thing as a total layperson's tutorial on mod_rewrite?

Regards,

Patrick

jdMorgan

4:12 pm on Apr 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This all boils down to presenting static URLs on your php pages where thay can be clicked on by visitors and picked up by search engines, and then using mod_rewrite to translate those static links, when requested from your server, back to dynamic URLs needed to call your script.

This would involve modifying your 'pager' and and any other php code on your pages to present a static link, with a special case for your 'home page', and then a simple rewriterule to change them back.

For example, using preg_replace in php, change the URL
http://www.example.com/index.php?s=7&np=2)
to
http://www.example.com/s7/np2
and then use mod_rewrite to change that back again when it is requested from your server.

For the special case of http://www.example.com/index.php?s=0&np=2, you'd simply remove all the parameters in your php code, and it would access your home page without any action required by mod_rewrite.

I don't know of any beginner's guide to mod_rewrite, although there may indeed be one. The documents cited in our forum charter [webmasterworld.com] have plenty of examples of mod_rewrite code and the regular expressions needed to make them useful.

Jim

Patrick Taylor

8:41 pm on Apr 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for the reply. I've looked at the guides in the forum charter. As this was, until today, uncharted ground for me, I was able to use:

Options +FollowSymLinks
RewriteEngine on
RewriteBase /
RewriteRule ^old\.html$ / [R]

... to convert old.html into ht*p//www.mydomain.com/ - which is what I'm attempting to do but with the real URLs I have, eg ht*p://www.mydomain.com/index.php?s=0&np=2

My URLs with parameters, such as the one above, are actually being crawled by search engines and their content indexed. The issue is that I want to avoid duplicate content, ie I want the starting content on the homepage to only be indexed as ht*p//www.mydomain.com/ - not:

ht*p//www.mydomain.com/index.php
ht*p://www.mydomain.com/index.php?s=0&np=2
ht*p://www.mydomain.com/#top

The other 'versions' of the homepage, that the pager points to have URLs like:

ht*p://www.mydomain.com/index.php?s=7&np=2

... but this doesn't matter, as there is only one URL in existence for that piece of content - ie no duplicates - and the content is being indexed by search engines.

Quite why

RewriteRule ^index\.php\?s=0&np=2$ / [R]

doesn't convert the URL with parameters to ht*p//www.mydomain.com/ I don't know.

Patrick

Patrick Taylor

9:02 pm on Apr 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Actually, I'm starting to see what you mean. When the user mouses over a link in the pager, the status bar needs to show a static URL, like ht*p//www.mydomain.com, so I need to php/convert the URL with parameters into a static one within the page. And then the mod_rewrite replaces it back to the parametered one.

Would that deal with the duplicate content issue?

deizu

10:08 pm on Apr 8, 2005 (gmt 0)

10+ Year Member



I read up on rewrites today, but RewriteRule apparently doesn't match the query string

A simple version of the rule should be the following

RewriteCond %{QUERY_STRING} ^(s=0&np=2)?$
RewriteRule ^index\.php$ / [R]

As far as "host.com/#top" and "host.com/" is concerned. The search robot *shouldn't* see any difference between the two and so there's no reason to have a rewrite rule for that possibility.

EDIT: Changed the RewriteCond Regex slightly in case you're using apache1.3.*

jdMorgan

3:27 am on Apr 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I need to php/convert the URL with parameters into a static one within the page. And then the mod_rewrite replaces it back to the parametered one.

Yes, precisely. This is how you implement "search engine friendly URLs."

Would that deal with the duplicate content issue?

Indirectly... They will fade out over time. However, once you get the stuff above working, you can remove the important ones faster using a special mod_rewrite variable: {THE_REQUEST}

The problem using a straight RewriteRule is that the local URL_path examined by RewriteRule is updated on any and all passes through the code. And after you do any rewrite in .htaccess, control is passed back up to httpd.conf, and then back down through all .htaccess files in the new filepath, in order to check for rewrites or access restrictions on the new URL-path. This makes mod_rewrite in .htaccess appear to be recursive, and can lead to deadloop problems. For example, it makes it impossible to rewrite indexa to index.php?val=a for purposes of calling the script with a friendly URL and also redirect index.php?val=a to indexa.html in order to list a friendly URL in the search engines. If you tried to do that using only the local URL-path in RewriteRule, you'd get an 'infinite' loop.

So, the trick is to examine only the originally-requested URL-path and not the rewritten one. This can be done by examining {THE_REQUEST}, which is the entire client request line, and looks something like what you see in your raw log files:

GET /index.php?s=0&np=7 HTTP/1.1

Assuming that your friendly URL is in the form /page/s0/np2, and bundling up all of the friendly URL-handling stuff, you'd get:


# Rewrite index page - Special case
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^$¦^index\.php$ /index.php?s=0&np=2 [L]
#
# Rewrite friendly URLs to index.php with query strings
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^page/s([0-9]+)/np([0-9]+)$ /index.php?s=$1&np=$2 [L]
#
# Clean up search engine listings by redirecting unfriendly URLs
# Index page:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.php?s=0/np=2\ HTTP
RewriteRule ^index\.php$ http://www.example.com/? [R=301,L]
#
# Other pages:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.php?s[0-9]+/np[0-9]+\ HTTP
RewriteRule ^index\.php$ http://www.example.com/page/s%1/np%2? [R=301,L]

Jim

jdMorgan

3:32 am on Apr 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



And to add: Don't worry about #top

These are local anchors, and are handled only on the client (browser) side. They may appear in your logs, but only badly-broken robots will list them in search indexes. It is in fact impossible in a normal search engine to link to a URL with a local anchor in it and get that anchor indexed.

Jim

Patrick Taylor

9:39 am on Apr 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thankyou for the helpful replies, which I will study closely.

Regards,

Patrick

Patrick Taylor

10:45 am on Apr 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've altered my pager so that links are now in the form of:

ht*p://www.mydomain.com/index.php/s7/np2

The .htaccess file is now:

#
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^$¦^index\.php$ /index.php?s=0&np=2 [L]
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^s([0-9]+)/np([0-9]+)$ /index.php?s=$1&np=$2 [L]
#

(no hashes)

But when I click a link, the URL remains the same, and I see the page but with all styling and images gone.

Patrick

jdMorgan

3:36 pm on Apr 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The browser probably now thinks that styles and images are located at http://www.mydomain.com/index.php/s7/image.gif. This is because you've used directory-relative links on the pages, such as <img src="image.gif">.

It is the client (browser or spider) that resolves relative links like this to absolute URLs, and it resolves them relative to what it thinks is the "current directory."

The solution is to use <img src="/image.gif"> or <img src="/images/image.gif"> (note the leading slash), which will make the browser resolve the image path relative to the root of the site. The same goes for CSS, external JavaScript, etc.

The URL in the title bar should never change from "friendly" to "unfriendly" format -- that would defeat the entire purpose of this exercise. The rewrite takes place only internal to the server, which is all that is needed.

On the other hand, the code that redirects from unfriendly to friendly uses an external 301-Moved Permanently redirect, which will show in the address bar. This is difference between an internal rewrite and an external redirect; The external redirect is needed to "notify" browsers and SE spiders that the requested resource is now at a new URL.

Jim

Patrick Taylor

4:22 pm on Apr 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm way out of my depth with this. All kinds of strange things are happening. My images are already ../image.jpg because my stylesheets are in a /css/ directory and if I don't do them that way, I don't see them on my local machine.

For me this is a useful introduction to mod_rewrite as a concept, but when one is bad at programming it gets out of hand.

Thankyou for your patient assistance. The thing is, my 'search-engine-unfriendly' URLs are actually being indexed by search engines, and the real point of the exercise - what I'm fundamentally trying to achieve - is for this URL:

ht*p://www.mydomain.com/index.php?s=0&np=2

to be rewritten to:

ht*p://www.mydomain.com/

... so that I don't have any duplicate content. On first entry to the homepage the parameters aren't required because the database query does its work correctly. It's only when "next" is clicked in the pager that the parameters are required explicitly (for the next bunch of content). It is at this point that I need the pager link back to the initial-entry homepage to be mod_rewritten to ht*p://www.mydomain.com/ - otherwise I have different URLs pointing to the same content. I never actually require ht*p://www.mydomain.com/index.php?s=0&np=2

I can't see why "RewriteRule ^index\.php\?s=0&np=2$ / [R]" doesn't rewrite the parametered URL into the simple domain, but of course this is because I don't have a proper grasp of the topic.

Patrick

jdMorgan

5:04 pm on Apr 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



what I'm fundamentally trying to achieve - is for this URL:

http://www.mydomain.com/index.php?s=0&np=2

to be rewritten to:

http://www.mydomain.com/

But this is where the misunderstanding is: You want to redirect http://www.mydomain.com/index.php?s=0&np=2 to http://www.mydomain.com/ in order to clean up the search engine listings and avoid dup content, and you must rewrite http://www.mydomain.com/ to http://www.mydomain.com/index.php?s=0&np=2 in order to call your script.

A rewrite and a redirect are two utterly different things, although you can do either using mod_rewrite.

Let's walk through a request referred from Google with all code in-place and working, using the index page as an example to see how all this works. We'll assume that the searcher finds an old URL that you'd like to change.

  • Request from user's browser: GET /index.php?s=7&np=2 HTTP/1.1, Host: www.mydomain.com
  • mod_rewrite action: Generate redirect response, redirecting from http://www.mydomain.com/index.php?s=7&np=2 to http://www.mydomain.com/s7/np2
  • User's browser: Note 301 redirect, update address bar and re-request content from new URL.
  • New request from user's browser: GET /s7/np2 HTTP/1.1, Host: www.mydomain.com
  • mod_rewrite action: Rewrite /s7/np2 to /index.php?s=7&np=2, invoking script.
  • index.php script action: Serve requested content with friendly links on the page given to the browser.
  • User's browser: Display correct content for originally-requested URL.

    Now let's look at a search engine spider doing the same thing:

  • Request from spider: GET /index.php?s=7&np=2 HTTP/1.1, Host: www.mydomain.com
  • mod_rewrite action: Generate redirect response, redirecting from http://www.mydomain.com/index.php?s=7&np=2 to http://www.mydomain.com/s7/np2
  • Spider: Note 301 redirect, update URL database and then re-request content from new URL.
  • New request from spider: GET /s7/np2 HTTP/1.1, Host: www.mydomain.com
  • mod_rewrite action: Rewrite /s7/np2 to /index.php?s=7&np=2, invoking script.
  • index.php script action: Serve requested content with friendly links on the page given to the browser.
  • Spider: Insert correct content into database for analysis and inclusion in search results.

    If you only need to correct the index page, then you can use the the two "special case for index page" sections of code above, and omit the two intended for all other pages.

    Note that you cannot use <img src="images/image.gif"> or <img src="../images/image.gif"> if you want this to work. You must use either <img src="/full_path_to_images_from_root/image.gif"> or the canoncial <img src="http://www.example.com/full_path_to_images_from_root/image.gif">. Yes, this will break the images on your local machine, because it isn't doing the rewrite if it's not set up as a server.

    Alternately, you can set up redirects to properly direct the image and script fetches to the proper "real" location, using the same techniques as used for the pages, but in reverse. This is easiest if all images and css files are located in a central place just below root. In that case, something like:


    RewriteRule /s[0-9]+/images/([^.]+)\.(gif¦jpg)$ /images/$1.$2 [L]
    RewriteRule /s[0-9]+/css/([^.]+)\.css$ /css/$1.css [L]

    would take care of the problem by discarding the "friendly" subdirectory and page path "/s7/np2"

    I know this must be frustrating for you, but nothing comes free. Once you get this working --having put a lot of work into it-- you will get better search results, a cleaner-looking site, more "memorable" URLs for type-ins, and the experience of successfully implementing a new technique in order to achieve those benefits. And you will find that the next time something comes up where rewriting a URL might help, it will all be a whole lot easier. Let me tell you about my feelings the first few months I was cleaning up my sites: I'd get many, many server errors. The page would display "500-Server Error" and I'd think to myself, "Well, I guess I've only got 499 more crashes to go before I understand this stuff"... ;)

    Jim

  • Patrick Taylor

    6:53 pm on Apr 9, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    If there wasn't a good "mod_rewrite introduction for Dummies" previously, there is now. I was surprised how little there was to find in Google searches on this - I mean for real "Dummies". It seems it's hard for gifted programmers to get into the mind of ungifted ones and stay with the very basics, even when they believe they're keeping it simple. So thanks again for a very clear explanation.

    When above I referred to:

    Options +FollowSymLinks
    RewriteEngine on
    RewriteBase /
    RewriteRule ^old\.html$ / [R]

    ... the aim was to 'convert' one URL into another, in my case a parametered URL into the homepage URL. In fact I've now achieved this with php conditionals in the pager script. The unparametered homepage URL is fine for the initial content, and it's only further content accessible via the pager that still requires the parameters.

    Your walkthroughs are very illuminating. As an exercise I will see if I can put all this into effect.

    Thanks again,

    Patrick

    Patrick Taylor

    8:23 am on Apr 20, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    I'm still working on this, with only partial success. My current .htaccess file is:

    AddHandler x-httpd-php .html .htm
    ErrorDocument 404 /error404.php
    Options +FollowSymLinks
    RewriteEngine on
    RewriteCond %{QUERY_STRING} ^$
    RewriteRule ^index-test\.php/s([0-9]+)/np([0-9]+)$ /index-test.php?s=$1&np=$2 [L]
    RewriteRule ^images/([^.]+)\.(gif¦jpg)$ /images/$1.$2 [L]

    (I have a "test" index page.)

    Within the page, I've re-written the non-se-friendly links like:
    ht*p://www.mydomain.com/index-test.php?s=5&np=2

    into:
    ht*p://www.mydomain.com/index-test.php/s5/np2

    In Firefox the link goes to the right page, but in IE I get a window that says "Problems with this Web page... etc" (syntax error) and have to close this error window a few times before the page is fully displayed. This is not a problem with the normal index page (which doesn't have the non-se-friendly URLs converted).

    Also, on the server I still see no images, which are in a sub folder "images". The path to my images is "images/myimage.jpg". The page is correctly styled, but that's only because I've added a forward slash to it's path (in the root directory), but of course on my local machine I now see no styling.

    A further nudge in the right direction would be appreciated.

    Patrick

    jd01

    9:57 am on Apr 20, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Hi Patrick,

    I have read most of this thread and I have to say, your rules look good!

    For the images problem, if you are using this "images/image.jgp" that you have in your example, you will get better results (with or without the rule) if you use a leading / and the full path to the directory.

    It's easier to exlain with an example so:
    images/image.jpg
    looks for "images" starting from the directory you are in EG if your URL is http//yoursite.com/stuff/page.html and this page has images on it, by default you will be trying to find images at http//yoursite.com/stuff/images/image.jpg

    /images/image.jpg looks for "images" starting from the root of your domain in the same example if your URL is http//yoursite.com/stuff/page.html, by default you will be trying to find images at
    http//yoursite.com/images/image.jpg

    Most of the time this fails, because your rule has to match exactly, and /stuff/images/image.jpg is not /images/image.jpg

    When you are rewriting you know a user is not physically at ht*p://www.mydomain.com/index-test.php/s5/np2, but the browser does not, so it will begin looking for the images in the current directory, and of course it can't find them, because the images are not at ht*p://www.mydomain.com/index-test.php/s5/images/image.jpg

    Hope this makes sense...

    As far as the browser, I can't see anything wrong with your rule. (I tried for about 5 minutes, and I just don't see it.) You might want to make sure you empty your cache, and try a third browser if you have one... Maybe someone else can see something wrong, but it looks good to me.

    Justin

    Patrick Taylor

    12:28 pm on Apr 20, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Justin, thanks for the explanation. My .htaccess file is now:

    AddHandler x-httpd-php .html .htm
    ErrorDocument 404 /error404.php
    Options +FollowSymLinks
    RewriteEngine on
    RewriteCond %{QUERY_STRING} ^$
    RewriteRule ^index-test\.php/s([0-9]+)/np([0-9]+)$ /index-test.php?s=$1&np=$2 [L]
    RewriteRule /s[0-9]+/images/([^.]+)\.(gif¦jpg)$ /images/$1.$2 [L]
    RewriteRule /s[0-9]+/css/([^.]+)\.css$ /css/$1.css [L]

    I looked in my logs to see what the browser was asking for, but still no dice in terms of getting the styles and images to show - and the "syntax" error persists even though the test page is exactly as the real one - except for the alteration of the pager links.

    One day I will set my machine up as a server. In the meantime, when the user is in (for example) ht*p//www.mydomain.com/pages/ I am using a path to images as ../images/myimage.jpg and so forth.

    Patrick

    jd01

    4:23 pm on Apr 20, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Patrick,

    I kept looking at the page rule, to try to see what you were doing wrong... The image rule is the problem! duh, my bad.

    RewriteRule ^images/([^.]+)\.(gif¦jpg)$ /images/$1.$2 [L]

    This rule is an infinite loop, because it writes "images" anything to "images" anything, then starts over... remember, anytime you rewrite, the new rule is processed again, so if you rewrite to where you came from you will continually rewrite the same condition over and over and over and...

    You really don't need the page or the css rule, just make sure they are on your pages with /images/image.jpg
    and /css.css (or whatever you use) this will always send the request to http//yoursite/images/image.jpg and http//yoursite/css.css respectively.

    (In other words if they are on your page right and there is no infinite loop, you should have no need to rewrite them at all.)

    Very sorry I didn't see the loop earlier.

    Justin

    Added: My advice is get the page rule working, then you can make any additions. EG pictures, css, etc. It can be just plain tough to account for every picture in every directory, without creating a loop.

    Patrick Taylor

    7:11 pm on Apr 20, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Thanks for that. One thing as a time - sound advice. Absolute URLs for a while, I think.

    Patrick

    Patrick Taylor

    10:13 am on Apr 21, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Perservering with this (signs of progress)...

    My .htaccess file is now:

    AddHandler x-httpd-php .html .htm
    ErrorDocument 404 /error404.php
    Options +FollowSymLinks
    RewriteEngine on
    RewriteCond %{QUERY_STRING} ^$
    RewriteRule ^index-test\.php/s([0-9]+)/np([0-9]+)$ /index-test.php?s=$1&np=$2 [L]
    RewriteRule /s[0-9]+/images/([^.]+)\.(gif¦jpg)$ /images/$1.$2 [L]
    RewriteRule /s[0-9]+/([^.]+)\.css$ /$1.css [L]
    RewriteRule /s[0-9]+/([^.]+)\.js$ /$1.js [L]

    The se-friendly URL now displays without any errors. I think that is because previously I had not included a rule for .js files (which are currently in the root folder, as are the .css files).

    It also picks up the styling correctly, but not the actual images themselves, which are in a folder named "images" in the root folder. I suspect this problem is something to do with the $1 and $2.

    I'm not sure how the $ works. Also, is there still a looping issue with the way the .htaccess file is now written?

    Patrick

    jdMorgan

    3:52 am on Apr 22, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    There are no looping issues that I can see. Looping is a concern if the rewritten URL matches the pattern of the RewriteRule itself. Taking an extremely simple case,

    RewriteRule ^index\.html$ /index.html?quux=foo [L]

    will loop, because the substitution URL (index.html?quux=foo) matches the pattern (^index\.html) due to the fact that RewriteRule does not 'see' the query string (?quux=foo) part of the requested URL. So the substitution (newly-rewritten) URL matches the pattern, and this rule will loop. When necessary, RewriteCond can be used to test the query string value, but RewriteRule does not have access to it.

    > I'm not sure how the $ works.

    The "$" followed by the numerals 1 to 9 are called "back-references." They refer to the parenthesized sub-patterns in the RewriteRule. The numbers are assigned to the parenthesized expressions in order from left to right. In your rule:


    > RewriteRule /s[0-9]+/images/([^.]+)\.(gif¦jpg)$ /images/$1.$2 [L]

    $1 is assigned the value of whatever matches the ([^.]+) part of the requested URL, and $2 is assigned the value of either ".gif" or ".jpg", whichever was requested.

    You may also back-reference values in a preceding matched RewriteCond pattern using %1 to %9.

    See the mod_rewrite documentation [webmasterworld.com] -- There's not much hope in getting through this without understanding mod_rewrite's use of regular expressions and back-references.

    Jim

    Patrick Taylor

    10:43 am on Apr 22, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    There's not much hope in getting through this without understanding mod_rewrite's use of regular expressions and back-references.

    I've come to that conclusion. In fact I've found that a few hours of reading has been very informative, and regular expressions aren't quite as intimidating now as they seemed at the outset.

    I can also see the importance of mod_rewrite in web construction. Thanks again for the excellent lead-in.

    Quite why my images aren't showing up is unclear. Everything seems to be in order in the .htaccess I posted above.

    Patrick

    Patrick Taylor

    11:58 am on Apr 22, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Success at last. In the images RewriteRule I replaced the pipe symbol from ¦ to one with no gap in the middle (it shows up with a gap in the post preview) and the images now show.

    I've learned a lot. The only slight downside to all this is that now, the new page (which is the re-written one) doesn't have the first page's images in its cache, so they have to re-load.

    Thanks for your help!

    Patrick

    jdMorgan

    6:48 pm on Apr 22, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Again, this (caching) problem can be avoided if you use server-relative URLs for your images, rather than page-relative URLs. That is, the URLs should start with a slash as in <img src="/images/log.gif">. The browser then resolves these by adding only your domain name to the request and therefore requests the images from a 'fixed location.'

    Jim