Forum Moderators: phranque
I have a site on a domain: eg ht*p://www.mydomain.com/
Within the homepage is a link that returns the user to the top of the page. When the link is clicked, in the address bar, instead of ht*p://www.mydomain.com/ I see ht*p://www.mydomain.com/#top.
There is also a pager in the homepage - a link that takes the user to further 'pages' of dynamic content (technically still the same page but with an url like ht*p://www.mydomain.com/index.php?s=7&np=2). They can use the pager to return to the start, at which point the homepage URL becomes ht*p://www.mydomain.com/index.php?s=0&np=2.
I am concerned that search engines will index all three URLs for the homepage, when I only want ht*p://www.mydomain.com/ to be indexed. So is the solution a mod_rewrite? Is this some code that goes in an .htaccess file?
As my programming and Apache-related knowledge is very limited, how would I go about solving the problem, without spending a long time learning something I only need to do once? (and possibly failing) Is there a such a thing as a total layperson's tutorial on mod_rewrite?
Regards,
Patrick
This would involve modifying your 'pager' and and any other php code on your pages to present a static link, with a special case for your 'home page', and then a simple rewriterule to change them back.
For example, using preg_replace in php, change the URL
http://www.example.com/index.php?s=7&np=2)
to
http://www.example.com/s7/np2
and then use mod_rewrite to change that back again when it is requested from your server.
For the special case of http://www.example.com/index.php?s=0&np=2, you'd simply remove all the parameters in your php code, and it would access your home page without any action required by mod_rewrite.
I don't know of any beginner's guide to mod_rewrite, although there may indeed be one. The documents cited in our forum charter [webmasterworld.com] have plenty of examples of mod_rewrite code and the regular expressions needed to make them useful.
Jim
Options +FollowSymLinks
RewriteEngine on
RewriteBase /
RewriteRule ^old\.html$ / [R]
... to convert old.html into ht*p//www.mydomain.com/ - which is what I'm attempting to do but with the real URLs I have, eg ht*p://www.mydomain.com/index.php?s=0&np=2
My URLs with parameters, such as the one above, are actually being crawled by search engines and their content indexed. The issue is that I want to avoid duplicate content, ie I want the starting content on the homepage to only be indexed as ht*p//www.mydomain.com/ - not:
ht*p//www.mydomain.com/index.php
ht*p://www.mydomain.com/index.php?s=0&np=2
ht*p://www.mydomain.com/#top
The other 'versions' of the homepage, that the pager points to have URLs like:
ht*p://www.mydomain.com/index.php?s=7&np=2
... but this doesn't matter, as there is only one URL in existence for that piece of content - ie no duplicates - and the content is being indexed by search engines.
Quite why
RewriteRule ^index\.php\?s=0&np=2$ / [R]
doesn't convert the URL with parameters to ht*p//www.mydomain.com/ I don't know.
Patrick
Would that deal with the duplicate content issue?
A simple version of the rule should be the following
RewriteCond %{QUERY_STRING} ^(s=0&np=2)?$
RewriteRule ^index\.php$ / [R]
As far as "host.com/#top" and "host.com/" is concerned. The search robot *shouldn't* see any difference between the two and so there's no reason to have a rewrite rule for that possibility.
EDIT: Changed the RewriteCond Regex slightly in case you're using apache1.3.*
I need to php/convert the URL with parameters into a static one within the page. And then the mod_rewrite replaces it back to the parametered one.
Would that deal with the duplicate content issue?
Indirectly... They will fade out over time. However, once you get the stuff above working, you can remove the important ones faster using a special mod_rewrite variable: {THE_REQUEST}
The problem using a straight RewriteRule is that the local URL_path examined by RewriteRule is updated on any and all passes through the code. And after you do any rewrite in .htaccess, control is passed back up to httpd.conf, and then back down through all .htaccess files in the new filepath, in order to check for rewrites or access restrictions on the new URL-path. This makes mod_rewrite in .htaccess appear to be recursive, and can lead to deadloop problems. For example, it makes it impossible to rewrite indexa to index.php?val=a for purposes of calling the script with a friendly URL and also redirect index.php?val=a to indexa.html in order to list a friendly URL in the search engines. If you tried to do that using only the local URL-path in RewriteRule, you'd get an 'infinite' loop.
So, the trick is to examine only the originally-requested URL-path and not the rewritten one. This can be done by examining {THE_REQUEST}, which is the entire client request line, and looks something like what you see in your raw log files:
GET /index.php?s=0&np=7 HTTP/1.1
Assuming that your friendly URL is in the form /page/s0/np2, and bundling up all of the friendly URL-handling stuff, you'd get:
# Rewrite index page - Special case
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^$¦^index\.php$ /index.php?s=0&np=2 [L]
#
# Rewrite friendly URLs to index.php with query strings
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^page/s([0-9]+)/np([0-9]+)$ /index.php?s=$1&np=$2 [L]
#
# Clean up search engine listings by redirecting unfriendly URLs
# Index page:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.php?s=0/np=2\ HTTP
RewriteRule ^index\.php$ http://www.example.com/? [R=301,L]
#
# Other pages:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.php?s[0-9]+/np[0-9]+\ HTTP
RewriteRule ^index\.php$ http://www.example.com/page/s%1/np%2? [R=301,L]
These are local anchors, and are handled only on the client (browser) side. They may appear in your logs, but only badly-broken robots will list them in search indexes. It is in fact impossible in a normal search engine to link to a URL with a local anchor in it and get that anchor indexed.
Jim
ht*p://www.mydomain.com/index.php/s7/np2
The .htaccess file is now:
#
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^$¦^index\.php$ /index.php?s=0&np=2 [L]
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^s([0-9]+)/np([0-9]+)$ /index.php?s=$1&np=$2 [L]
#
(no hashes)
But when I click a link, the URL remains the same, and I see the page but with all styling and images gone.
Patrick
It is the client (browser or spider) that resolves relative links like this to absolute URLs, and it resolves them relative to what it thinks is the "current directory."
The solution is to use <img src="/image.gif"> or <img src="/images/image.gif"> (note the leading slash), which will make the browser resolve the image path relative to the root of the site. The same goes for CSS, external JavaScript, etc.
The URL in the title bar should never change from "friendly" to "unfriendly" format -- that would defeat the entire purpose of this exercise. The rewrite takes place only internal to the server, which is all that is needed.
On the other hand, the code that redirects from unfriendly to friendly uses an external 301-Moved Permanently redirect, which will show in the address bar. This is difference between an internal rewrite and an external redirect; The external redirect is needed to "notify" browsers and SE spiders that the requested resource is now at a new URL.
Jim
For me this is a useful introduction to mod_rewrite as a concept, but when one is bad at programming it gets out of hand.
Thankyou for your patient assistance. The thing is, my 'search-engine-unfriendly' URLs are actually being indexed by search engines, and the real point of the exercise - what I'm fundamentally trying to achieve - is for this URL:
ht*p://www.mydomain.com/index.php?s=0&np=2
to be rewritten to:
ht*p://www.mydomain.com/
... so that I don't have any duplicate content. On first entry to the homepage the parameters aren't required because the database query does its work correctly. It's only when "next" is clicked in the pager that the parameters are required explicitly (for the next bunch of content). It is at this point that I need the pager link back to the initial-entry homepage to be mod_rewritten to ht*p://www.mydomain.com/ - otherwise I have different URLs pointing to the same content. I never actually require ht*p://www.mydomain.com/index.php?s=0&np=2
I can't see why "RewriteRule ^index\.php\?s=0&np=2$ / [R]" doesn't rewrite the parametered URL into the simple domain, but of course this is because I don't have a proper grasp of the topic.
Patrick
what I'm fundamentally trying to achieve - is for this URL:http://www.mydomain.com/index.php?s=0&np=2
to be rewritten to:
http://www.mydomain.com/
But this is where the misunderstanding is: You want to redirect http://www.mydomain.com/index.php?s=0&np=2 to http://www.mydomain.com/ in order to clean up the search engine listings and avoid dup content, and you must rewrite http://www.mydomain.com/ to http://www.mydomain.com/index.php?s=0&np=2 in order to call your script.
A rewrite and a redirect are two utterly different things, although you can do either using mod_rewrite.
Let's walk through a request referred from Google with all code in-place and working, using the index page as an example to see how all this works. We'll assume that the searcher finds an old URL that you'd like to change.
Now let's look at a search engine spider doing the same thing:
If you only need to correct the index page, then you can use the the two "special case for index page" sections of code above, and omit the two intended for all other pages.
Note that you cannot use <img src="images/image.gif"> or <img src="../images/image.gif"> if you want this to work. You must use either <img src="/full_path_to_images_from_root/image.gif"> or the canoncial <img src="http://www.example.com/full_path_to_images_from_root/image.gif">. Yes, this will break the images on your local machine, because it isn't doing the rewrite if it's not set up as a server.
Alternately, you can set up redirects to properly direct the image and script fetches to the proper "real" location, using the same techniques as used for the pages, but in reverse. This is easiest if all images and css files are located in a central place just below root. In that case, something like:
RewriteRule /s[0-9]+/images/([^.]+)\.(gif¦jpg)$ /images/$1.$2 [L]
RewriteRule /s[0-9]+/css/([^.]+)\.css$ /css/$1.css [L]
I know this must be frustrating for you, but nothing comes free. Once you get this working --having put a lot of work into it-- you will get better search results, a cleaner-looking site, more "memorable" URLs for type-ins, and the experience of successfully implementing a new technique in order to achieve those benefits. And you will find that the next time something comes up where rewriting a URL might help, it will all be a whole lot easier. Let me tell you about my feelings the first few months I was cleaning up my sites: I'd get many, many server errors. The page would display "500-Server Error" and I'd think to myself, "Well, I guess I've only got 499 more crashes to go before I understand this stuff"... ;)
Jim
When above I referred to:
Options +FollowSymLinks
RewriteEngine on
RewriteBase /
RewriteRule ^old\.html$ / [R]
... the aim was to 'convert' one URL into another, in my case a parametered URL into the homepage URL. In fact I've now achieved this with php conditionals in the pager script. The unparametered homepage URL is fine for the initial content, and it's only further content accessible via the pager that still requires the parameters.
Your walkthroughs are very illuminating. As an exercise I will see if I can put all this into effect.
Thanks again,
Patrick
AddHandler x-httpd-php .html .htm
ErrorDocument 404 /error404.php
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^index-test\.php/s([0-9]+)/np([0-9]+)$ /index-test.php?s=$1&np=$2 [L]
RewriteRule ^images/([^.]+)\.(gif¦jpg)$ /images/$1.$2 [L]
(I have a "test" index page.)
Within the page, I've re-written the non-se-friendly links like:
ht*p://www.mydomain.com/index-test.php?s=5&np=2
into:
ht*p://www.mydomain.com/index-test.php/s5/np2
In Firefox the link goes to the right page, but in IE I get a window that says "Problems with this Web page... etc" (syntax error) and have to close this error window a few times before the page is fully displayed. This is not a problem with the normal index page (which doesn't have the non-se-friendly URLs converted).
Also, on the server I still see no images, which are in a sub folder "images". The path to my images is "images/myimage.jpg". The page is correctly styled, but that's only because I've added a forward slash to it's path (in the root directory), but of course on my local machine I now see no styling.
A further nudge in the right direction would be appreciated.
Patrick
I have read most of this thread and I have to say, your rules look good!
For the images problem, if you are using this "images/image.jgp" that you have in your example, you will get better results (with or without the rule) if you use a leading / and the full path to the directory.
It's easier to exlain with an example so:
images/image.jpg
looks for "images" starting from the directory you are in EG if your URL is http//yoursite.com/stuff/page.html and this page has images on it, by default you will be trying to find images at http//yoursite.com/stuff/images/image.jpg
/images/image.jpg looks for "images" starting from the root of your domain in the same example if your URL is http//yoursite.com/stuff/page.html, by default you will be trying to find images at
http//yoursite.com/images/image.jpg
Most of the time this fails, because your rule has to match exactly, and /stuff/images/image.jpg is not /images/image.jpg
When you are rewriting you know a user is not physically at ht*p://www.mydomain.com/index-test.php/s5/np2, but the browser does not, so it will begin looking for the images in the current directory, and of course it can't find them, because the images are not at ht*p://www.mydomain.com/index-test.php/s5/images/image.jpg
Hope this makes sense...
As far as the browser, I can't see anything wrong with your rule. (I tried for about 5 minutes, and I just don't see it.) You might want to make sure you empty your cache, and try a third browser if you have one... Maybe someone else can see something wrong, but it looks good to me.
Justin
AddHandler x-httpd-php .html .htm
ErrorDocument 404 /error404.php
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^index-test\.php/s([0-9]+)/np([0-9]+)$ /index-test.php?s=$1&np=$2 [L]
RewriteRule /s[0-9]+/images/([^.]+)\.(gif¦jpg)$ /images/$1.$2 [L]
RewriteRule /s[0-9]+/css/([^.]+)\.css$ /css/$1.css [L]
I looked in my logs to see what the browser was asking for, but still no dice in terms of getting the styles and images to show - and the "syntax" error persists even though the test page is exactly as the real one - except for the alteration of the pager links.
One day I will set my machine up as a server. In the meantime, when the user is in (for example) ht*p//www.mydomain.com/pages/ I am using a path to images as ../images/myimage.jpg and so forth.
Patrick
I kept looking at the page rule, to try to see what you were doing wrong... The image rule is the problem! duh, my bad.
RewriteRule ^images/([^.]+)\.(gif¦jpg)$ /images/$1.$2 [L]
This rule is an infinite loop, because it writes "images" anything to "images" anything, then starts over... remember, anytime you rewrite, the new rule is processed again, so if you rewrite to where you came from you will continually rewrite the same condition over and over and over and...
You really don't need the page or the css rule, just make sure they are on your pages with /images/image.jpg
and /css.css (or whatever you use) this will always send the request to http//yoursite/images/image.jpg and http//yoursite/css.css respectively.
(In other words if they are on your page right and there is no infinite loop, you should have no need to rewrite them at all.)
Very sorry I didn't see the loop earlier.
Justin
Added: My advice is get the page rule working, then you can make any additions. EG pictures, css, etc. It can be just plain tough to account for every picture in every directory, without creating a loop.
My .htaccess file is now:
AddHandler x-httpd-php .html .htm
ErrorDocument 404 /error404.php
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^index-test\.php/s([0-9]+)/np([0-9]+)$ /index-test.php?s=$1&np=$2 [L]
RewriteRule /s[0-9]+/images/([^.]+)\.(gif¦jpg)$ /images/$1.$2 [L]
RewriteRule /s[0-9]+/([^.]+)\.css$ /$1.css [L]
RewriteRule /s[0-9]+/([^.]+)\.js$ /$1.js [L]
The se-friendly URL now displays without any errors. I think that is because previously I had not included a rule for .js files (which are currently in the root folder, as are the .css files).
It also picks up the styling correctly, but not the actual images themselves, which are in a folder named "images" in the root folder. I suspect this problem is something to do with the $1 and $2.
I'm not sure how the $ works. Also, is there still a looping issue with the way the .htaccess file is now written?
Patrick
RewriteRule ^index\.html$ /index.html?quux=foo [L]
> I'm not sure how the $ works.
The "$" followed by the numerals 1 to 9 are called "back-references." They refer to the parenthesized sub-patterns in the RewriteRule. The numbers are assigned to the parenthesized expressions in order from left to right. In your rule:
> RewriteRule /s[0-9]+/images/([^.]+)\.(gif¦jpg)$ /images/$1.$2 [L]
You may also back-reference values in a preceding matched RewriteCond pattern using %1 to %9.
See the mod_rewrite documentation [webmasterworld.com] -- There's not much hope in getting through this without understanding mod_rewrite's use of regular expressions and back-references.
Jim
There's not much hope in getting through this without understanding mod_rewrite's use of regular expressions and back-references.
I've come to that conclusion. In fact I've found that a few hours of reading has been very informative, and regular expressions aren't quite as intimidating now as they seemed at the outset.
I can also see the importance of mod_rewrite in web construction. Thanks again for the excellent lead-in.
Quite why my images aren't showing up is unclear. Everything seems to be in order in the .htaccess I posted above.
Patrick
I've learned a lot. The only slight downside to all this is that now, the new page (which is the re-written one) doesn't have the first page's images in its cache, so they have to re-load.
Thanks for your help!
Patrick
Jim