homepage Welcome to WebmasterWorld Guest from 54.196.136.119
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

This 33 message thread spans 2 pages: 33 ( [1] 2 > >     
Mod Rewrite Not Working Properly. Why?
url rewrite not working
Tehuti




msg:4612285
 1:15 pm on Sep 24, 2013 (gmt 0)

I'm very new to rewriting URLs. I've only been learning about it for two days!

I'm trying to rewrite my URLs, but it's not working. I have changed the URLs on my website to look like this:

<a href="/id_number/hyphenated_article_title/">Anchor_text</a>

E.g.,

<a href="/12/how-to-lose-weight-fast/">How to Lose Weight Fast</a>

My .htaccess file looks like this:

RewriteEngine on
RewriteRule ^/([0-9]+)/[A-Za-z0-9-]+/?$ article.php?id=$1 [NC,L]


The rewrite is working, but article.php is loading slowly and without CSS styling.

Anyone know why?

 

Tehuti




msg:4612315
 3:06 pm on Sep 24, 2013 (gmt 0)

Whoa! Every link on my website now begins with this:

www.example.com/12/how-to-lose-weight-fast/

So, for example, my contact page URL looks like this:

www.example.com/12/how-to-lose-weight-fast/contact.html

I'm in way over my head here! I'm guessing that I need more rules in my .htaccess file to make sure that the current rule only applies to URLs that begin with a number ID.

lucy24




msg:4612398
 7:51 pm on Sep 24, 2013 (gmt 0)

RewriteEngine on
RewriteRule ^/

I assume your RewriteRules are loose in the config file, or this pattern wouldn't work. (In htaccess or any <Directory> section, leave off the leading / slash.)

The rewrite is working, but article.php is loading slowly and without CSS styling.

I can't speak to the "loading slowly" but the explanation for the missing css is simple. Crystal ball says the pages contain relative links to the css: either
"directory/styles.css"
or
"../directory/styles.css"

The user's browser is asking for the stylesheet based on where it thinks it is: the originally requested location, not the rewrite target. The whole idea of a rewrite is that the browser doesn't know it's been rewritten.

You will need to change the page itself to use only absolute URLs for stylesheets-- and anything else, such as images or scripts:

"/directory/styles.css"

where leading / slash = root.

Every link on my website now begins with this:

www.example.com/

Where's the www.example.com coming from? If you're staying on the same site, links should begin with / as above. Part of the problem may be php, but that's for a different subforum.

Tehuti




msg:4612469
 1:28 am on Sep 25, 2013 (gmt 0)

Thank you, Lucy! Got it fixed.

You were right: the issue was with relative URLs. I changed all of the links and paths and everything is now fixed.

I had no idea that, after rewriting the URLs, relative URLs would stop working. Learnt a lot today.

Just to clarify, my rewrite rules were written in my .htaccess file. I think I added the slash that you noticed at the beginning of the rule (i.e., RewriteRule ^/) when writing the forum post.

Another clarification: www.example.com was supposed to represent my domain name. ;-)

Thanks again!

g1smd




msg:4612524
 8:07 am on Sep 25, 2013 (gmt 0)

For various reasons you would be better off changing from
www.example.com/12/how-to-lose-weight-fast/
to
www.example.com/12-how-to-lose-weight-fast
without the fake numbered folder level and without a trailing slash.


One other thing that should go near the start of your PHP script, is a few lines of code such that if I request
www.example.com/7362-a-page-about-stuff
and page "7362" does not exist in the database then the PHP returns a 404 HEADER and "includes" the HTML code and content of your 404 page. Failure to do this will lead to Google reporting "soft 404" errors for your site.


Additionally, you should capture the "page name" from the requested URL and pass this $2 value as an extra parameter to the PHP script. The PHP script should check that the requested $2 page name is valid for the requested $1 id and redirect to the correct URL if it does not exactly match.

If I request
www.example.com/12-how-to-
or
www.example.com/12-this-product-is-rubbish
or
www.example.com/12/how-to-lose-weight-fast,,,,,.......-------
your site should redirect me to the correct URL for the page "12" content.

As you have it now, all of those requests will directly display the page "12" content with "200 OK" status leading to serious Duplicate Content problems.


Both of the suggested additional functions are a very small number of lines of code.

When you generate the page title part of the URL it can also be helpful to limit exactly what characters are used, swapping everything to lower case, changing spaces to hyphens, removing punctuation other than hyphens and so on.

lucy24




msg:4612552
 10:43 am on Sep 25, 2013 (gmt 0)

###. I entirely overlooked that part.

From first post:
RewriteRule ^/([0-9]+)/[A-Za-z0-9-]+/?$ article.php?id=$1 [NC,L]
Since only the numerical bit at the beginning is being captured, you could enter any old garbage for the second pseudo-directory. So what's it even there for?

Incidentally you don't need [NC] when all you're capturing is numbers-- AND the alphabetic part already says explicitly [A-Za-z]

[A-Za-z0-9-] = [^/]
for a savings of eight bytes. Unless your URLs potentially contain other characters like _ or ~ that you specifically need to exclude from rewriting.

Tehuti




msg:4612690
 8:30 pm on Sep 25, 2013 (gmt 0)

G1smd, thank you very much for that high-quality advice! I have implemented everything that you suggested. It took all day, and I learnt loads!

Lucy24, thank you, too, for some brilliant advice!

g1smd




msg:4612705
 8:52 pm on Sep 25, 2013 (gmt 0)

You're welcome - and thanks for following it.

You're welcome to post updated code here in case there's other glitches to fix.

There's no one right way of doing this stuff, but there are a very large number of wrong ways!

I assume you already have a non-www/www canonical redirect as the last of your external redirects and before the internal rewrites.

What happens if someone requests
example.com/article.php?id=12 or similar URL? Have those types of URL ever been live on the web? If they have, you need a few more lines of RewriteRule code to redirect those requests to the new format.

Were your
www.example.com/12/how-to-lose-weight-fast/ style URLs ever live on the web? A simple RewriteRule can redirect all of those requests to the new URL format.
Tehuti




msg:4612729
 9:32 pm on Sep 25, 2013 (gmt 0)

I assume you already have a non-www/www canonical redirect as the last of your external redirects and before the internal rewrites.


That's way over my head! I don't even understand what you mean!

This is what my .htaccess file looks like:

# Always use www in the domain
RewriteEngine on
RewriteCond %{HTTP_HOST} ^([a-z.]+)?example\.com$ [NC]
RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteRule .? http://www.%1example.com%{REQUEST_URI} [R=301,L]

RewriteRule ^([0-9]+)-([A-Za-z0-9-]+)/?$ article.php?id=$1&slug=$2 [L]

RewriteRule ^videos/health/([0-9]+)/?$ videos.php?cat=1&page=$1 [L]

ErrorDocument 404 /404.php


What happens if someone requests example.com/article.php?id=12 or similar URL? Have those types of URL ever been live on the web?


No. My website is only a few days old.

Note that I initially tried to thank you via a personal message, but I got an error message saying that your folder is full!

g1smd




msg:4612741
 10:10 pm on Sep 25, 2013 (gmt 0)

Directly after
RewriteRule On add two new rules:

RewriteRule ^([0-9]+)-([A-Za-z0-9-]+)/$ http://www.example.com/$1-$2 [R=301,L]

RewriteRule ^videos/health/([0-9]+)/$ http://www.example.com/videos/health/$1 [R=301,L]

If valid URLs are all lower case, then remove the A-Z from the pattern in the first of those two rules.

------

Alter the last two rules (from your code example in the previous post) to this (three changes in each rule):

RewriteRule ^([0-9]+)-([A-Za-z0-9-]+)$ /article.php?id=$1&slug=$2 [L]

RewriteRule ^videos/health/([0-9]+)$ /videos.php?cat=1&page=$1 [L]

If valid URLs are all lower case, then remove the A-Z from the pattern in the first of those two rules.

------

Make sure your internal navigation links within the site link to URLs without a trailing slash.

Make sure there is a blank line after every
RewriteRule. Makes the code more readable.

Make sure that each rule has a code comment explaining in plain English what it does.

lucy24




msg:4612767
 12:51 am on Sep 26, 2013 (gmt 0)

I assume you already have a non-www/www canonical redirect as the last of your external redirects and before the internal rewrites.

That's way over my head! I don't even understand what you mean!

It means
# Always use www in the domain

;)

But as g1 said, this should be your last redirect. Not the last RewriteRule, just the last of all rules that include the [R] flag. And the optimal form is a single condition:

^(www\.example\.com)?$
Tehuti




msg:4612773
 1:17 am on Sep 26, 2013 (gmt 0)

I thank you both! Learning loads here.

Okay, I think I have implemented everything correctly, except I didn't add the canonical rewrites that g1smd suggested. The reason is that they have a slash (/) at the end of the pattern, thus:

RewriteRule ^([0-9]+)-([a-z0-9-]+)/$ http://www.example.com/$1-$2 [R=301,L]

Is the slash meant to be there? My URLs don't have a slash at the end. They look like this:

www.example.com/123-hyphenated-lowercase-article-title

Also, I might be totally wrong, but doesn't my very first rewrite rule already make all links canonical?

Below is my complete .htaccess file. If you guys ever need any advice concerning writing, I'm your man!

RewriteEngine on
RewriteCond %{HTTP_HOST} ^([a-z.]+)?example\.com$ [NC]
RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteRule .? http://www.%1example.com%{REQUEST_URI} [R=301,L]

RewriteRule ^([0-9]+)-([a-z0-9-]+)$ /article.php?id=$1&slug=$2 [L]

RewriteRule ^browse/([0-9]+)$ /browse.php?page=$1 [L]

RewriteRule ^videos/health/([0-9]+)$ /videocategory.php?cat=1&page=$1 [L]
RewriteRule ^videos/fitness/([0-9]+)$ /videocategory.php?cat=2&page=$1 [L]
RewriteRule ^videos/weight-loss/([0-9]+)$ /videocategory.php?cat=3&page=$1 [L]
RewriteRule ^videos/self-esteem/([0-9]+)$ /videocategory.php?cat=4&page=$1 [L]

RewriteRule ^about$ /about-us.php [L]

RewriteRule ^contact$ /contact-us.php [L]

RewriteRule ^privacy$ /privacy-policy.php [L]

RewriteRule ^terms$ /terms-of-use.php [L]

ErrorDocument 404 /404.php

Tehuti




msg:4612777
 1:46 am on Sep 26, 2013 (gmt 0)

Last question. It concerns this rule:

RewriteRule ^([0-9]+)-([A-Za-z0-9-]+)$ /article.php?id=$1&slug=$2 [L]

G1 said that, if my valid URLs are all lowercase, then I should remove A-Z from the pattern.

The problem is that, if someone then uses an upper case letter, they will get a 404 error.

If I keep A-Z, they will not get a 404 error because my PHP code redirects to the right, lowercase URL.

Shall I remove A-Z even though my script redirects to the right, lowercase URL?

lucy24




msg:4612811
 4:00 am on Sep 26, 2013 (gmt 0)

Is the slash meant to be there? My URLs don't have a slash at the end.

That was probably a typo on g1's part. But since this specific rule is for a 301 redirect, it is OK to be flexible. You can omit the slash and also the ending anchor, and keep an [NC] if you like.

but doesn't my very first rewrite rule already make all links canonical

Yes, but it's in the wrong place. If someone asks for
example.com/wrong-file-name

your existing code will FIRST redirect them to
www.example.com/wrong-file-name

before they ever get to
www.example.com/right-file-name

If instead you put the domain-name-canonicalization rule* last, then most requests only get redirected once:

example.com/wrong-file-name
goes straight to
www.example.com/right-file-name

That's why redirects always include the full protocol plus hostname: so people who make two separate mistakes will only get redirected once.

The problem is that, if someone then uses an upper case letter, they will get a 404 error.

If I keep A-Z, they will not get a 404 error because my PHP code redirects to the right, lowercase URL.

Does your php redirect or does it simply rewrite case-insensitively? When a rule creates a rewrite, you have to make sure it only accepts one form of the URL. Otherwise you get Infinite URL Space. Well, not literally infinite, but

wrongname = WRONGNAME = WrongName = WrOnGnAmE
and soon. You can count on your fingers and see that the number of possibilities is 2^(len(pathname)) --in this example 512 --and only one of them is correct.

If your php checks casing and issues a true 301 redirect for mis-cased requests, that is fine and you should keep it. But don't let the php be case-insensitive.

Below is my complete .htaccess file.

But where's the rest? Don't you also have a bunch of rules that redirect requests from old ugly URL to new pretty URL?


* "Domain name canonicalization" = fancy way of saying "with/without www". There's more to it, but www is the part most people have to deal with.

g1smd




msg:4612845
 8:05 am on Sep 26, 2013 (gmt 0)

The htaccess file looks good.

The fact that your PHP code redirects to a correctly cased URL is EXCELLENT. Make sure that if you request example.com/45-THIS-PAGE that it returns a 301 status (302 or 200 would be a disaster) and that it tells the browser to make a new request for www.example.com/45-this-page. Leave the A-Z part in the patterns.

Add these extra rules directly after
RewriteEngine On

# Change first slash to hyphen and strip optional trailing slash if requested.
# Redirect those requests to canonical hostname at the same time.
RewriteRule ^([0-9]+)/([A-Za-z0-9-]+)/?$ http://www.example.com/$1-$2 [R=301,L]

# Strip trailing slash and redirect those requests to canonical hostname at the same time.
RewriteRule ^([0-9]+)-([A-Za-z0-9-]+)/$ http://www.example.com/$1-$2 [R=301,L]

RewriteRule ^browse/([0-9]+)/$ http://www.example.com/browse/$1 [R=301,L]

RewriteRule ^videos/(health|fitness|weight-loss|self-esteem)/([0-9]+)/$ http://www.example.com/videos/$1/$2 [R=301,L]

RewriteRule ^(about|contact|privacy|terms)/$ http://www.example.com/$1 [R=301,L]

# Non-www to www hostname canonicalisation redirect.


The above rules redirect people to the correct URL if they accidentally add a slash on the end. You said the site had been online for a few days, so it's quite possible Google has already indexed URLs in that format. Without the redirect, visitors see your 404 page. With the redirect, they are bounced to the correct URL and see the content.

it's in the wrong place

The non-www redirect IS in the right place. I'm not sure the code used for it is optimum, but we can come back to that later.

Sometimes you have to accept that a two-step redirect is inevitable for some requests. Although slightly less efficient, if the PHP code also checked what hostname was requested as well as checking casing and naming and redirected those to the correct URL, the non-www to www canonicalisation redirect could perhaps be completely omitted from the htaccess file. However, that would leave image files open to double indexing. Alternatively, the non-www to www canonicalisation redirect could remain in the htaccess file, but with a whole bunch of preceding RewriteCond patterns such that any URL request that is dealt with by PHP is ignored by this one rule in htaccess.

That was probably a typo on g1's part.

Not this time. :) See above.

You do need to be aware that requests for example.com/about.php or for example.com/article.php?id=1234&slug=this-page-name currently return "200 OK" status. If those URLs should ever be indexed that's a serious Duplicate Content problem.

When implementing the type of hyphenated URL scheme you have now, it is usual to see a bunch of redirects from URLs with parameters to the new format. If the site has never been live with those types of URLs you can safely omit that stuff. However, do keep an eye on site logs, analytics and webmastertools for those types of URLs being requested. If it ever happens, you must add the additional code to redirect those requests.

Don't you also have a bunch of rules that redirect requests from old ugly URL to new pretty URL?

See above. The site has only been live for a few days, so it might not be necessary.

g1smd




msg:4612861
 9:28 am on Sep 26, 2013 (gmt 0)

Make sure the 301 redirects generated by the PHP code begin with protocol and hostname.

Tehuti




msg:4613044
 12:47 am on Sep 27, 2013 (gmt 0)

What a couple of days! I fully understand what's going on now!

I have implemented everything that you two have suggested.

First, if someone misspells the URL, here's the PHP code that redirects to the correct article:

<?php
$id = $_GET['id'];
$slug = $_GET['slug'];
$conn = mysql_connect("", "", "") or die(mysql_error());
mysql_select_db("");
$res = mysql_query("SELECT id, slug FROM `table` WHERE id = $id", $conn) or die(mysql_error());
$row = mysql_fetch_assoc($res);

if($row['id']) {

if($slug !== $row['slug']) {
header("HTTP/1.1 301 Moved Permanently");
header("Location: http://www.example.com/$id-{$row['slug']}");
die();
}

else {
$sql = "SELECT article FROM `table` WHERE id = $id";
$results = mysql_query($sql, $conn) or die(mysql_error());
$row = mysql_fetch_assoc($results);
}

} // End first if

else {
header("HTTP/1.1 404 Not Found");
include('http://www.example.com/404.php');
die();
}
?>


The code sits at the top of the page, above the doctype. It's right, right?

I checked if any of my old (ugly) URLs have been indexed by Google and found loads of them! Testing my newly-learnt mod_rewrite skills, as taught by you two, I tried to redirect the ugly indexed URLs to the new addresses and hit a problem.

Apparently, you can't have

RewriteRule ^(contact-us)\.php$ http://www.example.com/$1 [R=301,L]

and

RewriteRule ^contact-us$ /contact-us.php [L]

in a .htaccess file without causing an infinite loop!

I browsed the Web for an answer and discovered this:

[webmasterworld.com...]

It's G1 and JD helping someone else with the same problem.

I understand what I have to do, but I don't understand exactly how to do it.

Can anyone explain, please, the following line and how I have to implement it in my personal situation?

RewriteCond %{THE_REQUEST} ^[A-Z]+\ /ugly\.html\ HTTP/

Am I supposed to do the following for contact-us.php?


# At top of file, in the external redirects section:
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /contact-us\.php\ HTTP/
RewriteRule ^(contact-us)\.php$ http://www.example.com/$1 [R=301,L]

# At bottom of file, in the internal rewrites section:
RewriteRule ^contact-us$ /contact-us.php [L]

g1smd




msg:4613050
 1:15 am on Sep 27, 2013 (gmt 0)

Wow. That is some solid research, and useful questions.

I would prefer to see the bit of PHP code that returns 404 (because the requested ID does not exist) sit above the bit of PHP code that returns 301 (because the URL rquest is malformed in some way but is for a valid ID). I just find it a bit more clear to organise it that way:

- 404 if ID doesn't exist
- 301 if ID exists but requested URL wasn't quite correct
- send page out if URL request completely correct

One thing that I do directly after the 301 header is send out a very basic HTML page that says the page has moved, and where, with a href style link. I find that useful for debugging sometimes. If the 301 line is temporarily commented out, the HTML page and message can be viewed.

Another thing that I sometimes do is have the 404 or the 301 process write to a custom log file recording the date and time, requested URL, user agent, referrer and other things.

A word of warning. The "include" must be internal to the server. The page should not be fetched by the server from the web. The "include" should be done internally, fetching the file from the filesystem (remove protocol and hostname from the reference). Only then should it be sent out over the web.

You've run into the redirect-rewrite infinite loop problem! Everyone does eventually.

The solution is fairly simple. The RewriteRule that redirects needs a preceding RewriteCond looking at THE_REQUEST to ensure that when .php or when parameters are requested that the request came from out there on the web and not from here inside the server as the result of a prior internal rewrite. Only the former should be redirected.

THE_REQUEST looks at the literal request:
GET /about.php HTTP/1.1
GET /article.php?id=543&name=this-page-of-stuff HTTP/1.1
as sent by the browser, so that is what your pattern needs to match.

However, rather than ^[A-Z]+\ I usually use ^[A-Z]{3,9}\ at the start. It's a really minor point.

Yes, your example code looks about right, except perhaps the redirect target should be "contact" and not $1 as you captured "contact-us" (unless, of course, you have altered the URL since the earlier example you posted). If you have altered the URL, don't forget to add a redirect from old to new!

When the matching is such that it's only the addition or removal of ".php" do use the power of ^(about|contact|blah|this|that)... to combine rules.

You might also want to think about what happens if someone requests example.com/about.php?param=junk where about.php never processes parameters. It would be wise to either ensure 404 is returned or redirect to the correct URL. If you do nothing, this request is "200 OK" and more Duplicate Content. Every little improvement like this that you can implement makes the site ever more bomb proof. In this case, it's a small change to the RegEx pattern in the redirecting RewriteRule.

Rule order is very important. The redirect goes in the redirects section of your file and the rewrite goes in the rewrites section of your file.

General order of RewriteRules:
-Blocks
-Redirects
-Rewrites

Within each group:
-most specific first,
-most general last.

Tehuti




msg:4613242
 4:41 pm on Sep 27, 2013 (gmt 0)

More great stuff!

Okay, here's my revised PHP code:

<?php
$id = $_GET['id'];
$slug = $_GET['slug'];
$conn = mysql_connect("", "", "") or die(mysql_error());
mysql_select_db("");
$res = mysql_query("SELECT id, slug FROM `table` WHERE id = $id", $conn) or die(mysql_error());
$row = mysql_fetch_assoc($res);

if(!$row['id']) {
header("HTTP/1.1 404 Not Found");
include('404.php');
die();
}

else {

if($slug !== $row['slug']) {
header("HTTP/1.1 301 Moved Permanently");
header("Location: $id-{$row['slug']}");
die();
}
else {
$sql = "SELECT article FROM `table` WHERE id = $id";
$results = mysql_query($sql, $conn) or die(mysql_error());
$row = mysql_fetch_assoc($results);
}

}

?>


I thought I redirected all indexed ugly URLs last night, but I was wrong. I found many today that didn't seem to be redirecting. After taking a closer look, I noticed that they all had parameters (e.g., ?catid=12). A bit of research taught me that I need to use a different condition and make slight adjustments to the rewrite rule:

RewriteCond %{QUERY_STRING} ^catid=([0-9]+)$
RewriteRule ^category\.php$ http://www.example.com/%1? [R=301,L]


It worked, so I was happy.

Then, I found some indexed URLs that look like this:

www.mysite.com/?page=22

These must have been indexed when I tried pagination on my home page.

I can't seem to redirect these links to the home page. I tried the following, but it didn't work:

RewriteCond %{QUERY_STRING} ^page=[0-9]+$
RewriteRule ^index\.php? http://www.example.com? [R=301,L]


I'm guessing that the pattern in the rewrite rule is to blame. Am I right?

I also found that my image folder has been indexed:

www.example.com/images/

Shall I redirect it to the home page or something?

Another thing that I sometimes do is have the 404 or the 301 process write to a custom log file recording the date and time, requested URL, user agent, referrer and other things.


Give me your brain!

Tehuti




msg:4613248
 4:55 pm on Sep 27, 2013 (gmt 0)

Out of curiosity, when we match the request variable in the rewrite condition, why don't we match the "1.1" at the end?

We match it up to here:

HTTP/

Never here:

HTTP/1.1

I'm assuming that it's not necessary.

Another curiosity ...

G1, you said that you use this:

^[A-Z]{3,9}\

If it's always just "GET", why not use this:

^GET\
lucy24




msg:4613317
 8:30 pm on Sep 27, 2013 (gmt 0)

RewriteRule ^index\.php? http://www.example.com? [R=301,L]

This rule will only work on requests that explicitly name "index.php" or "index.ph". If you ever used URLs in the form
example.com/?page=123

you will need to change the pattern to
^(index\.php)?

Frankly the ? in your example makes me a little uneasy because it suggests you've misunderstood what it's doing. In a pattern it doesn't mean "the request contains a query". It means "the preceding element (either a single letter, or a group in brackets or parentheses) is optional".

We match it up to here:

HTTP/

Never here

You only need to match up to the point where you're sure you have reached the end of the requested URI. In fact the "HTTP" itself doesn't matter; you only need it because you can't end a line with a literal space (escaped or not). For that matter you could say
^[A-Z]{3,9}\ \S+\s
since a request never contains literal spaces.

If it's always just "GET", why not use this:

^GET\

If you're absolutely sure that it is always GET, you can. But you forgot about that one php page that uses POST ;) Some user-agents, especially search engines, use HEAD.

[edited by: engine at 8:59 am (utc) on Sep 28, 2013]
[edit reason] fixed typo [/edit]

g1smd




msg:4613358
 9:59 pm on Sep 27, 2013 (gmt 0)

RewriteCond %{QUERY_STRING} ^catid=([0-9]+)$
and later
RewriteCond %{QUERY_STRING} ^page=[0-9]+$

The code will be much more robust if you use
RewriteCond %{QUERY_STRING} (^|&)catid=([0-9]+)(&|$)
with the "capture" in %2, not %1, and
RewriteCond %{QUERY_STRING} (^|&)page=[0-9]+(&|$)

I am a little uneasy that you're looking at QUERY_STRING and not at the THE_REQUEST. You need to be abolutely certain that NO request can ever generate an infinite loop.

If you still use catid= or page= internally within the site (within rules that internally rewrite), you must look at THE_REQUEST rather than QUERY_STRING in the above conditions.

----

The pattern
^index\.php? matches index.php and index.ph

You'll need
^(index\.php)?$ and that "$" on the end is very important.

In the rule target you have
http://www.example.com?
but you should have
http://www.example.com/?
with that extra slash.

----

Conditions looking at THE_REQUEST match only as far as HTTP/ so that the rule works for HTTP/1.0 and for HTTP/1.1 requests. There are a few rare occasions when you'll want the rule to work for a specific HTTP version, but mostly you want it to work for all. Additionally, one day there may well be a HTTP/1.2 or HTTP/2.0 that your site will have to deal with. You don't really want to have to go back and alter the code on every site you have ever worked on...

Tehuti




msg:4613387
 2:08 am on Sep 28, 2013 (gmt 0)

G1 and Lucy, thank you for a very educational few days! My site is much better now, thanks to you two. I appreciate the effort very much. ;-)

Learning this stuff has actually been quite fun. It's made me obsess over URL-related SEO subtleties the significance of which I never used to appreciate. I even installed the Live HTTP Headers add-on so that I can be sure that the right headers are being called!

Note that I made a typo when I included the question mark in the rewrite pattern (i.e., ^index.php?). It was supposed to be a dollar sign.

I've got two last, nagging questions ...

1. In many of the mod_rewrite tutorials that I've come across, "RewriteBase /" is used at the top of the file to avoid having to include "^/" before each rewrite rule pattern. How is it that we have used neither?

2. I have pagination on my site. The URL looks like this:

www.example.com/articles/1

The "1" is the page number. At the moment, the pages only go as high as 15.

If someone manually changes the page number to, say, 100, no results appear.

Shall I adjust the pagination script such that, when there are no results, a 404 is given?

g1smd




msg:4613420
 8:58 am on Sep 28, 2013 (gmt 0)

I prefer to specify the actual rewrite location. I've almost never used RewriteBase.

For pagination URLs you must return 404 for page numbers that do not exist. Failing to do so will see those URLs flagged by Google as "soft 404 errors". You don't want that.

Live HTTP Headers is very useful. Another useful tool is Xenu LinkSleuth.

I can recommend making a text file that lists several examples of every type of malformed or invalid URL that has ever been requested of your site. When parameters are involved make several extra test URLs by adding a junk parameter before the real one, a junk parameter after the real one and junk parameters both before and after the real one.

Add other URLs that you think of even if they haven't yet been requested. Take every URL in the list and add a capitalisation error to it. You want a list with every possible error and combination of errors that you can think of.

When the list is complete, presumably all with www URLs, duplicate every URL as non-www. When you add new entries in the future don't forget to add both www and non-www versions.

Feed this list to Xenu LinkSleuth using the import feature (or whatever it is called). Make sure you set the scan depth to the minimum number so that Xenu LinkSleuth requests only the URLs in the list and does not go on to spider the entire site. Let it scan your list of test URLs.

Carefully check the on-screen table and the generated report to make sure that every URL does the right thing: redirect, return 404, etc.

Some of your requests (e.g. non-www request with capitalisation error) are going to result in a double redirect. That's unavoidable without quite a bit more work. I would be very worried if any requests generate a triple redirect.

Go through your htaccess file and make sure that every rule has a code comment before it that explains in plain English what it is supposed to do. You'll thank yourself when you come back to the file next year.

I also number the rules, starting at 11 for rules that block access (something we haven't discussed, yet), 21 for redirects and 31 for rewrites. I use 21.a and 21.a.1 style numbering to group multiple similar rules. Additionally, very often, rewrite 32.b is the partner to redirects 22.b.1 and 22.b.2 and so on.

Numbering makes it much easier to edit a file with a lot of similar rules. Without the numbering it is very easy to scroll down the page to look at something else then forget which rule you were originally looking at.

[edited by: g1smd at 9:30 am (utc) on Sep 28, 2013]

lucy24




msg:4613423
 9:07 am on Sep 28, 2013 (gmt 0)

Note that I made a typo

So did I. But with any luck a moderator will come along and fix it for me :)
when I included the question mark in the rewrite pattern (i.e., ^index.php?). It was supposed to be a dollar sign.

Whew! Good to know.

"RewriteBase /" is used at the top of the file to avoid having to include "^/" before each rewrite rule pattern.

Not the pattern. The RewriteBase refers to the target. In any case / is the default. (The apache docs say confusingly "Default: none" but that doesn't mean there is no default, it means the default value is the empty string.)

This directive is required when you use a relative path in a substitution

I think it's identical between 2.2 and 2.4; they just put in a prettier picture. Note the words "relative path". That's the part you would never use anyway.

Shall I adjust the pagination script such that, when there are no results, a 404 is given?

Yes, absolutely, good idea. Make it as generic as possible: "If the parameters in the request don't result in a valid page, return a 404." This is one of the simplest things you can do in php. (Translation: even I can do it ;))

g1smd




msg:4613428
 10:22 am on Sep 28, 2013 (gmt 0)

I have to say I have really enjoyed this thread.

When URL rewriting has been discussed here previous threads have often got tangled up, often for days on end, with discussion about very basic mod_rewrite syntax and the differences between redirects and rewrites. While I have no problem explaining those things again (and again and again) it's great to have a thread that immediately gets right into the details of the original question.

While you're new to URL rewriting, it's obvious that you're not new to RegEx or to mod_rewrite. That has led to a much faster implementation. Pending your tests with LinkSleuth you're probably good to go. Feel free to ask more questions though.

Tehuti




msg:4613441
 1:33 pm on Sep 28, 2013 (gmt 0)

Xenu looks very worthwhile! I have to check it out. I also like the idea of having a numbering system.

I will spend the next couple of weeks digesting and implementing the information that you two have extended my way. I'm actually looking forward to it. Making a site secure is satisfying.

Once again, thank you both for your assistance. WebmasterWorld doesn't look as presentable as other webmaster forums, but its community is on a different level.

And a special thank you for the extra goodie that you sent me, G1, via private message. Although it's way above my head at the moment, going through it will be both interesting and educational.

Beware, I may be back with more questions soon!

g1smd




msg:4613456
 3:26 pm on Sep 28, 2013 (gmt 0)

I just want to check one thing with you.

In the example PHP code you posted earlier, the 301 redirect part had both the protocol and hostname missing from the target URL.

I mentioned this in
msg:4612861 but didn't see a specific acknowledgement that you had fixed that. It's important. Add it to the list of stuff to tick off. :)

More questions are good! They've all been intellegent and well-phrased.

Tehuti




msg:4613563
 3:08 am on Sep 29, 2013 (gmt 0)

Mate, I didn't notice that. Thanks for the heads-up!

When you asked me to remove the protocol and the hostname from the PHP include statement, I went and removed it from the 301 as well! Not very smart.

Butt-kissing aside, your observance is truly remarkable. Your brain's obviously well-trained at analysing and dissecting details.

g1smd




msg:4613589
 6:45 am on Sep 29, 2013 (gmt 0)

Thanks. Yes, it's all in the details... but I always miss a few because there can be so many. My brand of SEO is all about the technical bits of the site workings, and partly the actual content. On the other hand, my last link-building campaign was in April 2004. You won't find me discussing linking schemes and article directories except to warn people off. :)

This 33 message thread spans 2 pages: 33 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved