homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

This 32 message thread spans 2 pages: 32 ( [1] 2 > >     
A Few Rewrite Questions
Correcting spaces, duplicates. And initiate big SEO change?

 11:41 am on Feb 21, 2013 (gmt 0)

I have two duplicate issues in WMT. One is due to spaces the other due to a single apostrophe.

Space, %20, %25%20, +
The site displays info on sportspersons. URLs for each sportsperson page is like this:


there are duplicates for




Which should I be using, and how can I rewrite it?

Single Apostrophe
SQL injection is often targetted on the FAQ section:

/faq/index.php?id=200' and ' union .....

This is blocked but in the generated sitemap this is recorded:


which will duplicate with the genuine request:


How do I remove the single apostrophe from the end? Note that this only needs to happen in the /faq/ directory.

Total Rewrite
I appreciate that when setting up the site I could have made it more SEO friendly by not using get vars in the string. Would it now be too late to rewrite these:


to these:


In total there are over 100,000 names. Would implimenting this change be too much for

a) the search engines - would 100,000 301s cause a penalty?
b) the server - what is the impact on performance?



 12:02 pm on Feb 21, 2013 (gmt 0)

Moving to an extensionless format reaps a lot of rewards. It is highly recommended. You have much better control over URL formats and can reject all requests with appended parameters.

It's quite a lot of work to set up, but it means you will have a LOT less maintenance going forward.

Setting up the system so that everything correctly redirects to the new URL is crucial.

If every part of the new URL can be gleaned from the old URL, then you can do it in just a few lines of code in the htaccess file.

If the "translation" is more complicated you need another method. Rewrite incoming requests for the old URL so that they are served by a special PHP script that looks up the new URL in a PHP array or in a database. The PHP script sends the correct 301 HEADER pointing to the new URL. If the old URL request cannot be fulfilled with a new URL, it is vital that the PHP script returns 404 and a "not found" error message - it must not ever return 200 OK. It's a couple of dozen lines of code for the whole thing.

Your proposed URL format is more complicated than it need be. You do not need the .php part at all. Replace /sportsperson.php/ with /sp- or similar.


 12:50 pm on Feb 21, 2013 (gmt 0)

Hm, I think g1 and I are editing in tandem

Which should I be using

None of them. You simply cannot win if you have a literal space anywhere in an URL, whether in the body of the path or in the query string.

%2520 is especially lethal because it means the URL has gone through two rounds of encoding: the first one to change the space into %20 and then a second one to change that leading % into %25. In real life you'll find this pattern in logs if you've got a locally based analytics program like piwik so there are double-nested query strings.

It doesn't have to stop there either. In my logs from a few months ago I find one building up to
%2525252525252525252525252525 252525252525252525252525252525 252525252525252525252520
with added real spaces for posting purposes

It wasn't an evil robot, it was me testing a rule. Looks like Firefox let it go through 40 (forty!) iterations before it put its foot down. This strikes me as over-optimistic. (I have no idea what I was testing, but I kinda doubt I was seeing how many redirects the browser would permit.)

How do I remove the single apostrophe from the end?

Is it the end of the entire query string or just the end of that particular parameter? It's easier when it's at the very end because there aren't as many captures to juggle, but

:: don't cut & paste, I'm making this up as I go along ::

RewriteCond %{QUERY_STRING} ^([^']+)(?:'|%27)(&.+)?
RewriteRule (blahblah) http://www.example.com/$1?%1%2 [R=301,L]

The %27 may not be necessary. I can never remember what gets disencoded when :(


to these:


The significant part is

which optimally would become


assuming all your players have the goodness to have exactly two names ;)

If your query string uses abbreviations like "hky" or "fball" there will be more rules involved. Whether you can still do it in mod_rewrite alone will depend on just how many sports you cover. Three or four or six sports, no problem. But if you're talking about curling, jai alai and Australian Rules Football, it may be php-script time.


 1:05 pm on Feb 21, 2013 (gmt 0)

The apostrophe is always at the end of the id no.


I filter out the full sql injection attempts but there are many requests for just the id with an apostrophe at the end.

The log file looks like this:

GET /faq/index.php?id=200%27 HTTP/1.1" 200

So a 200 is being returned and of course stored in the sitemap (whereas the full attempts are 403d and not sitemaped)


I got four weeks to decide / go ahead with this big change or not. Ranking well for the site 'theme' but not as good for the sportspersons profiles.


 1:09 pm on Feb 21, 2013 (gmt 0)

There will be three sports.

Oh and of course. There will be some names with apostrophes in them such as

Ed O'Neill

So that needs consideration I guess.


 1:56 pm on Feb 21, 2013 (gmt 0)

Irrespective of the actual name, only put [a-z0-9+.-] in the URL for the page.

I posted a short routine for this a few months back. It's one line of code.


 2:36 pm on Feb 21, 2013 (gmt 0)

GET /faq/index.php?id=200%27

Ah, so it is %27 then. I did say I can never remember which way it goes. But either way you should be able to grab it in mod_rewrite and forcibly redirect to the apostrophe-less version.

Those apostrophes really have it in for you don't they?

Assuming for the sake of discussion that you will not have one athlete named O'Neill with apostrophe and a second one named ONeill without apostrophe, you can simply leave it out of the URL. The tricky part will be the other direction: somehow your php page has to know that the apostrophe is there.

I suppose it is too much to hope that nobody's name starts with O unless it's O'Something. How many of them do you already have? You could always shunt all the O's -- or possibly just the On's and Om's and select other initials --to a preliminary lookup.

You've also got the problem of double-barreled names. You can easily go from ?name=Joe+Smith-Wesson to /joe-smith-wesson or from ?name=Jean-Pierre+Fou to /jean-pierre-fou ... But going back again you need to distinguish between the hyphens that are part of the name and the hyphens that are part of the new pretty URL. (Just recently someone had a similar issue with city names that went into URLs as san-francisco-hotels and las-vegas-restaurants and so on. Which was fine until it came time to extract the right pieces for each element of the query string. Can't remember how he ended up dealing with it.)


 2:57 pm on Feb 21, 2013 (gmt 0)

I have ' in the get string at the moment. This works fine and retrieves the right page from the database:


If that translates to


then surely the database is not going to find that exact match if I extract the name from after the second -

namesearch = "Ed ONeill"

Would it not be wise to include the apostrophe


or at least a token, such as ~ which can be identified and transformed before database search?



Yes there will be some strange names in there with double barrel, multiple initials such as Lord R W Fortescue-Smythe

As in the above example I could use a token, which should be URL friendly?


Then it's a case of transforming _ to - before the database search.


 3:24 pm on Feb 21, 2013 (gmt 0)

Apostrophe is not a valid character to use in a URL.

See the HTTP specification.

You might need an extra piece of data in the database for each page: the URL path.

I use the record number as a unique key. There was a detailed discussion about this just a week or so ago.


 6:31 pm on Feb 21, 2013 (gmt 0)

OK read those similar posts.

Having the database ID on the end will get around the name aliasing problem. The correct name will always be found because it will search the id rather than on the name, which may have missing apostrophes. Calling this:


I can rewrite that to call this:


RewriteRule ^hockey\-profile\-.*/([0-9]+)$ php/sportsperson.php?sport=hky&id=$1 [NC,L]

That's fine. Now I need to rewrite the other way, so that any existing links to the old pages will 301 to the new format?

Andy Langton

 6:37 pm on Feb 21, 2013 (gmt 0)

Spot on in terms of the redirect. But I would take a little more time to ensure your URLs are as ideal as possible, because you are talking about (essentially) creating new content as far as Google is concerned, and then attempting to pass back the value from your old content. Anticipate an amount of initial loss of value, even with a perfect implementation.

So, I would start with what you might consider 'perfect' URLs and work back from there. The technical side is, in some respects, the easiest part :)


 7:14 pm on Feb 21, 2013 (gmt 0)

Please read the recent thread about having the ID first followed by a hyphen then the slug text.

There are disadvantages in having the ID last, and it's a very bad idea to create a "folder level" for it.


 9:28 pm on Feb 21, 2013 (gmt 0)

Found it. Understand about the id being at the start.

I think I'll keep the repeat words down so dropped 'profile'.

RewriteRule ^([0-9]+)\-hockey\-.*$ php/sportsperson.php?sport=hky&id=$1 [NC,L]

This is now the output:



 10:07 pm on Feb 21, 2013 (gmt 0)

Now looking at rewriting the old urls (which are in search engines and on messageboards)

I can not rewrite that in .htaccess. I have to allow those old pages to exist then redirect in php, correct?

If that's so is it worth doing this at all? Isn't this going to double the number of requests, increase bandwidth, cpu overhead etc.

Google requests /sportsperson.php?sport=hky&name=Joe%20Dude

The script detects that the page is the old format (and there is no ID) Searches the database for the name and gets the ID. Now it issues a php redirect

header( 'Location: http : / / ww w .example.com/2419-hockey-Joe-Dude' ) ;

Hmm. This seems a lot of hassle just to get around an apostrophe problem.

I could drop the id and save a small part of the planet in wasted resources.


 10:34 pm on Feb 21, 2013 (gmt 0)

Yes, you have the right idea for the code. Implementing this makes it almost impossible for there to be duplicate content issues on the site. The most resource-intensive bit is the database lookup. On sites where I have used this, all the old URLs had the ID number in too, so it was a very quick lookup.

Moving to the hyphenated extensionless URLs fixes a whole load of problems that you never knew could exist on a site.

RewriteRule ^([0-9]+)\-hockey\-.*$ php/sportsperson.php?sport=hky&id=$1 [NC,L]
There's a couple of problems with your rewrite. Try:
RewriteRule ^([0-9]+)-hockey-(.*) /php/sportsperson.php?sport=hky&id=$1&text=$2 [L]
The hyphens should not be escaped.
Don't allow aNyCase otherwise you'll have infinite Duplicate Content.
If $1 is not a valid ID make sure the script returns the correct 404 response.
Make sure that .* is captured and the value is checked it is the right text for this ID, otherwise you'll have infinite Duplicate Content.
If the $2 text is not right for this ID, make sure the PHP script redirects the user to the correct URL.

For an easy life, go all lower case for the URLs. Here's the simpler version of the normalisation code: [webmasterworld.com...]


 1:32 am on Feb 22, 2013 (gmt 0)

With possibility of trailing - in the URL? Can't say I like the looks of that. If it's the whole
element that is optional, it would have to be


If your url is in the form


then I honestly don't see where you've gained anything by changing formats. Especially for the increasing number of users who never even look at the URL and have no idea what you're talking about when you refer to the browser's address bar.

I don't always care for the "If it ain't broke don't fix it" line of thinking* but sometimes you gotta stop and consider it.

* Implying that there are only two possible states, "broke" and "not broke".


 2:34 am on Feb 22, 2013 (gmt 0)

If the PHP script can redirect truncated URLs requests then .* is OK. If not, then .+ will have to be used.

I am wary of using friendly URLs with fake folder levels. Get the whole lot hyphenated, with the ID first.


 9:59 am on Feb 22, 2013 (gmt 0)

3% of names have an apostrophe in them. 97% don't! It seems a bit over the top to perform php redirects and two database searches just for a 3% problem.

Why not the token idea? Rather than display ' use _ instead


the sportsperson.php script splits on the first - to get

$split1 = 'hockey'
$split2 = 'Ed-O_Neill'

$name = str_replace("-", " ", $split2)
$name = str_replace("_", "'", $split2)

$name = "Ed O'Neill" which is the valid name in the database

So is this a go or not?

I don't mind if there is an initial drop because I am not getting much traffic from the current setup. Article pages are fine, the profile pages are not.

I don't mind if this change causes a 100% drop in searches for profiles but I am wary that such a big change would cause a sitewide penalty.


 10:12 am on Feb 22, 2013 (gmt 0)

Never use underscores or spaces in URLs.

If you need to split and use something other than a hyphen, go read the HTTP/1.1 specification and look specifically at the section that lists the valid characters that can be used in the path part of the URL in an unencoded form. There's at least comma, tilde, plus, and a few others to choose from.

[edited by: g1smd at 10:25 am (utc) on Feb 22, 2013]


 10:21 am on Feb 22, 2013 (gmt 0)

Overlapping g1 and unfortunately saying the exact opposite, which was bound to happen sooner or later :(

Cool. That's your apostrophes taken care of. And you can deal with hyphenated names in the same way:

$1'blahblah (apostrophe, also covers your D'Souzas and l'Enfants and so on)

$1-blahblah (hyphen)

Now you're fine as long as you don't get athletes with glottal stops in mid-name. Are any of your sports popular in Hawai'i? Conversely if someone is using a professional name like T-Bone I don't want to hear about it.

Further edit:
Tilde. Yeah, that would work. Even though I don't ::cough-cough:: share some people's ineradicable loathing of lowlines in URLs. It's hyphens that set my teeth on edge ;)

plus and a few others

I thought that was a typo until I remembered the literal + sign. If your name is apache dot org you can even use literal periods . in mid-url, but I think most people would not recommend this.


 10:27 am on Feb 22, 2013 (gmt 0)

Underscores visually disappear in underlined links and are indistinguishable from spaces.

Underscores are not fully treated as word separators by search engines.

Spaces and many others should be encoded, and that makes%20the%20URL%20unreadable.

I used to use periods as separators, but I now usually use them only as the separator between filename and extension.


 1:22 pm on Feb 24, 2013 (gmt 0)

This is what I have now:


RewriteMap lc int:tolower

<Directory "/home/example/public_html">
RewriteEngine on

#This rewrites the new style URL in a format the script can handle
RewriteRule ^(.*)-hockey-(.*)$ /php/sportsperson.php?sport=hky&code=$1&name=${lc:$2} [L]

#This redirects any existing links / SEO pages into the new format
RewriteCond %{QUERY_STRING} ^sport=(.*)&code=(.*)&name=(.*)&valid=(.)$
RewriteRule ^sportsperson.php$ http : / / w w w .example.com/${lc:%2}-hockey-${lc:%3}? [R=301,L]



I dropped the id in the URL. The code is a flag for, say, major, minor, college. A typical URL will now look like:


which will be rewritten as


For Ed O'Neill I will use underscore (as I said only a small percentage have apostrophes)



Is this good to go?


 3:24 pm on Feb 24, 2013 (gmt 0)

Two problems.

(.*) means "read all of the request to the very end and capture it". By definition you cannot have (.*) at the beginning or in the middle of a pattern. It can only appear at the end. Replace (.*) with a more specific element, such as ([^&]+) or similar, etc. Failure to fix this will cause the server to perform tens of thousands of "back off and retry" trial match attempts per request.

The redirecting rule will be triggered again after the internal rewrite as the internally rewritten path now matches the redirecting rule again. It is vital that the redirecting rule tests THE_REQUEST to be sure the rule redirects only on user request for parameters and not internal request for parameters after a rewrite. Failure to follow this step leads to an infinite loop.


 9:20 pm on Feb 24, 2013 (gmt 0)

How about this:

RewriteRule ^([^&]+)-hockey-(.*)$ /php/sportsperson.php?sport=hky&code=$1&name=$2 [L]

What about the spaces being convereted to %2520 from the old links?

This is what search engines have previously called:

/sportsperson.php?sport=hky&code=min&name=Joe Dude&valid=1

if an SE visitor clicks on that link they then see this:


The database finds the right profile but the URL box is showing that, rather than


Note this is only for SE and links from other sites. The new format script is generating the correct format.


 10:17 pm on Feb 24, 2013 (gmt 0)

([^&]+) goes in the Condition testing parameters. It reads only the rest of this parameter rather then all of them in one go.

In the Rule, you'll need something else, something like
([^-]+) or similar.

You'll need a separate ruleset to redirect parameter requests with spaces. As I said at the beginning, "Setting up the system so that everything correctly redirects to the new URL is crucial". URLs with spaces shouldn't be a surprise at this stage, and your plans should include dealing with those right from the beginning.

Your first step should be to make a list of all URL formats that the site has ever used, both intentional and unintentional, as well the proposed new formats. Documentation first, coding second.


 3:01 am on Feb 25, 2013 (gmt 0)

In the pretty URL is it just one word before the name of the sport? If so you don't have to get fancy; a simple ([a-z]+) will do.

Conversely does the person's name in the pretty URL come after the name of the sport, with nothing else after that? That makes the whole thing a ### of a lot easier.

<Directory "/home/example/public_html">

Oh, oops, you're in a config file. I'd been assuming htaccess. If it's config, then the pattern needs to start with / but this will make no difference since you won't capture it either way.


 10:47 am on Feb 25, 2013 (gmt 0)

Yes yes and yes.


code will only ever be one of three words consisting of letters. sport one word. name could be


This is indeed in httpd.conf

I pretty much got this sorted from a new format point of view. I have the database php script correctly generating the new format links.

search for Neill and hockey in the datbase search script returns a list:


The only issue I have is that SEs and links from other sites have spaces in them, which the SEs have converted to %20. If I google for Joe Dude hockey stats I will see something like this:

Joe Dude Stats for Major League Hockey
http : / / w w w . mysite . com /sportsperson.php?stat=hky&code=maj&name=Joe%20Dude&valid=1

and when I click on that link the correct script will run but the browser URL now shows:

h t t p : / / w w w.mysite . com/aw-jockey-joe%2520dude

which will surely cause duplicates.


 11:07 am on Feb 25, 2013 (gmt 0)

You need another rule before all of the others. This one will convert spaces to hyphens as well as do whatever the exisiting redirect does.

Requests with spaces will be processed by Rule 1.

Requests without spaces will not match the RegEx pattern in Rule 1 and so Rule 1 will be skipped. Rule 2 (your existing redirect) will process those requests.

It's at this point that instead of trying to do complex URL format manipulations in htaccess I instead elect to rewrite the requests to a special PHP script that does all the fancy stuff and which also sends the 301 status too.

In this way, I end up with just one rule in htaccess for URL requests with parameters. The PHP script then sorts out the correct response, 404 or 301, and the correct URL if it's a 301. The PHP script then sends the response to the browser.


 6:53 pm on Feb 26, 2013 (gmt 0)

OK I have gone the php redirect route. This should actually not waste load because only requests for the sportsperson script will be redirected - the rest of the site, blog, forum, articles won't need to see all that.

1. httpd.conf has just one task: rewrites new format to existing script

2a. Existing script detects spaces, apostrophes and str replaces
2b. Detects if accessed via old format and issues a 301 to the new format if necessary.

I'm happy with the php script but can you do a final flight check on the httpd.conf before I launch it?


<Directory "/home/example/public_html">
RewriteEngine on
RewriteCond %{HTTP_HOST} ^example\.co\.uk
RewriteRule (.*) http://www.example.co.uk/$1 [R=301,L]

RewriteCond %{REQUEST_URI} ^(/[^/]+/)/+(.*)$ [OR]
RewriteCond %{REQUEST_URI} ^(/)/+(.*)$
RewriteRule ^. http://www.example.co.uk%1%2 [R=301,L]

#fix issue with mail. duplication
RewriteCond %{HTTP_HOST} ^mail\.example\.co.uk$ [NC]
RewriteRule ^(.*)$ http://www.example.co.uk/$1 [R=301,L]

#build string for script based on new format
RewriteRule ^([a-zA-Z]+)-hockey-(.*)$ /php/sportsperson.php?sport=hky&code=$1&name=$2 [L]



Here's the start of the sportsperson script:


// read get vars issued either by new format or via SE links
if(isset($_GET['sport']))$r_sport = substr(filter_var($_GET['sport'], FILTER_SANITIZE_STRING), 0, 10);
if(isset($_GET['code']))$r_code = substr(filter_var($_GET['code'], FILTER_SANITIZE_STRING), 0, 3);
if(isset($_GET['name']))$r_name = urldecode(substr(filter_var($_GET['name'], FILTER_SANITIZE_STRING), 0, 100));

$redirect_req = 0;
//if name has a space in it replace it and flag up redirect required
if(strpos($r_name, ' ') !== false) {
$redirect_req = 1;
$r_name = str_replace(' ', '-', $r_name);
//if name has a apostrophe in it replace it and flag up redirect required
if(strpos($r_name, "'") !== false) {
$redirect_req = 1;
$r_name = str_replace("'", '_', $r_name);
if($redirect_req == 1) {
$redirect_url = strtolower('http://www.example.co.uk/' . $r_code . '-' . $r_sport . '-' . $r_name);
header("HTTP/1.1 301 Moved Permanently");
header("Location: $redirect_url");



 10:25 pm on Feb 26, 2013 (gmt 0)

RewriteCond %{HTTP_HOST} ^example\.co\.uk
RewriteRule (.*) http://www.example.co.uk/$1 [R=301,L]

Urk. This should come at the very end of all redirects-- after the generic /index.php redirect which comes second-to-last (and which, ahem, seems to be missing ;)) --and the condition should be expressed as


meaning "If the requested host is anything other than the ONE acceptable form of my domain name".

This 32 message thread spans 2 pages: 32 ( [1] 2 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved