Forum Moderators: phranque

Message Too Old, No Replies

.htaccess again - replacing a string

Using .htaccess to remove strings

         

zapotex

8:22 pm on Nov 28, 2011 (gmt 0)

10+ Year Member



Dear All,

congratulations for your amazing forum! I ended up here a ton of times through googling to solve my problems. Now I decided to also register and this is my first post!

I'm using a CMS that does a really poor job regarding SEO: there is a ton of duplicate content!

I'm planning to fix this with .htaccess. Here's what I would like to do:

1) For the page "displayimage.php" with parameters, I would like to remove these strings:
a) "album=ANYTHING&", where ANYTHING can any number or string
b) "cat=ANYTHING&", where ANYTHING can any number or string
c) "&uid=ANYTHING", where ANYTHING can any number or string

2) For the page "thumbnails.php" with parameters, I would like to remove
a) "&page=1"

I have looked for examples of this and I could not find much. Unfortunately I'm not familiar with regex, then I have a bit of a hard time figuring out how to do things in .htaccess if I don't find an example of exactly what I'm looking for. Here is my current .htaccess:

ErrorDocument 401 /errorpages/error-401.php
ErrorDocument 404 /errorpages/error-404.php
ErrorDocument 500 /errorpages/error-500.php
RewriteEngine On
RewriteBase /
Redirect 301 /galleries/index.php?cat=0 http://www.example.com/galleries/index.php
RewriteCond %{HTTP_HOST} \.us$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
RewriteCond %{HTTP_HOST} \.org$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
RewriteCond %{HTTP_HOST} \.net$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
RewriteRule ^index\.html?$ / [NC,R,L]
RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteRule .* http://www.%{HTTP_HOST}%{REQUEST_URI} [R=301,L]
RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule ^$ http://www.example.com/index.php [L,R=301]
RewriteCond %{REQUEST_URI} /galleries/$
RewriteRule ^(.*) http://www.example.com/galleries/index.php [L,R=301]


Thanks a lot in advance to everyone!

[edited by: eelixduppy at 1:33 am (utc) on Dec 1, 2011]
[edit reason] exemplified [/edit]

g1smd

9:37 pm on Nov 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Don't use Redirect here. Use RewriteRule for all of the rules.

R gives a 302 redirect. Use R=301 here.

Add a blank line after every RewriteRule to make the code more readable.

Note that the rules in htaccess cannot change your URLs. You must alter the links on the page to point to the correct URL. That is where the change occurs.

lucy24

11:40 pm on Nov 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Also note that if your main concern is SEO, you can start by telling g### and the rest of them to disregard the named parameters. If you actually don't use them, you can then simply let them die a natural death.

Do all of your domains pass through this same .htaccess? You can collapse the "host" group of rules into

%{HTTP_HOST} \.(us|org|net)$

or simply

%{HTTP_HOST} !^(www\.example\.com)?$

Put this rule after all the specific redirects. It only needs to pick up the remaining requests that haven't already been redirected by other means.

zapotex

11:38 am on Nov 30, 2011 (gmt 0)

10+ Year Member



Hi g1smd and lucy24, thanks a lot for your kind help! :-)

I'll definitely edit the htaccess file and make it more readable as you suggest. Concerning your point:
Note that the rules in htaccess cannot change your URLs. You must alter the links on the page to point to the correct URL. That is where the change occurs.

I'm not sure I understand. If I instruct htaccess to rewrite an url, the page with the new address will be sent to the browser. It is the same pagte as the old one, simply without the parameters I do not want in the title. Am I getting the meaning of htaccess completely wrong?

Also note that if your main concern is SEO, you can start by telling g### and the rest of them to disregard the named parameters. If you actually don't use them, you can then simply let them die a natural death.

I actually DO use them. displayimage.php?pid=4 is not the same page as displayimage.php?pid=5. The parameters that create duplicate content are the others (displayimage.php?pid=4&album=XX&cat=YY is the same as simply displayimage.php?pid=4)

In other words, I only want to remove some of the parameters.

o all of your domains pass through this same .htaccess? You can collapse the "host" group of rules into

%{HTTP_HOST} \.(us|org|net)$

or simply

%{HTTP_HOST} !^(www\.example\.com)?$

Put this rule after all the specific redirects. It only needs to pick up the remaining requests that haven't already been redirected by other means.

Will do! I'm a noob with htaccess and I wrote a different rule for each extension because I had no idea how to put them together! Thanks for your help!

lucy24

9:36 pm on Nov 30, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If I instruct htaccess to rewrite an url, the page with the new address will be sent to the browser. It is the same page as the old one, simply without the parameters I do not want in the title. Am I getting the meaning of htaccess completely wrong?

The parameters, unlike # fragments, are part of the URL. Make sure you've got a solid grip on the difference between redirect and rewrite. From the user's side: if the browser's address bar is different from what they intially typed or clicked, it's a redirect. If the address stays the same, it's a rewrite.

Are you saying here that you want to use certain parameters, but not have them show up in the address bar? If they're not anywhere in the URL, they don't exist as far as htaccess is concerned. But it doesn't have to be all or nothing. You can delete selected parameters, or tell g### to ignore only those specific ones.

zapotex

11:02 pm on Nov 30, 2011 (gmt 0)

10+ Year Member



Make sure you've got a solid grip on the difference between redirect and rewrite.

Oohh... OK, I really need to study I guess. Thanks for that. I did not know rewrite vs. redirect was an important point
if the browser's address bar is different from what they intially typed or clicked, it's a redirect. If the address stays the same, it's a rewrite.

Then I DEFINITELY want a redirect. A 301 one. I can also achieve it with a RewriteRule, if I specify R=301, right?
Are you saying here that you want to use certain parameters, but not have them show up in the address bar? If they're not anywhere in the URL, they don't exist as far as htaccess is concerned

Actually no. I'm using only 1 parameter, "pid". I want to delete all the others and redirect the user (and the bots) to the page with the "clean" URL, that only contains the PID parameter. It all began like this: my CMS (coppermine gallery, a real SEO nightmare) generates links to a given photo from the album the photo is in, from the "most viewed" page, from the "last updated" page, etc... All the links are different, but lead to the exact same photo, then I have a duplicate content issue :-( I would like to fix exactly that.
You can delete selected parameters

EXACTLY what I'm trying to do!
tell g### to ignore only those specific ones

I had no idea I could... Could you please point me to a link that explains how? I thought of robots.txt, but it only prevents crawling, not indexing :-( I could put nofollow tags, but I would have to get my hands VERY dirty with the PHP code of my CMS... That's why I decided to do it with htaccess...

THANKS for all your precious help! Best

lucy24

11:32 pm on Nov 30, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's one of those google paradoxes. In order to tell them not to do such-and-such, you have to get chummier with them. Sign up for Google Webmaster Tools, and also for Bing and anyone else who has an equivalent. Somewhere among the menus there will be something called-- if you're lucky-- "parameters". If you're not lucky, you'll have to grub through all the menus until you find it.

:: shuffling papers ::

In GWT it's under Site configuration >> URL parameters
Only use this feature if you feel confident about how parameters work for your site. Telling Googlebot to exclude URLs with certain parameters could result in large numbers of your pages disappearing from our index.

Well, that's exactly what you want them to do, but thanks for the warning anyway :)

Can't say I really like the wording, though. I hope what they mean is "exclude certain parameters from URLs", because as written it makes it sound as if any URL that even contains the parameter will be excluded.

zapotex

4:37 pm on Dec 1, 2011 (gmt 0)

10+ Year Member



Hi everyone! Thanks to your advice I made progress that just yesterday I did not even think possible! :-)

Here is what I have now:


RewriteCond %{THE_REQUEST} /galleries/displayimage.php(.*)&pid=([0-9]+)
RewriteRule ^(.*) http://www.example.com/galleries/displayimage.php?pid=%2? [L,R=301]

THIS DOES ALMOST EVERYTHING I WANTED! It is really great, but there is one more thing.

My CMS generates link such as this:

http://www.example.com/galleries/displayimage.php?album=45&pid=917#top_display_media


The RewriteRule I created thanks to your advice removes the parameters that are not "pid", just like I wanted, but it leaves the bookmark. For the example above, the user is redirected to:

http://www.example.com/galleries/displayimage.php?pid=917#top_display_media


The "album" parameter has been correctly removed, but the #top_display_media has not!

Can you help me fix this issue too?

Thank you so much!

Davide

zapotex

4:40 pm on Dec 1, 2011 (gmt 0)

10+ Year Member



And about Google: THANKS a lot to Lucy for the thorough explanation! I also found the wording pretty scary and I think that, for the moment, I'll stick to redirection and see how it goes :-)

But I'll investigate about Google too.

Thanks a lot everyone!

zapotex

4:54 pm on Dec 1, 2011 (gmt 0)

10+ Year Member



One more problem: I'm trying to remove "&page=1" from the URL. Here is what I have:

RewriteCond %{THE_REQUEST} /galleries/thumbnails.php?album=([0-9]+)&page=1$
RewriteRule ^(.*) http://www.example.com/galleries/thumbnails.php?album=%1? [L,R=301]

Unfortunately nothing happens :-(

I don't see why the pattern does not match. It is exactly like the URL that my CMS generates, for example:
http://www.example.com/galleries/thumbnails.php?album=45&page=1

and I would like to remove &page=1 in order to remove the duplicate content.

Could you please help me on this!

Thanks a lot!

zapotex

7:37 pm on Dec 1, 2011 (gmt 0)

10+ Year Member



I also tried removing the $ after "&page=1", but it still does not work :-(

wilderness

7:41 pm on Dec 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



search the archives for "?" "string" and "QSA"

lucy24

8:38 pm on Dec 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



/galleries/thumbnails.php?

Bzzt! Bzzt! Bzzt!
Someone hereabouts just went digging in the Forums Library and found an elderly post that gives a quick rundown on Regular Expressions.

The bit I've quoted contains two mistakes. One of them will generally not make a difference. The other is lethal. That is, it won't crash your server, it will just make the rule fail every time.

Can you really search for ? alone? Surely every single thread in every single Forum contains a question mark :)

zapotex

8:42 pm on Dec 1, 2011 (gmt 0)

10+ Year Member



Awesome! It works! So the question mark needs to be escaped in the pattern. OK, got it.

ON the other hand, I still don't get when you're supposed to put it at the end of the rule. Sometimes you have otherwise it appends a whole bunch of things, other times you get a %3f if you use a ?.

Thanks a lot! Problem solved!

g1smd

8:48 pm on Dec 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Since THE_REQUEST is the literal GET request sent by the browser, you should begin your RewriteCond pattern with
^[A-Z]{3,9}\ /
to match the literal GET or POST at the start.

There's also no need to capture the RewriteRule pattern using brackets on the (.*) if you don't re-use $1.

Additionally, using .* means the whole RewriteCond has to be evaluated for every request for images, stylesheets, and pages in other folders. This slows the site down.

Use
^galleries/thumbnails\.php$
for the RewriteRule pattern instead. This ensures the condition is evaluated only if the rule pattern is a match.

It's detailed in a diagram in the Apache manual, and I did not understand the significance of it for at least a year, but the Rule pattern is evaluated first. The RewriteCond lines are evaluated only if the rule pattern was a match.

zapotex

9:12 pm on Dec 1, 2011 (gmt 0)

10+ Year Member



The bit I've quoted contains two mistakes.
Found them! I have to escape the . AND the ? :-) Done and now it works!

Can you really search for ? alone?
I searched for QSA and I found what I needed!

Thanks Lucy!

g1smd: thanks a lot to you too!
Since THE_REQUEST is the literal GET request sent by the browser, you should begin your RewriteCond pattern with ^[A-Z]{3,9}\ / to match the literal GET or POST at the start.

I will do as you suggest! I actually thought that I could get away without the [A-Z]{3,9} if I just did not put the ^ before the pattern and apparently it seems to work... I mean, the pattern is matched and the rule is executed...
The only problem I still have is that I can't get rid of "#top_display_media"

It's detailed in a diagram in the Apache manual, and I did not understand the significance of it for at least a year, but the Rule pattern is evaluated first. The RewriteCond lines are evaluated only if the rule pattern was a match.

That's a great advice, especially for someone on a not-so-fast shared hosting :-) I will do as you suggest! And I believe I can also simplify my file by making a better use of the pattern part of "RewriteRule" instead of making an obsessive use of "RewriteCond"

Thanks a lot everyone!

g1smd

9:55 pm on Dec 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You will need RewriteCond whenever you look at anything other than the straight PATH part of the request.

You need it for looking at http_host, server_port, query_string, the_request and others.

lucy24

10:04 pm on Dec 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can't get rid of "#top_display_media"

You won't be able to. Anything in # is a fragment, not technically part of the URL at all. It is handled by the browser, not by the server. If the page no longer has an anchor for this fragment, it is not a problem, because the browser will simply send the user to the top of the page. This is a basic HTML requirement and you can trust all browsers to do it, no matter how capricious they are about other "compliant user-agents" rules ;)

I still don't get when you're supposed to put it at the end of the rule

? by itself at the end of the target means "delete the entire existing query string, if any". If there was no query, it has no effect. You will sometimes see a pattern ending in html? if you want to catch both "htm" and "html" extensions. For ? after parentheses, study Regular Expressions.

I actually thought that I could get away without the [A-Z]{3,9} if I just did not put the ^ before the pattern and apparently it seems to work... I mean, the pattern is matched and the rule is executed...

It depends on exactly what your Rule and Condition are looking for. For example, I don't use php, so any request for a php file will be a robot up to no good:

RewriteCond %{THE_REQUEST} \.php
RewriteRule \.php$ - [F]

It has to go in {THE_REQUEST} rather than a conditionless rule, because mod_index uses php (or, at least, my host's implementation of it does) for auto-indexing.

zapotex

10:52 pm on Dec 1, 2011 (gmt 0)

10+ Year Member



Thank you very much for the real htaccess crash course you gave me over this forum!

You will need RewriteCond whenever you look at anything other than the straight PATH part of the request.

You need it for looking at http_host, server_port, query_string, the_request and others.

Great synthesis! It would have taken me months of experience before I got there on my own!


report msg
joined:Apr 9, 2011
posts:1844
#:439318210:04 pm on Dec 1, 2011 (gmt 0)

I can't get rid of "#top_display_media"

You won't be able to. Anything in # is a fragment, not technically part of the URL at all. It is handled by the browser, not by the server. If the page no longer has an anchor for this fragment, it is not a problem, because the browser will simply send the user to the top of the page. This is a basic HTML requirement and you can trust all browsers to do it, no matter how capricious they are about other "compliant user-agents" rules ;)

Thanks for that too! I'm not worried about browser compliance. My only concern is duplicate content & SEO. But I hope bookmarks are not considered by Google.
? by itself at the end of the target means "delete the entire existing query string, if any". If there was no query, it has no effect. You will sometimes see a pattern ending in html? if you want to catch both "htm" and "html" extensions. For ? after parentheses, study Regular Expressions.

I will! And thanks for clarifying, now my own htaccess file makes a lot more sense to me than it did before.
RewriteCond %{THE_REQUEST} \.php
RewriteRule \.php$ - [F]

Very interesting... I don't get why the second line is not sufficient... Isn't the first line simply a repetition of the second? Actually the second is more specific because it says that the path&filename must END eith .php...

Thanks everyone again!

g1smd

11:04 pm on Dec 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The RewriteCond looking at THE_REQUEST checks that an actual browser "out there" on the web asked for something ending in .php as a URL request.

Without it, any request for a file ending in .php whether from "outside" or as a result of a prior internal rewrite will both be blocked.

To be totally clear, the two-liner blocks only external requests for .php as a URL and the one-liner would block a request for example.com/foo if that request resulted in a rewrite to fetch the file foo.php from the server.

To understand that you need to be absolutely clear that URLs are a reference system used "out there" on the web and server filepaths are something used only "here" inside the server. They are not at all the same thing, merely "associated" by the action of the server software.

The default action is that given a request for example.com/robots.txt the server fetches the robots.txt from the server filesystem.

With a rewrite in place, that request for example.com/robots.txt could be rewritten such that the internal file pointer is told to fetch and process the file /site-security.php?output=robots instead. The user "out there" on the web would simply see content at the URL they requested served with a 200 OK status code.

[edited by: g1smd at 11:17 pm (utc) on Dec 1, 2011]

zapotex

11:06 pm on Dec 1, 2011 (gmt 0)

10+ Year Member



I get it... The difference between the first and the second line is not the pattern, it is the variable: THE_REQUEST on the first line and the path on the second line and the path kind of behaves like REQUEST_URI...

Thanks again! Thanks to your advices, now studying the documentation will be a breeze! :-)