Forum Moderators: phranque

Message Too Old, No Replies

Redirecting ugly, long URLs using pattern matching and htaccess

         

robintel

2:11 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



Hello!

First of all allow me to thank you for the good job you do by posting answers on this world class forum. Keep up the good work!

Now, the reason I'm writing this is because I have very serious problems with the links on my site. I use Joomla! CMS and the TinyMCE content editor creates relative URLs and many of them look like this:

example.com/category/section/index.php/other_category/other_section/article.html

While te correct links look like this:

example.com/category/section/article.html

The correct links do exist, but due to the relative links generated I now have many duplicates (actually more than double) that I want to get rid of.

I believe that there is some .htaccess trick to redirect such links to 404 (so that search engines remove duplicate links from their DBs) using some sort of pattern matching.

I got rid of the relative URLs by using the find + replace MySQL syntax, so the cause has been removed but I really need assistance with removing the faulty URLs.

Any ides?

Thank you very much!

solutionc

2:34 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



If you know the page url's then a simple redirect in your .htaccess will work

Simple .htaccess redirect code

Redirect pagetoredirect.html /redirect.html

robintel

2:41 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



I'm afraid that we're talking about thousands of URLs. The other problem, besides the number of URLs, is that the content is dynamic, so that if Google indexes the wrong URLs it will still find content (and create another batch of duplicates) but no 404s.

g1smd

3:05 pm on Mar 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If there is a simple link between the pattern of the 'wrong' URL and the pattern of the 'right' URL, then a small number of rules each using
RewriteRule
might be able to cover all of the possible variations.

robintel

3:16 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



Well, the rulse should, in theory, be like this: if the requested URL is like:

example.com/[something]/index.php/[something else]/article.html

then the error code should be 404.

Since I am a n00b I must ask: how do I code this?

robintel

3:33 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



P.S. The correct URL is:

example.com/index.php/category/section/article.html

So, I rely on the position of the index.php part to distinguish between valid and invalid URLs. If it's next to the TLD then it's OK. If not, then 404.

jdMorgan

5:49 pm on Mar 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



These examples may be too specific or perhaps not specific enough, but you can either force a 404 or invoke a 301 redirect to both correct duplicate search engine listings *and* recover the traffic from those listings.

# Return 404 for duplicate URLs
RewriteRule ^[^/]+/index\.php/([^/]+/)+article\.html$ /path-to-non-existent-file [L]

-or-

# Externally redirect duplicate URLs to canonical URLs
RewriteRule ^[^/]+/index\.php/(([^/]+/)+[^.]+\.html)$ http://www.example.com/index.php/$1 [R=301,L]

On Apache 2.x, you can use the [R=404] flag to generate a 404 response instead of rewriting to a non-existent filepath, but the code above works on any Apache server 1.3 or later.

Jim

robintel

5:59 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



Thank you for your effort, unfortunately this code is not specific enough and I am unable to modify it since I unfortunately lack the skill.

I need to redirect pages like:

example.com/[<b>something wrong</b>]/index.php/[something else]/article.html

to a 404 eror. The idea is that if there is [<b>something wrong</b>] before the /index.php/ part, other than the website URL then the 404 should occur. Sorry for the bold, I need to emphasize that if there is some string before the /index.php/ part of the SEF URL then the 404 should occur.

This is way beyond my .htaccess skill...

[edited by: robintel at 6:23 pm (utc) on Mar. 30, 2009]

robintel

6:02 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



@jdMorgan

I think the code will work if I can properly redirect to 404. Testing...

[edited by: robintel at 6:24 pm (utc) on Mar. 30, 2009]

robintel

6:14 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



Nope, unfortunately it's not working. I hope it's all right to post a more specific URL. If not, I humbly apologize.

Specific example of problem URL:

example.com/[**this_is_wrong_and_should_be_gone**opinii/metastaza-comunismului]/index.php/blog/tech/avem-sitelinks.html

This should redirect to 404, because it's a duplicate of:

example.com/index.php/blog/tech/avem-sitelinks.html

These links are driving me crazy!

I also must apologize for the typos above. English is not my mother tongue.

g1smd

6:34 pm on Mar 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If it is a duplicate, don't send a 404, send the user to the right URL via a 301 redirect.

That helps the user see the right page, and preserves search engine listings for only the correct URLs.

robintel

6:37 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



OK, but how do I redirect? I mean there are multiple combinations of the same URL, so hard coding the URLs is unreasonable.

jdMorgan

6:43 pm on Mar 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You need to better-define the URLs. We (and mod_rewrite) need to know *exactly* which parts of the URL are fixed, which can vary, and the amount of variation -- for example, how many directory levels might be present in a variable field. We (and mod_rewrite) also need to know which parts are to be discarded, and which are to be re-used in the new URL.

If this is difficult, then provide multiple examples.

Example:
example.com/<one or more directory levels here: remove all>/index.php/<zero or more directory levels here: keep all>/<any-page-name-here: keep>.html

That example may or may not be what you want, but the description needs to be exact and complete; When the code is written to match that description, it will do exactly what is described -- and not necessarily what you wanted, unless the description is perfect and comprehensive.

It is a mistake (and a waste of time) to proceed to coding if the requirements have not been fully and correctly defined.

Also, note that by including the alternative code in my previous post above, I was implicitly recommending doing a 301 redirect instead of a 404. By using a 301, you can recover the traffic and the linking power of the incorrect links in addition to correcting the search engine listings, instead of throwing away that traffic and link-power. Part of this is technical (mod_rewrite code), and part of it is an SEO problem. The SEO problem should be addressed first, and a 301 is likely a better solution.

Jim

robintel

6:58 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



Thank you Jim!

To answer your very good set of questions, the URLs are like this:

example.com/<three directories:remove all>/index.php/<two other drectories:keep all>/<title_of_the_article.html:keep>

So, there are two fixed parts, and three other that vary. If we could drop the first <three directories> and one of the slashes the URL would be correct.

I understood your proposal regarding the 301 code and I am considering redirecting to it, if we somehow manage to point to the correct URL.

Once again, I appreciate your precious time taken to helping me.

g1smd

7:12 pm on Mar 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is quite easy then.

Pattern:

^[^/]+/[^/]+/[^/]+/index.php/([^/]+/[^/]+/[^.]\.html)$

Target:

http://www.example.com/index.php/$1

Use a

RewriteRule
with
[R=301,L]

jdMorgan

7:29 pm on Mar 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Or use a quantifier:

RewriteRule ^[b]([^/]+/){3}[/b]index.php/([^/]+/[^/]+/[^.]\.html)$ http://www.example.com/index.php/$2 [R=301,L]

Jim

robintel

7:53 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



I've been playing around with these proposals and Apache did not return a 500. But did not redirect either. Is there something wrong with my .htaccess file?

<snip>

[edited by: jdMorgan at 3:04 am (utc) on Mar. 31, 2009]
[edit reason] Copyrighted code deleted. [/edit]

g1smd

7:57 pm on Mar 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



When you have redirects and rewrites in the same .htaccess file you need to list all of the redirects (all those with [R=301,L] within) before you list any of the rewrites (all those with just [L] within).

Within each of those two groups, you need to list most specific first and most general last.

robintel

8:10 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



I'm afraid I just don't know how to order them and not receive a 500. You can see from the comments in the file that I like to keep everything tidy, but I am not that proficient. I know it's not an excuse, but I am still learning.

robintel

9:57 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



Sadly nothing seems to work, and I tried everything I could think of.

g1smd

10:10 pm on Mar 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Rearrange your code using this pattern:

When you have redirects and rewrites in the same .htaccess file you need to list all of the redirects (all those with [R=301,L] within) before you list any of the rewrites (all those with just [L] or [F] within).

Within each of those two groups, you need to list most specific first and most general last. Look at the pattern to see whether it matches one file (i.e. is most specific), or lots of files (i.e. is least specific).

.

If I said "rearrange this so that all the numbers were first and the letters were last, and they go smallest to biggest within each of the two groups ... D 5 F 3 E A 6 C 1 4 B 2", then you could do it.

This is no more difficult than that. Post your tidied up code below...

[edited by: g1smd at 10:43 pm (utc) on Mar. 30, 2009]

robintel

10:32 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



I don't have any redirect ending just in [L] (last). I rearranged the contents of the file to the best of my htaccess knowledge and looks like this:

<snip>

I think we're getting somewhere since normal, valid URLs work while the invalid ones generate a 500.

Thank you for your patience.

[edited by: robintel at 11:08 pm (utc) on Mar. 30, 2009]

[edited by: jdMorgan at 3:05 am (utc) on Mar. 31, 2009]
[edit reason] Copyrighted code deleted. [/edit]

robintel

10:37 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



Errata: I get a 404.

g1smd

10:46 pm on Mar 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



*** I don't have any redirect ending just in [L] (last). ***

That's right. The items ending in just [L] are rewrites (not redirects), as are those that end in just [F]. They go last.

Any rewrites without an [L] need to have one added on the end.

robintel

11:07 pm on Mar 30, 2009 (gmt 0)

10+ Year Member



OK. This is the latest version I could come up with:

<snip>

########## Begin - Rewrite rules to block out some common exploits
## If you experience problems on your site block out the operations listed below
## This attempts to block the most common type of exploit `attempts` to Joomla!
#
#
#added by robintel
#IF the URI contains a "http:" or "ftp:" or "https"
RewriteCond %{QUERY_STRING} http\: [OR]
RewriteCond %{QUERY_STRING} ftp\: [OR]
RewriteCond %{QUERY_STRING} https\: [OR]
#OR if the URI contains a "["
RewriteCond %{QUERY_STRING} \[ [OR]
#OR if the URI contains a "]"
RewriteCond %{QUERY_STRING} \] [OR]
RewriteCond %{QUERY_STRING} scanhttp\: [OR]
RewriteCond %{QUERY_STRING} link [OR]
RewriteCond %{QUERY_STRING} @rfi [OR]
RewriteCond %{QUERY_STRING} rfi [OR]
RewriteCond %{QUERY_STRING} q=cache [OR]
RewriteCond %{QUERY_STRING} path_escape=http\:[OR]
RewriteCond %{QUERY_STRING} page=http\:[OR]
RewriteCond %{QUERY_STRING} error=http\:[OR]
RewriteCond %{QUERY_STRING} page [OR]
RewriteCond %{QUERY_STRING} evil_root [OR]
RewriteCond %{QUERY_STRING} %3A%2F%2F [OR]
RewriteCond %{QUERY_STRING} main_path [OR]
RewriteCond %{QUERY_STRING} CONFIG [OR]
RewriteCond %{QUERY_STRING} GLOBALS [OR]
#end added
#Begin anti SQL injection protection 08.02.2008
RewriteCond %{QUERY_STRING} (\;¦\'¦\"¦\%22).*(union¦select¦insert¦drop¦update¦md5¦benchmark¦or¦and¦if).* [NC,OR]
# Block out any script trying to set a mosConfig value through the URL
RewriteCond %{QUERY_STRING} mosConfig_[a-zA-Z_]{1,21}(=¦\%3D) [OR]
# Block out any script trying to base64_encode crap to send via URL
RewriteCond %{QUERY_STRING} base64_encode.*\(.*\) [OR]
RewriteCond %{QUERY_STRING} ("¦%22).*(>¦%3E¦<¦%3C).* [NC,OR]
RewriteCond %{QUERY_STRING} (\<¦%3C).*iframe.*(\>¦%3E) [NC,OR]
# Block out any script that includes a <script> tag in URL
RewriteCond %{QUERY_STRING} (\<¦%3C).*script.*(\>¦%3E) [NC,OR]
# Block out any script trying to set a PHP GLOBALS variable via URL
RewriteCond %{QUERY_STRING} GLOBALS(=¦\[¦\%[0-9A-Z]{0,2}) [OR]
RewriteCond %{QUERY_STRING} error=[a-zA-Z_]{1,21}(=¦\%3D) [OR]
# Block out any script trying to modify a _REQUEST variable via URL
RewriteCond %{QUERY_STRING} _REQUEST(=¦\[¦\%[0-9A-Z]{0,2})
# Send all blocked request to homepage with 403 Forbidden error!
RewriteRule ^(.*)$ index.php [F]
#
########## End - Rewrite rules to block out some common exploits

RewriteCond %{QUERY_STRING} DOCUMENT_ROOT [OR]
RewriteCond %{QUERY_STRING} .*=http.+ [NC,OR]
RewriteCond %{REQUEST_URI} %3C/scripts/.+\.php%3E [OR]
RewriteCond %{HTTP_REFERER} ^<script>window\.open.+$ [NC]
RewriteRule .* - [F]

I still get the 404.

[edited by: jdMorgan at 3:07 am (utc) on Mar. 31, 2009]
[edit reason] Copyrighted code deleted. [/edit]

jdMorgan

3:01 am on Mar 31, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Please do not post copyrighted code on WebmasterWorld!

I have deleted all but part of the final code dump, as it's too much work to clean up all of the others.

I apologize for any confusion or difficulty this may cause, but we cannot have clearly copyrighted code posted here. I'm sorry, but it is a violation of international copyright law, and cannot be allowed.

For the sake of continuing the discussion, please remove all copyrighted code lines before posting.

Thanks,
Jim

[edited by: jdMorgan at 3:10 am (utc) on Mar. 31, 2009]

robintel

6:18 am on Mar 31, 2009 (gmt 0)

10+ Year Member



Thank you Jim for bringing this into my attention. I apologize, of course. I am contempt, though, because I got the 404s so that search engines can now remove the duplicate links.

Thank you Jim and g1smd.