Forum Moderators: phranque
First of all allow me to thank you for the good job you do by posting answers on this world class forum. Keep up the good work!
Now, the reason I'm writing this is because I have very serious problems with the links on my site. I use Joomla! CMS and the TinyMCE content editor creates relative URLs and many of them look like this:
example.com/category/section/index.php/other_category/other_section/article.html
While te correct links look like this:
example.com/category/section/article.html
The correct links do exist, but due to the relative links generated I now have many duplicates (actually more than double) that I want to get rid of.
I believe that there is some .htaccess trick to redirect such links to 404 (so that search engines remove duplicate links from their DBs) using some sort of pattern matching.
I got rid of the relative URLs by using the find + replace MySQL syntax, so the cause has been removed but I really need assistance with removing the faulty URLs.
Any ides?
Thank you very much!
# Return 404 for duplicate URLs
RewriteRule ^[^/]+/index\.php/([^/]+/)+article\.html$ /path-to-non-existent-file [L]
# Externally redirect duplicate URLs to canonical URLs
RewriteRule ^[^/]+/index\.php/(([^/]+/)+[^.]+\.html)$ http://www.example.com/index.php/$1 [R=301,L]
Jim
I need to redirect pages like:
example.com/[<b>something wrong</b>]/index.php/[something else]/article.html
to a 404 eror. The idea is that if there is [<b>something wrong</b>] before the /index.php/ part, other than the website URL then the 404 should occur. Sorry for the bold, I need to emphasize that if there is some string before the /index.php/ part of the SEF URL then the 404 should occur.
This is way beyond my .htaccess skill...
[edited by: robintel at 6:23 pm (utc) on Mar. 30, 2009]
Specific example of problem URL:
example.com/[**this_is_wrong_and_should_be_gone**opinii/metastaza-comunismului]/index.php/blog/tech/avem-sitelinks.html
This should redirect to 404, because it's a duplicate of:
example.com/index.php/blog/tech/avem-sitelinks.html
These links are driving me crazy!
I also must apologize for the typos above. English is not my mother tongue.
If this is difficult, then provide multiple examples.
Example:
example.com/<one or more directory levels here: remove all>/index.php/<zero or more directory levels here: keep all>/<any-page-name-here: keep>.html
That example may or may not be what you want, but the description needs to be exact and complete; When the code is written to match that description, it will do exactly what is described -- and not necessarily what you wanted, unless the description is perfect and comprehensive.
It is a mistake (and a waste of time) to proceed to coding if the requirements have not been fully and correctly defined.
Also, note that by including the alternative code in my previous post above, I was implicitly recommending doing a 301 redirect instead of a 404. By using a 301, you can recover the traffic and the linking power of the incorrect links in addition to correcting the search engine listings, instead of throwing away that traffic and link-power. Part of this is technical (mod_rewrite code), and part of it is an SEO problem. The SEO problem should be addressed first, and a 301 is likely a better solution.
Jim
To answer your very good set of questions, the URLs are like this:
example.com/<three directories:remove all>/index.php/<two other drectories:keep all>/<title_of_the_article.html:keep>
So, there are two fixed parts, and three other that vary. If we could drop the first <three directories> and one of the slashes the URL would be correct.
I understood your proposal regarding the 301 code and I am considering redirecting to it, if we somehow manage to point to the correct URL.
Once again, I appreciate your precious time taken to helping me.
When you have redirects and rewrites in the same .htaccess file you need to list all of the redirects (all those with [R=301,L] within) before you list any of the rewrites (all those with just [L] or [F] within).
Within each of those two groups, you need to list most specific first and most general last. Look at the pattern to see whether it matches one file (i.e. is most specific), or lots of files (i.e. is least specific).
.
If I said "rearrange this so that all the numbers were first and the letters were last, and they go smallest to biggest within each of the two groups ... D 5 F 3 E A 6 C 1 4 B 2", then you could do it.
This is no more difficult than that. Post your tidied up code below...
[edited by: g1smd at 10:43 pm (utc) on Mar. 30, 2009]
<snip>
I think we're getting somewhere since normal, valid URLs work while the invalid ones generate a 500.
Thank you for your patience.
[edited by: robintel at 11:08 pm (utc) on Mar. 30, 2009]
[edited by: jdMorgan at 3:05 am (utc) on Mar. 31, 2009]
[edit reason] Copyrighted code deleted. [/edit]
<snip>
########## Begin - Rewrite rules to block out some common exploits
## If you experience problems on your site block out the operations listed below
## This attempts to block the most common type of exploit `attempts` to Joomla!
#
#
#added by robintel
#IF the URI contains a "http:" or "ftp:" or "https"
RewriteCond %{QUERY_STRING} http\: [OR]
RewriteCond %{QUERY_STRING} ftp\: [OR]
RewriteCond %{QUERY_STRING} https\: [OR]
#OR if the URI contains a "["
RewriteCond %{QUERY_STRING} \[ [OR]
#OR if the URI contains a "]"
RewriteCond %{QUERY_STRING} \] [OR]
RewriteCond %{QUERY_STRING} scanhttp\: [OR]
RewriteCond %{QUERY_STRING} link [OR]
RewriteCond %{QUERY_STRING} @rfi [OR]
RewriteCond %{QUERY_STRING} rfi [OR]
RewriteCond %{QUERY_STRING} q=cache [OR]
RewriteCond %{QUERY_STRING} path_escape=http\:[OR]
RewriteCond %{QUERY_STRING} page=http\:[OR]
RewriteCond %{QUERY_STRING} error=http\:[OR]
RewriteCond %{QUERY_STRING} page [OR]
RewriteCond %{QUERY_STRING} evil_root [OR]
RewriteCond %{QUERY_STRING} %3A%2F%2F [OR]
RewriteCond %{QUERY_STRING} main_path [OR]
RewriteCond %{QUERY_STRING} CONFIG [OR]
RewriteCond %{QUERY_STRING} GLOBALS [OR]
#end added
#Begin anti SQL injection protection 08.02.2008
RewriteCond %{QUERY_STRING} (\;¦\'¦\"¦\%22).*(union¦select¦insert¦drop¦update¦md5¦benchmark¦or¦and¦if).* [NC,OR]
# Block out any script trying to set a mosConfig value through the URL
RewriteCond %{QUERY_STRING} mosConfig_[a-zA-Z_]{1,21}(=¦\%3D) [OR]
# Block out any script trying to base64_encode crap to send via URL
RewriteCond %{QUERY_STRING} base64_encode.*\(.*\) [OR]
RewriteCond %{QUERY_STRING} ("¦%22).*(>¦%3E¦<¦%3C).* [NC,OR]
RewriteCond %{QUERY_STRING} (\<¦%3C).*iframe.*(\>¦%3E) [NC,OR]
# Block out any script that includes a <script> tag in URL
RewriteCond %{QUERY_STRING} (\<¦%3C).*script.*(\>¦%3E) [NC,OR]
# Block out any script trying to set a PHP GLOBALS variable via URL
RewriteCond %{QUERY_STRING} GLOBALS(=¦\[¦\%[0-9A-Z]{0,2}) [OR]
RewriteCond %{QUERY_STRING} error=[a-zA-Z_]{1,21}(=¦\%3D) [OR]
# Block out any script trying to modify a _REQUEST variable via URL
RewriteCond %{QUERY_STRING} _REQUEST(=¦\[¦\%[0-9A-Z]{0,2})
# Send all blocked request to homepage with 403 Forbidden error!
RewriteRule ^(.*)$ index.php [F]
#
########## End - Rewrite rules to block out some common exploits
RewriteCond %{QUERY_STRING} DOCUMENT_ROOT [OR]
RewriteCond %{QUERY_STRING} .*=http.+ [NC,OR]
RewriteCond %{REQUEST_URI} %3C/scripts/.+\.php%3E [OR]
RewriteCond %{HTTP_REFERER} ^<script>window\.open.+$ [NC]
RewriteRule .* - [F]
I still get the 404.
[edited by: jdMorgan at 3:07 am (utc) on Mar. 31, 2009]
[edit reason] Copyrighted code deleted. [/edit]
I have deleted all but part of the final code dump, as it's too much work to clean up all of the others.
I apologize for any confusion or difficulty this may cause, but we cannot have clearly copyrighted code posted here. I'm sorry, but it is a violation of international copyright law, and cannot be allowed.
For the sake of continuing the discussion, please remove all copyrighted code lines before posting.
Thanks,
Jim
[edited by: jdMorgan at 3:10 am (utc) on Mar. 31, 2009]