Forum Moderators: phranque
By default, the software program gives you really ugly URLs:
/index.php?title=This_is_the_file_name
But with a little editing to an internal file, and the .htaccess file, you can turn that into:
/This_is_the_file_name
The .htaccess code looks like this:
RewriteRule ^[^:]*\. - [L]
RewriteRule ^[^:]*\/ - [L]
RewriteRule ^(.+)$ /index.php?title=$1 [L,QSA]
I've never seen regex looking like that, but the code worked great.
My problem is, I'm noticing some page titles result in the wrong file being pulled or an error.
Pages with a period (.) in the title result in an error.
/...Baby_One_More_Time
and pages with a plus sign (+) redirect to a different title
/1+1=2
takes you to
/1 2=3
There might be more issues, but I've only come across those two errors.
If somebody could help me modify the regex, it would be much appreciated.
In the first case, the code is specifically "skipping" the rewrite if a period or slash is found in the requested (pretty) URL and no colon precedes those characters (whether immediately-preceding or not). The intent is apparently to avoid rewriting URLs that have filetypes appended, such as "robots.txt" or "index.php" itself.
In the second case, it is likely the script itself changing plus signs to spaces, as this is a common thing, for example among most search engines.
So the first question to ask is, "Can you do without periods and plus signs in titles?" Noting that these are titles --such as used in newspapers-- periods are never used, and I doubt that plus signs are ever used, either.
If the answer is "No," then you'll probably have to go with a rewriting solution that is a lot less efficient. And by that I mean "very inefficient" and may not be suitable for a high-traffic site. You will also have to modify the Wiki script to fix the plus sign problem, as that can't be fixed in mod_rewrite.
Modification for "period" problem:
# Rewrite URL-path requests with no periods or slashes (unless preceded by a colon) to script
RewriteRule ^([^./]+¦[^:]*:[^./]*[./].*)$ /index.php?title=$1 [QSA,L]
#
# Else rewrite only if requested URL-path does not end with any of the listed
# filetypes and does not resolve to a physically-existing directory or file
RewriteCond $1 !\.(gif¦jpe?g¦png¦php[345]?¦s?html?¦xml¦js¦css¦txt¦rdf¦pdf¦flv¦swf¦wmv¦mpe?g[234]?¦avi¦mp3)$
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^(.+)$ /index.php?title=$1 [QSA,L]
Also, order the filetype-exclusion list based on your 'hit-count' stats, with the most frequently-requested filetypes at the beginning of the list.
Again it's a balance between executing more mod_rewrite code, titling freedom, and disk-check time.
Important: Replace the broken pipe "¦" characters with solid pipes before use; Posting on this forum modifies the pipe characters, and the code won't work if the broken pipes are not corrected.
And again, the problem with plus signs is not resolvable by mod_rewrite -- I'd suggest you either avoid the use of that character, or use preg_replace in a "patch" in the script to substitute either a different character or the word "plus" for all occurrences of that character in your title links (look for the "link printing" or "get title from database" directives, and add the preg_replace in or near those lines).
I should also say that I don't know why the original code looked for colons -- It may have something to do with the URLs produced by the Wiki script; Perhaps you can include periods and slashes in Wiki titles if you precede them with a colon (check the documentation, as you may not even need to modify this code). But I attempted to copy that functional behaviour into the new code above, and will leave that issue for you to investigate.
Jim
I can deal with the articles with a plus very easily. With 80,000 pages, I've only found one with a plus sign in the title.
The period however, is a bigger issue, because I really want to have titles with periods in them. Say for example on article on Google.com which would be different than Google.org. I wouldn't just want to call it Google, and Google_com and Google_org looks sloppy.
I was thinking maybe a decent solution would be to have a 404 error page that first looks to see if there is a page under the ugly URL with the same name before finally resolving into a 404 error.
Could this be an acceptable solution for sites aspiring for heavy traffic and with an occasional title with a period?
I think I'll just do the sensible thing and prevent titles from being created with periods in them, unless they also contain a colon. For URLs, I'm going to block them unless they're in the format URL:domainname.com
I'm trying to fix up a regex statement to do this:
[^URL:]*\. seems close, but not quite there.
Jim