Forum Moderators: phranque

Message Too Old, No Replies

.htaccess rewite, EE, and Geofinder problems. Creative solution?

Having trouble removing index.php and playing nice with Geofinder queries

         

Benek

1:17 am on Aug 25, 2010 (gmt 0)

10+ Year Member



Hello,

I'm building a site in ExpressionEngine 1.6.9. I'm using an add-on called Geofinder to power my location-specific search function. I'm also using an add-on called LG .htaccess Generator to do rewrites to remove index.php from the URLs as well as a few other things.

The .htaccess file it generates works perfectly fine but it seems to be conflicting with that's required to get Geofinder to work properly. When I type in a search for a location the resulting results URL looks something like this:
http://dogwalker.co.nz/dogwalkers/search/Auckland/1000/
("Auckland" is the search word), but when I type in an address that is more complex, with a comma for example ("Greenlane, Auckland"), The resulting URL of
http://dogwalker.co.nz/dogwalkers/search/Greenlane%2C+Auckland/1000/
gives me this error on a blank page:

"Disallowed Key Characters"

I've contacted the author of the Geofinder add-on and he says that error is the result of a .htaccess rewrite rule that's not quite right. As background, here's the entire .htaccess script I have as generated by the default code from LG .htaccess Generator:


# -- LG .htaccess Generator Start --

# .htaccess generated by LG .htaccess Generator v1.0.0
# http://leevigraham.com/cms-customisation/expressionengine/addon/lg-htaccess-generator/

# secure .htaccess file
<Files .htaccess>
order allow,deny
deny from all
</Files>

# Dont list files in index pages
IndexIgnore *

# EE 404 page for missing pages
ErrorDocument 404 /index.php?/

# Simple 404 for missing files
<FilesMatch "(\.jpe?g|gif|png|bmp)$">
ErrorDocument 404 "File Not Found"
</FilesMatch>

RewriteEngine On

RewriteBase /

# remove the www
RewriteCond %{HTTP_HOST} ^(www\.$) [NC]
RewriteRule ^ http://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

# Add a trailing slash to paths without an extension
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_URI} !(\.[a-zA-Z0-9]{1,5}|/)$
RewriteRule ^(.*)$ $1/ [L,R=301]

# Remove index.php
# Uses the "include method"
# http://expressionengine.com/wiki/Remove_index.php_From_URLs/#Include_List_Method
RewriteCond %{REQUEST_URI} !(\.[a-zA-Z0-9]{1,5})$
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_URI} ^/(includes|pages|dogwalkers|parks||members|P[0-9]{2,8}) [NC]
RewriteRule ^(.*)$ /index.php?/$1 [L]

# Remove IE image toolbar
<FilesMatch "\.(html|htm|php)$">
Header set imagetoolbar "no"
</FilesMatch>

# -- LG .htaccess Generator End --


According to the author of Geofinder the line...
RewriteRule ^(.*)$ /index.php?/$1 [L]

Needs to be replaced with...
RewriteRule ^(.*)$ /index.php/$1 [L]

(no question mark)

His reasoning: "The problem with the first one is it turns everything after the ? to a huge GET parameter. It would cause issues with EE search as well."

So of course my next step was to remove the ? and resave the .htaccess file. BAM, the whole site is dead. Trying to access any page gives blank screen with this error:

"No input file specified."

Back to Geofinder guy who is continuing to be very helpful and he says I need to ask my host why the rewrite won't work without the question mark. He says it must be something to do with my server environment. Re recommended one or two settings they could change that might fix it.

My host's tech guy, also very helpful, changed those settings (like AcceptPathInfo On) and it still didn't work. he went as far as to make a new test site and try all different things to try to get it to work without the ? and he couldn't do it.

So now here I am. I have an add-on developer who has been nothing but helpful with suggestions for what me or my host can change to get it to work but nothing have solved it. I've got a host tech who gone the extra mile to try to get it functioning but he also can't solve it (with this admittedly limited rewrite knowledge).

So I'm turning to you pros for some desperately need help. I need to find a way to modify my .htaccess rules to continue to sucessfully remove index.php from the CMS URLs, but also allow characters like commas and other punctuation in the search results URLs so the search functions properly.

This geo search is the backbone of the site and there are no other add-ons that I know of for EE that accomplish the same thing. So I'm desperate to get this working.

Would greatly appreciate any help you can give.

jdMorgan

1:07 pm on Aug 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



He's right about the query string problem, but it is not your host or GeoFinder you need to be talking to about this, it is the EE folks. Apparently, EE requires that the URL-path be put into a query string, so that it can be passed into EE's index.php script for processing.

Since this is apparently an EE requirement, I'm going to guess that the EE script is not properly decoding multiply-encoded characters in the query string. The "rules" about which characters are allowed in URL-paths differ from those which are allowed in query strings, and this encoding/decoding problem comes up fairly often because no-one bothers to read the HTTP/1.1 specification, and they do not realize that we Webmasters are NOT free to use "any characters, anywhere that we wish" in URLs and query strings.

So, pending your discussion with the folks at EE, I'll suggest that you disable URL-encoding in any RewriteRule that can affect URLs passed to EE, and see if that helps. With a correction to rule order to avoid stacked/chained/multiple redirects, two major performance improvement mods, a security tweak, and several more minor performance tweaks, your code would look like this:

# -- LG .htaccess Generator Start --
# .htaccess generated by LG .htaccess Generator v1.0.0
# http://leevigraham.com/cms-customisation/expressionengine/addon/lg-htaccess-generator/
# Tuned and tweaked by the good folks at webmasterworld.com
#
# secure .htaccess, htpasswd, and htgroup file
<FilesMatch "\.(htaccess|htpasswd|htgroup)$">
Order allow,deny
Deny from all
</FilesMatch>
#
# Don't list files in directory indexes
IndexIgnore *
#
# Declare EE 404 error document for missing pages
ErrorDocument 404 /index.php?/
#
# Declare simple/short 404 error response for missing image files
<FilesMatch "\.(jpe?g|gif|png|bmp|ico)$">
ErrorDocument 404 "File Not Found"
</FilesMatch>
#
RewriteEngine On
RewriteBase /
#
# Externally redirect to add a trailing slash to paths without an extension which do not resolve to an
# existing file. (Note major performance improvement by skipping 'exists' check when not necessary,
# and prevention of an infinite redirect loop if the "home page" accidentally goes missing)
RewriteCond $1 !(\.[a-z0-9]{1,8}|/)$ [NC]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^(.+)$ http://example.com/$1/ [NE,R=301,L]
#
# Externally redirect non-canonical "www" hostname requests, non-www FQDN hostname requests,
# and non-www hostname requests with appended port numbers to the canonical non-www hostname
RewriteCond %{HTTP_HOST} ^(www\.example\.com|example\.com(\.|\.?:[0-9]+)$) [NC]
RewriteRule ^(.*)$ http://example.com/$1 [NE,R=301,L]
#
# Internally rewrite specified extensionless URL-paths to index.php if
# they do not resolve to existing files. Uses the "include method"
# http://expressionengine.com/wiki/Remove_index.php_From_URLs/#Include_List_Method
# (Modified for performance and to prevent rewriting htaccess, htpasswd, and htgroup requests)
RewriteCond $1 !\.[a-z0-9]{1,8}$ [NC]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^((includes|pages|dogwalkers|parks||members|P[0-9]{2,8}).*)$ /index.php?/$1 [NC,NE,L]
#
# Send HTTP response header to disable IE image toolbar
<FilesMatch "\.(html?|php)$">
Header set imagetoolbar "no"
</FilesMatch>
#
# -- LG .htaccess Generator End --

As you can see, just about every routine here has been modified. I suggest that you learn more about Apache directives and until you no longer need them, use "htaccess generators" only to give you a rough idea of what is needed, then tweak and tune the output. It is obvious that this generator produces code that is far from optimized, and your site's performance likely suffers as a result. To be fair however, this auto-generated code is at least "not truly awful" like most we see here; It is an almost-impossible task to auto-generate optimized code.

The "exists" checks result in resource-intensive and slow calls to the OS to check the filesystem to see if a file or directory exists. If the current state of the filesystem is not cached in memory, or if that cache is marked as stale, then the result will be a read operation on the physical disk. This delays the request, executes a lot of OS and filesystem code, and can result in the premature demise of your hard drive.

Therefore, these 'exists' checks should be skipped whenever possible by fully-qualifying them; The RewriteConds checking -f, -d, -s, and -l should always be the last RewriteConds in any rule-set. The same is true for RewriteConds checking %{REMOTE_HOST}, but that's another subject...

The code change that is relevant to your main question is the use of the [NE] flag on all rules which may handle requests for your extensionless URLs that will get passed to EE. This includes the "add-a-slash" and hostname canonicalization rules.

Note the use of the [NC] flag to shorten the [a-zA-Z-0-9] patterns down to [a-z0-9], yielding at least a 30% performance increase...

Note that when possible, "RewriteCond %{REQUEST_URI}" patterns have been moved to the RewriteRule and that RewriteRule patterns have been tweaked to improve performance.

If the [NE] flag helps with your original problem, then the only difference between this code and your original should be a noticeable improvement in your page-loading time.

If the [NE] flag fix does not help, I'd suggest that you eliminate spaces and another other characters which must be encoded for use in either the URL-path or the query string. The HTTP/1.1 protocol specification [w3.org] defines these characters for both cases. If you don't want to dig into that, the short answer is to use only characters, numbers, underscores, and hyphens in your URLs and query strings.

Jim

Benek

10:11 pm on Aug 25, 2010 (gmt 0)

10+ Year Member



Thanks Jim for the amazing reply. Probably the most thorough and thoughtful reply I've ever received on a forum.

I've tried implementing your code through LG .htaccess generator. The reason for this rather than writing the .htaccess file directly is that the generator automatically updates the file when new EE template groups or pages are added. The beauty of it is that it's dynamic like that. Unfortunately it would not work.

I pasted your revised code into LG .htaccess generator's code window, replaced the sections that list template groups and pages with the variables for those things, and then tried to save and I got a screen with this error:

Not Implemented
The page you are looking for cannot be displayed because a header value in the request does not match certain configuration settings on the Web server.
Web Server at dogwalker.co.nz


So, next I tried the direct route, forgetting about LG .htaccess generator and pasting it directly to my .htaccess file. This does work to some degree. It successfully removed index.php from the URLs, however it does not solve the problem of the "Disallowed Key Characters" error I get when trying searches that contain punctuation like commas.

So, while I appreciate very much that this is more tuned and efficient than the default code from LG .htaccess Generator, it seems to be somehow incompatible with the add-on, or with EE, or just with my web host. If I can't use it through the LG add-on it means I would have to update the .htaccess file manually if I create a new template group or page, which isn't a realistic possibility.

I've already managed to get my host to change a number of settings to try to get this working and no success yet, but they seem willing to help. Are there any other changes I can ask them to make that might make this code compatible with the hosting environment?

Secondly, now that it's been determined that this revised code does not solve the original problem, what's my next step? Are you suggesting I contact EE and ask them to rewrite their code to fix this? Some of this is over my head so I'm not sure I followed everything but it sounded like you're saying the cause of the original problem is that EE is written incorrectly by requiring everything after index.php to be passed as a query string.

Given that it's unrealistic that I can get EE to alter their core code just for my problem, what other options am I left with? Is there any modification that can be made to Geofinder that might fix this? Or am I back to my host to see if there's anything more they can do? Or is there something else I can try in .htaccess?

You've proven to be extremely knowledgeable already and I'm hoping you can help me determine the next step to take to solve this.

Thanks for the great help,
Benek

Benek

11:10 pm on Aug 25, 2010 (gmt 0)

10+ Year Member



The Geofinder developer has noted that I should mention the host is running mod_fcgid, in case that helps pinpoint the issue.

He also pointed out an article that has this to say about the "No input file specified" issue:

Which brings me to the last problem I was getting on my PHP files: No input file specified. This started another round of fruitless internet searches. The bottom line is that this meant that PHP couldn't execute the file. Well, another stat command on the file in question showed:
Access: (0660/-rw-rw----) Uid: (1000/ someuser) Gid: ( 2000/ somegroup)
but more importantly neither the owner nor group for the file was associated with the suexec user that was being used for mod_fcgid. A quick chown -R suexecuser:suexecgroup command later on the folder holding my http files (-R makes it recursive) and my PHP file was working like a charm. Just make sure you replace suexecuser and suexecgroup with your actual suexec user and group (this is specified in my /var/www/vhosts/yourdomain.com/conf/vhost.conf file).


I guess that's something I should bring up with my host.

Do either of these new clues help solve this? It's a bit over my head.

jdMorgan

1:16 pm on Aug 26, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Since this is apparently an EE requirement, I'm going to guess that the EE script is not properly decoding multiply-encoded characters in the query string. The "rules" about which characters are allowed in URL-paths differ from those which are allowed in query strings, and this encoding/decoding problem comes up fairly often because no-one bothers to read the HTTP/1.1 specification, and they do not realize that we Webmasters are NOT free to use "any characters, anywhere that we wish" in URLs and query strings.


This is the issue you should take up with EE -- At least to inquire about it, ask for support, or at least report it as a potential problem. Software vendors may not modify their code to fix one Webmaster's problems, but they certainly won't consider modifying it at all if no-one reports problems...

I don't know EE or GeoFinder at all, and I'm fairly shaky on FastCGI, so I'm at the limit of my experience here. Maybe someone else will come along here after awhile, but I'd suggest that you try to get support from all three vendors -- at least each one will have expert knowledge of his "piece" of the puzzle.

Jim

Benek

9:17 pm on Aug 26, 2010 (gmt 0)

10+ Year Member



Oh darn, I was hoping you'd have some other ideas up your sleeve!

Unfortunately I've already exhausted help from other vendors--that's why I turned here.

The host is now saying that the level of help and customization I'm asking for is starting to get into the realm of virtual/dedicated server and the client is not going to want to make that upgrade just to have URL rewrites. So I'm afraid I might not get any more help from the host.

The Geofinder developer has been extremely helpful in pointing out things to look for and settings to try but none of them have lead to a solution, and we're getting way beyond the realm of his support.

I have contacted the LG .htaccess Generator author and sent him a link to the site but I have not got any kind of help or suggestion from him and I may never--it's a free add-on and I don't know how much support he offers.

So unless someone else chimes in here I may be out of options. I'll give it until Monday and then I may just forget the .htaccess stuff entirely and not try to rewrite the URLs. I can live with index.php being there if it means this nightmare of a problem is over. I hate to say it, but this might be one of the technical problems I've run into that I actually have to give up on.

jdMorgan

9:49 pm on Aug 26, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, the simplest solution may be to simply eliminate the use of the problematic characters in article titles. As stated, these characters are noted as "prohibited" or "unwise" in the HTTP/1.1 protocol specification (cited above) that underpins the Web -- and such warnings should be taken quite seriously, especially when proven true.

Jim

Benek

10:06 pm on Aug 26, 2010 (gmt 0)

10+ Year Member



The problem is that they aren't article title. They are search queries. People could potentially enter anything in the search field and I have no control over what characters they might use.

It's a location search, so things like commas are extremely likely since they are so common in addresses.