SteveWh - 6:09 pm on Mar 22, 2012 (gmt 0)
Thank you for posting the update with the obviously correct explanation. That no doubt took some time to track down.
Even in situations where you can't change this behavior (due to PHP as CGI), you can still intercept the incoming request with .htaccess and rewrite the request to return 403 Forbidden (with [F]) or 410 Gone (with [G]) or a 404. 404 doesn't have its own RewriteRule method, but code like this will return 404 without changing the URL in the requester's browser. You might have to refine the regex for your situation to eliminate false positives:
RewriteRule \.htm/ NonexistentPageName.htm
The best solution for your particular situation depends on these two "mysteries":
For quite some time I've been seeing 404 errors in Google WMT that would come from existing pages being artificially put into existing folders they don't belong to.
How is WMT getting the idea that these pages "should" exist? Is it from links on other websites that you have no control over, or from links on your own site that have been constructed this way for some reason?
WMT reports back 404 for /sub/page1.html based on the link from /sub/page2.html/a-b-c where a-b-c is a string created from the title of the page.
How (and where) is the title-of-the-page string being created and appended to the URL? If your own code is doing it (such as a Search Engine Friendly add-in or something like that), it might be better if it didn't.
Don't use relative linking to images, CSS files or JS files. Start the href with a leading slash and include the full path to the file. Your site is then impervious to crawl errors such as this when using rewrites or AcceptPathInfo.
I almost made that change a few months ago, but then decided not to. The reasons were:
Its only advantage that I could see is to fix up pages that have been requested using a wrong URI with a trailing path. Except for the weird robots (I don't care if they get broken CSS, images, etc.), that is a rare occurrence for my site.
Its disadvantages are also minor, but they can inconvenience me occasionally:
If I load a page directly from my computer's file system without requesting it through Apache, the absolute CSS and image links are broken because the leading "/" is the top of filesystem, not web root.
If I put the site in a subdirectory of Apache htdocs without creating a virtual host for it, the same problem occurs: "/" maps to htdocs itself, not the site's subdir, whereas relative links correctly point to whatever is the "effective" top folder the site is in.
It's just personal preference, but it's a couple of things someone contemplating that change might want to know about in advance.