|parsing as PHP|
parse as PHP affects how page request is returned
| 6:25 pm on Mar 13, 2012 (gmt 0)|
I use PHP within HTML so I have this line in .htaccess:
AddType application/x-httpd-php .php .htm .html
For quite some time I've been seeing 404 errors in Google WMT that would come from existing pages being artificially put into existing folders they don't belong to.
For example, page1.html is in the root, and page2.html is in the /sub/ subfolder.
WMT reports back 404 for /sub/page1.html based on the link from /sub/page2.html/a-b-c where a-b-c is a string created form the title of the page.
So if the title of the page2 is "Page 2 Buy Now" the non-existing URL will be /sub/page2.html/Page-2-Buy-Now, and instead of returning 404 the site actually returns a page, but broken with no CSS and images applied.
If I comment out the line for PHP parsing, those non-existing pages return 404.
The only PHP within the code is one "include" for addthis.
Why would this happen?
| 8:34 am on Mar 14, 2012 (gmt 0)|
Does the server return /sub/page2.html? That's what my Apache servers would return for /sub/page2.html/Page-2-Buy-Now . It seems to ignore the trailing garbage after the legitimate page URL, as long as it can turn the initial portion of the page request into something legitimate. (BTW, this is with mod_negotiation not enabled).
There's at least one forum thread here (maybe others) discussing something similar
It never occurred to me that Apache returning a page instead of a 404 might have to do with parsing .htm and .html as PHP. I haven't experimented with a situation where Apache wasn't configured to do that.
I suspect the reason your CSS and images are broken is that although Apache returns a page, the requesting browser considers /sub/page2.html/ to be a subdirectory (from its format), and requests relatively-linked objects from that subdir. Since it doesn't exist on the server, all those requests get 404. I discovered from a similar situation that Firefox 10 does this.
In other words, Apache and Firefox disagree about exactly what page was requested. Apache serves a page where it technically should have sent a 404 instead (the whole URL doesn't actually exist), and the browser is misled about what dir the page came from.
| 8:53 am on Mar 14, 2012 (gmt 0)|
Don't use relative linking to images, CSS files or JS files.
Start the href with a leading slash and include the full path to the file.
Your site is then impervious to crawl errors such as this when using rewrites or AcceptPathInfo.
| 5:01 am on Mar 15, 2012 (gmt 0)|
Thanks for replies.
Well, I checked this with other sites I have with other providers.
They all do this.
Then I created a page with a few letters of text, no template, nothing. It still reacts in the same way.
As soon as I comment this line out from .htaccess
AddType application/x-httpd-php .php .htm .html
it returns 404.
Why would this parsing command affect this?
| 11:16 am on Mar 15, 2012 (gmt 0)|
I can confirm your results. With the AddHandler line in effect, I get a page, but when I comment it out, I get 404. I did the test in Windows with Apache 2.2 with PHP running as an Apache module.
The only insight I can try to offer is that Apache might be using a different subroutine to figure out what to serve (if anything) for the requested URI, depending on whether it's going to serve the page itself immediately or run it through PHP first. Maybe there are two different URI resolution and file-find routines.
Your discovery is something I haven't seen discussed previously.
This could be a minor security issue, since it provides a way for an outsider to determine if .htm files are being passed through the PHP interpreter, and thus whether .htm/.html pages likely contain PHP code upon which PHP exploits could be attempted. (On the other hand, it's more efficient to just send the exploit requests without bothering to test for underlying PHP first.)
Offhand, I can't think of a good reason to create these two different behaviors on purpose, but it's always possible there is one.
| 10:41 pm on Mar 17, 2012 (gmt 0)|
Well, I guess the best for me at this moment is to stop using this possibility. If I need PHP, I can use the .php extension.
| 7:52 am on Mar 18, 2012 (gmt 0)|
smallcompany, is PHP configured on your server to run as an apache module or in CGI mode?
| 3:27 pm on Mar 18, 2012 (gmt 0)|
Server API: Apache 2.0 Handler
|is PHP configured on your server to run as an apache module or in CGI mode? |
BTW, I get the same behavior on servers where it's set as CGI.
That slash after file extension has always puzzled me.
For example, if I request a page like this:
I get 404 as it should be. But this request
returns the /sub/page.html, just broken with no images and CSS. I understand why it's broken (because of paths) and that does not bother me. But all the outgoing links are broken as well which creates 100s and 1,000s of 404 that Googlebot then keeps asking for forever.
Why that slash causes the server act in "200 OK" way?
| 8:09 am on Mar 22, 2012 (gmt 0)|
An update on the behavior. Apparently, trailing stuff is supposed to work in the way as determined by the handler responsible for the request.
When I turned this off on the server that runs PHP in Apache module, it worked. I did this in .htaccess:
But that same code on the server running as CGI did not help. The server was still returning 200 and content of the page when a request had a trailing slash followed (or not) with anything.
So the good news is that things work as they're supposed to. The thing is that more schooling is needed.
Cheers and thanks for replies.
| 6:09 pm on Mar 22, 2012 (gmt 0)|
Thank you for posting the update with the obviously correct explanation. That no doubt took some time to track down.
Even in situations where you can't change this behavior (due to PHP as CGI), you can still intercept the incoming request with .htaccess and rewrite the request to return 403 Forbidden (with [F]) or 410 Gone (with [G]) or a 404. 404 doesn't have its own RewriteRule method, but code like this will return 404 without changing the URL in the requester's browser. You might have to refine the regex for your situation to eliminate false positives:
RewriteRule \.htm/ NonexistentPageName.htm
The best solution for your particular situation depends on these two "mysteries":
|For quite some time I've been seeing 404 errors in Google WMT that would come from existing pages being artificially put into existing folders they don't belong to. |
How is WMT getting the idea that these pages "should" exist? Is it from links on other websites that you have no control over, or from links on your own site that have been constructed this way for some reason?
|WMT reports back 404 for /sub/page1.html based on the link from /sub/page2.html/a-b-c where a-b-c is a string created from the title of the page. |
How (and where) is the title-of-the-page string being created and appended to the URL? If your own code is doing it (such as a Search Engine Friendly add-in or something like that), it might be better if it didn't.
|Don't use relative linking to images, CSS files or JS files. Start the href with a leading slash and include the full path to the file. Your site is then impervious to crawl errors such as this when using rewrites or AcceptPathInfo. |
I almost made that change a few months ago, but then decided not to. The reasons were:
Its only advantage that I could see is to fix up pages that have been requested using a wrong URI with a trailing path. Except for the weird robots (I don't care if they get broken CSS, images, etc.), that is a rare occurrence for my site.
Its disadvantages are also minor, but they can inconvenience me occasionally:
If I load a page directly from my computer's file system without requesting it through Apache, the absolute CSS and image links are broken because the leading "/" is the top of filesystem, not web root.
If I put the site in a subdirectory of Apache htdocs without creating a virtual host for it, the same problem occurs: "/" maps to htdocs itself, not the site's subdir, whereas relative links correctly point to whatever is the "effective" top folder the site is in.
It's just personal preference, but it's a couple of things someone contemplating that change might want to know about in advance.
| 6:57 pm on Mar 22, 2012 (gmt 0)|
Don't know yet. I started looking into raw log files. Google WMT does not help as I can only see the dates when the server started returning 404 for those links. I actually need to see the initial request whenever it happened. The rest was just an effect of broken linking.
|How is WMT getting the idea that these pages "should" exist? |
No idea. This may take another investigative effort like this AcceptPathInfo which BTW has been mentioned here many times. I'll be searching more on this through old WebmasterWorld posts.
|How (and where) is the title-of-the-page string being created and appended to the URL? |
This does not matter as such pages do not exist nor they should return 200, at least in this case. These pages have to return 404.
|Don't use relative linking to images, CSS files or JS files. |
I'm still puzzled with the Apache's explanation of AcceptPathInfo as it does not mention that a server in CGI mode would not accept the directive. Yet, I'm coming across references that folks had to switch their servers configurations to Apache in order to get this work.
In my case, I can't do it as I have quite a few hosting plans that are plain shared plans. There has to be a way for shutting this feature off.
| 3:48 pm on Mar 24, 2012 (gmt 0)|
On the server configured in CGI mode, a change in PHP handler directive produced the desired result. The old was:
AddHandler application/x-httpd-php5 .html .htm
The new one is:
AddHandler application/x-httpd-php .html .htm
The support said this:
As the server is already using php5, it was making redundant calls that did not work.
So viva the support that resolved it in their first attempt!