Welcome to WebmasterWorld Guest from 35.172.195.49

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

addhandler plus trailing slash havoc

     
2:36 am on Dec 4, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 2, 2006
posts:2241
votes: 8


As I was looking into the site's config for one type of problem, looking into server logs revealed the server returning 200 on page requests with the trailing slash. Example:

example.com/page.html/

The requests came from Googlebot, and the slash was followed with bunch of subfolders from the site, all from the root, yet listed in that requests as /sub1/sub3/sub7/sub10/...

While investigating, I've figured that no matter what was after the trailing slash, or simply nothing, the result was the same - HTML code of the page would be returned with 200 response, with all internal paths broken (images, css, etc.)

Now, I don't really care about broken paths as I know why is that. What worries me is the fact that 200 is returned instead of 404.

I dive into it, and figure that the problem would be resolved after deleting .htaccess. So I go line by line, and I finally pinpoint it:

<IfModule !mod_php7.c>
AddHandler application/x-httpd-php .html .htm
</IfModule>


I've already put a request with the support, but I definitely am eager to hear from people here as well, since replies come from the first line of this type of job.

I do need to parse HTML files as PHP as I use PHP to return variables that I maintain in separate files. I can replace some stuff with the plain HTML code, but not all. I know one solution would be to switch extensions to PHP, but...

Why at the first place would this AddHandler directive cause the server return 200 instead of 404?

Thank you

P.S.
Years ago I had similar issue that was resolved by adding AcceptPathInfo Off on the server that was configured as Apache. Now, since most servers are configured as FastCGI, this is not an option.
3:46 am on Dec 4, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4562
votes: 364


That really needs to come from your host, they know how it is set up and there are different ways that hosts offer php. I have seen several different AddHandler lines required at different hosts. I have never seen one within an Apache <IfModule before but that does not mean it is wrong. You will need to know for certain what version of php is being used. If you have a phpinfo.php file, it can usually tell you the version. You may find it in CP if your site has CP.

Usually the line for AddHandler includes the version number as in this old php5 example:
AddHandler application/x-httpd-php5 .html .htm
If your site is running php7 the line would be
AddHandler application/x-httpd-php7 .html .htm
or for php7.1 use
AddHandler application/x-httpd-php71 .html .htm
- BUT your php may be configured in some other way. If they use EasyApache CP module it would be
AddHandler application/x-httpd-ea-php71 .html .php
I would wait to hear from your host or consider it trial and error work.
5:58 pm on Dec 4, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 2, 2006
posts:2241
votes: 8


Thanks.

Non of that really changes anything. I deleted the IF statement and called it with the simple AddHandler line without even stating which version of PHP is being called. I don't think it's necessary as the VPS is running 7.1 only.

PHP is working fine as it has always worked.

AddHandler directive is causing trailing slash return 200 instead of 404, and I hoped to hear why would that be as it does not sounds normal. Removing AddHandler directive fixes it.


Thanks
7:37 pm on Dec 4, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 2, 2006
posts:2241
votes: 8


Well, the support said that slash after HTML extension returning 200 is expected behavior. Isn't that creating duplicated content?
Phew, now I have to rethink a lot.

Thanks
8:48 pm on Dec 4, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15937
votes: 889


Isn't that creating duplicated content?
Well, only if the search engine discovers it. Putting a slash after html/ isn't something they try routinely, like requesting directory names with and without slash. In the requests that you posted about at the outset, are those genuine requests from legitimate search engines and/or humans, or are they things that you yourself tried experimentally?
8:55 pm on Dec 4, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 2, 2006
posts:2241
votes: 8


Requests are coming from Googlebot (server logs). All of them have /RK=2/ in them plus other subfolders garbage. Something somewhere referenced to a page in that way, and this beast picked it up. The problem is that it's enough that a single page returns 200 like here, and here you go with bunch of other scrambled URLs returning 200.
To resolve it from 200 standpoint, I started rewriting all .html/ requests to .html. I don't really like it, but since such the URL up to the .html extension exists, but rest do not, I find it least harmful to redirect. Or I may switch to 404ing them, not sure at this moment.

Thanks
10:59 pm on Dec 4, 2018 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11874
votes: 245


I started rewriting all .html/ requests to .html.

"rewriting" or "redirecting"?
11:05 pm on Dec 4, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 2, 2006
posts:2241
votes: 8


Thanks for asking. Honestly, I know those are different things yet I don;t know how. here is the code:

#Remove slash after .html extension
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)\.html/ /$1.html [L,R] # <- for test, for prod use [L,R=301]


Being 301, I guess it's redirecting. BTW, do I even need that RewriteCond line?
12:02 am on Dec 5, 2018 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11874
votes: 245


Being 301, I guess it's redirecting. BTW, do I even need that RewriteCond line?

the [R=301] flag will make it a (301) redirect.

the ruleset won't fire unless the requested path ends in ".html/".
therefore if you don't have any directories thus named, the RewriteCond is superfluous.
12:14 am on Dec 5, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 2, 2006
posts:2241
votes: 8


Thank you.

Finally, in the past, years ago, I used AcceptPathInfo Off directive in .htaccess to simply turn off that slash thing, and all was good. Now, when server is setup as FastCGI, such entry makes no change.

Looking here:

[httpd.apache.org...]

I have a feeling it should work. But it does not. Can this be used in FastCGI environment at all?

Thanks
12:59 am on Dec 5, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15937
votes: 889


I know those are different things yet I don’t know how.
A redirect means you are telling the visitor (browser or robot) to make a fresh request. You will see the new request in your access logs immediately after the first request. A rewrite means stuff happens behind the scenes that the visitor (again, human or robot) doesn’t know about. CMS such as WP are built entirely around rewrites. In the interest of double markedness, you may find it useful to use the phrases “external redirect” and “internal rewrite”, so long as you understand that “external” doesn’t necessarily mean go to some other site; it means “this isn’t just happening inside the server”.

That being said, I'm not crazy about the pattern
^(.*)\.html/
since it means the server will capture all the way to the end before saying “oh, oops, I need to allow room for .html/ at the end”. If your URLs never contain literal periods, I’d suggest
^([^.]+\.html)
instead. Note that you may as well capture the ".html" too, since you'll be reusing it.

As phranque observed, the -d test is superfluous, since you will obviously never have a directory with "blablah.html/" in its name. Sure, it's possible and legal, but if you had had such directories, this whole thread would have involved a whole different set of questions. Possibly you meant to say -f instead, but don't, because this too is superfluous. The whole point is that you already know the file doesn’t and can’t exist.

the ruleset won't fire unless the requested path ends in ".html/"
You meant to say “contains ".html/"” since there was no closing anchor (and rightly so). There may or may not be additional garbage after the /

Now, here’s option B. This is one of the rare cases where two steps forward, one back, may be the right thing to do, as it saves the server the extra work of capturing on the 99% of html requests that don't have extraneous garbage:
RewriteCond %{REQUEST_URI} ^/([^.]+\.html)
RewriteRule \.html. https://www.example.com/%1 [R=301,L]
(Note exact position or non-position of anchors.) But now we’re getting into personal-coding-style territory.

Even in your existing rule, you may choose to replace / in the pattern with . meaning any-old-garbage-whatsoever. And it can’t hurt to put a ? at the end of the target, in case these same bum requests also come with an equally bum query string.