Forum Moderators: phranque

Message Too Old, No Replies

URLs with missing extensions Still Served

         

JohnRuskin

9:23 am on Dec 8, 2010 (gmt 0)

10+ Year Member



Through an inadvertent coding error on my part, Google picked up a web page as
www.mydomain.com/my-page
even though the true document is
www.mydomain.com/my-page.html

Now both are in the index. Ever the tinkerer, I attempted to write a rewrite ruleset that would -add- back in missing extensions, if absent. To my surprise, none of my versions, or other versions found, would do the trick. Page served with extension, but browser sticks with no extension.

No matter what I do, rewrite or redirect, my domain will serve up "my-page.html" if the request is "my-page". In a fit of "try", then I created a file, absent the extension, that would would use an html redirect, just so I could force google to not follow on my errors. While I'm not sure if it would work, I know that the simple redirecting page shows up as a txt in my browser [that file has content which is HTML compliant, including doctype].

The whole absent-extension-thing is behaving as if something is happening on the GoDaddy Server, before anything reaches my ROOT .htaccess, which allows the extensionless request as valid, and produces the extension'd file. I don't have access to server root for modifications of .htaccess & etc; rather only to my domain root

My entire .htaccess, in its current failed form, following: code between the #! are the variations I've added and tried, to address this situation. Any hints? Thanks

=========
.htaccess
---------

#########
# Root
#########

AddHandler server-parsed .html
AddHandler server-parsed .shtml
AddHandler server-parsed .htm

RewriteEngine On

#!Attempt Fix, here...!
# Redirect missing HTML to include it.
##Redirect 301 property-claim [complianceofficer.com...]

##RewriteCond %{REQUEST_FILENAME} !-d
#RewriteCond %{REQUEST_FILENAME} !-f
#RewriteRule ^([^.]+)$ [complianceofficer.com...] [R=301,L]

## If no trailing slash or .html, and no "." in URL
RewriteCond $1 !(/$|\.html$|\.)
# and if URL with appended slash does not exist as a directory
RewriteCond %{DOCUMENT_ROOT}/$1/ !-d
# and if page does exist with ".html" appended
RewriteCond %{DOCUMENT_ROOT}/$1.html -f
# then append ".html" to URL
RewriteRule (.*) http://www.example.com/$1.html [R=301,L]
#!


# Redirect HTM and SHTMx to HTML
RewriteCond %{HTTP_HOST} ^www\.complianceofficer\.com
RewriteRule ^(.*)\.(shtm|shtml|htm)$ [complianceofficer.com...] [R=301,L]

# Externally redirect all non-canonical hostname requests to canonical "www.domain.com" hostname
RewriteCond %{HTTP_HOST} !^(schoolboard|www|lighterside)\.complianceofficer\.com$
RewriteRule ^(.*)$ [complianceofficer.com...] [R=301,L]

JohnRuskin

9:34 am on Dec 8, 2010 (gmt 0)

10+ Year Member



Noticing my potential lack of wisdom, using:
^(.*)
Instead of:
^([^/]+)
in my two rewrites that do work... Am I correct about the processing time potential saved with this change...? And, while I understand the purpose of the change, I can't quite 'translate' the longer expression: can someone?

jdMorgan

12:47 pm on Dec 8, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The ... thing is behaving as if something is happening on the GoDaddy Server, before anything reaches my ROOT .htaccess ...

Very likely so...

Try this in your root .htaccess:

AddHandler server-parsed .html .shtml .htm .shtm
#
# !Attempt Fix, here...!
Options +FollowSymLinks -Indexes -MultiViews
#
RewriteEngine On
#
# If no trailing slash or "." in URL
RewriteCond $1 !(/$|\.)
# and if URL with appended slash does NOT exist as a directory
RewriteCond %{DOCUMENT_ROOT}/$1/ !-d
# and if page DOES exist with ".html" appended
RewriteCond %{DOCUMENT_ROOT}/$1.html -f
# then append ".html" to URL
RewriteRule ^(.+)$ http://www.example.com/$1.html [R=301,L]
#
# Redirect .shtm, .shtml, and .htm URL requests in www.example.com to .html
RewriteCond %{HTTP_HOST} ^www\.example\.com
RewriteRule ^(.+)\.(shtml?|htm)$ http://www.example.com/$1.html [R=301,L]
#
# Externally redirect all non-canonical hostname requests to canonical "www.example.com" hostname
RewriteCond %{HTTP_HOST} !^(www|schoolboard|lighterside)\.example\.com$
RewriteRule ^(.*)$ http://www.example.com/$1.html [R=301,L]

This disables content-negotiation which, although very useful in some cases, can severely interfere with mod_rewrite. Take a look at the mod_negotiation for documentation an eye-opener...

Other candidates (for similar problems to this) are mod_speling and AcceptPathInfo, wich can also interfere with mod_rewrite.

Note that I modified and removed several redundant patterns and/or functions. No change whatsoever to function was made, only a tiny performance tweak.

There is nothing fundamentally wrong with using ".*" if you *really* mean that you want to match "anything at all, absolutely everything, or even nothing." In the code above, I changed one pattern from ".*" to ".+" because the rule should not accept "nothing."

Jim

[edit] Corrected as noted below. [/edit]

[edited by: jdMorgan at 11:51 pm (utc) on Dec 8, 2010]

JohnRuskin

2:58 pm on Dec 8, 2010 (gmt 0)

10+ Year Member



Thanks for your solution, Jim. I think the -multiview fixed it.

1.
I think you meant to delete the "/$1.html" in the final rewrite rule you suggested...?

2.
Now that I think on it, in my world, either a trailing dot, or no extension, should trigger adding an "html" or ".html", respectively. This would leave the rule, in my fabrication, as two sets? My solution below. Curious if there is a single rule which teases that out to a single rule...

Interestingly, testing yours now, with a trailing slash leads to all kinds of weird production
  • www.example.com/index.html/
    [/*]
    produces a css-free page in Fox
    and on that produced page clicking on a URL ["morestuff.html"]
    yields again "index.html" w/o css,
    and this in the Fox AddressBar:
  • www.example.com/index.html/morestuff.html
    [/*]
    What happens...? I can't divine this behavior from the rule set. Another GoDaddy behavior, not producing a FileType sense, or is this a Fox Weirdness combined with GoDaddy Quirk?

    3.
    Do I have these right:

    As I read the documentation, the -multiview is a quirk's fix that solves original problem, and it did [Thanks, Jim]. Am I correct that there is a server level variant of that command that GoDaddy can/did enable? Am I correct that +MultiView solves confusing URLs with generally confusing solutions...?

    -indexes, I read as a security issue; a good idea.

    +FollowSymLinks -- I don't quite get; The way I am tempted to understand, it would be useful if my file names, in directories, pointed to other locations, and not internal to modRewrite issues. Am I right?

    ---John

    ===========.htaccess================
    #########
    # Root
    #########

    AddHandler server-parsed .html .htm .shtml .shtm
    Options +FollowSymLinks -Indexes -MultiViews
    RewriteEngine On

    ## If not (either: trailing slash; or ".") in URL
    RewriteCond $1 !(/$|\.)
    # and if URL with appended slash does not exist as a directory
    RewriteCond %{DOCUMENT_ROOT}/$1/ !-d
    # and if page does exist with ".html" appended
    RewriteCond %{DOCUMENT_ROOT}/$1.html -f
    # then externally redirect
    # then append ".html" to URL
    RewriteRule ^(.+)$ http://www.example.com/$1.html [R=301,L]

    ## If (trailing ".") in URL
    RewriteCond $1 (\.$)
    # and if requested URL does not exist
    RewriteCond %{DOCUMENT_ROOT}/$1 !-f
    # then externally redirect
    # then append "html" to URL [which already has the dot]
    RewriteRule ^(.+)$ http://www.example.com/$1html [R=301,L]

    ## Externally Redirect .htm .shtm .shtml to .html
    RewriteCond %{HTTP_HOST} ^www\.example\.com
    RewriteRule ^(.*)\.(shtml?|htm)$ http://www.example.com/$1.html [R=301,L]

    ## Externally redirect all non-canonical hostname requests to canonical "www.domain.com" hostname
    RewriteCond %{HTTP_HOST} !^(subDomOne|SubDomTwo|SubDomThree)\.example\.com$
    RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
  • JohnRuskin

    3:22 pm on Dec 8, 2010 (gmt 0)

    10+ Year Member



    Placing:
    AcceptPathInfo Off

    into the root .htaccess
    creates a server error when I try
  • "www.example.com/valid-page.html/more"


    GoDaddy reports no Apache version in the header, so I can't tell if this is a version issue.
  • jdMorgan

    11:06 pm on Dec 8, 2010 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    I don't feel confident that I understand the majority of your previous post... Let's take it one question at a time...

    1. Did I intend to delete the trailing $1? No, or I would have done so.


    2. No extension is already taken care of. Trailing punctuation (from bad forum autolinking routines, for example) will have to be a separate rule. That separate rule need not check for file-exists or directory-exists because a trailing period wouldn't form a valid filename, and you're unlikely to want to name directories with trailing periods...


    3. [If I request] "www.example.com/index.html/[/*]" [from my server], it produces a css-free page in Fox

    {Please write directly, grammatically, and specifically, here... it is difficult enough to figure out the technical aspects without having to guess at the basic language. I may well have guessed wrong with my 'edit' of your statement. Contributors here cannot be blamed for ignoring posts that they cannot understand...}

    Since both /index.html/ and /index.html/[/ are directory-path specifications, and your included-object links are likely 'relative' links, that does not surprise me. Also URL-paths containing [, /, *, or * will be encoded, since all of those characters are either 'reserved' or 'unwise'. If you really want to bullet-proof your site against malformed URLs like that, use server-relative or canonuical links for included objects.


    4. Options +/-MultiViews enables or disables content-negotiation. See mod_negotiation documentation.

    AcceptPathInfo is not available on older Apache versions, and will indeed cause a 500-Server Error if not supported.

    +FollowSymLinks is required by mod_rewrite... as documented.

    Hope that's useful... If no, please restate one or two questions to keep it short.

    Thanks,
    Jim

    JohnRuskin

    11:35 pm on Dec 8, 2010 (gmt 0)

    10+ Year Member



    Jim: My apologies, and my thanks for your response:

    1.
    Your suggested use of /$1.html/$1 ends up creating this returned result:
  • www.example.com/page.html.html/page.html
    What am I missing here? I would have struck the "/$1.html", leaving only the "/$1". That's why I thought it was a typo.

    2.
    My apologies: Twice, I manually typed [/*] as a list item closing delimiter, and see now it is not required in this forum.

    The question, rephrased, posits this URL
  • www.example.com/index.html/

    which, when submitted, returns and displays a page free of CSS formatting in the Fox Browser, despite that it normally works. Further, clicking on a URL on that page[eg: "morestuff.html"]
    returns that same "index.html" file without CSS formatting, and this in the Fox AddressBar:
  • www.example.com/index.html/morestuff.html


    What happened...? I can't divine this behavior from the rule set. Is this a Fox Weirdness combined with a GoDaddy Quirk? Or am I doing something wrong.

    My thanks, again, in advance

    John
  • JohnRuskin

    11:41 pm on Dec 8, 2010 (gmt 0)

    10+ Year Member



    Ahh. Now I see a point I missed, re #2 -- That the trailing slash turns the prior text into a directory-path. This intellectual curiosity still stands: I still don't see why there is a return of a CSS-free index.html page from the root, instead of a 404 missing.

    Thanks, again

    John

    jdMorgan

    12:03 am on Dec 9, 2010 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    I missed the rogue "$1" because I was looking at the "final" rule -- meaning to me, the last rule of the four. I corrected the code above to prevent further confusion and propagation of bad code.

    Download and install the "Live HTTP Headers" add-on for Firefox, and look at your CSS problem. If your .css files are relatively-linked as described above, you will see that your CSS file is being loaded from /index.html/[/cssfile.css. Use server-relative or canonical links for css, images, and JavaScript files instead of page-relative links if you want to avoid this problem. It is the browser which constructs canonical URLs from page- or server-relative links, based on the URL in its address bar. Since the 'directory' of the URL in the address bar is "/index.html/[/", a page-relative link (e.g. <link rel="stylesheet" type="text/css" href="cssfile.css"> will be canonicalized to http://example.com/index.html/[/cssfile.css, and that is what it will request. The css file request will go 404, but that certainly won't show a 404 message on the HTML page itself, because the HTML page file fetch was successful...

    Using <link rel="stylesheet" type="text/css" href="/cssfile.css"> or <link rel="stylesheet" type="text/css" href="http://example.com/cssfile.css"> will prevent that problem.

    You'll likely learn a lot about this and many other aspects of HTTP simply by observing your browser requests and your server responses with Live HTTP Headers -- It is basic Webmaster kit, and well worth the time to experiment a bit with it...

    Jim

    JohnRuskin

    2:01 am on Dec 9, 2010 (gmt 0)

    10+ Year Member



    Thanks, Jim for your response...your patience is appreciated...

    I have the HttpFox which, I suspect, serves the same function as the addon you describe -- I had looked in on what was going on, but as mentioned earlier missed recognizing that for my browser, "/index.html/" became a directory by that confused name...!

    Been using HttpFox to follow success/failure of Rulesets for the Rewrites, and it doesn't help the confusion, below -- between the way my browser and the server treat the consequence of the trailing slash.

    To explain my confusion:
    On my site, "mypage.html" is a valid page.

    When I type in "
    www.example.com/mypage.html/
    ", the server does NOT return the index.html for the root, nor does it return a 404. Rather it returns the valid, existing "mypage.html", stripping, as it were, the trailing slash, and producing the page: "mypage.html" from the domain root.

    Concurrently, the reference to the "main.css" file, fails; I understand now what the source is....in this case, the "mypage.html" is taken by my browser to be part of the directory path for the produced and served page [ie, my browser regards the trailing slash as part of the served document's PATH, ie DOES regard the "mypage.html" as a directory-path, regards it as part of the base page for the link....and therefore the reference to
    www.example.com/mypage.html/main.css
    FAILS.

    So there is a distinction between how the server treats the trailing slash, and how my browser does.

    What I don't understand is -what- about the server, ie, in the .htaccess file, or something/somewhere else, that makes the server drop the slash, and NOT treat the "mypage.html" as a directory name, and thus return a 404?

    It is the mere fact that if/when there is no Directory by the same name, Apache snags a file -if- it can find it? Is this merely native behavior, or is it something in or missing from my .htaccess, or is it the result of something else -- something sort of like the way multiview might work (except I have turned multiview off).

    jdMorgan

    12:26 am on Dec 14, 2010 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    > It is the mere fact that if/when there is no Directory by the same name, Apache snags a file -if- it can find it? Is this merely native behavior, or is it something in or missing from my .htaccess, or is it the result of something else -- something sort of like the way multiview might work (except I have turned multiview off).

    Do be sure you turned off MultiViews (plural). MultiViews and AcceptPathInfo are the usual culprits when this "automagically find a file" behaviour is observed. Mod_speling can also make minor corrections to misspelled URLs.

    Aside from your own rewriterules or scripts "modifying the URLs to make them work," those three server modules/functions are the primary ones that can cause this kind of unexpected behaviour.

    Jim