Forum Moderators: open

Message Too Old, No Replies

Validating XHTML documents

Are they valid?

         

encyclo

12:30 am on Feb 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I made a comment in this thread [webmasterworld.com] about the difference between having a valid document and a document which validates. I want to play devil's advocate here, and I'm looking for opinions. Here are three pages authored with XHTML:

1) h**p:www.alistapart.com/about/

A very popular web magazine. Uses XHTML 1.0 Transitional. (I selected the "About" page as it will change less frequently than the front page).

2) h**p://www.w3.org/TR/xhtml11/

W3C Recommendation for XHTML 1.1. Uses (surprise surprise!) XHTML 1.1.

3) h**p://www.webstandards.org/about/

Group advocating (guess what?!) web standards. Uses XHTML 1.0 Strict. (Again I chose a fairly static "About" page.)

My question is this: are these three pages valid XHTML?

DrDoc

12:55 am on Feb 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Before answering this question, keep in mind that the validator only checks the structure of a page. And in that sense, yes, they are all valid XHTML. However, they all also have minor flaws:

ALA
• The XML prologue is missing. All XHTML pages are supposed to have the XML prologue unless the encoding is UTF-8 or UTF-16.

W3C
• XHTML 1.1 pages should not be served as text/html (even though they may). Page is served as text/html, not application/xhtml+xml (which is of course because of lacking support)

WASP
• The XML prologue is missing. All XHTML pages are supposed to have the XML prologue unless the encoding is UTF-8 or UTF-16.
• HTML style comments are used around inline style sheets. This should not be done since XML parsers are allowed to silently remove the contents of comments. *

* Note that I didn't not check any external style sheets since it was only the markup that was in question

encyclo

1:20 am on Feb 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I knew I could count on you, DrDoc!

ALA
  • The XML prologue is missing. All XHTML pages are supposed to have the XML prologue unless the encoding is UTF-8 or UTF-16.
  • I agree. If you aren't using UTF-8 or UTF-16, there should either be an xml prolog, or the charset should be defined in the http header (before the page is served). This page (and in fact the whole ALA site) is invalid XHTML 1.0 Transitional. However, it validates.

    W3C
  • XHTML 1.1 pages should not be served as text/html (even though they may). Page is served as text/html, not application/xhtml+xml (which is of course because of lacking support)
  • Close, but no cigar ;) XHTML 1.0 should not be served as text/html, but you may do so. XHTML 1.1 must not be served as text/html. This page is invalid XHTML 1.1. However, it validates.

    WASP
  • The XML prologue is missing. All XHTML pages are supposed to have the XML prologue unless the encoding is UTF-8 or UTF-16.
  • They don't need an XML prolog, as the charset is defined as ISO-8859-1 in the http header. No problem there.

  • HTML style comments are used around inline style sheets. This should not be done since XML parsers are allowed to silently remove the contents of comments.
  • This doesn't invalidate the XHTML - ok I agree the inline stylesheets should have no effect, but the comments are valid. So, no problem there either.

    They're missing something else - I still haven't found the reference in the specs (the W3C site and documentation is a real nightmare to navigate!), but I think the page is invalid. I'm the least sure about this one, though... ;)

    DrDoc

    1:32 am on Feb 26, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Close, but no cigar

    Yeah, I didn't feel like looking it up in the specs :)
    But, now that I did... I noticed that I am right after all ;) XHTML Media Types [w3.org]

    As for WASP... I didn't bother checking the headers for any of these pages (too much work ;))... And I didn't say that all the things I brought up were errors... just "minor flaws", since I figured that would better describe it :)

    I couldn't find any other problems with the WASP site though... Care to enlighten us as to what the problem might be?

    inline stylesheets should have no effect

    Useragents may silently remove the contents of comments -- nothing more, nothing less. May, not should ;)

    DrDoc

    1:41 am on Feb 26, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    I just need to add one thing -- all three pages are still valid XHTML. The validator only checks wellformedness against the XHTML dtd, nothing else. So, the flaws we've brought up in here do not by any means suggest that there are flaws in the W3C validator. No, they merely suggest that the Web site authors (of the three sites submitted for comparison) are flirting with the grey zone and walking the fine line between valid and invalid. Still... they are on the right side of that line. :)

    BarkerJr

    3:33 am on Feb 26, 2004 (gmt 0)

    10+ Year Member



    XHTML 1.1 must not be served as text/html.

    Actually, it's "should not," so it is still valid.

    Of course, with a little PHP, it's easy enough to have the best of both.

    if (strstr($_SERVER['HTTP_ACCEPT'], 'application/xhtml+xml')) 
    header('Content-Type: application/xhtml+xml; charset=iso-8859-15');

    Replacing the charset with your page's charset, of course.

    grahamstewart

    8:50 am on Feb 26, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Well the w3 validator is open source so if you would like to offer any bug reports or code fixes I am sure the guy running it, Gerald Oskoboiny, would be glad of your help.

    It would be nice to have an advisory text, something like "this page is valid xhtml but you should really have an xml prolog" with a link to the discussion on the pros and cons of the prolog.

    Interesting to note that both non-compliances are caused by problems with browsers. Even the w3c recognise these practises...

    About the xml prolog...

    Because XHTML is based on XML, it is common to add an XML declaration at the beginning of the markup...

    With Internet Explorer, however, if anything appears before the DOCTYPE declaration the page is rendered in quirks mode...

    Also, some user agents interpret the XML declaration to mean that the document is unrecognized XML rather than HTML, and therefore may not render the document as expected.

    We assume that, because of its tendency to cause Internet Explorer to
    render in quirks mode, some people prefer not to use the XML declaration for XHTML served as text/html.

    About the content type...

    We recommend the use of XHTML wherever possible; and if you serve XHTML as text/html we assume that you are conforming to the compatibility guidelines in Appendix C of the XHTML 1.0 specification.

    We recognize that XHTML served as XML is still not widely supported, and that therefore many XHTML 1.0 pages will be served as text/html.

    There is a lot of good information about practical xhtml and character encodings in that article [w3.org] so it is well worth a read.

    Personally I tend to use HTML4 Strict to avoid these issues. It is close enough to xhtml that I can convert fairly easily in the future but it lets me get on with developing pages now.

    pageoneresults

    9:27 am on Feb 26, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Great discussion. I didn't know this one...

    Declare encoding for your CSS style sheets too

    It is a good idea to always declare the encoding of external CSS stylesheets. (It is not necessary for CSS embedded in a document.) This is done by adding a statement to the top of the file such as:

    @charset "utf-8";

    grahamstewart

    9:41 am on Feb 26, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Yep, didn't know that one either.

    I doubt it would really matter at the moment since CSS tends to be basic ascii, but in the future when more browsers support CSS generated content I guess it will be more important.

    pageoneresults

    9:54 am on Feb 26, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    After watching this discussion progress, I've been making some changes to a site I have where I just converted to xhmtl about two months ago. I was originally using just this...

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

    ...without an XML prolog and without declaring anything in the

    <html>
    . I was declaring charset and language using metadata.

    I've now converted everything over to...

    <?xml version="1.0" encoding="ISO-8859-1"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

    Hopefully I've done the right thing. I did not see any adverse affects when using just the XHTML 1.1 DOCTYPE without the prolog and without declaring anything in the

    <html>
    . Am I on the right track here?

    grahamstewart

    9:59 am on Feb 26, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    The prolog will put IE into Quirks mode. Hasn't that broken your layout?
    If your site is currently in the ISO charset then it may as well be in UTF8 (since ISO is a subset) and if its in UTF8 then as far as I can tell you don't need the prolog.

    pageoneresults

    10:02 am on Feb 26, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    The prolog will put IE into Quirks mode. Hasn't that broken your layout?

    I tested a few pages before I converted 300+ pages over. No problems whatsoever with layout and I'm testing in IE, Opera and Moz. I see no difference whatsoever. What exactly would I be looking for? Everything validates just fine too.

    grahamstewart

    11:01 am on Feb 26, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    hmm.. interesting. It could be that you are fortunate enough to have a layout that is unaffected.

    Basically Quirks-mode would mean that IE should revert to its old broken box model, so I would expect to see elements appearing as the wrong width and possibly in the wrong place.

    However I guess if you have already coded a layout that looks okay in IE5 (which always uses the broken model) then it won't really matter which mode the browser is in.

    pageoneresults

    11:04 am on Feb 26, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Would it be suggested to remove the prolog and continue to use the metadata for declaring charset?

    encyclo

    12:01 pm on Feb 26, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Re: XHTML 1.1 Specification page - my case is crumbling here, so I'll concede on this one - for the moment! Valid XHTML 1.1.

    Re: WASP. The problem with their site is that they do not define the language of the page. While this obviously means that the page is not WCAG Priority 3 compliant (and makes for poor usability), until I can prove otherwise, I say the page is tentatively valid. I suspect that it's a simple oversight in their case.

    For ALA, however, I won't concede. I reckon ALA is invalid, and this is the point I want to make about the validator: it can't check everything - not because it's buggy, but because it only looks at document structure compared with the DTD. ALA is structurally valid, but the XHTML 1.0 spec is more than just that. The validator doesn't check for MIME type, and it doesn't check for character encoding - but that doesn't mean that these two issues don't matter.

    Would it be suggested to remove the prolog and continue to use the metadata for declaring charset?

    If the prolog isn't causing trouble, you can keep it - however, the best way of defining the charset is with a HTTP header - so the browser knows what charset to use before it starts to render the document. If you do that, you don't need either the xml prolog or the meta tag in the document itself.

    Hagstrom

    1:38 pm on Feb 26, 2004 (gmt 0)

    10+ Year Member



    W3C Recommendation for XHTML 1.1. Uses (surprise surprise!) XHTML 1.1.

    Well, that is surprising - considering that the HTML 4.01 specs [w3.org] are written with a transient dtd and include align="center" and stuff ;)

    DrDoc

    5:12 pm on Feb 26, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    WASP

    I just remembered the third issue (which I in fact noticed before... just didn't remember it when I posted about it initially)...

    They are using

    lang="iso-8859-1"
    as an attribute. iso-8859-1 is not a valid language code. It should be something like "en" or "sv" or so...

    g1smd

    5:53 pm on Mar 1, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    >> the validator doesn't check for MIME type, and it doesn't check for character encoding <<

    It could easily report the MIME type. A lot of other spiders already do this. Might want to suggest it as a feature.

    The validator does check for character encoding. It complains if the charset isn't declared, and fails to parse the page at all if there are invalid characters in it.