homepage Welcome to WebmasterWorld Guest from 54.205.247.203
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / HTML
Forum Library, Charter, Moderators: incrediBILL

HTML Forum

    
Polyglot Markup: HTML-Compatible XHTML Documents
swa66




msg:4461448
 12:39 am on Jun 5, 2012 (gmt 0)

When I first read the HTML 5 specs I was very disappointed about the rampant promotion of tag soup in there. It felt almost criminal to go back to that.
Yet, XHTML 5 seems not a solution to it thanks to a combination:
- browsers like IE8 and older as they don't do the right thing
- the "thou shall not serve XHTML as text/html" mantra
It would force one to keep 2 versions of the same - I'm not even considering.

Back when I used to start with xhtml 1.1 I did it in such a way that existing browsers would deal with it properly, even while it meant I had to serve it as text/html due to the browser's lack of foresight. Over the years I've continued this, but I'm slowly discovering an alternative that could allow me to move to xhtml 5 without having to tell a lot of my visitors that their browser is too far below par or having to keep 2 versions.

It seems you can make xhtml 5 that is also valid HTML 5. It even has a name: "Polyglot (X)HTML (5)".

There's a w3c draft on it: [w3.org ]

You essentially limit yourself in what you generate in your "(x)html" to make it so that it is both xhtml5 and HTML5 at the same time (they also focus a lot on having the same DOM, but that's only interesting if you mess with javascript - not something I'm all that interested in myself).

You need to combine this with something that detects if the browser is ready to accept application/xhtml+xml content ... luckily the browser tells you via the accept header it sends in the request.

And that can hence e.g. be done in your apache configuration with rewrite rules:

e.g. (shamelessly stolen from the web):

RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml
RewriteCond %{HTTP_ACCEPT} !application/xhtml\+xml\s*;\s*q=0\.?0*(\s|,|$)
RewriteCond %{REQUEST_URI} \.html$
RewriteRule .* - [T=application/xhtml+xml;charset=utf-8]
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml
RewriteCond %{HTTP_ACCEPT} !application/xhtml\+xml\s*;\s*q=0\.?0*(\s|,|$)
RewriteCond %{REQUEST_URI} !\.
RewriteRule .* - [T=application/xhtml+xml;charset=utf-8]


Or in any server side programming/scripting language by setting the header
Content-type: application/xhtml+xml; charset=utf-8


Essentially: one takes care and follows some simple rules and you get all the benefits of having proper xml instead of a big mess, strict validation through delivery of your xml to browsers that can do draconian validation (refuse to render invalid code), and you get backward compatibility with retarded browsers by delivering it as if it were HTML 5 to them without having to keep 2 copies.

My questions:
- has anybody out here played with it already ?
- you need to be careful with html entitites as only a very few are allowed anymore it seems (per the w3 draft) - e.g. PHP htmlentities() would need to be changed from what I see it do.
- any experience in integrating it with 3rd party stuff like adsense / something like google maps/ ... (scripts get some serious limit)
- other pitfalls ?

please: I'm not trying to start a xhtml vs html discussion - been there done that, nobody got a T-shirt. I'm just interested in experiences with doing this and where the limits are to where we can push it - for those convinced to do it regardless.

 

DrDoc




msg:4462619
 7:56 pm on Jun 7, 2012 (gmt 0)

I approve. I mean, this is the most correct way of doing things ...

swa66




msg:4472197
 4:22 pm on Jul 3, 2012 (gmt 0)

Started playing with this on a test site - just static (x)html so far.

The apache config can be tweaked to do indexes better than the above:

RewriteEngine On
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml
RewriteCond %{HTTP_ACCEPT} !application/xhtml\+xml\s*;\s*q=0
RewriteCond %{THE_REQUEST} HTTP/1\.1
RewriteRule \.html$ - [T=application/xhtml+xml;charset=utf-8]

It also does indexes served from a index.html that way while keeping it's paws off of an index.php.

It's nice to have the browser tell you that your code is not well formed anymore! [Had a closing quote dropped on an argument on an edit and well it was found instantly]

Also the validator at [html5.validator.nu...] feels solid for this type of content [it fetches it in xml mode] in conjuction with the one of w3.org at [validator.w3.org...] [which parses as if it were mere html5].

mattur




msg:4472287
 8:49 pm on Jul 3, 2012 (gmt 0)

You essentially limit yourself in what you generate in your "(x)html" to make it so that it is both xhtml5 and HTML5 at the same time (they also focus a lot on having the same DOM, but that's only interesting if you mess with javascript - not something I'm all that interested in myself).


Polyglot markup is defined as the subset of (X)HTML5 which produces the same DOM when parsed as HTML or XML (ignoring namespaces differences). So if you get a different DOM when parsing a document as XML to the DOM when parsing it as HTML, the document is by definition not Polyglot markup.

For example, the code below is well-formed XML and valid HTML5:


<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title></head>
<body>
<table>
<tr><td></td></tr>
</table>
</body>
</html>


But it is *not* Polyglot markup because HTML parsers insert an implicit tbody element into the DOM, and XML parsers don't. In this case the Polyglot markup would be:


<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title></head>
<body>
<table>
<tbody>
<tr><td></td></tr>
</tbody>
</table>
</body>
</html>


(I would of course strongly advise people against using Polyglot markup :) )

swa66




msg:4472297
 9:11 pm on Jul 3, 2012 (gmt 0)

It's not just the javascript DOM they try to keep the same. Depending on the mode, (some) browsers actually insert fictional tbody elements in your first example - some even show the fictional elements if you view the soure code. Aside of problems in javascript, such things could wreak havoc in CSS just as well when looking at e.g. direct child selectors.

IMHO the entire concept of adding optional elements back in is just too crazy for words. But I guess it shows why I dislike HTML5 so badly.

mattur




msg:4472311
 10:01 pm on Jul 3, 2012 (gmt 0)

All conformant HTML parsers create an implicit tbody element in the DOM when parsing the first example. This tbody element is then available to JS and CSS because they work on the DOM, not the source code.

The whole point of Polyglot markup is to produce the same DOM when parsed as HTML or as XML, to avoid the problem of JS and CSS working differently depending on how the document is parsed. This is why the second example has an explicit tbody element, and why I chose this case as an example :)

swa66




msg:4473788
 12:01 pm on Jul 9, 2012 (gmt 0)

I've actually used this by now on a very small site (3 pages large). But it went smooth so far and for the first time ever I've even gotten away with not having to fight with IE at all.

I guess dropping IE6 as "supported" helped a lot and that IE7 and IE8 didn't do some CSS3, nor some SVG, but that the fall back was already present and worked for them too was nice.

Anyway I was expecting a huge fight with IE and got none. Nice for a change.

Next I'll do something a bit more difficult (something with advertising on it), wondering what the scripts of adsense, Amazon etc. will start to do.

swa66




msg:4479277
 11:31 pm on Jul 26, 2012 (gmt 0)

I've been doing some php scripts that output polyglot xhtml5

Some interesting bits other might find useful:
NOTE: I'm not the best php programmer ever - feel free to correct/add

Output the content-type if the browser is not retarded - obviously do this before outputting anything:

if(stristr($_SERVER["HTTP_ACCEPT"],"application/xhtml+xml")){
header('Content-Type: application/xhtml+xml;charset=UTF-8');
}


function to use instead of htmlentities() so it's only converting the 5 allowed named entities:

function myxmlencode($str) {
$in= array ('&','<','>','"',"'");
$out= array ('&amp;','&lt;','&gt;','&quot;','$apos;');
return(str_replace($in, $out, $str));
}

swa66




msg:4539574
 10:49 pm on Jan 26, 2013 (gmt 0)

I've been using polyglot html5 for a while now. I like it a lot.

I've also run into an obstacle when dealing with adsense - I feared that one.

It turns out adsense scripts use the document.write() method - which is unsupported by browsers when they are dealing will well structured contents such as xml served html...
Ads simple do not show.

I've started a separate thread over in the adsense forum: [webmasterworld.com...] for those interested - it includes a workaround that might or might not be all that "legal" - hard to judge at this point.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / HTML
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved