I was concerned late last week when I read that the XHTML Working Group will not be allowed to finish its work. The announcement indicates that this is being done to speed the work of HTML5 – but it also raises important questions about how XML will work with data on the Web.
2009-07-02: Today the Director announces that when the XHTML 2 Working Group charter expires as scheduled at the end of 2009, the charter will not be renewed. By doing so, and by increasing resources in the HTML Working Group, W3C hopes to accelerate the progress of HTML 5 and clarify W3C’s position regarding the future of HTML. A FAQ answers questions about the future of deliverables of the XHTML 2 Working Group, and the status of various discussions related to HTML. Learn more about the HTML Activity.
Originally, HTML was based on SGML. XML was designed to make SGML easier to parse, so that it would be more accessible to software. From the beginning, XML was designed to support documents meant for humans to read, and data designed for programs. In modern software, there is no firm boundary between the two – data is often transformed on the fly for humans to read, and software often mixes human-readable content with data.
By representing HTML as an XML vocabulary, XHTML addresses HTML parsing issues, difficulty combining HTML with XML data, and extensibility issues:
- An XML parser can not parse arbitrary HTML, and parsers which do are extremely complex. Tools like Tag Soup parse arbitrary HTML and convert it into XHTML, so that even legacy content in odd dialects of HTML can be processed by XML tools.
- XML vocabularies like SVG (a graphics format) and MathML (a mathematical notation format) need to play nicely with HTML. XHTML lets them be combined with HTML using the same mechanisms all XML tools use to combine vocabularies.
- HTML has become quite large, and XHTML provides a modularization of HTML, using XML mechanisms, that makes it easier to maintain and extend.
Typically, Working Groups are rechartered if there is active interest in their work and they are close to their goal, as the XHTML Working Group is. Another year of effort would probably be sufficient to finish the XHTML specifications. And the XHTML people are intimately familiar with XML and namespace issues, plus the issues that arise with XML and HTML software.
The FAQ that the announcement points to concerns me. Some of the work of the XHTML Working Group will not be finished, and there is even a suggestion that HTML5 will not handle namespaces correctly, and that XML vocabularies will not be able to mix freely with HTML5!
Consider this issue, to which the FAQ points:
The HTML5 specification does not have a mechanism to allow decentralized parties to create their own languages, typically XML languages, and exchange them in HTML5 text/html serializations. This would allow languages such as SVG, MathML, FBML and a host of others to be included. At one point, an editors version of the HTML5 specification contained a subset and reformulation of SVG and MathML. Tim Berners-Lee described this incorporation of SVG and MathML without namespaces as horrific and the issue raiser [Dave Orchard] completely concurs with the him.
This issue limits the ability of non-HTML5 working groups to define languages as the languages must be “brought into” the HTML5 language. This dramatically increases the scope of HTML5 and decreases the ability to modularize development of orthogonal languages.
In the end, the problem could result in the text/html serialization rules becoming the standard serialization rules for XML languages, replacing XML itself. This could occur if every decentralized language has a choice between the XML serialization, the text/html serialization or both. In many cases, the language may choose the text/html serialization.
I find this last paragraph alarming. XML was designed specifically to allow decentralized parties to create their own languages, and to combine them cleanly using namespaces. By using the mechanisms defined in XML, XHTML inherits this ability, an ability that HTML5 lacks. One proposed solution is to absorb XML entirely into the HTML5 specification. I agree with Tim and Dave that incorporating SVG and MathML without namespaces is horrific, but the obvious solution is for HTML5 to support namespaces and XML cleanly – just as XHTML does.
I would like to see the XHTML Working Group finish their work, and I would like to see the result used as the XML serialization of HTML5. Legacy HTML ensures that there will always be large amounts of HTML content that is difficult to parse without tools like Tag Soup. But XHTML gives us a very sane XML vocabulary to convert such content into, and makes HTML accessible to the wide variety of XML tools that make up much of the infrastructure of the Web.