ebxml-architecture message

Subject: xml-dev-Jan-2000 Call for unifying and clarifying XML 1.0, DOM

From: Nikola Stojanovic <nstojano@cjds.com>
To: ebXML-Architecture@lists.oasis-open.org
Date: Tue, 25 Jan 2000 10:54:51 -0500

Title: xml-dev-Jan-2000: Call for unifying and clarifying XML 1.0, DOM

Here is a recent article of the xml-dev list. Please let me know if you find sharing these kind of info in this manner inappropriate.

Nikola Stojanovic
Lead Architect
Columbine JDS Systems, Inc. (www.cjds.com)
Office: 607-2732224
Cell: 303-8875804

Call for unifying and clarifying XML 1.0, DOM, XPATH, and XML Infoset

Nils Klarlund (klarlund@research.att.com)
24 Jan 2000 10:49:07 -0500

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Next message: Arjun Ray: "Re: Attribute normalisation and character entities"
Previous message: Muralidhar Devarapalli: "Re: Java XML-Parser"

XML should be about a universal and simple model of trees based on the linear syntax of XML 1.0, right? Well, it's not. I hope to generate a discussion of how the current multitude of models can be unified. This message is long, reflecting the enormity of the confusion that's being sown. And, I want to convince everybody who's interested that it'll be a conspicuous failure not to unify terminology and models; conversely, I believe that at a little price, involving a small amount of back-pedaling, XML could get an attractive and universal model. Time is running out, however. Consider the following five W3C contributions to the question of what tree an XML document represents: - LOST CHILDREN (DOM2): "Attr nodes are not actually child nodes of the element they describe, the DOM does not consider them part of the document tree" So, attribute nodes are attached to their element node, but the element node is not a parent. The document tree doesn't represent the document (but presumably, the "document hierarchy" does). - LOSING YOUR CHILDHOOD (XPATH): "The element is the parent of each of these attribute nodes, but an attribute node is not a child of its parent element" So, I'm not a child of my parent (if I am an attribute node). Please, don't say you meant that! - THE TONGUE TWISTER (INFOSET): Tree = XML Information Set Node = XML Information Item This abstract model is the promising one. Modulo the unhelpful terminology, it's exactly the simple tree model that's so needed. Its nodes are not called nodes and its children are not called children, because apparently the authors believe that would make the confusion even more explicit. - THE NARROW VIEW (XML 1.0): "for each non-root element C in the document, there is one other element P in the document such that C is in the content of P, but is not in the content of any other element that is in the content of P. P is referred to as the parent of C, and C as a child of P." So, according to XML 1.0, only children that are elements are in fact children. - THE PC VIEW (XML Schema): "The shipDate element daughter of PurchaseOrderType is..." The term daughter is not defined in the Schema draft. This is a big mess. I'll outline a modest simplification that affects several of the (draft) recommendations with the result that there is one model and one terminology. But before that, I'll give more examples to show some adverse effects of the lack of consistency among the tree models. At the end, I'll show how some of the examples appear after the simplification. My simplification is certainly not the only way to go about these fundamental problems, but I hope that they'll show that they are solvable. 1. TEXT AND MARKUP CONFUSION In XPATH and DOM, text denotes a maximum continuous sequence of characters (with no tags), but in XML 1.0 a very different explanation is provided: "Text consists of intermingled character data and markup", where markup is defined as "Markup takes the form of start-tags, end-tags, empty-element tags, entity references, character references, comments, CDATA section delimiters, document type declarations, and processing instructions." INFOSET does not take a position, but introduces a finer-grained model, where individual characters are nodes in the tree representation. 2. XPATH/XSLT NODE CONFUSION - Apparently, node means node, but not quite in XLST: "node() matches any node other than an attribute node and the root node" (2) - and the contrary opinon is offered in XPATH: "A node test node() is true for any node of any type whatsoever" In fact, there is not a technical inconsistency. There is an intricate explanation: when node() is used as a pattern, it is assumed that the pattern applies to children (the ones that are not attributes), since "child" is the default axis. So to include attribute nodes one has to write "@* | node ()". The "@" overrides the default axis, but node() doesn't. This is pretty wild. I wrote a long XSLT program in October and in January I don't understand even the patterns I used before spending 20 minutes re-reading XPATH and XSLT. 3. MORE MARKUP AND TEXT CONFUSION In DOM, we read "If there is no markup inside an element's content, the text is (3) contained in a single object implementing the Text interface that is the only child of the element." But, the sentence that follows says: "If there is markup, it is parsed into the information items (3') (elements, comments, etc.) and Text nodes that form the list of children of the element." So, in this sentence, markup now means markup in the XML 1.0 sense + character data. Also, since information items, in fact, include the character data, the sentence says that both the fine-grained character information items and some corresponding Text nodes somehow are included in the list of children. 4. ROOT OR DOCUMENT CONFUSION Lets look at XML 1.0 again: "There is exactly one element, called the root, or document element, no part of which appears in the content of any other element." Now, the root element node is a child of the root node according to (DOM, XPATH)! This consequence would be formulated in INFOSET speak as the prose: A reference to the document element information item is contained in the children list of the document information item. (4) (John, if you read this, please correct me if I'm wrong.) 5. THE TRANSLATION OF INFOSET INTO CONVENTIONAL TERMINOLOGY In XPATH, a whole section is dedicated to describing a natural data model, even though it substantially replicates INFOSET. Since the authors of XPATH wisely use familiar concepts, they've been obliged to include tautoligisms: "An element node comes from an element information item. The children of an element node come from the children and children - comments properties. The attributes of an element node come from the attributes property." And who does that help? THE PROPOSAL There are three main kinds of nodes: root nodes, property nodes, and content nodes. They form a hierarchy of node concepts as follows: root node property nodes: attribute node notation node namespace declaration node content nodes: cdata nodes: CDSect node (for CDATA sections) text node markup nodes: element node comment node entity node PI node This terminology seems to be rather consistent with XML 1.0 except that we use "text" in the sense found in DOM and XPATH and that "child" is not just applied to elements, but all nodes that are immediate descendants. By an official resolution, this difference should be made clear. A root node is the document information item of INFOSET or the Document interface of DOM or the root node of XPATH. The root node has exactly one element child, which is called the document node, since it corresponds to the document element of XML 1.0. By resolution, the term "root element" in XML 1.0 is banished. Now, define the *text view* of the XML tree as the tree gotten by grouping together maximum consecutive sequences of text and CDSect nodes into one text node. That's all. (I am omitting document declarations from this discussion; they are less important, although they need a model, too.) WHAT ARE THE REPERCUSSIONS? INFOSET will become XML-TREE, and it will be the enjoyable gold standard that defines the XPATH data model and for which DOM is the API---all without notational and conceptual confusion. For example, (4) becomes "The document node is a child of the root node." The XPATH data model *is* the text view of the XML tree. But now XPATH and XSLT can make use of additional predicates: content() is the pattern that matches any content node That solves (2). In particular, an erratum could be issued that would get rid of the node() pattern puzzle. (Even without, future good practice would dictate that content() be used in most situations where node() is now used). The erratum would further specify that the "child" axis will now be called the "content" axis. For DOM, there will be some changes that I hope people would find entirely innocent: for example, the introduction of the DOM structure model in section 1.1.1: "The DOM presents documents as a hierarchy of Node objects that also implement other, more specialized interfaces. Some types of nodes may have child nodes of various types, and others are leaf nodes that cannot have anything below them in the document structure. The node types, and which node types they may have as children, are as follows: " could be recast: "The DOM presents documents as Node objects organized according to the XML Tree model. Some nodes also implement other, more specialized interfaces. An element node may have child nodes of various types that represent content, attributes, and namespace declarations. The node types, and which kinds of node types their content children may have, are as follows:" So this is not a revolution! They're would be very minor changes to the IDL specification as well: interface Node { // NodeType ... readonly attribute Node parentNode; readonly attribute NodeList childNodes; readonly attribute Node firstChild; readonly attribute Node lastChild; ...} becomes interface Node { // NodeType ... readonly attribute Node parentNode; readonly attribute NodeList contentNodes; readonly attribute Node firstContentChild; readonly attribute Node lastContentChild; ...}, and the ownerElement can now be removed from interface Attr : Node { readonly attribute DOMString name; readonly attribute boolean specified; attribute DOMString value; // raises(DOMException) on setting // Introduced in DOM Level 2: readonly attribute Element ownerElement; }; since an attribute node has a parent, which the ownerElement was supposed to denote. And (3) and (3') would simply become "The contentNodes list contains the content nodes of the element." For XML Schema, there would be significant simplifications in terminology. Simpletons, are you still there? /Nils xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1 Unsubscribe by posting to majordom@ic.ac.uk the message unsubscribe xml-dev (or) unsubscribe xml-dev unsubscribe xml-dev your-subscribed-email@your-subscribed-address Please note: New list subscriptions now closed in preparation for transfer to OASIS.

Next message: Arjun Ray: "Re: Attribute normalisation and character entities"
Previous message: Muralidhar Devarapalli: "Re: Java XML-Parser"

This archive was generated by hypermail 2.0b3 on Mon Jan 24 2000 - 15:58:35 GMT

Follow-Ups:
- Re: xml-dev-Jan-2000 Call for unifying and clarifying XML 1.0, DOM
  - From: "William J. Kammerer" <wkammerer@foresightcorp.com>