ebxml-transport message

Subject: Versioning

From: George Smith <GeorgeS@Highwire.com>
To: ebxml-transport@lists.ebxml.org
Date: Tue, 10 Oct 2000 16:02:38 -0700


Hello all,

I know I am new to the list, but I have been "forced" to get involved due to
the OTA's decision to accept whatever comes out of ebXML.

I have been pushing for the OTA to accept "parsing friendly" versioning.  I
believe they have.  However, some keep referring to the "way that ebXML does
it".  So...

Please find attached, my analysis on different versioning options and their
impact on different parsing techniques.  Also note that, IMO, the use of
namespaces for messaging is a BAD idea!  It does not bring sufficient value
for the pain it inflicts (bad ROI).


Thanks,
George Smith
Highwire
206-812-4614 x 228

Validation.gif


                              XML Parsing Techniques/Tools
                                          by
                                     George Smith
                                      28 Sept 00


The primary XML Parsing Techniques/Tools:

    Custom 100% Application based,
    SAX, or XML event driven,
    DOM based with referenced schema validation, &
    Schema Compiler.


Details for each of the above:


    Custom          - A program or class custom developed to "deal"
                      with a particular XML document.

                      Pros: Can be VERY fast.

                            Supports direct conversion to "internal"
                            desired representation.

                      Cons: Development is incredibly labor intensive.
                            Therefor, the need to support many massages
                            usually "dissuades" this approach.

                            All validation is application's responsibility.

                            Any default or fixed values, or enumerations
                            specified in the schema must be duplicated in
                            the code.


    SAX             - A parsing system that parses "tags", and generates
                      events on XML "parts".

                      Pros: Very Fast.

                            Supports direct conversion to "internal"
                            desired representation.

                      Cons: Only validates tag structure. Does not even
                            validate that the XML is "well-formed".

                            Development for complex messages is labor
                            intensive.

                            Almost all validation is application's
                            responsibility.

                            Any default or fixed values, or enumerations
                            specified in the schema must be duplicated in
                            the code.

                      Note: Since "most" XML messaging is currently
                            "trusted", this approach is currently the
                            most popular.


    DOM             - A parsing system that validates that the XML
                      document "conforms" to the schema indicated
                      "within" (at the top of) the XML message.

                      Pros: Creates a DOM tree, that can be "queried".

                            If you trust the sender then the XML is
                            guaranteed to be well-formed, and properly
                            structured to the schema.

                            Any default or fixed values, or enumerations
                            specified in the schema are automatically
                            "handled".

                      Cons: The DOM access API is "painful" to use.

                            DOM is "claimed" to be memory heavy.  This is
                            IMO, usually due to the fact that the DOM tree
                            is converted to a desired "internal"
                            representation.  This means that almost every
                            node/data element is in memory twice (once for
                            the DOM supporting Object and once for the
                            developer's supporting Object).

                            Since the schema is "interpreted", this form of
                            validation is considered "too slow" by many
                            development shops.

                            Since "most" DOM validating parsers only support
                            DTDs, the "level" of "automatic" validation is
                            limited to XML structure.  Therefor, all data
                            validation is the application's responsibility.

                            If the sender is NOT "trusted", then how can you
                            trust that they have indicated to validate
                            against the correct DTD?  To solve this problem,
                            the DTD reference MUST be either at a "neutral"
                            public location, or at the sender's site.  This
                            would allow the receiver to validate that the
                            validation is to the correct DTD.  This presumes
                            that the parser provides access to the DTD
                            reference.


    Schema Compiler - A system that takes a "public" schema (and possibly
                      an augmentation file) and generates a program, class,
                      or class tree to parse (and validate?) an XML message.

                      There are a number of these systems "out there".
                      Some of the companies using, developing, or offering
                      these are (or the product names):

                                  Oracle,
                                  SUN,
                                  ConXtra,
                                  DXML (product name), &
                                  jDOM (product name).

                      Some of these "solve" the DOM tree "access" problem.
                      Some solve the "untrusted" validation problem.
                      Some solve BOTH problems.

                      Pros: Creates a "DOM like" tree, that can be "queried".

                            May solve the DOM tree "access" problem.

                            May solve the "untrusted" validation problem.
                            If it does, then it does it "fast".

                            Some either support augmented schema files, or
                            schema augmentation files.  So..., these can
                            raise the validation level to include data
                            typing, and possibly some biz rules (like no
                            past dates).  This "minimizes" the remaining
                            validation that is the application's
                            responsibility.

                            Any default or fixed values, or enumerations
                            specified in the schema are automatically
                            "handled".

                      Cons: If they don't solve the DOM tree "access"
                            problem, the DOM access API is "painful" to use.

                            If the resulting "DOM like" tree needs to be
                            converted to a desired "internal"
                            representation, then almost every node/data
                            element is in memory twice (once for the "DOM
                            like" supporting Object and once for the
                            developer's supporting Object).  This can be
                            memory heavy.

                            If the receiver expects to receive multiple
                            messages (or versions), then it must be "easy"
                            to incorporate the "switch" into the resulting
                            generated code/program.  This is often done by
                            creating a "master" schema, that indicates the
                            "switch" via element names.  This requires that
                            the "switch" NOT be based on data content!

                            With some of these products, validation is
                            problematical, and hence these should NOT be
                            used for "untrusted" message processing.

                      Note: Picking (or creating) the right Schema Compiler
                            can dramatically reduce the effort to support
                            multiple messages with multiple versions.  And
                            all this comes with little to NO performance
                            penalties!


Message differentiation & Versioning:

    Any form of versioning of messages should not favor, and more
    importantly, not preclude any of the above tool sets.  The ONLY form of
    versioning that is guaranteed to work with ALL of the above tool sets
    is element "name" based versions.  This same logic applies to major
    functional differences (actions) which might be represented by different
    schema.  Furthermore, the closer that this versioned element or "action"
    id is to the root of an XML document, the "friendlier" the multiple
    message specification is to the Schema Compiler option.


Note: For a graphical perspective on the Two Stages of XML Message
      Validation, please print the Validation.gif (or view the
      Validation.vsd) file.

-The End-


                      Pros & Cons of different versioning options...
                                          by
                                     George Smith
                                      29 Sept 00


Intro:

    If there is a need to implement a server to handle "just" the OTA
    profile messages, then there are currently four messages that must be
    supported.  These are:

        Create,
        Read,
        Update, &
        Delete.

    To both validate, and "do something", someplace in the XML, there must
    be an indication of the action desired (and also the "data" to perform
    this action on).  The OTA currently does this with it's "action" verb
    elements (tag names).

    Now if in addition: this server needs to support more than one client,
    AND, it needs to support these clients for "years", AND the
    specification of "what" makes up a profile changes, AND the clients
    can NOT be expected to change to use the new specifications
    "simultaneously", THEN the server must support the "simultaneous"
    (more of less) receipt of multiple versions.

    If the server only supports three versions "simultaneously" (at any
    one time), then the four "simple" messages, generate the need to
    actually support twelve (12) messages.  Now most likely, this server
    will need to support both: more than three versions "simultaneously",
    and more then just the four OTA profile messages.  For example, if
    there are fifteen messages, and four versions, then the server is
    really supporting sixty (60) messages.

    To select both the appropriate validation and action from these 10s to
    1000s of messages, a combination of version identification and action
    identification needs to be present.  This version identification is
    currently present in the OTA's root element name.  The exact location
    of the (or more than one) versioned element name is flexible, given
    that it is above (closer to the root) than any change related to the
    version change.

    The question has been raised of NOT using versioning in the element
    names.  Where else could versioning be, and what are the consequences
    (Pros & Cons).


Background:

    There are four basic XML parsing techniques (please see
    XMLparsingTechniques.txt for the details).  These are:

        Custom (CUSTOM) 100% Application based,
        SAX, (SAX) or XML event driven,
        DOM (DOM) based with referenced schema validation, &
        Schema Compiler (SCHEMA-COMP).

    Assumption: The version identification method should NOT preclude the
    use of any of these techniques.  In addition, it would be "nice" if
    the version identification method was "easy" to use with all of these
    techniques.


Version Options:

    The version MUST be available as part of the message, or "in
    something" attached to the message.  Some of the options are:

        1) In the schema file (referred to by the !DOCTYPE SYSTEM field).
        2) In the schema file name (from the !DOCTYPE SYSTEM field).
        3) In the schema file "path" (from the !DOCTYPE SYSTEM field).
        4) In the !DOCTYPE PUBLIC field.
        5) In an Attribute, as a value.
        6) In an Element, as a value.
        7) As part of a NameSpace URI.
        8) As part of an Attribute's name.
        9) As part of an Element's name.


Pros, Cons, & Opinion:

    Option 1) In the schema file.

        Pros:
              There is NO version "dirtiness" in the XML.

        Cons:
              Only the "DOM" parser "bothers" to fetch the schema.

        Opin:
              This option is a non-starter!


    Option 2 & 3 & 4) In the !DOCTYPE.

        Pros:
              There is NO version "dirtiness" in the XML except in the
              !DOCTYPE.

        Cons:
              Many communicating systems that do not use a "DOM" parser do
              not "bother" to sent the !DOCTYPE field.
              It becomes difficult to specify, if a private schema
              reference is desired.

        Opin:
              These options should be non-starters!


    Option 5) In an Attribute, as a value.

        Pros:
              This is a common practice.
              It is VERY easy to use with the CUSTOM, "SAX", & "DOM"
              parsers.
              It can be specified as a REQUIRED attribute with an
              enumeration list of "one" option.  (It should NOT be FIXED,
              as that technically makes it optional)

        Cons:
              This option prohibits anyone from creating a "master" (or
              complete) schema by combining "all" the individual version's
              schema.  This is because, "choices" in schema can ONLY be
              made against element names!  This "master" schema is then
              used with the SCHEMA-COMP to produce a single "complete"
              validation & parsing sub-system.
              If this option is used with the SCHEMA-COMP option, then the
              resulting "compiled" code must be "adjusted" to handle the
              version "switch" outside of normal schema "choice"
              processes.  Some SCHEMA-COMPs may not allow this
              "adjustment".

        Opin:
              This option, due to its popularity, would be my second
              choice!


    Option 6) In an Element, as a value.

        Pros:
              It is VERY easy to use with the CUSTOM, "SAX", & "DOM"
              parsers.

        Cons:
              In addition to all the Cons of Option 5, if the schema is a
              DTD, then it is NOT possible to specify/enforce the version.

        Opin:
              This option should be a non-starter!


    Option 7) As part of a NameSpace URI.

        Pros:
              The NameSpace as a URI, "can be" both globally explicit, and
              descriptive.

        Cons:
              Very few parsers of ANY type are currently NameSpace
              "friendly", this could eliminate many "DOM" options.
              DTDs and NameSpaces are (if you follow the rules for
              NameSpaces) basically incompatible!
              Only some SCHEMA-COMP parsers "could" handle this option
              with the same concerns as those of Option 5.

        Opin:
              Sine NameSpace use for messaging is "EVIL", this option
              should be a non-starter!


    Option 8) As part of an Attribute's name.

        Pros:
              It is VERY easy to use with the CUSTOM, "SAX", & "DOM"
              parsers.

        Cons:
              What would be the value of the attribute?
              No one (that I am aware of) uses this option.

        Opin:
              This option should be a non-starter!


    Option 9) As part of an Element's name.

        Pros:
              This is a common practice (including the OTA ver 1).
              It is VERY easy to use with ALL the parser techniques.
              It allows the creation of "master" schema.

        Cons:
              NONE!

        Opin:
              This option is my first choice!


Conclusions:

    The only "reasonable" options are the two in current common practice:

        Option 5) In an Attribute, as a value.    &
        Option 9) As part of an Element's name.

    Option 9) As part of an Element's name.  Gets my vote due to reasons:

        1) It is what the OTA is already "using" (and agreed to), &
        2) It is "friendly" to ALL parsing techniques.

-The End-