ebxml-transport message

Subject: RE: Words for charset
From: Dick Brooks <dick@8760.com>
To: Dick Brooks <dick@8760.com>, ian.c.jones@bt.com,ebxml-transport@lists.ebxml.org
Date: Thu, 31 Aug 2000 10:39:13 -0500
The below text regarding XML character encoding was extracted from appendix
F of the XML spec (http://www.w3.org/TR/REC-xml):

"Because each XML entity not in UTF-8 or UTF-16 format must begin with an
XML encoding declaration, in which the first characters must be '<?xml', any
conforming processor can detect, after two to four octets of input, which of
the following cases apply. In reading this list, it may help to know that in
UCS-4, '<' is "#x0000003C" and '?' is "#x0000003F", and the Byte Order Mark
required of UTF-16 data streams is "#xFEFF".


00 00 00 3C: UCS-4, big-endian machine (1234 order)
3C 00 00 00: UCS-4, little-endian machine (4321 order)
00 00 3C 00: UCS-4, unusual octet order (2143)
00 3C 00 00: UCS-4, unusual octet order (3412)
FE FF: UTF-16, big-endian
FF FE: UTF-16, little-endian
00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark (and thus, strictly
speaking, in error)
3C 00 3F 00: UTF-16, little-endian, no Byte Order Mark (and thus, strictly
speaking, in error)
3C 3F 78 6D: UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC,
or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the
characters of ASCII have their normal positions, width, and values; the
actual encoding declaration must be read to detect which of these applies,
but since all of these encodings use the same bit patterns for the ASCII
characters, the encoding declaration itself may be read reliably
4C 6F A7 94: EBCDIC (in some flavor; the full encoding declaration must be
read to tell which code page is in use)
other: UTF-8 without an encoding declaration, or else the data stream is
corrupt, fragmentary, or enclosed in a wrapper of some kind
This level of autodetection is enough to read the XML encoding declaration
and parse the character-encoding identifier, which is still necessary to
distinguish the individual members of each family of encodings (e.g. to tell
UTF-8 from 8859, and the parts of 8859 from each other, or to distinguish
the specific EBCDIC code page in use, and so on).

Because the contents of the encoding declaration are restricted to ASCII
characters, a processor can reliably read the entire encoding declaration as
soon as it has detected which family of encodings is in use. Since in
practice, all widely used character encodings fall into one of the
categories above, the XML encoding declaration allows reasonably reliable
in-band labeling of character encodings, even when external sources of
information at the operating-system or transport-protocol level are
unreliable.

Once the processor has detected the character encoding in use, it can act
appropriately, whether by invoking a separate input routine for each case,
or by calling the proper conversion function on each character of input. "


Dick Brooks
Group 8760
110 12th Street North
Birmingham, AL 35203
dick@8760.com
205-250-8053
Fax: 205-250-8057
http://www.8760.com/

InsideAgent - Empowering e-commerce solutions

> -----Original Message-----
> From: Dick Brooks [mailto:dick@8760.com]
> Sent: Thursday, August 31, 2000 8:03 AM
> To: ian.c.jones@bt.com; ebxml-transport@lists.ebxml.org
> Subject: Re: Words for charset
>
>
> List-Unsubscribe:
>  <mailto:ebxml-transport-request@lists.ebxml.org?body=unsubscribe>
> List-Archive: <http://lists.ebxml.org/archives/ebxml-transport>
> List-Help: <http://lists.ebxml.org/doc/email-manage.html>,
>  <mailto:ebxml-transport-request@lists.ebxml.org?body=help>
>
> Ian/Chris,
>
> IMO, the character encoding of the data within a MIME body part
> must follow the
> encoding specified in the charset parameter associated with the
> Content-type
> header. In other words, it wouldn't be good to "change" encodings
> from UTF-16 to
> UTF-8 in the middle of a body part, the two encoding schemes are
> VERY different
> and I believe the MIME parser would choke on this.  I suggest we
> use the XML
> prolog to identify the character set encoding for XML documents
> (including the
> ebXML header), instead of the charset parameter in the MIME
> Content-type header.
> This will accomplish two things:
>
> 1. Eliminates possible conflict between the charset parameter in
> the MIME header
> and encoding parameter in the XML prolog.
> 2. Provides character set encoding to the XML parser when a XML
> document is
> loaded.
>
>
>
References:
- Re: Words for charset
  - From: dick@8760.com