[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: RE: Words for charset
The below text regarding XML character encoding was extracted from appendix F of the XML spec (http://www.w3.org/TR/REC-xml): "Because each XML entity not in UTF-8 or UTF-16 format must begin with an XML encoding declaration, in which the first characters must be '<?xml', any conforming processor can detect, after two to four octets of input, which of the following cases apply. In reading this list, it may help to know that in UCS-4, '<' is "#x0000003C" and '?' is "#x0000003F", and the Byte Order Mark required of UTF-16 data streams is "#xFEFF". 00 00 00 3C: UCS-4, big-endian machine (1234 order) 3C 00 00 00: UCS-4, little-endian machine (4321 order) 00 00 3C 00: UCS-4, unusual octet order (2143) 00 3C 00 00: UCS-4, unusual octet order (3412) FE FF: UTF-16, big-endian FF FE: UTF-16, little-endian 00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark (and thus, strictly speaking, in error) 3C 00 3F 00: UTF-16, little-endian, no Byte Order Mark (and thus, strictly speaking, in error) 3C 3F 78 6D: UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the ASCII characters, the encoding declaration itself may be read reliably 4C 6F A7 94: EBCDIC (in some flavor; the full encoding declaration must be read to tell which code page is in use) other: UTF-8 without an encoding declaration, or else the data stream is corrupt, fragmentary, or enclosed in a wrapper of some kind This level of autodetection is enough to read the XML encoding declaration and parse the character-encoding identifier, which is still necessary to distinguish the individual members of each family of encodings (e.g. to tell UTF-8 from 8859, and the parts of 8859 from each other, or to distinguish the specific EBCDIC code page in use, and so on). Because the contents of the encoding declaration are restricted to ASCII characters, a processor can reliably read the entire encoding declaration as soon as it has detected which family of encodings is in use. Since in practice, all widely used character encodings fall into one of the categories above, the XML encoding declaration allows reasonably reliable in-band labeling of character encodings, even when external sources of information at the operating-system or transport-protocol level are unreliable. Once the processor has detected the character encoding in use, it can act appropriately, whether by invoking a separate input routine for each case, or by calling the proper conversion function on each character of input. " Dick Brooks Group 8760 110 12th Street North Birmingham, AL 35203 dick@8760.com 205-250-8053 Fax: 205-250-8057 http://www.8760.com/ InsideAgent - Empowering e-commerce solutions > -----Original Message----- > From: Dick Brooks [mailto:dick@8760.com] > Sent: Thursday, August 31, 2000 8:03 AM > To: ian.c.jones@bt.com; ebxml-transport@lists.ebxml.org > Subject: Re: Words for charset > > > List-Unsubscribe: > <mailto:ebxml-transport-request@lists.ebxml.org?body=unsubscribe> > List-Archive: <http://lists.ebxml.org/archives/ebxml-transport> > List-Help: <http://lists.ebxml.org/doc/email-manage.html>, > <mailto:ebxml-transport-request@lists.ebxml.org?body=help> > > Ian/Chris, > > IMO, the character encoding of the data within a MIME body part > must follow the > encoding specified in the charset parameter associated with the > Content-type > header. In other words, it wouldn't be good to "change" encodings > from UTF-16 to > UTF-8 in the middle of a body part, the two encoding schemes are > VERY different > and I believe the MIME parser would choke on this. I suggest we > use the XML > prolog to identify the character set encoding for XML documents > (including the > ebXML header), instead of the charset parameter in the MIME > Content-type header. > This will accomplish two things: > > 1. Eliminates possible conflict between the charset parameter in > the MIME header > and encoding parameter in the XML prolog. > 2. Provides character set encoding to the XML parser when a XML > document is > loaded. > > >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC