ebxml-transport message

Subject: Comments on Reliable Messaging Specification, Aug. 11, 2000
From: mwsachs@us.ibm.com
To: ebxml-transport@lists.ebxml.org
Date: Sun, 13 Aug 2000 23:46:55 -0400
1.1  Purpose and Scope

Line 55:  This paragraph should state whether all implementers shall
provide reliable messaging or it is optional.  This statement will be an
important clarification if the reliable messaging spec is moved into the
ebXML messaging spec.

2.1  Base Concepts

line 105, Editor note 5:  The reliable messaging protocol details should be
transparent to the sending and receiving parties.  Therefore, the service
interface should not be concerned with the window width.  The service
interface might provide an abstract quality of service parameter but the
details of window size etc. should be determined by the message service
handler based on  the requested quality of service and the details of the
selected  low-level transport. The low-level transport details are
invisible to the two parties other than the information which must be
stated in the TPA.  If there are any details which must be agreed to
between the message service handlers, these might be stated in the TPA
although since they don't directly concern the parties, it would be
preferable to exchange initialization messages between message service
handlers in order to reach the agreement.

line 112, Editor Note 7: If there is an implementation limit to the window
size, this has to be agreed to by the two message service handlers and
perhaps by the two parties. The agreement is stated in the TPA, if it has
to be visible to the parties, or it is arrived at by means of an exchange
of initialization messages between the message service handlers when they
first make contact with each other.  There are three interacting variables
related to the window size:  (1) maximum buffer size for the window; (2)
desired number of messages in the window; (3) maximum message size which
the application requires.  Note that (3) could be very large.  (1) is
probably an implementation limit that the parties need not know but the
message service handlers must set it to the smaller of their two
capabilities.  Given limit (1), the window negotiation can be based on (2)
or (3), each of which sets a limit on the other.

 NOTE WELL:  because each item in the window is a complete
application-level message, any implementation limit on the window size sets
a limit on the maximum application-level message size, which may be
unacceptable.  We must be very careful about imposing message size limits
on the application.  The application design may prevent splitting one
message into smaller messages; hence window size limits could prevent
support of some applications.  Reliable transport protocols deal with this
issue by segmenting the messages underneath the application and windowing
the segments.  Think about IP underneath TCP and the sliding window
protocols in HDLC and the LLC layer of the LANs.

If we really need a windowing protocol in the message service handler, the
windowing protocol should segment the messages in order to avoid
restricting the application message sizes.  The segmentation could be
accomplished by enveloping the message header inside the routing header and
adding to the routing information whatever identifiers and other
information are needed for the windowing protocol.  The windowing is done
with these segments rather than with complete application-level messages.

If I am right in the foregoing, then we may have reached the point where
reliable messaging is adding complexity which may not be needed, given that
most transport protocols are inherently reliable, being built on TCP.  The
major exception for us is SMTP.  See section 2.6.7.3 of the IBM tpaML
proposal for a discussion of SMTP and a suggested means of layering an end
to end ACK on top of SMTP to achieve end to end at-most-once delivery.
Please also note the suggestion in section 2.6.6 that a received message be
hardened before returning the transport-level ACK. This appears to be
sufficient to assure guaranteed delivery and failure recovery even with
HTTP.

We should look at what reliability gaps currently exist at the message
service level and see if we can deal with them in a much simpler way. The
discussions in the tpaML proposal (section 2.6) may provide guidance.

Line 123, item 7:  Observation: The usual sliding window protocols are full
duplex with regards to messages and ACKs, and there is a pause only on
detection of a lost message.  The protocol specified in this document is
not a sliding window at all;  it is more like a "jumping window"
protocol - it is half duplex and there is a pause on every window.  That is
a serious degradation of message latency and throughput compared to sliding
window protocols.

Line 134, Editor Note 9: The persistent store used for reliable messaging
is (conceptually) independent of the long-lived persistent stores needed at
the message service level for managing conversation state and long-term
logging. An implementation may choose to use the same store for both
purposes or use separate stores.  If an implementation uses the same store,
then statements in this specification about discarding messages from the
persistent store must not be normative.  An implementation which uses the
same store for both purposes may need to mark messages as "windowing
processing complete" but it cannot actually erase the messages.

Line 136, item 9:  Please replace "For only the last message..." by "To
detect loss of the last message..."  The statement in the specification is
an implementation statement.  For example, the sender could choose to set a
deadline for each message and slide the deadline forward until the last
message of the window.  This would enable early detection of "hard"
failures.  My suggested change avoids stating a requirement that the
timeout may only be set on the last message.

Line 137, item 9  ("information from the TPA"):  It is not obvious that a
separate timeout is needed for reliable messaging.  The existing
transport-level timeout as defined in tpaML section 2.6.4 may serve the
purpose.  However, this point requires considerably more thought. As it
stands, it is not clear to me that the complexity of the window timeout is
worth the value added.  A much simpler solution for this 1-out-of-N case
(loss of the last message) is to rely on the normal transport-level timeout
(e.g. the time to the HTTP response).  Simply terminate the window.  The
messaging service will simply time out at the transport level and re-send
the message, starting a new window. This, however, leads to the following
considerations:

In this protocol, there seem to be two possibilities regarding the timeout:

   The normal per-message transport-level timeout is not used with reliable
   messaging - but this extends the time to retry a lost message to the
   time to fill the window.
   The per-message transport-level timeout is still used on top of the
   reliable messaging protocol.  In this case, the reliable messaging
   protocol must NEVER retransmit a message in the window if it was
   successfully received since the upper level already knows that the
   message was successfully received. (Perhaps discarding the duplicate is
   sufficient; I am not certain of this.)

It is essential that we understand what additional reliability is provided
by this protocol over the much simpler one described in the tpaML proposal
- persist each message and then ACK it.  Note that with the exception of
SMTP, the transport-level ACKs are present whether or not reliable
messaging is used, so for transports which have their own ACKs, reliable
messaging seems only to delay the retry of a missing message until the end
of the window.  In addition to increasing latency, the retry causes the
retried message to be out of order, which may cause trouble higher up in
the system.

Aside from retries, the protocol in this specification increases latency
by preventing a message from being passed upward in the receiving system
until the window is filled.  This protocol may have some value for SMTP
but, as mentioned earlier, the tpaML proposal suggests a much simpler means
of adding reliability to SMTP.

2.2  Features

Line 161 (High Performance):  As mentioned earlier, the protocol in this
specification is not a sliding window.  It is a batching protocol which
increases the latency for all messages except the last one in each window.
See the above discussion.

2.3  Message Envelope Elements

line 167, title:  Shouldn't this be "Message Header Elements"?

2.3.2  Message Header - Reliable Messaging Info Element

Line 173, editor note 12:  As discussed earlier, the window count should
not be visible to the parties.  It must be established and managed by the
message service handlers.

2.3.3  Routing Header

Line 179, Editor Note 13:  If it is intended that the messages in a single
window can be from various TPAs and various conversations, then the message
service instance must be identified.  Be careful, however, because the
latency created by such a window affects all TPAs and conversations,
especially when retries are performed.  If there is a separate message
service instance for each conversation, then the window can be smaller and
retries in one window need not delay other conversations.  In this case,
the conversation ID is sufficient to identify the message service instance.

2.4  Message Transfer Sequence

Line 212, Editor Note 14:  So far, the only payload in the message is the
application payload.  The error message should be expressed using elements
in the routing header.

Line 213, Item 5:  It should be made clear that the persistent store
described in this specification is logically distinct from any persistent
storage used to store message state and logging information.

2.8  Garbage Collection

Line 254 and following:  Non-normative implementation text is useful when
it helps to explain the protocol.  I believe that this section  just
describes a storage management algorithm.  The basic rule that should be
described is that messages MAY be eliminated from the conceptual persistent
store after they are acknowledged.  It should be made clear that the store
used for reliable messaging is logically distinct from the higher level
long term persistent store but there is nothing preventing an
implementation to use one store for both purposes.

5. References

Lines 296 and 298:  Please replace these two references by a reference to
the combined specification.

Regards,
Marty
*************************************************************************************

IBM T. J. Watson Research Center
P. O. B. 704
Yorktown Hts, NY 10598
914-784-7287;  IBM tie line 863-7287
Notes address:  Martin W Sachs/Watson/IBM
Internet address:  mwsachs @ us.ibm.com
*************************************************************************************
Follow-Ups:
- Re: Comments on Reliable Messaging Specification, Aug. 11, 2000
  - From: Jim Hughes <jfh@fs.fujitsu.com>