ebxml-transport message

Subject: Re: Comments on Reliable Messaging Specification, Aug. 11, 2000
From: Jim Hughes <jfh@fs.fujitsu.com>
To: mwsachs@us.ibm.com
Date: Tue, 15 Aug 2000 22:02:36 -0700
Marty,

Inserted below are my comments on your email, especially how I resolved 
them in the latest version of the RM spec. Thanks for the comments...

Jim

At 11:46 PM 8/13/00 -0400, mwsachs@us.ibm.com wrote:
>1.1  Purpose and Scope
>
>Line 55:  This paragraph should state whether all implementers shall
>provide reliable messaging or it is optional.  This statement will be an
>important clarification if the reliable messaging spec is moved into the
>ebXML messaging spec.

Inserted requirement that all implementers SHALL support RM functions.

>2.1  Base Concepts
>
>line 105, Editor note 5:  The reliable messaging protocol details should be
>transparent to the sending and receiving parties.  Therefore, the service
>interface should not be concerned with the window width.  The service
>interface might provide an abstract quality of service parameter but the
>details of window size etc. should be determined by the message service
>handler based on  the requested quality of service and the details of the
>selected  low-level transport. The low-level transport details are
>invisible to the two parties other than the information which must be
>stated in the TPA.  If there are any details which must be agreed to
>between the message service handlers, these might be stated in the TPA
>although since they don't directly concern the parties, it would be
>preferable to exchange initialization messages between message service
>handlers in order to reach the agreement.

Rephrased this Editor Note. The "From" MSH (Messaging Service Handler) 
needs to figure out the RM-Group size (I changed "Window" to "RM-Group" 
because it really is just a group of messages, and everyone is getting 
confused by the term "Window"). The "To" MSH doesn't need to negotiate to 
learn the number of possible messages in the RM-Group, except to possibly 
set an upper bound. The actual number of messages is denoted by setting 
RM-Group Count >0 in the last message.

>line 112, Editor Note 7: If there is an implementation limit to the window
>size, this has to be agreed to by the two message service handlers and
>perhaps by the two parties. The agreement is stated in the TPA, if it has
>to be visible to the parties, or it is arrived at by means of an exchange
>of initialization messages between the message service handlers when they
>first make contact with each other.  There are three interacting variables
>related to the window size:  (1) maximum buffer size for the window; (2)
>desired number of messages in the window; (3) maximum message size which
>the application requires.  Note that (3) could be very large.  (1) is
>probably an implementation limit that the parties need not know but the
>message service handlers must set it to the smaller of their two
>capabilities.  Given limit (1), the window negotiation can be based on (2)
>or (3), each of which sets a limit on the other.

"RM-Group Size" means the *number* of messages in the RM-Group, not the 
bytes occupied by the message. The MSHs have to keep track of the number of 
messages and their respective identifiers. The (byte/octet) size of a 
message is not discussed in this RM spec... Note 7 was rephrased.

>  NOTE WELL:  because each item in the window is a complete
>application-level message, any implementation limit on the window size sets
>a limit on the maximum application-level message size, which may be
>unacceptable.  We must be very careful about imposing message size limits
>on the application.  The application design may prevent splitting one
>message into smaller messages; hence window size limits could prevent
>support of some applications.  Reliable transport protocols deal with this
>issue by segmenting the messages underneath the application and windowing
>the segments.  Think about IP underneath TCP and the sliding window
>protocols in HDLC and the LLC layer of the LANs.

Again, we are not covering logical message splitting in this RM spec.

>If we really need a windowing protocol in the message service handler, the
>windowing protocol should segment the messages in order to avoid
>restricting the application message sizes.  The segmentation could be
>accomplished by enveloping the message header inside the routing header and
>adding to the routing information whatever identifiers and other
>information are needed for the windowing protocol.  The windowing is done
>with these segments rather than with complete application-level messages.
>
>If I am right in the foregoing, then we may have reached the point where
>reliable messaging is adding complexity which may not be needed, given that
>most transport protocols are inherently reliable, being built on TCP.  The
>major exception for us is SMTP.  See section 2.6.7.3 of the IBM tpaML
>proposal for a discussion of SMTP and a suggested means of layering an end
>to end ACK on top of SMTP to achieve end to end at-most-once delivery.
>Please also note the suggestion in section 2.6.6 that a received message be
>hardened before returning the transport-level ACK. This appears to be
>sufficient to assure guaranteed delivery and failure recovery even with
>HTTP.
>
>We should look at what reliability gaps currently exist at the message
>service level and see if we can deal with them in a much simpler way. The
>discussions in the tpaML proposal (section 2.6) may provide guidance.
>
>Line 123, item 7:  Observation: The usual sliding window protocols are full
>duplex with regards to messages and ACKs, and there is a pause only on
>detection of a lost message.  The protocol specified in this document is
>not a sliding window at all;  it is more like a "jumping window"
>protocol - it is half duplex and there is a pause on every window.  That is
>a serious degradation of message latency and throughput compared to sliding
>window protocols.

Another reason why I changed the name to "RM-Group".

>Line 134, Editor Note 9: The persistent store used for reliable messaging
>is (conceptually) independent of the long-lived persistent stores needed at
>the message service level for managing conversation state and long-term
>logging. An implementation may choose to use the same store for both
>purposes or use separate stores.  If an implementation uses the same store,
>then statements in this specification about discarding messages from the
>persistent store must not be normative.  An implementation which uses the
>same store for both purposes may need to mark messages as "windowing
>processing complete" but it cannot actually erase the messages.

Good point. Change made to section 2.4(5). Deleted Garbage Collection, 
section 2.8.

>Line 136, item 9:  Please replace "For only the last message..." by "To
>detect loss of the last message..."  The statement in the specification is
>an implementation statement.  For example, the sender could choose to set a
>deadline for each message and slide the deadline forward until the last
>message of the window.  This would enable early detection of "hard"
>failures.  My suggested change avoids stating a requirement that the
>timeout may only be set on the last message.

Change made at beginning of the sentence.

The reason for saying that a timeout is specified for *only* the last 
message of an RM-Group is to avoid having timeouts for *all* messages in 
the RM-Group. The Sender finds out that messages (other than the last) in 
an RM-Group never arrived by getting an error message in response to the 
last message. The Sender recovers from non-delivery of the last message by 
using the timeout.

If the Sender wants to use timeouts for each message, it could do so by 
just sending "unreliable messages" (DeliverySemantics = Unspecified) and 
monitoring timeouts, *if* there was a way to force MSH-level ACKs for every 
message. Or, the Sender could just make each message its own RM-Group, and 
this spec would require an MSF-ACK back for each group of 1 message.

In an earlier version of this spec, we had defined the kind of sliding 
timeout, for use in garbage collection. What is proposed here is 
essentially the same solution, since the latest timeout is actually the one 
assigned to the final message in the RM-Group.

>Line 137, item 9  ("information from the TPA"):  It is not obvious that a
>separate timeout is needed for reliable messaging.  The existing
>transport-level timeout as defined in tpaML section 2.6.4 may serve the
>purpose.  However, this point requires considerably more thought. As it
>stands, it is not clear to me that the complexity of the window timeout is
>worth the value added.  A much simpler solution for this 1-out-of-N case
>(loss of the last message) is to rely on the normal transport-level timeout
>(e.g. the time to the HTTP response).  Simply terminate the window.  The
>messaging service will simply time out at the transport level and re-send
>the message, starting a new window. This, however, leads to the following
>considerations:

One of the major rationales of this proposal is to make *no* assumptions on 
the underlying transport (the "carrier pigeon model"). Thus, we don't 
introduce the concept of a "normal transport-level timeout". If we lift 
this assumption, then obviously other solutions are possible...

>In this protocol, there seem to be two possibilities regarding the timeout:
>
>    The normal per-message transport-level timeout is not used with reliable
>    messaging - but this extends the time to retry a lost message to the
>    time to fill the window.

Yes, you are correct. However, the Sender MSH can minimize the number of 
messages in an RM-Group if this is a problem (or even turn off RM functions 
in the MSH layer if he wants to just use known transport layer functions 
and not expect any kind of RM-layer ACK/error message from the receiving 
MSH. I would expect that scenario if the transport is inherently reliable.

>    The per-message transport-level timeout is still used on top of the
>    reliable messaging protocol.  In this case, the reliable messaging
>    protocol must NEVER retransmit a message in the window if it was
>    successfully received since the upper level already knows that the
>    message was successfully received. (Perhaps discarding the duplicate is
>    sufficient; I am not certain of this.)
>
>It is essential that we understand what additional reliability is provided
>by this protocol over the much simpler one described in the tpaML proposal
>- persist each message and then ACK it.  Note that with the exception of
>SMTP, the transport-level ACKs are present whether or not reliable
>messaging is used, so for transports which have their own ACKs, reliable
>messaging seems only to delay the retry of a missing message until the end
>of the window.  In addition to increasing latency, the retry causes the
>retried message to be out of order, which may cause trouble higher up in
>the system.
>
>Aside from retries, the protocol in this specification increases latency
>by preventing a message from being passed upward in the receiving system
>until the window is filled.  This protocol may have some value for SMTP
>but, as mentioned earlier, the tpaML proposal suggests a much simpler means
>of adding reliability to SMTP.

I haven't formed a firm opinion on the TPA and its use in ebXML 
transactions, but I am troubled by its size and complexity. How do we 
implement things such as "it is strongly recommended that the framework 
implement and end-to-end acknowledgment" (Note, section 2.6.7.3)? 
Especially, it seems to me that the TPA is present to describe the profiles 
of two parties, and there is no TPA mandate that the parties SHALL 
implement some kind of reliability function or other protocol... that's the 
function of other documents.

If both MSHs operate on a "persist and ACK" each message, as you describe, 
then you just need to define if the ACK is a transport-ACK or an RM-ACK. In 
the latter case, we would use RM functions and set the RM-Group size to 1. 
Does this make sense?

>2.2  Features
>
>Line 161 (High Performance):  As mentioned earlier, the protocol in this
>specification is not a sliding window.  It is a batching protocol which
>increases the latency for all messages except the last one in each window.
>See the above discussion.

Yes. As the header for this section says, "edit once the other parts are done".

>2.3  Message Envelope Elements
>
>line 167, title:  Shouldn't this be "Message Header Elements"?

No, because the RM functions do not alter the contents of the Message 
Header. The Routing Header must be added to the Envelope, not the Message 
Header. We need to make this change to the Messaging Specification.

>2.3.2  Message Header - Reliable Messaging Info Element
>
>Line 173, editor note 12:  As discussed earlier, the window count should
>not be visible to the parties.  It must be established and managed by the
>message service handlers.

This is not entirely true. The From-Party (see Figure 1) may have valid 
reasons to tell the Sending MSH that a group of messages must be sent 
reliably, and it would have nothing to do with the characteristics of the 
underlying transport. Quite possibly the From-Party is interested to know 
only when the group of messages was reliably sent. We need to define the 
interface to the From-Party to lock this down.

>2.3.3  Routing Header
>
>Line 179, Editor Note 13:  If it is intended that the messages in a single
>window can be from various TPAs and various conversations, then the message
>service instance must be identified.  Be careful, however, because the
>latency created by such a window affects all TPAs and conversations,
>especially when retries are performed.  If there is a separate message
>service instance for each conversation, then the window can be smaller and
>retries in one window need not delay other conversations.  In this case,
>the conversation ID is sufficient to identify the message service instance.

I'm not sure what you are proposing here. RM doesn't know about 
conversations and other items identified in the Header.

>2.4  Message Transfer Sequence
>
>Line 212, Editor Note 14:  So far, the only payload in the message is the
>application payload.  The error message should be expressed using elements
>in the routing header.

The first pass at the error/acknowledgement message format is in the new 
draft. Routing Headers are meant to be used for each sequence of 
Sender-Receiver MSHs that a message passes through between the From-Party 
and the To-Party. I don't think there is a need for Routing Headers in RM 
error messages, since there errors need to be sent only directly between 
two adjacent MSHs (assuming an eventual multi-MSH node network is possible 
later). Thus, it seems easier to me to just put the error information into 
the payload of the error message... but I'm open to discussion!

>Line 213, Item 5:  It should be made clear that the persistent store
>described in this specification is logically distinct from any persistent
>storage used to store message state and logging information.

Corrected in the text.

>2.8  Garbage Collection
>
>Line 254 and following:  Non-normative implementation text is useful when
>it helps to explain the protocol.  I believe that this section  just
>describes a storage management algorithm.  The basic rule that should be
>described is that messages MAY be eliminated from the conceptual persistent
>store after they are acknowledged.  It should be made clear that the store
>used for reliable messaging is logically distinct from the higher level
>long term persistent store but there is nothing preventing an
>implementation to use one store for both purposes.

Eliminated from the specification.

>5. References
>
>Lines 296 and 298:  Please replace these two references by a reference to
>the combined specification.

Done!

>Regards,
>Marty
>*************************************************************************************
>
>IBM T. J. Watson Research Center
>P. O. B. 704
>Yorktown Hts, NY 10598
>914-784-7287;  IBM tie line 863-7287
>Notes address:  Martin W Sachs/Watson/IBM
>Internet address:  mwsachs @ us.ibm.com
>*************************************************************************************
References:
- Comments on Reliable Messaging Specification, Aug. 11, 2000
  - From: mwsachs@us.ibm.com