ebxml-transport message

Subject: Raised by DB's RM proposal but unrelated...

From: Doug Bunting <Doug@ariba.com>
To: "Routing & Packaging ebXML Transport (E-mail)"<ebxml-transport@lists.ebxml.org>
Date: Wed, 14 Feb 2001 00:42:02 -0800

First, a big issue that's been ignored through multiple revisions to our document: The semantics of the "timeout" parameter make little sense. It essentially defines an override for the retryInterval parameter for the first delivery failure. I'm not sure of the whole history, but the text hasn't changed in a while. David may have made the last change when he correctly realised the description of this parameter was ambiguous (around or before Tokyo). At this point, we have two choices:

Work to restore the old text and make it unambiguous.
Remove the timeout parameter from our specification completely.

On Rik and others' recommendations, we're leaning towards the second approach. The concept of timeout applies either to business processes (when should an application discard and not process a received message?) or connection failures (when should a transmission be considered to have failed in lieu of a transport error?). The first case is slightly beyond the scope of our specification (above our layer, though we've included the timeToLive parameter in support of that layer). The second case was originally the target of our timeout parameter. However, most connection protocols either handle timeout internally or provide no manner to control the interval from higher layers. In either case, the timeout parameter does not have a place in our specification. It should be handled no differently than any other connection failure. (We may need to define something for a SMTP binding since its timeout is something like 4 days.)

Recommended changes (using the 0.93 specification as a starting point):

Section 10.2.1.3, lines 1411-1417, change

1) The Sending MSH MUST resend the original message if an Acknowledgment Message has

not been received from the Receiving MSH and either of the following are true:

a) The message has not yet been resent and at least the time specified in the timeout

parameter has passed since the first message was sent, or

b) The message has been resent, and the following are both true:

i) At least the time specified in the retryInterval has passed since the last time the

message was resent, and

to

1) The Sending MSH MUST resend the original message if an Acknowledgment Message has

not been received from the Receiving MSH and the following are both true:

a) At least the time specified in the retryInterval has passed since the

message was last sent, and

(renumbering the next point accordingly)

Section 10.6.1, line 1638, remove "timeout" line in the table.
Section 10.6.4.3, lines 1703-1706, remove entire section.
Section 10.6.4.6, lines 1715-1721, same change as for section 10.2.1.3 or remove repeated text and reference 10.2.1.3 (my preference).
I'm not sure where this fits: We don't currently define what causes retry very well. An originating MS (or is it MSH?) just notices that it hasn't received an acknowledgement in the current world. That sounds as if timeout is the only failure that can lead to retry. Realistically, the originator will receive a transport failure (host not found, host unreachable, proxy overload, HTTP 404, timeout, et cetera) and initiate retry semantics based on any such failure. Some ebXML warnings may also cause the originator to retry (temporary server failures, though we don't have that concept right now). Do the various errors which lead to retry need better description? If so, where should that description appear?
Similarly, we don't describe which of the ebXML errors might indicate permanent failure. While all lower level transport failures should be considered transient, some (or all) of the errors we define may cause the originating application to immediately give up. This should be explicit in our specification.
Somewhere, we may want to display a timeline something like (or completely unlike and much better looking than) the following:

|==--------|==--------|*---------|

where the message is sent at a consistent interval indicated by the vertical bar whether the MS encounters an immediate delivery error (*) or the lower layers time out (==).

As David has separately mentioned, the retryInterval should actually be the minimum size for this window. Retry semantics including backoff algorithms or acceleration rules should never go faster than that minimum. (Nobody is going to tell an originating application how often and when they're allowed to retry. They can choose their own semantics with a few minimums and maximums to avoid a deluge at the recipient.)

thanx,

doug