RSVP Working Group Interim Meeting Golden State Conference Room, Cisco Systems San Jose, CA April 29, 1999 OVERVIEW The RSVP Working Group held an interim meeting on April 29, 1999 to consider proposed changes to RSVP's state maintenance machinery, to reduce overhead and/or reduce the worst-case state convergence time. The following Internet Drafts were under consideration: (a) "Staged Refresh Timers for RSVP, Ping Pan, Henning Schulzrinne, and Roch Guerin; draft-pan-rsvp-timer-00.txt. (b) "RSVP Refresh Reduction Extensions", Lou Berger, Der-Hwa Gan, George Swallow; draft-berger-rsvp-refresh-reduct-01.txt. (c) "A Proposal for reducing RSVP Refresh Overhead using State Compression", Lan Wang, Andreas Terzis, and Lixia Zhang; draft-wang-rsvp-state-compression-00.txt (d) "RSVP Extension for ID-based Refreshes", M. Yuhara and M. Tomikawa; draft-yuhara-rsvp-refresh-00.txt Attendees are listed at the end. The following minutes were taken by Fred Baker. MINUTES A. Opening remarks by Bob Braden Two distinct problems to be solved: 1.) Setup Performance a.) worst-case time to install state b.) worst-case time to remove state RSVP spec says RSVP messages should get essentially lossless service, but this now appears to be too optimistic. If there are losses, there can be 30 second delays in creating or removing state, which is unacceptable lack of robustness. 2.) Setup Cost The primary cost concern is the overhead due to soft-state refresh messages. If design point is 10K sessions, have ~300 messages/second. Issues are CPU time and bandwidth used - O(300 KBPS) (aggregate across all interfaces). The RSVP spec talks about fixing these problems by adjusting the refresh timeout time R, but Lou Berger pointed out that these two problems require R to be changed in opposite directions. Design constraints: backwards compatible handle multicast sessions handle route changes able to be used across non-RSVP cloud auto-configuration Common Themes Among Proposals: One common theme among proposals has been to distinguish "refresh" from "trigger" messages. Trigger messages are used to create, modify, or tear down state, while refresh messages are simply sent periodically to assure consistency and to keep state from timing out. Standard RSVP does not distinguish these two cases, resulting in great simplicity but inadequate performance. The proposals all depend upon route change notification to invoke trigger messages along a new path. This may not work if a route change within a non-RSVP cloud changes the egress router (next RSVP hop). Another common theme among proposals is to add an "ACK-requested" flag to RSVP messages. This option provide reliable delivery of trigger messages for any RSVP messages the sender cares about they are retransmitted until acknowledged, perhaps with an adaptive backoff. This can be a problem in the case of multicast Path and PathTear messages, where you may not know the number of next hops, and ACK implosion can be an issue. There are various proposals to reduce refresh overhead by somehow compressing refresh messages. However, all proposals continue to send individual trigger messages (but perhaps packed -- see Lou Berger talk) when state changes. Design parameters (issues of scale): Number of sessions Average inter-arrival time of new sessions Average duration of a session _________________________________________________________________ B. Refresh Reduction proposals The plan for the meeting was to have a terse presentation of each of the four Drafts, followed by a discussion. However, there was a lively discussion during each of the presentations, giving the group had a very good grasp of the overlaps and differences among them. B.1 Ping Pan - Staged Refresh Timers for RSVP Ping Pan described a proposal developed at IBM to solve the performance problem. Background: RSVP's soft state assures that unused state will be discarded and resources recovered. However, it does so via an unreliable mechanism across links that may have 1-2% message loss normally, with bursts up to 20% loss. If the first RESV or PATH is lost at any point (imagine it being lost at every point), the worst case installation interval is the sum of the cleanup intervals of the routers en route. Rationale for fast installation: application may strongly require fast installation billing and accounting may be skewed by incorrect or inconsistent information Proposal: Send trigger messages with "echo request" (i.e., ACK request) bit. Retransmission interval governed by a "staged" refresh timer (unacknowledged retransmissions back off over time). When ACK is received, increase refresh interval (e.g., to 15 mins). Systems that forward the message (have an appropriate route and therefore don't discard it) acknowledge its receipt Use this on both installation and teardown, but not error or confirm. For a multicast session where you don't know the number of next hops (the reliable multicast problem referred to above): you have to use exponential backoff rather than suddenly slowing down. On PATH Tear, it is sufficient to retransmit until all known next hops have acknowledged. On RESV messages, the previous hop is known. _________________________________________________________________ B.2 Lou Berger - RSVP Refresh Reduction Extensions Motivation: MPLS scalability and other reliability issues. Wished to: Address key RSVP implications. Leverage RSVP as a signaling protocol. Allow for extension implementation on a feature-by-feature basis. Proposed Approach: Start with RSVP 2205. Incrementally reduce raw refresh rate. Bound state-change propagation time. Refresh all state via a single message. Enable arbitrary failure detection interval. Enable implementations to only implement needed extensions. Three New Features: 1. RSVP message packing Some routers have a significant per-message cost to pass a message across the kernel boundary to an RSVP deamon. Packing small RSVP messages into a single larger message will therefore improve RSVP setup performance on such systems. [There was some disgreement from idealists in the meeting, who suggested that we should not let such implementation-specific issues determine the protocol.] Lixia Zhang pointed out that the term "aggregation" used in the draft (and in the talk) is a misleading term; it would be more accurately called "packing" (and these minutes use the latter term). 2. RSVP Message acknowledgement This feature can retransmit any message until it is acknowledged. 3 State refresh via a single message The intent is to limit refresh (message and CPU) overhead. Other than sending messages that contain new information, neighbor nodes only need to periodically probe each other's livness, and these probes may equally well be thought of as periodic collective state refreshes. This mechanism was earlier characterized as "hard state". It has been observed that even with "hard state", there is a requirement for periodic probing each neighbor with an "up-down protocol", e.g., the Hello messages suggested in the draft. Proposed Extensions: Hello Extension Exchange "Hello" messages that indicate the current process instance ("reboot number" or "epoch"). This number changes when a system or process restarts, telling its neighbors to retransmit their entire database. Message ID Extension Attaches a unique (per sender) id to an RSVP message, to be used for acknowledging that message. In his proposal, these identifiers are not ordered. Acknowledgement object carrying same Message Id can be piggy-backed on any message in the opposite direction. Discussed several issues for multicast: Ack implosion, cases where number of next hops is not known, etc. Also, some discussion on format and size of identifier - doesn't much matter what it is, but we should agree on what it is. Packing Extension Pack multiple RSVP messages of any type (except packed messages) into one container packet for transmission to a next/previous hop. Auto-configurable by setting "packing-capable" bit in the common message header, and only packing to nodes from which this bit has been received. The container message is sent unicast, so the individual messages are effectively being tunneled (RSVP-in-RSVP) to the neighbor. If some neighbor is not packing-capable, multicast sessions cannot be packed. A cheap implementation of packing functionality is to insert a small latency in output buffer (e.g. 100 msec), to collect together messages bound to particular next/previous hop. _________________________________________________________________ Lan Wang and Andreas Terzis - Proposal for reducing RSVP Refresh Reduction Goals: Reduce refresh overhead Minimize state resynchronization delay Retain self-healing nature of RSVP's soft state Their proposal is to group state by neighbor it is sent to, and keep a tree of MD5 hashes of this state. A node periodically sends a single refresh message containing the MD5 digest of its state to that neighbor and requests an ACK. If the neighbor has a different digest, a specified algorithm is used to isolate what is different and repair the bad state. As an additional optimization, recently sent trigger messages may be cached; upon detection of state inconsistency, a node can first replay the cache in hopes that this will fix the error, before invoking the MD5 partial recovery algorithm. The cache is flushed each time the MD5 is ACKed (implying that two neighbor nodes having consistent state). This proposal adopted Berger's extension of an ACK flag option. Timestamps are used to generate ordered unique message identifiers for this acknowledgment. One of more of these identifiers may be packed into a single ACK message back to the sender. There was an extended discussion of the MD5 digest mechanism. It was realized that the efficiency of the mechanism depends critically upon the data structure technique ("B+ trees") used to maintain the MD5 hash trees, and that neighbor RSVP nodes must keep consistent MD5 tree structures when sessions are added or deleted. The protocol spec needs to define them in sufficient precision to ensure interoperable implementations. Vern Paxson pointed out that the proposed n-ary tree data structure is not the only possibility, and other choices -- e.g., a hash table or simply XOR -- could provide much greater simplicity and nearly as good performance. It was also pointed out that MD5 is not necessarily the only, or even the best choice of hash function. A computationally less expensive hash might be found. _________________________________________________________________ B.4 Lixia Zhang - RSVP Extensions for ID-based Refreshes (Since neither of the authors of this proposal could attend, Lixia Zhang presented the material.) This draft builds upon the Lou Berger draft (see earlier), to reduce the bandwidth overhead when individual RSVP refresh messages must be sent. Individual PATH refreshes need to be sent across non-RSVP clouds, since RSVP routers may be unaware of route changes inside of the clouds. They propose that once a Path trigger message has been acknowledged using the Berger mechanism, subsequent Path refresh messages can be sent using the Message Id as a shorthand. The group opinion was that this shorthand refresh message was a good idea. Combined with the message packing idea in Berger's proposal, one can achieve significant reduction in refresh message overhead. [However, the Draft notes that these shorthand messages cannot be packed using the Berger mechanism] _________________________________________________________________ C. DISCUSSION What are the objectives? Agreement was reached that we need an acknowledge mechanism to bound the time of state change propagation. This needs to cover the delivery of RESV/PATH, and RESV/PATH Tear at least, and including other (arbitrary) messages would be a plus. Agreement was reached that we would like to reduce per-message and per-session CPU overhead, and maybe bandwidth. Storage is not a real issue, although dramatically increasing storage requirements would not be well received. An attempt was made to define the "threats" for which the proposed new mechanisms should provide reliability. The following threats were discussed: RSVP message loss Route change Node failure Link failure Message reordering and race conditions Undetected message error The assumption has been in the past that soft state resolves all of these problems, plus potentially others, eventually. If the probability of a threat is sufficiently low and soft state fixes it, refreshing is a sufficient solution. If we take a solution that somehow stops soft state refresh, the issues it was claimed to solve must be re-examined. We do not believe that any of our issues result in new security threats. Proposed Mechanisms The group extracted the following list of mechanisms from the various proposals. 1) Message Packing: A useful option to adopt May need implementation guidelines. Are there applicability (requirements) issues? 2) Message ID/Timestamp Carries an "ACK-needed" flag. Permits refresh reduction if used to provide acknowledgement. Suggested using two fields to reliably identify a message, a 32-bit ordered unique message ID AND a 24 bit epoch number. This epoch number should be randomly generated each time a node reboots or a link goes down. Agreed that this ID is for a message, not for atoms of state. Thus, a single FF Resv message would carry a single Message ID, not an ID per flow descriptor. In case of message packing, each sub-message in the pack may have an ID, and the container message should carry its own ID for acknowlegment. However, it was unclear how to apply this to Confirm messages; this needs further thought to get the definitions precisely right. 3) Message ACK Message or Object The consensus was to adopt the approach in the Berger proposal, defining an ACK object that can be piggy-backed on messages flowing in the opposite direction. In the absence of such messages, the Berger proposal also defines an ACK-message to return acknowledgment promptly. 4) Message ID as shorthand for (Path) refresh Needs an NAK object when unknown routing change happens. 5) State Compression This is a way to keep soft state with a constant overhead by periodically exchanging a state signature. There was general consensus that mechanisms 1) - 4) would be useful. However, there was less agreement on state compression. The Berger et al scheme, using a Hello message, is essentially hard state; once the state has been set up and acknowledged, it is assumed to be reliably in place. If the two ends get out synch, the only recourse is to retransmit all the state, assuming that the problem is detected. The Wang/Terzis/Zhang MD5 hash scheme periodically checks for state synchronization between neighbors, and it provides partial state recovery when the check fails. There was much discussion about whether the additional assurance provided by the MD5 hash scheme is worth its cost. The group discussed various threats to state consistency (see earlier), and it appeared that obvious threats have been covered by the Berger scheme. Still, bad and mysterious things do happen in real large-scale network nodes, and people seemed to agree that the stronger assurance of the MD5 scheme would be desirable if it can be done with reasonable cost in complexity and performance. Vern Paxson's suggestion (see earlier) was brought up again. One specific suggestion for reducing the complexity was to maintain a signture for each session state but without the tree struture of the UCLA proposal, and simply exchange this sequence of (potentially very large number of) state signatures between neighbor nodes for either refresh or partial rollback. D. DECISION AND/OR FUTURE DIRECTION 1) There was group consensus to start with the Lou Berger et al draft, merge in the better ideas from UCLA, Ping Pan, and Fujitsu, and work that into short-term consensus document incorporating the mechanisms 1) - 4) above. There were suggestions that this set of mechanisms by itself might go a long way towards solving both the performance problem and the cost problem. The consensus was a bit muddier on the state compression mechanism 5). 2) The group saw some value in partial rollback, and it is asserted that there may be problems that are not adequately solved by the reworked Berger draft. The UCLA folks agreed to quickly re-work the MD5 digest proposal based on the feedback in two weeks, and bring revision for further consideration by the Working Group or its replacement. Some felt that the Working Group should adopt the Berger Hello mechanism in the meantime; others thought that this might not be necessary, and it might be better to await a revised (and hopefully simplified) version of the UCLA MD5 work.