Network Working Group                                        Jerry Ash
Internet Draft                                      Gagan L. Choudhury
<draft-ash-ospf-isis-congestion-control-02.txt>               Jeff Han
Expiration Date:  December 2002                     Mahmood Noorchashm
                                                  Vera D. Sapozhnikova
                                                        Mostafa Sherif
                                                                  AT&T

                                                        Anurag Maunder
                                                        Sanera Systems

                                                        Vishwas Manral
						              NetPlane

 
                                                            June, 2002


             Proposed Mechanisms for Congestion Control/
               Failure Recovery in OSPF & ISIS Networks

            <draft-ash-ospf-isis-congestion-control-02.txt>

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.


ABSTRACT:

Earlier papers and contributions identified issues of congestion control and
failure recovery for link-state protocol networks, such as OSPF, ISIS, and
PNNI networks [maunder, choudhury, pappalardo1, pappalardo2, atm01-0101].
The problem addressed is to enable link-state protocols to a) gracefully
recover from massive loss of topology database information, and b) respond
gracefully to network overloads and failures. This contribution proposes
specific additional considerations for network congestion control/failure
recovery.  Candidate mechanisms are proposed for control of network
congestion and failure recovery, in particular we initially propose to
investigate the following mechanisms:
a)      throttle new connection setups, topology-state updates, and Hello
updates based on automatic congestion control mechanisms, 
b)      special marking of critical control messages (e.g., Hello and
topology-state-update Ack) so that they may receive prioritized processing,
c)      database backup, in which a topology database could be automatically
recovered from loss based on local backup mechanisms, and 
d)      hitless restart, which allows routes to continue to be used if there
is an uninterrupted data path, even if the control path is interrupted due
to a failure.
We propose that the OSPF and ISIS working groups include the candidate
mechanisms in an evaluation of congestion control/failure recovery for OSPF
and ISIS networks.  

Table of Contents

   1. Introduction                                             
   2. Failure Experience and Analysis
      2.1 Failure Experience
      2.2 Vendor Analysis of Product Performance
      2.3 Analytic Modeling, Simulation Analysis, and Emulation Analysis
   3. Proposed Solution Methods
   4. Candidate Mechanisms for Congestion Control/Failure Recovery
      4.1 Automatic Congestion Control
      4.2 Special Marking of Critical Control Messages
      4.3 Database Backup
      4.4 Hitless Restart
   5. Proposals
   6. Security Considerations                                  
   7. Acknowledgments                                          
   8. References                                               
   9. Authors' Addresses                                       
   10. Copyright Statement
 
1. Introduction

Congestion can arise in data networks for many different reasons. There is
evidence based on previous failures that link state (LS) routing protocols,
such as OSPF and ISIS, currently can not recover from large failures which
result in widespread loss of topology database information (especially when
areas/peer-groups get "too large").  LS protocols typically use
topology-state update (TSU) mechanisms to build the topology database at
each node, typically conveying the topology status through flooding of TSU
messages containing link, node, and reachable-address information between
nodes.  In OSPF, they use the link state advertisement (LSA), in PNNI, such
mechanisms use the PNNI topology state element (PTSE), in frame-relay and
proprietary-routing networks, they may use other TSU mechanisms to exchange
topology status information to build the topology database at each node.  In
this contribution we use a generic term - TSU - so as to cover LS protocols
in general.

Earlier papers and contributions identified issues of congestion control and
failure recovery for LS protocol networks, such as OSPF, ISIS, and PNNI
networks [maunder, choudhury, pappalardo1, pappalardo2, atm01-0101].  The
problem addressed is to enable LS protocols to a) gracefully recover from
massive loss of topology database information, and b) respond gracefully to
network overloads and failures. This contribution proposes specific
additional considerations for network congestion control/failure recovery.

As discussed in Section 2, there have been data network outages in which
these problems have manifested themselves.  As a result of these failures, a
number of congestion control/failure recovery mechanisms were recommended,
and these are reviewed in Section 3 of this contribution.  

In Section 4, we discuss how these candidate mechanisms for control of
network congestion and failure recovery can be used.  In particular we
discuss:

a)      how a node might detect congestion or failure conditions based on
various criteria, such as CPU real-time use, memory consumption, and others,
b)      how a node might control its workload once congestion is detected,
such as through 
*       the throttling of new connection setups, topology-status updates,
and Hello updates with the use of automatic congestion control mechanisms,
*       special marking of critical control messages (e.g., Hello and
topology-state-update Ack) so that they may receive prioritized processing,
a)      how a node might recover from failure, such as through 
*       database backup, in which a topology database could be automatically
recovered from loss based on local backup mechanisms,
*       hitless restart, which allows routes to continue to be used if there
is an uninterrupted data path, even if the control path is interrupted due
to a failure.

We propose that the OSPF and ISIS working groups include the candidate
methanisms in an evaluation of congestion control/failure recovery for OSPF
and ISIS networks.   

1.1 Related Work Underway in Standards Bodies

There is much work already underway in standards bodies, namely the IETF,
ATM Forum, and ITU-T, to address issues of congestion control and failure
recovery in ATM- and IP-based packet networks.  Numerous references are
cited and are further explained in the document [maunder, moy1, moy2, moy3,
murphy, whitehead, zinen, atm01-0101, btd-cs-congestion-02.00].

2. Failure Experience and Analysis

In this section we present evidence of the current problems associated with
LS failure recovery from various failure conditions, which is based on a)
failure experience, b) vendor analysis of product performance, and c)
analytic modeling, simulation analysis, and emulation analysis

2.1 Failure Experience

AT&T has experienced serious data network outages in which recovery of the
underlying LS protocols was inadequate.  For example, in the failure in the
AT&T Frame Relay Network on April 13, 1998 [att], an initial procedural
error triggered two undetected software bugs, leading to a huge overload of
control messages in the network.  The result of this control overload was
the loss of all topology database information, and the LS protocol then
attempted to recover the database with the usual Hello and TSU updates. 

Analysis has shown that several problems then occurred to prevent the
network from recovering properly:

*       Very large number of TSUs being sent to every node to process,
causing general processor overload
*       
*       Route computation based on incomplete topology recovery, causing
routes to be generated based on transient, asynchronous topology information
and then in need of frequent re-computation
*       
*       Inadequate work queue management to allow processes to complete
before more work is put into the process queue
*       
*       Inability to segment the network (into smaller "peer groups") to aid
in the LS protocol recovery
*       
*       Inability to access the node processors with network management
commands due to lack of necessary priority of these messages

A more recent failure occurred on February 20, 2001 in the AT&T ATM Network,
which resulted in a large overload of TSUs, and a lengthy network outage
[pappalardo1, pappalardo2].  Manual procedures were put in place to reduce
TSU flooding, which worked to stabilize the network.  It is desirable that
such TSU flooding reduction be automatic under overload.

In general, there have been a number of major outages reported by most major
carriers, and routing protocol issues have generally been involved.  Other
relevant LS-network failures are reported in [cholewka, jander].

Various networks employing LS protocols use various control messages and
mechanisms to update the LS database, not necessarily LSAs, PTSEs, or
flooding mechanisms.  Based on experience, however, the LS protocols are
found to be vulnerable to loss of database information, control overload to
re-sync databases, and other failure/overload scenarios which make such
networks more vulnerable in the absence of adequate protection mechanisms.
Hence we are addressing a generic problem of LS protocols across a variety
of implementations, and the basic problem is prevalent in LS protocol
networks employing frame-relay, ATM, and IP based technologies.

2.2 Vendor Analysis of Product Performance

Various vendors and service providers were asked to analyze their product
with reference to how their LS protocol recovers from a total network
failure, that is, loss of all database information in the specified
scenario.  The specification of the failure presented to the vendors made
the following assumptions:

1. 400 node network is considered
	a) 100 backbone nodes
	b) 3 edge nodes per backbone node (Edge single homed)
	c) Each backbone node is connected to no more than 10 nodes (maximum
	   node adjacency is 13, which defines a sparse network)
2. There are 101 areas
	a) 1 backbone area with 100 backbone nodes
	b) 100 edge areas, each with 3 nodes, all homed on the backbone peer
	   group
3. 1,000,000 addresses are advertised in the network
4. maximum node adjacency is 13
	a) sparse network

Assumptions on packet sizes and processing times varied according to product
capabilities associated with the individual estimates given below.

The results for each projected recovery time from three
vendor/service-provider analyses of the above scenario are as follows:

Recovery Time Estimate A - 3.5 hours
Recovery Time Estimate B - 5-15 minutes
Recovery Time Estimate C - 5.5 hours

Clearly there is a large variability to the estimates, however the
expectation is that vendor equipment recovery is not adequate under a large
failure scenario as was analyzed.

2.3 Analytic Modeling, Simulation Analysis, and Emulation Analysis

Some analysis models have been published which reveal problematic
performance of LS protocols under various failure conditions [maunder,
atmf00-0249].

There are various simulation capabilities [NS, wandl, opnet,
scalable-networks, makesystems] for analyzing LS protocol behavior under
failure.  We report below an event simulation study [choudhury] on overloads
in large-scale networks and the corresponding recovery times.

Emulation of large failure performance could also be available through
laboratory testing of actual protocol performance of vendor products under
failure conditions. However to date no studies have been published on
failures in large scale networks and the corresponding recovery times.

The results from such simulation and emulation studies are specific to the
type of equipment and network configuration modeled, and therefore are not
completely general.  However, the results to date lend support for the
existence of problems with LS protocols in the face of congestion and
network failures. 

A network-wide event simulation model is reported in [choudhury] to study
the impact of a TSU storm.  It captures the actual congestion seen at
various nodes and accounts for propagation delay between nodes,
retransmissions in case a TSU is not acknowledged, failure of links for TSUs
delayed beyond the node-dead interval, and link recovery following database
synchronization and TSU flooding once the TSU is processed. It approximates
a real network implementation and uses processing times that are roughly in
the same order of magnitude as measured in the real network (of the order of
milliseconds).

There are two categories of IGP messages processed at each node in the
simulation. Category 1 messages are triggered by a timer and include the
Hello refresh, TSU refresh and retransmission packets. Category 2 messages
are not triggered by a timer and include received Hello, received TSU and
received acknowledgements. Timer-triggered messages are given non-preemptive
priority over the other type and are queued in the timer queue.  As a
result, the received Hello packets and the received acknowledgement packets
may see long queuing delays under intense CPU overload.  Figure 1 below
shows sample results of the simulation study when applied to a network with
about 300 nodes and 800 links.  The node-adjacency varies from node-to-node
and the maximum node-adjacency is 30.  The Hello interval is assumed to be 5
seconds, the minimum interval between successive SPF (Shortest-Path-First)
calculations is 1 second, and the Node-Dead Interval ("Router-Dead
Interval") is 15 seconds, i.e., a link is declared down if no Hello packet
is received for three successive hello intervals.   During the study, a TSU
storm of size X is created at instant of time 100 seconds where storm-size
is defined as the number of TSUs generated during a storm.  Three cases are
considered with X = 300, 600 and 900 respectively.  Besides the storm, there
are also the normal once-in-thirty-minutes TSU refreshes.  At any given
point of time we define a quantity "dispersion" that is the number of
control packets already generated in the network but not received and
processed in at least one node (each control packet is assumed to carry
three TSUs).

Table 1 plots dispersion as a function of time and thereby illustrates the 
impact of TSU storm on network stability.  Before the TSU storm, the
dispersion due to normal TSU refreshes remains small.  We expect the
dispersion to jump to a high value right after the storm and then come down
to the pre-storm level after some period of time (this happens with X=300
and X=600 but not X=900).  In Table 1 with a TSU storm size 300, the "heavy
dispersion period" lasted about 11 seconds and no link losses were observed.
With a TSU storm of size 600, the "heavy dispersion period" lasted about 40
seconds.  Some link losses were observed a little after 15 seconds within
the "heavy dispersion period" but eventually all links recovered and the
dispersion came down to the pre-storm level. With a TSU storm of size 900,
the "heavy dispersion period" lasted throughout the simulation period (6
minutes).

======|=====================================================================
      |            Table 1: DISPERSION as a FUNCTION of TIME (in sec)       
      |                     for different TSU Storm Sizes                   
STORM |=====================================================================
SIZE  |100s     106s    110s    115s    140s    170s    230s    330s    370s
======|=====================================================================
 300  | 0        39      3       1       0       1       0       0       0  
------|---------------------------------------------------------------------
 600  | 0       133     120     100      12      1       0       0       0  
------|---------------------------------------------------------------------
 900  | 0       230     215     196     101     119     224     428     488 
======|=====================================================================

The generic observations are as follows:

*       If the initial TSU storm size (e.g., X=300) is such that the delays
experienced by Hello packets are not big enough to cause any link failures
anywhere in the network, the network remains stable and quickly gets back to
a period of "low dispersion".  These types of LSA storms are observed quite
frequently in operational networks, from which the network easily recovers.
*       If the initial TSU storm size (e.g., X=600) is such that the delays
experienced by a few Hello packets in a few nodes cause link failures then
some secondary TSU storms are generated.  However, the secondary storms do
not keep growing indefinitely and the network remains stable and eventually
gets back to a period of "low dispersion".  This type of LSA storm was
observed in an operational network triggered by a network upgrade, from
which the network recovered but with some difficulty.
*       If the initial TSU storm size (e.g., X=900), is such that the delays
experienced by many Hello packets in many nodes cause link failures then a
wave of secondary TSU storms are generated.  The network enters an unstable
state and the secondary storms are sustained indefinitely or for a very long
period of time. This type of LSA storm was observed in an operational
network triggered by a network failure (2/20/01 AT&T ATM network failure
discussed in Section 2.1) from which the network did not recover without
corrective steps (manual procedures were used to reduce TSU flooding and
stabilize the network).

The results show that there is a TSU storm threshold above which the network
shows unstable behavior.  It is desirable that TSU flooding reduction be
automatic under overload, and the model could be used to assess the benefits
of various control mechanisms, such as those discussed in the next Section.

3. Proposed Solution Methods

Generally, we wish to find generic LS protocol enhancements such that
networks can control congestion arising from any stimulus. There are general
congestion control/failure recovery methods that have been applied in many
networks, and these should be considered for enhancement to the OSPF and
ISIS protocols. It would be desirable to not have TSU congestion being the
primary factor in limiting the size of peer-groups. Rather, we should like
to be able to grow a peer-groups to a very large size in order to reduce
complexity, and also recover from large failure resulting in widespread loss
of topology database information. 

Some candidate protocol mechanisms to allow recovery from large failures are
these:

a)      throttle new connection setups, topology-state updates, and Hello
updates based on automatic congestion control mechanisms, triggered by
congestion at nodes. Congestion may be either due to the processing of
routing update or connection setup messages and is manifested by sustained
CPU busy time.  This is similar to automatic congestion control (ACC)
[Q.764, whitehead, hosein1, hosein2], or dynamic overload control (DOC)
mechanisms used in various networks in which a node can signal other nodes
to limit the rate of control messages sent to it [mummert],
b)      special marking of critical control messages (e.g., Hello and
topology-state-update Ack) so that they may receive prioritized processing
[maunder],
c)      database backup, in which a topology database could be automatically
recovered from loss based on local backup mechanisms,
d)      hitless restart, which allows routes to continue to be used if there
is an uninterrupted data path, even if the control path is interrupted due
to a failure [moy1],
e)      node processor CPU real-time and memory control, which would provide
work process queuing mechanisms to allow lower priority processes to run and
not starve CPU memory and real-time to the point where no processes run,
f)      give necessary priority to network management control messages so
they are not locked out during times of failure
g)      avoid duplicate control messages where possible, (e.g., sent out on
parallel links to the same node).  PNNI already addresses flooding on
parallel links, it is currently proposed for OSPF [moy2].
h)      mechanisms to be able to automatically segment a large peer group
into smaller peer groups, allow these segments to recover their topology and
routing, then combine the smaller peer groups into the original large peer
group and allow that to recover.  Such mechanisms are implemented in
operational networks, such as through network management scripts to segment
the network under severe overload.  Currently these procedures are triggered
manually, however it would be desirable to automate such mechanisms within
the LS protocol.
i)      provide bandwidth allocation mechanisms to limit the amount of link
bandwidth allocation to control traffic versus data traffic.  This
capability exists in PNNI: RCCs are provisioned with traffic parameters
(default parameters are specified in [af-pnni-0055.000]) and traffic is
shaped into the RCC VC at the configured traffic rate.
j)      avoid acknowledging messages that overflow the work queue.  The
point is not to ack a message that is eventually discarded and not
processed, ack only when it is known that it will be processed.

4. Candidate Mechanisms for Congestion Control/Failure Recovery

In the following sections, we describe in more detail the first four of
these mechanisms in Section 3, and propose that more detailed evaluation be
undertaken for inclusion into OSPF and ISIS. 

4.1 Automatic Congestion Control

Mechanisms are proposed to slow down TSU/Hello update rates throttle new
connection setups, topology-state updates, and Hello updates based on
automatic congestion control and triggered by congestion at nodes.

The objective of throttling new connection setups applies directly in the
case of some link-state protocols such as PNNI which support connection set
up, and indriectly in the case of OSPF or ISIS where the concept could well
apply to MPLS set ups of label switched paths (LSPs) [mpls-architecture].
That is, in both establishing an LSP and modifying LSPs, throttling
connection setups would help reduce the associated control load, and the
methods as described below might then apply.

Automatic congestion control (ACC) [Q.764, whitehead, hosein1, hosein2] is
used during node overloads to maintain network throughput.  Such mechanisms
have been used in large-scale voice networks for many years, such as the
AT&T voice network [mummert, ash], which is a large dynamic routing network.
ACC therefore is applicable to both dynamic routing networks and fixed
routing networks.  

In ITU-T Recommendation Q.764, ACC can distinguish different levels of
congestion, for example, a less severe congestion condition (level 1) and
more severe condition (level 2).  If either congestion level is reached, an
ACC information message may be sent to the adjacent nodes indicating the
level of congestion.  The adjacent nodes, when receiving an ACC information
message, should reduce their traffic to the affected overloaded node.  The
ACC level is typically based on the real time utilization and queue length
thresholds in the overloaded node.  The ACC levels can be passed back to the
adjacent nodes, for example, in release messages.  When an adjacent node
finds out that a node is overloaded, it blocks connection requests to the
overloaded node.  For example, if the ACC is set to level 2, then the
adjacent node could block a percentage of traffic and adjust the blocking
rate until ACC congestion level 1 is reached.

The mechanism described in ITU-T Recommendation Q.764, however, was found to
over-control in response to overload  [hosein1, hosein2].  Various improved
algorithms have been investigated [whitehead, hosein1], and provide good
performance and are relatively simple to implement. The exact mechanisms for
automatic congestion control need to be specified in the follow-on study of
the mechanism.

In the case of applying ACC to new connection setups during congestion,
traffic can be shed by rejecting newly-arriving connection requests back to
the originating node which then either routes around the congestion or
denies the new setup.  This feature is currently being provided within the
signaling congestion control upgrades for PNNI [atm00-0257].

For controlling the level of TSU flooding during congestion, a difficulty is
that with distributed routing, flooding happens from all nodes which must
continue to flood their neighbors the information a node receives.  There is
no initial source to flow control the TSUs flooded to a node.  A mechanism
could devised in which nodes "advertise" some level of congestion and if
enough nodes are sufficiently "congested" then all nodes in the peer group
start to slow down the rate at which they emit TSUs.  As an example, suppose
the maximum rate at which trunk bandwidth-change TSUs can be generated is
one per five seconds per trunk.  Under a TSU overload, as evidenced by long
CPU busy times at some number of nodes, a mechanism can automatically change
the "five" to say "ten".

The TSU congestion state could be inferred, for example, by not receiving
Hello updates within say 2 times the Hello interval.  We could have an
explicit way to notify other nodes of congestion, such as a mechanism
similar to the grace-LSA mechanism where we send the information in Opaque
LSA's [moy1].

We could envision an overload bit which is set under LSU storm detection,
and which results in the following actions:

1. The LSU retransmit interval is automatically increased, thus throttling
LSU's as called for above.  The LSU retransmit interval, Rxmt, refers to
three things:
a)	In the sending and receiving of LS database packets two nodes form a
master/slave relationship.  Each LS database packet has a sequence number,
and each database packet sent by the master is acknowledged by the slave
through echoing of the sequence number.  Both packets sent by the master and
their responses contain summaries of LS data.  The master is the only one
allowed to retransmit LS database packets.  It does so only at fixed
intervals, the length of which is the configured per-interface, denoted as
the Rxmt Interval.  
b)	The link state request packets that are not satisfied are resent at
fixed intervals (the Rxmt interval),
c)	The time interval between LSU retransmissions, for adjacencies
belonging to this interface. 
Increasing Rxmt helps slow the rate of LSUs, including retransmits, thus
reducing congestion.

2. Define an interval which is greater than the node dead interval, call it
adjacency breaking interval. When the actual node dead interval occurs, we
do not put the link down, instead we update the node LSU.  If Hellos are
received before the adjacency dead interval is reached, the adjacency is not
broken. This would result in adjacencies not breaking and causing a
secondary flooding storm, although we are still sending out the correct LSU
information for the SPF calculation. This would also result in less data
traffic going through the link in case of problems on that link, as the node
LSU will not show the link as having transit capability.

3. Increase the Hello interval and the node dead interval.  However we need
to ensure that adjacencies are not broken in case of a Hello or node dead
interval mismatch.  After the network stabilizes, we decrease the intervals
and then set the overload bit off. 

4. Do not do any incremental SPF and if required increase SPF interval.

To prevent LSU flooding storms in the stable state, we could use optimized
refreshment in which all LSU's are not refreshed at the same time, thus
preventing a flooding storm. A method for this is given in [refresh-guide].
To reduce the rate of flooding during database synchronization, the master
node should not immediately send a new database description packet if either
the master node or slave node is congested.

Generally, when a node detects that either itself or a connected node is
congested, it should reduce the rate of flooding. The mechanism could be
something like a sliding window control in which the size of the window is
equal to the approximate number of LSU's that an adjacent node can process.
This helps in the case when a node has to flood thousands of LSU's, which at
present it would do immediately and perhaps cause the other end to become
overloaded, and result in further congestion through innumerable
retransmits.  In case of a sliding window control, we would limit the rate
of LSU flooding, thus causing no overload at the other end.

The exact mechanisms for automatic congestion control need to be specified
in the follow-on study of the mechanism.

4.2 Special Marking of Critical Control Messages 

Special marking is proposed to quickly identify critical control messages
(e.g., Hello and TSU Ack) so that they may receive prioritized processing.
This mechanism has been proposed for OSPF [maunder].  The benefit and use of
this capability has been illustrated in Section 2.3.  It is useful to mark
control messages, such as TSU Acks and Hello packets, with a priority
marking so that the work-load management process within a node can easily
distinguish high priority work from lower priority work [maunder].  In the
case of ATM networks, this can perhaps be accomplished through the use of a
separate VCI.  

Database description packets from a slave node could be prioritized in the
master node, as it is equivalent to an Ack for the Database Description
Packet from the Master.  In addition, packets could also be prioritized
based on:
a)	the state of the sending neighbor.  The higher the state of the
neighbor interaction the higher the priority of packets from it.  In OSPF,
for example, packets from neighbors with a state greater than Exchange could
be given higher priority than those that are in a lower state.
b)	those packets that convey "bad" information,
c)	LSU's that have a change versus LSUs that have no change.  LSU's
generated by LSU refresh generally do not have any change, so they should be
given lesser priority over LSU's transmitted because of change. Delay in
sending the refresh will not cause any harm, as the previous version of the
same LSU is already there in the network. However packets with change could
cause a change in routing table, and hence need to be prioritized.

In addition, it is necessary to provide a flexible scheduling mechanism for
CPU real-time and memory that gives each process certain minimal share even
during overload and does not starve any process.  CPU overload, if not
managed properly, can result in poor utilization of CPU real-time and
memory.  For example, during an overload in the receipt of control messages
such as TSUs and Hello messages, it is useful for node processors to manage
CPU real-time and memory so that node resources are not exhausted and useful
work is always accomplished.  Various means of workload management can be
employed. 

The exact mechanisms for special marking of critical control messages, and
for processor real-time and memory management, need to be specified in the
follow-on study of the mechanism.

4.3 Database Backup

A common practice in large networks is to provide a method of locally
backing up the LS database information.  Then in the case of a failure
condition in which database information is lost, the backup database is
accessed and used to reinitialize the LS database.  A mechanism to recover
any lost updates of database information, that is, those updates lost during
the interval from the time of failure to the recovery time, would be needed.

The concept of a database backup therefore is not new, however, it is new in
the context of using database backup together with LS protocols.  Database
backup has worked in practice for about 25 years in the AT&T core voice
network, which is a large-scale distributed, dynamic routing network (uses
RTNR, with class-of-service capabilities [ash]).  The intuition behind the
database backup concept is simple: if you lose your whole database (just to
take that example), it is better to reinitialize with a local copy, which
can be expected to have a large percentage of correct, up-to-date
information.  The alternative of flooding the entire database from other
nodes is far more time consuming and less efficient.  

The issue of getting the database back to 100% accuracy after restoration
from the local backup is an issue that needs to be addressed.  To further
define the concept, we can illustrate two possible implementations, although
there are many possible implementations, these are not necessarily being
proposed  

One way to view the LS database at each node is to consider a decomposition
into the slow information database (SIDB) and the fast information database
(FIDB).  Information in the SIDB changes slowly, for example)

*       max link bandwidth
*       metric
*       reachable addresses

Information in the FIDB changes rapidly, for example
       
*       available link bandwidth (e.g., real-time updates)

Because the data is backed up, SIDB and FIDB information would only be
updated when changes occurred, and need not be periodically refreshed.
Under a total LS database failure scenario, as discussed in Section 2, a
failed node restores its most recent database and advertises the time-stamp
of the most recent SIDB update received.  Other nodes would then know which
SIDB updates to restore to the node.  FIDB information is flooded as normal
and not restored through the most-recent time-stamp mechanism, eventually
the FIDB would re-sync.  

After a failure, a node bases its routing on the current database, derived
as above.  Some of the database information will be somewhat out of sync,
and routing will not be optimal, however the routing would suffice until the
database re-syncs completely.  Since the SIDB information changes slowly,
restoring that data should require a minimal amount of flooding.  Since the
FIDB information is not restored, the total amount of flooding after a
database failure would be vastly reduced.  Since the TSU flooding is slowed
down after the failure event, for a while information from the old copy of
the database would be used for routing purposes but as TSUs trickle in, the
database should be gradually populated with new information.  This would
mean somewhat inefficient routing and increased crankbacks for a while but
would improve stability due to slowed-down TSU flooding.

Another alternative is to use the stable information as soon as a node gets
back up, as suggested above, but not attempt to reduce the flooding in the
network by not re-flooding the stable information.  This can be achieved by
each node maintaining a shadow copy of the stable information, as described
above.  When a node comes back up, it routes based on the backed up shadow
copy, and even if the current information collected through the routing
protocol is not complete, the node can still use the old information for
routing.  In this approach, each node could explicitly specify what
information is stable, and this may be easier than globally agreeing upon
the information that is stable versus the information that is unstable.  In
this approach, the overall control load may be reduced by fewer instances of
failed routing which result from incomplete database information.

Another mechanism for database backup would be to have a redundant backup
completely in sync with the primary database.  Upon database failure, the
node would automatically need to use of the backup.

Another mechanism is to back up all the non-self-originated LSU's, as well
as the states of interfaces.  Accordingly, when the network is recovering, a
node can generate all its own LSU's, node, network and summary LSUs.  

The above are just illustrative examples, which are not necessarily
proposed.  The exact mechanisms for using LS backup and database backup
needs to be specified in the follow-on study of the mechanism.

4.4 Hitless Restart

Nodes normally implement a separation of control and forwarding functions,
wherein certain processors are dedicated to control and management tasks
such as routing, while other processors perform the data forwarding tasks.
This separation creates the possibility of maintaining a node's data
forwarding capability while the control software is restarted/reloaded,
which is known as "hitless restart".

Under normal operation, LS protocols intentionally route around a restarting
node while it rebuilds its LS database.  Avoidance is accomplished by having
the node's neighbors reissue their TSUs, omitting links to the restarting
node.  However, if the network topology remains stable and the restarting
node is able to keep its forwarding table(s) across the restart, it would be
possible to keep the restarting node on the forwarding path.  

Mechanisms have been proposed for implementing hitless restart within LS
protocols [moy1].  [Zinin] in addition enables recovery from a node crash in
an active/standby configuration, or other multiprocessor model.  The exact
mechanisms for hitless restart need to be specified in the follow-on study
of the mechanism.

5. Proposals

Proposal A: 

We propose that the OSPF and ISIS working groups start a new work item on
OSPF and ISIS routing congestion.

Proposal B: 

We propose that the OSPF and ISIS working groups include the four candidate
mechanisms in a more detailed evaluation of their inclusion in the OSPF and
ISIS protocols for congestion control/failure recovery for OSPF and ISIS
networks.  We suggest that the OSPF and ISIS working groups explore the
suggested solution mechanisms in Sections 3-4: 

a)      automatic congestion control mechanisms to throttle new connection
setups, topology-state updates, and Hello updates, triggered by congestion
at nodes,
b)      special marking of critical control messages (e.g., Hello and
topology-state-update Ack) so that they may receive prioritized processing,
c)      database backup, in which a topology database could be automatically
recovered from loss based on local backup mechanisms, and 
d)      hitless restart, which allows routes to continue to be used if there
is an uninterrupted data path, even if the control path is interrupted due
to a failure.

6. Security Considerations

There are no new security considerations based on proposals in this draft.

7. Acknowledgments

The authors gratefully acknowledge the comments and suggestions from many
people.  At AT&T we thank Margaret Chiosi, Enrique Cuevas, Tom Helstern,
Steve Holmgren, Aswatnarayan Raghuram, and Mike Shapiro, Shawn McAllister
at Alcatel, Marty Albright at Worldcom, Sohel Khan at Sprint, Ghassem 
Koleni at Nortel Networks, Thomas Cornely and Greg Wetzel at Covad, 
Bob Dianda at Sprint, and Jeff Parker at Axiowave.

8. References

[af-pnni-0055.000]  "Private Network-Network Interface Specification Version
1.0 (PNNI 1.0)," March 1996.

[ash]  Ash, G. R., "Dynamic Routing in Telecommunications Networks," McGraw
Hill.

[atmf00-0249] "Scalability and Reliability of large ATM networks."

[atm00-0257] "Signaling Congestion Control in PNNI Networks: The Need and
Proposed Solution Outline."

[atm00-0480] "Congestion Control/Failure Recovery in PNNI Networks."

[atm01-0101] "Proposed Mechanisms for Congestion Control/Failure Recovery in
PNNI Networks."

[att]  "AT&T announces cause of frame-relay network outage," AT&T Press
Release, April 22, 1998.

[btd-cs-congestion-02.00] "Signaling Congestion Control Version 1.0",
Baseline Text

[cholewka]  Cholewka, K., "MCI Outage Has Domino Effect," Inter@ctive Week,
August 20, 1999.

[choudhury]  Choudhury, G., Maunder, A. S., Sapozhnikova, V., "Faster
Link-State Convergence and Improving Network Scalabiity and Stability,"
presentation at LCN 2001.

[hosein1] Hosein, P., "An Improved ACC Algorithm for Telecommunication
Networks," Telecommunication Systems 0, 1998.

[hosein2] Hosein, P., "Overload Control for Real-Time Telecommunication
Databases," International Teletraffic Congress - 16, Edinburgh, Scotland,
June 1999.

[jander]  Jander, M., "In Qwest Outage, ATM Takes Some Heat," Light Reading,
April 6, 2001. 

[maunder] Maunder, A. S., et. al., "Explicit Marking and Prioritized
Treatment of Specific IGP Packets for Faster IGP Convergence and Improved
Network Scalability and Stability," work in progress.

[mpls-architecture] Rosen, E., Viswanathan, A., Callon, R., "Multiprotocol
Label Switching Architecture," RFC 3031, January 2001.

[mummert] Mummert, V. S., "Network Management and its Implementation on the
No. 4ESS," International Switching Symposium, Japan, 1976.

[moy1] Moy, J., "Hitless OSPF Restart", work in progress.

[moy2]  Moy, J., "Flooding over parallel point-to-point links," work in
progress.

[moy3]  Moy, J., "Flooding Over a Subset Topology," work in progress.

[murphy]  Murphy, P., "OSPF Floodgates," work in progress.

[pappalardo1] Pappalardo, D., "Can one rogue switch buckle AT&T's network?,"
Network World Fusion, February 23, 2001.

[pappalardo2]  Pappalardo, D., "AT&T, customers grapple with ATM net
outage," Network World, February 26, 2001.

[Q.764]  "Signalling System No. 7 - ISDN user part signalling procedures,"
December 1999.

[refresh-guide]  Zinin, A., "Guidelines for Efficient LSA Refreshment in
OSPF," work in progress.

[whitehead]  Whitehead, Martin, "A class of overload controls based on
controlling call reject rates," ITU-T contribution D.19, Feburary 2001.

[zinen]  Zinin, A., et. al., "OSPF Restart Signaling," work in progress.

9. Authors' Addresses

Jerry Ash
AT&T
Room MT D5-2A01
200 Laurel Avenue
Middletown, NJ 07748, USA
Phone: 732-420-4578
Email: gash@att.com

Gagan L. Choudhury
AT&T
Room D5-3C21
200 S.Laurel Avenue
Middletown, NJ 07748
Phone: 732 420-3721
E-mail: gchoudhury@att.com

Jeff Han
AT&T
Room C5-2C34
200 S. Laurel Avenue
Middletown, NJ 07748
Phone: 732 420-2438
E-mail: jhan@att.com

Vishwas Manral
NetPlane
189, Prashasan Nagar,
Road number 72
Jubilee Hills
Hyderabad
India
E-mail: vishwasm@netplane.com

Anurag S. Maunder
Sanera Systems,
370 San Aleso Avenue
Sunnyvale, CA 94085
Ph: 408-734-6123
E-mail: amaunder@sanera.net

Mahmood Noorchashm
AT&T
Room C2-2A22
200 S. Laurel Avenue
Middletown, NJ 07748
Phone: 732-420-4185
E-mail: noorchashm@att.com

Vera D. Sapozhnikova
AT&T 
Room C5-2C29
200 S. Laurel Avenue
Middletown, NJ 07748
Phone: 732-420-2653
E-mail: sapozhnikova@att.com

Mostafa Hashem Sherif 
AT&T 
Room C5-2D18
200 S. Laurel Avenue
Middletown, NJ 07748
Phone: 732-420-2448
E-mail: mhs@.att.com

10. Copyright Statement

Copyright (C) The Internet Society (1998). All Rights Reserved.

This document and translations of it may be copied and furnished to 
others, and derivative works that comment on or otherwise explain it or 
assist in its implementation may be prepared, copied, published and 
distributed, in whole or in part, without restriction of any kind, 
provided that the above copyright notice and this paragraph are included 
on all such copies and derivative works.

However, this document itself may not be modified in any way, such as by 
removing the copyright notice or references to the Internet Society or 
other Internet organizations, except as needed for the purpose of 
developing Internet standards in which case the procedures  for 
copyrights defined in the Internet Standards process must be followed, 
or as required to translate it into languages other than English.

The limited permissions granted above are perpetual and will not be 
revoked by the Internet Society or its successors or assigns.

This document and the information contained herein is provided on an "AS 
IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK 
FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT 
LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT 
INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR 
FITNESS FOR A PARTICULAR PURPOSE.