Network Working Group X. Xu
Internet-Draft China Mobile
Intended status: Standards Track 1 July 2024
Expires: 2 January 2025
Fully Adaptive Routing Ethernet using BGP
draft-xu-idr-fare-00
Abstract
Large language models (LLMs) like ChatGPT have become increasingly
popular in recent years due to their impressive performance in
various natural language processing tasks. These models are built by
training deep neural networks on massive amounts of text data, often
consisting of billions or even trillions of parameters. However, the
training process for these models can be extremely resource-
intensive, requiring the deployment of thousands or even tens of
thousands of GPUs in a single AI training cluster. Therefore, three-
stage or even five-stage CLOS networks are commonly adopted for AI
networks. The non-blocking nature of the network become increasingly
critical for large-scale AI models. Therefore, adaptive routing is
necessary to dynamically load balance traffic to the same destination
over multiple ECMP paths, based on network capacity and even
congestion information along those paths.
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 2 January 2025.
Xu Expires 2 January 2025 [Page 1]
Internet-Draft FARE using BGP July 2024
Copyright Notice
Copyright (c) 2024 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Path Bandwidth Extended Community . . . . . . . . . . . . . . 4
4. Solution Description . . . . . . . . . . . . . . . . . . . . 5
4.1. Adaptive Routing in 3-stage CLOS . . . . . . . . . . . . 5
4.2. Adaptive Routing in 5-stage CLOS . . . . . . . . . . . . 6
5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8
7. Security Considerations . . . . . . . . . . . . . . . . . . . 8
8. References . . . . . . . . . . . . . . . . . . . . . . . . . 8
8.1. Normative References . . . . . . . . . . . . . . . . . . 8
8.2. Informative References . . . . . . . . . . . . . . . . . 9
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 9
1. Introduction
Large language models (LLMs) like ChatGPT have become increasingly
popular in recent years due to their impressive performance in
various natural language processing tasks. These models are built by
training deep neural networks on massive amounts of text data, often
consisting of billions or even trillions of parameters. However, the
training process for these models can be extremely resource-
intensive, requiring the deployment of thousands or even tens of
thousands of GPUs in a single AI training cluster. Therefore, three-
stage or even five-stage CLOS networks are commonly adopted for AI
networks. Furthermore, In rail-optimized CLOS topologies with
standard GPU servers (HB domain of eight GPUs), the Nth GPUs of each
server in a group of servers are connected to the Nth leaf switch,
which provides higher bandwidth and non-blocking connectivity between
the GPUs in the same rail. In rail-optimized topology, most traffic
between GPU servers would traverse the intra-rail networks rather
than the inter-rail networks.
Xu Expires 2 January 2025 [Page 2]
Internet-Draft FARE using BGP July 2024
The non-blocking nature of the network, especially the network for
intra-rail communication, become increasingly critical for large-
scale AI models. AI workloads tend to be extremely bandwidth-hungry
and they usually generate a few elephant flows simultaneAously. If
the traditional hash-based ECMP load-balancing was used without any
optimization, it's highly possible to cause serious congestion and
high latency in the network once multiple elephant flows are routed
to the same link. Since the job completion time depends on worst-
case performance, serious congestion will result in model training
time longer than expected. Therefore, adaptive routing is necessary
to dynamically load balance traffic to the same destination over
multiple ECMP paths, based on network capacity and even congestion
information along those paths. In other words, adaptive routing is a
capacity-aware and even congestion-aware path selection algorithm.
Furthermore, to reduce the congestion risk to the maximum extent, the
routing should be more granular if possible. Flow-granular adaptive
routing still has a certain statistical possibility of congestion.
Therefore, packet-granular adaptive routing is more desirable
although packet spray would cause out-of-order delivery issue. A
flexible reordering mechanism must be put in place(e.g., egress ToRs
or the receiving servers). Recent optimizations for RoCE and newly
invented transport protocols as alternatives to RoCE no longer
require handling out-of-order delivery at the network layer.
Instead, the message processing layer is used to address it.
To enable adaptive routing, no matter whether flow-granular or
packet-granular adaptive routing, it is necessary to propagate
network topology information, including link capacity and/or even
available link capacity (i.e., link capacity minus link load) across
the CLOS network. Therefore, it seems straightforward to use link-
state protocols such as OSPF or ISIS as the underlay routing protocol
in the CLOS network, instead of BGP, for propagating link capacity
information and/or even available link capacity information. How to
leverage OSPF or ISIS to achieve adaptive routing has been described
in [I-D.xu-lsr-fare]. However, some data center network operators
have been used to the use of BGP as the underlay routing protocol of
data center networks [RFC7938]. Therefore, there is a need to
leverage BGP to achieve adaptive routing as well.
[I-D.ietf-idr-link-bandwidth] has specified a way to perform weighted
ECMP based on link bandwidths conveyed in the non-transitive link
bandwith extended community. However, it is impractical to enable
adaptive routing by directly using the non-transitive link bandwidth
extended community due to the following constraints as mentioned in
[I-D.ietf-idr-link-bandwidth].
Xu Expires 2 January 2025 [Page 3]
Internet-Draft FARE using BGP July 2024
"No more than one link bandwidth extended community SHALL be attached
to a route. Additionally, if a route is received with link bandwidth
extended community and the BGP speaker sets itself as next-hop while
announcing that route to other peers, the link bandwidth extended
community should be removed. The extended community is optional non-
transitive."
Hence, this document defines a new extended community referred to as
Path Bandwidth Extended Community and describes how to use this newly
defined path bandwidth extended community to achieve adaptive
routing.
Note that while adaptive routing especially at the packet-granular
level can help reduce congestion between switches in the network,
thereby achieving a non-blocking fabric, it does not address the
incast congestion issue which is commonly experienced in last-hop
switches that are connected to the receivers in many-to-one
communication patterns. Therefore, a congestion control mechanism is
always necessary between the sending and receiving servers to
mitigate such congestion.
2. Terminology
This memo makes use of the terms defined in [RFC4360].
3. Path Bandwidth Extended Community
The Path Bandwidth Extended Community is used to indicate the minimum
bandwith of the path towards the destination. It is an new IPv4
Address Specific Extended Community that can be transitive or non-
transitive.
The value of the high-order octet of this extended type is either
0x01 or 0x41. The low-order octet of this extended type is TBD.
The Value field consists of two sub-fields:
Global Administrator sub-field: This sub-field contains the router
ID of the advertising router that appends the path bandwidth
extended community or updates the path bandwidth value of the
existing path bandwidth extended community.
Local Administrator sub-field: This sub-field contains the path
bandwidth value in IEEE floating point format with units of
Gigabytes per second (GB/s).
Xu Expires 2 January 2025 [Page 4]
Internet-Draft FARE using BGP July 2024
4. Solution Description
4.1. Adaptive Routing in 3-stage CLOS
+----+ +----+ +----+ +----+
| S1 | | S2 | | S3 | | S4 | (Spine)
+----+ +----+ +----+ +----+
+----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
| L1 | | L2 | | L3 | | L4 | | L5 | | L6 | | L7 | | L8 | (Leaf)
+----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
Figure 1
(Note that the diagram above does not include the connections between
nodes. However, it can be assumed that leaf nodes are connected to
every spine node in their CLOS topology.)
In a three-stage CLOS network as shown in Figure 1, also known as a
leaf-spine network, each leaf node would establish eBGP sessions with
all spine nodes.
All nodes are enabled for adaptive routing.
When a leaf node, such as L1, advertises the route to a specific IP
prefix that it originates, it will attach a transitive path bandwidth
extended community filled with a maximum bandwidth value.
Upon receiving the above advertisement, a spine node, such as S1,
SHOULD determine the minimum value between the bandwidth of the link
towards the advertising node (e.g., L1) and the value of the path
bandwidth extended community carried in the received route, and then
update the path bandwidth extended community with the above minimum
value before readvertising that route to remote eBGP peers. Once S1
receives multiple equal-cost routes for a given prefix from multiple
leaf nodes (e.g., L1 and L2 in the server multi-homing scenario), for
each route, it SHOULD determine the minimum value between the
bandwidth of the link towards the advertising node and the value of
the path bandwidth extended community carried in the received route,
and then use that minimum bandwidth value as a weight value for that
route when performing weighted ECMP. When readvertising the route
for that prefix to remote eBGP peers further, the path bandwidth
extended community would be updated with the sum of the minimum
bandwidth value of each route.
Xu Expires 2 January 2025 [Page 5]
Internet-Draft FARE using BGP July 2024
When a leaf node, such as L8, receives multiple equal-cost routes for
that prefix from spine nodes (e.g., S1, S2, S3 and S4), for each
route, it will determine the minimum value between the bandwidth of
the link towards the advertising node and the value of the path
bandwidth extended community carried in the received route, and then
use that minimum bandwidth value as a weight value for that route
when performing weighted ECMP.
Note that the weighted ECMP according to path bandwidth SHOULD NOT be
performed unless all equal-cost routes for a given prefix carry the
path bandwidth extended community.
4.2. Adaptive Routing in 5-stage CLOS
=========================================
# +----+ +----+ +----+ +----+ #
# | L1 | | L2 | | L3 | | L4 | (Leaf) #
# +----+ +----+ +----+ +----+ #
# PoD-1 #
# +----+ +----+ +----+ +----+ #
# | S1 | | S2 | | S3 | | S4 | (Spine) #
# +----+ +----+ +----+ +----+ #
=========================================
=============================== ===============================
# +----+ +----+ +----+ +----+ # # +----+ +----+ +----+ +----+ #
# |SS1 | |SS2 | |SS3 | |SS4 | # # |SS1 | |SS2 | |SS3 | |SS4 | #
# +----+ +----+ +----+ +----+ # # +----+ +----+ +----+ +----+ #
# (Super-Spine@Plane-1) # # (Super-Spine@Plane-4) #
#============================== ... ===============================
=========================================
# +----+ +----+ +----+ +----+ #
# | S1 | | S2 | | S3 | | S4 | (Spine) #
# +----+ +----+ +----+ +----+ #
# PoD-8 #
# +----+ +----+ +----+ +----+ #
# | L1 | | L2 | | L3 | | L4 | (Leaf) #
# +----+ +----+ +----+ +----+ #
=========================================
Figure 2
Xu Expires 2 January 2025 [Page 6]
Internet-Draft FARE using BGP July 2024
(Note that the diagram above does not include the connections between
nodes. However, it can be assumed that the leaf nodes in a given PoD
are connected to every spine node in that PoD. Similarly, each spine
node (e.g., S1) is connected to all super-spine nodes in the
corresponding PoD-interconnect plane (e.g., Plane-1).)
For a five-stage CLOS network as illustrated in Figure 2, each leaf
node would establish eBGP sessions with all spine nodes of the same
PoD while each spine node would establish eBGP sessions with all
super-spine nodes in the corresponding PoD-interconnect plane.
In rail-optimized topology, Intra-rail communication with high
bandwidth requirements would be restricted to a single PoD. Inter-
rail communication with relatively lower bandwidth requirements need
to travel across PoDs through PoD-interconnect planes. Therefore,
enabling adaptive routing only in PoD networks is sufficient. It's
optional to perform adaptive routing for cross-PoD traffic.
When a leaf node, such as L1 in PoD-1, advertises the route for a
specific IP prefix that it originates, it will attach a transitive
path bandwidth extended community filled with a maximum bandwidth
value.
Upon receiving the above route advertisement, a spine node, such as
S1 in PoD-1, will determine the minimum value between the bandwidth
of the link towards the advertising node (e.g., L1 in PoD-1) and the
value of the path bandwidth extended community carried in the route,
and then update the path bandwidth extended community with the above
minimum value before readvertising that route to remote eBGP peers.
Once S1 in PoD-1 receives multiple equal-cost routes for a given
prefix from multiple leaf nodes (e.g., L1 and L2 in PoD-1 in the
server multi-homing scenario), for each route, it will determine the
minimum value between the bandwidth of the link towards the
advertising node and the bandwidth value of the path bandwidth
extended community carried in the route, and then use that minimum
bandwidth value as a weight value for that route when performing
weighted ECMP. When readvertising the route for that prefix to
remote eBGP peers, the path bandwidth extended community would be
updated with the sum of the minimum bandwidth value of each route.
When a given super-spine node, such as SS1 in Plane-1, receives the
route for that prefix from S1 in PoD-1, it will not update the
transtive path bandwidth extended community when readvertising that
route. It COULD optionally attach another path bandwidth extended
community which is non-transitive to indicate the bandwith of the
link towards the advertising router.
Xu Expires 2 January 2025 [Page 7]
Internet-Draft FARE using BGP July 2024
When a given spine node in another PoD, such as S1 in PoD-8, receives
multiple equal-cost routes for a given prefix from super-spine nodes
in Plane-1 (e.g., SS1, SS2, SS3 and SS4 in Plane-1), it will not
update the value of the transitive path bandwidth extended community
when readvertising that route towards remote peers (Note that the
transitive path bandwidth extended community of those multiple equal-
cost routes carry the same value that was set by S1 in PoD-1).
Meanwhile, once each route contains a non-transitive path bandwidth
extended community, for each route, it will determine the minimum
value between the bandwidth of the link towards the advertising node
and the bandwidth value of the non-transitive path bandwidth extended
community carried in the route, and then use that minimum bandwidth
value as a weight value for that route when performing weighted ECMP.
When a leaf node, such as L8 in PoD-8, receives multiple equal-cost
routes for that prefix from multiple spine nodes (e.g., S1, S2, S3
and S4 in PoD-8), for each route, it will determine the minimum value
between the bandwidth of the link towards the advertising node and
the value of the path bandwidth extended community carried in the
route, and then use that minimum bandwidth value as a weight value
for that route when performing weighted ECMP.
Note that the weighted ECMP according to path bandwidth SHOULD NOT be
performed unless all equal-cost routes for a given prefix carry the
path bandwidth extended community.
5. Acknowledgements
TBD.
6. IANA Considerations
TBD.
7. Security Considerations
TBD.
8. References
8.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
Xu Expires 2 January 2025 [Page 8]
Internet-Draft FARE using BGP July 2024
[RFC4360] Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended
Communities Attribute", RFC 4360, DOI 10.17487/RFC4360,
February 2006, <https://www.rfc-editor.org/info/rfc4360>.
8.2. Informative References
[I-D.ietf-idr-link-bandwidth]
Mohapatra, P. and R. Fernando, "BGP Link Bandwidth
Extended Community", Work in Progress, Internet-Draft,
draft-ietf-idr-link-bandwidth-07, 5 March 2018,
<https://datatracker.ietf.org/doc/html/draft-ietf-idr-
link-bandwidth-07>.
[I-D.xu-lsr-fare]
Xu, X., He, Z., Wang, J., Huang, H., Zhang, Q., Wu, H.,
Liu, Y., Xia, Y., Wang, P., and S. Hegde, "Fully Adaptive
Routing Ethernet", Work in Progress, Internet-Draft,
draft-xu-lsr-fare-02, 25 February 2024,
<https://datatracker.ietf.org/doc/html/draft-xu-lsr-fare-
02>.
[RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
BGP for Routing in Large-Scale Data Centers", RFC 7938,
DOI 10.17487/RFC7938, August 2016,
<https://www.rfc-editor.org/info/rfc7938>.
Author's Address
Xiaohu Xu
China Mobile
Email: xuxiaohu_ietf@hotmail.com
Xu Expires 2 January 2025 [Page 9]