The following terminologies are used throughout this document. Terms defined in the
companion training document are referenced but not redefined unless the inference
context introduces substantive differences.¶
- TTFT:
-
Time to First Token. The elapsed time from receipt of an inference request by
the serving system to emission of the first output token. Includes prompt
processing (prefill), KV cache generation, optional KV cache transfer (in
disaggregated architectures), and initial decode step. Target: < 500 ms for
interactive serving.¶
- ITL:
-
Inter-Token Latency. The elapsed time between successive output tokens during
the autoregressive decode phase. Measured at P50, P95, P99, and P99.9
percentiles. Target: < 50 ms P99 for interactive serving.¶
- TPS:
-
Tokens Per Second. Aggregate throughput of the serving system measured as the
total number of output tokens generated per second across all concurrent
requests. Reported separately for input (prefill) TPS and output (decode) TPS.¶
- KV Cache:
-
Key-Value Cache. The intermediate attention state (key and value projection
matrices) computed during the prefill phase and reused during each decode step.
Size scales with model dimension, number of layers, number of attention heads,
sequence length, and numerical precision. For a 70B parameter model at FP16
with 4K context: approximately 1.34 GB per request.¶
- Prefill Phase:
-
The compute-bound phase of inference in which the entire input prompt is
processed in parallel to generate the KV cache and the first output token.
Characterized by high arithmetic intensity (200-400 ops/byte), high GPU
utilization (90-95%), and large activation tensors.¶
- Decode Phase:
-
The memory-bound phase of inference in which output tokens are generated
autoregressively, one token per forward pass. Characterized by low arithmetic
intensity (60-80 ops/byte), lower GPU utilization (20-40%), and
memory-bandwidth-limited KV cache reads.¶
- Disaggregated Serving:
-
An inference serving architecture in which prefill and decode computations are
executed on physically separate groups of accelerators (workers), connected by
a network fabric. The KV cache generated by prefill workers are transferred
over the fabric to decode workers.¶
- xPyD Ratio:
-
The allocation ratio of prefill (x) to decode (y) resources in a disaggregated
serving cluster. For example, 3P9D indicates 3 prefill nodes and 9 decode
nodes. The optimal ratio depends on model size, prompt length distribution,
output length distribution, and SLO targets.¶
- EP:
-
Expert Parallelism. A parallelism strategy for Mixture-of-Experts (MoE) models
in which expert sub-networks are distributed across multiple GPUs. Token
routing to the appropriate experts requires AllToAll communication.¶
- Wide EP:
-
Expert Parallelism spanning many GPUs (e.g., 96-way EP across 12 nodes),
requiring inter-node AllToAll communication for every MoE layer forward pass.¶
- DP Attention:
-
Data Parallelism applied to the attention computation, where the KV cache is
partitioned across data-parallel ranks. Each rank holds 1/DP_SIZE of the KV
cache, and AllToAll communication is used to exchange attention outputs.¶
- MoE:
-
Mixture of Experts. A model architecture that activates only a subset of
expert sub-networks for each token, enabling larger model capacity with
sub-linear compute scaling.¶
- Normal Dispatch:
-
A communication mode for AllToAll MoE dispatch optimized for the prefill phase.
Maximizes throughput for long input sequences but generates dynamic (symbolic)
shapes incompatible with CUDA Graph.¶
- Low-Latency Dispatch:
-
A communication mode for AllToAll MoE dispatch optimized for the decode phase.
Uses fixed input shapes compatible with CUDA Graph, reducing kernel launch
overhead at the cost of slightly lower peak throughput.¶
- RDMA:
-
Remote Direct Memory Access. A transport mechanism enabling direct
memory-to-memory data transfer between hosts without CPU involvement.
Implementations include InfiniBand Verbs and RoCEv2 (RDMA over Converged
Ethernet v2).¶
- RoCEv2:
-
RDMA over Converged Ethernet version 2. An RDMA transport that encapsulates
InfiniBand transport over UDP/IP, enabling RDMA semantics on standard Ethernet
fabrics.¶
- UET:
-
Ultra Ethernet Transport. A transport protocol defined by the Ultra Ethernet
Consortium (UEC) Specification 1.0, offering ordered/unordered reliable
delivery, multipath packet spraying, and integrated congestion control for
AI/HPC workloads.¶
- KVCXL:
-
KV Cache Transfer Library. A library providing standardized point-to-point
data transfer primitives (register, transfer, notify) for inference engines,
abstracting underlying transports (intra-node interconnect, RDMA, PCIe, and
storage interfaces). Multiple open-source and vendor implementations exist.¶
- GIN:
-
GPU-Initiated Networking. A communication paradigm where GPU threads directly
initiate network operations (RDMA sends, one-sided puts) without CPU
involvement, reducing latency by eliminating CPU-GPU synchronization.¶
- PagedAttention:
-
A memory management technique for KV caches that stores attention keys and
values in fixed-size pages (typically, 16-64 KB), enabling non-contiguous
allocation and reducing memory fragmentation.¶
- Continuous Batching:
-
A scheduling technique that dynamically adds new requests to an active
inference batch as decode slots become available, improving GPU utilization
compared to static batching.¶
- Prefix Caching:
-
Reuse of previously computed KV cache segments for prompts that share a common
prefix (e.g., system prompt), avoiding redundant prefill computation.¶
- DUT:
-
Device Under Test. In this document, the DUT is one or more network fabric
elements (switches, NICs, or the complete fabric) whose performance impact on
inference serving is being characterized.¶
- SUT:
-
System Under Test. The complete inference serving system including
accelerators, NICs, fabric, and serving software, when end-to-end metrics are
being measured.¶
- RT:
-
Router Tester / Traffic Generator. Test equipment capable of generating and
receiving network traffic at specified rates with timestamping accuracy
sufficient for the measurements defined herein.¶