Microsoft and NVIDIA have announced that Azure is now operating NVIDIA’s GB300 NVL72, the first large-scale production cluster specifically designed to advance larger and more powerful “inference” models and accelerate service deployment. This system integrates rack-level memory, fifth-generation NVLink, and 800Gb/s networking, allowing dozens of cabinets to function as a single massive accelerator.
The GB300 NVL72 is NVIDIA’s rack-scale platform built for the era of “AI inference,” where models demand more compute resources during inference to handle multi-step tasks. Each rack connects 72 Blackwell Ultra GPUs and 36 Grace CPUs into a unified 72-GPU NVLink domain, delivering a shared fast memory pool and 130 TB/s of intra-rack bandwidth to support massive contexts and extended chain-of-thought reasoning.
Azure’s cluster interconnects racks via NVIDIA’s Quantum-X800 InfiniBand, offering up to 800 Gb/s network throughput per GPU. This enables OpenAI to scale decoding and prefill operations across racks while maintaining low latency—crucial for interactive agents and long-context workloads.
A single NVL72 rack aggregates approximately 21 TB of HBM3e memory across these GPUs and, with CPU-GPU coherent memory, can reach around 40 TB of “fast” memory. This memory capacity supports larger models and extended prompts without frequent offloading, resulting in higher tokens per second and fewer pauses.
In benchmark tests, NVIDIA’s GB300 NVL72 recently set records in the new inference tests of MLPerf Inference v5.1, including higher DeepSeek-R1 throughput than previous-generation Blackwell-based clusters. The platform is also optimized for FP4 formats (e.g., NVFP4) and Dynamo-style disaggregated services—choices aimed at reducing inference costs for large-scale models.
At this scale, power and grid stability are becoming top-tier engineering concerns. The GB300 introduces rack power units with onboard energy storage and coordinated power smoothing, reducing peak demand by up to 30%—especially beneficial when thousands of GPUs accelerate or decelerate in sync. More grid-aware designs like this are expected in hyperscale AI builds.
This initiative aligns with a broader capacity strategy. Microsoft has been securing supply—both within its own data centers and through “new cloud” partners—to support GPU-intensive projects even as it prepares to roll out the new GB300 globally. This procurement approach, combined with Azure’s GB200 deployment, reflects Microsoft’s plan to shorten training cycles and bring larger models to market faster.