CoreWeave Unleashes the Power of the NVIDIA GB200 NVL72: A Glimpse into the Future of AI

Harsh Singh Banwait

Published on

November 26, 2024

At CoreWeave, innovation never rests. Last week, we unveiled an exclusive first look at one of the first groundbreaking NVIDIA GB200 NVL72 to be brought up by a major cloud provider. Today, we’re taking it even further. We’re thrilled to present one of the first live demonstrations of the GB200 NVL72 by a cloud provider, showcasing its remarkable performance, advanced cooling technology, and energy efficiency.

The demo gives you an inside look at how this state-of-the-art system performs under real-world conditions. It starts with the NCCL Allreduce Test, a standardized benchmark designed to demonstrate the high-speed NVIDIA NVLink^TM interconnectivity of the rack’s 72 GPUs. This test ensures that all GPUs communicate seamlessly, showcasing their ability to handle distributed workloads critical for large foundational models. Building on that foundation, the rack moves into the GPUBlaze Test, where its raw computational power truly shines. In this test, the GPUs are tasked with complex matrix multiplication workloads, mimicking the types of operations used in AI training, scientific simulations, and advanced data processing. As the GPUs ramp up their activity, the GB200 NVL72 proves it’s more than capable of tackling even the most demanding computational challenges.

CoreWeave has been working tirelessly to support the GB200 NVL72 with our managed Kubernetes portfolio. To further validate the GB200 NVL72’s capabilities, we conducted a Megatron training run using CoreWeave’s Kubernetes Service (CKS) and Slurm on Kubernetes (SUNK). The Megatron training job simulated the massive computational and memory demands of large foundational models, showcasing the GB200 NV72’s ability to run complex, distributed workloads efficiently while offering transparency into every aspect of the system’s performance. CKS and SUNK are fully optimized for the NVIDIA GB200 NVL72, providing seamless integration and enabling high-performance distributed training on the rack’s 72 GPUs. CKS delivers a hypervisor-free Kubernetes solution tailored for AI workloads, while SUNK extends this flexibility to burst and batch job scheduling. The training run also showcased the power of CoreWeave Observability Services, which offered real-time visibility into every aspect of system performance. Through intuitive dashboards, we tracked GPU utilization, memory and VRAM allocation, power consumption, and temperature metrics, providing actionable insights into how the hardware managed the immense computational demands of the workload.

But performance is only part of the story. As the workload intensifies, the rack’s Cooling Distribution Unit (CDU) takes center stage. This advanced system dynamically adjusts cooling output in response to GPU activity, helping to ensure the hardware stays at optimal temperatures without compromising performance. Real-time data from the CDU dashboard provides fascinating insights, showing how the return coolant temperature rises in sync with the computational load. At the same time, the rack’s power dashboard tracks energy consumption, offering a live view of how the system’s energy demands scale with its performance. The ability to monitor power usage in real time highlights the GB200 NVL72’s efficiency, providing transparency and control over energy management.

The GB200 NVL72s are also supported by NVIDIA Quantum-2 InfiniBand networking, delivering 400Gb/s bandwidth per GPU through a rail-optimized topology. Leveraging NVIDIA Quantum-2’s SHARP In-Network Computing technology, collective communication is offloaded, resulting in ultra-low latency and accelerated training speeds. Looking into the near future, we are excited to deploy the NVIDIA Quantum-X800 800Gb/s InfiniBand platform, doubling the bandwidth to each GPU and opening up new possibilities for AI innovation.

Before we wrap up, we want to extend a heartfelt shoutout to Switch, our trusted data center partner. Their world-class facilities and commitment to excellence have been instrumental in making this project a success. The collaboration with Switch has been invaluable in pushing the boundaries of what’s possible in accelerated computing.

This demo builds on the visuals we shared last week, bringing the GB200 NVL72 story to life. While the photos gave a glimpse of the hardware’s impressive physical design, this demonstration underscores its cutting-edge capabilities in performance, cooling, and energy efficiency. The NVIDIA GB200 NVL72, combined with CoreWeave’s proven product portfolio and engineering expertise, is more than a technical marvel—it’s a complete solution built to push the boundaries of accelerated computing. Whether you’re focused on AI innovation, scientific research, or massive data workloads, CoreWeave is ready to meet the demands of the most challenging environments.

Ready to get your hands on the world's most powerful compute? Reach out to us here or learn more at our upcoming webinar.

CoreWeave Unleashes the Power of the NVIDIA GB200 NVL72: A Glimpse into the Future of AI

Harsh Singh Banwait

Published on

November 26, 2024

We showcased NVIDIA GB200 NVL72 AI performance, energy efficiency, and cooling innovation, backed by Kubernetes support, observability tools, and Quantum-2 InfiniBand networking.

CoreWeave Unleashes the Power of the NVIDIA GB200 NVL72: A Glimpse into the Future of AI

CoreWeave Unleashes the Power of the NVIDIA GB200 NVL72: A Glimpse into the Future of AI

Related Blogs

Building Pennsylvania into the Mid-Atlantic AI Hub

CoreWeave Launches the First Generally Available NVIDIA RTX PRO 6000 Blackwell Server Instances

CoreWeave to Acquire Core Scientific

CoreWeave Leads the Way with First NVIDIA GB300 NVL72 Deployment

Accelerating AI Innovation Summit: On-Demand

Benchmark Results: CoreWeave AI Object Storage Delivers 2+ GB/s per GPU Throughput Across any Number of GPUs

Accelerating AI Leadership: How CoreWeave’s MLPerf Results Unlock Customer Innovation

CoreWeave, NVIDIA, and IBM Set MLPerf Record with Largest NVIDIA GB200 Blackwell Cluster, Achieving Over 2× Faster Training

CoreWeave Expands its NVIDIA Blackwell Fleet with Generally Available NVIDIA HGX B200 Instances

Unlocking AI Inference at Scale: CoreWeave Joins Red Hat Open Source Project llm-d as Founding Member

Products

Solutions

AI Infrastructure

Why CoreWeave

Resources

About