MLPerf Results: CoreWeave and NVIDIA Showcase Record-Breaking, Cloud Native AI Supercomputer

Brian Venturo

Published on

June 27, 2023

CoreWeave and NVIDIA MLPerf Submission: In Summary

CoreWeave, in a joint submission with partner NVIDIA, delivered record-breaking performance on MLPerf workloads, including the new GPT-3 LLM benchmark test, which trained in under 11 minutes on over 3,500 NVIDIA H100 Tensor Core GPUs on a CoreWeave H100 Cloud Supercomputer.
These record-breaking results were achieved on a production cluster built with NVIDIA Quantum-2 InfiniBand networking for Inflection AI, a leading AI lab.
CoreWeave was among the first cloud providers to go live with NVIDIA HGX H100 instances, which are being used today to train some of the largest and most ambitious models.

‍

In a combined MLPerf Training benchmark competition submission, NVIDIA and CoreWeave delivered record-breaking performance results on the MLPerf™ benchmark, an unbiased and reputable third-party benchmarking consortium. Using more than 3,500 NVIDIA H100 Tensor Core GPUs, CoreWeave’s publicly available supercomputing infrastructure trained the new MLPerf GPT-3 175B large language model (LLM) benchmark test in under 11 minutes.

This performance was more than 29x faster than the next best competitor and, done at scale with over 3,500 GPUs, was also 4x larger than the next best competitor. Making up one of the largest NVIDIA HGX clusters in the world, CoreWeave’s supercomputer instances feature the latest HGX servers with NVIDIA H100 Tensor Core GPUs, Intel 4th Generation Xeon Scalable Processors, and NVIDIA ConnectX-7 400Gb/s InfiniBand and BlueField-2 DPUs.

MLPerf is the industry-standard benchmark for both model training and inference that provide fair and useful insights into workloads that represent the state of the art in AI. Akin to the "0 to 60" benchmark for cars, these benchmarks are peer-reviewed by AI leaders in academia, research labs, and other industry members, and cover hardware, software, services, and more.

Reflecting the latest updates in the industry, MLPerf Training 3.0 added GPT-3 175B, a large language model that powers services like OpenAI’s ChatGPT, and is based on the Transformer network architecture.

What This Means for AI

Unmatched in speed and scale, this record-breaking result defines what’s possible for machine learning (ML) at the enterprise level—and sets a new standard for cutting-edge AI infrastructure.

These results demonstrate not the potential but the reality of ML performance for the world’s most powerful GPUs, the NVIDIA H100s, when run in CoreWeave Cloud. CoreWeave allows ML Research teams to train large models at unprecedented speed and efficiency by enabling parallel workloads to run across more NVIDIA GPUs. We deliver this infrastructure at scale, faster than anyone thought possible.

"These results prove that CoreWeave can deliver what we say we can deliver: best-in-class AI/ML focused solutions that meet the scale and pace required by today’s most demanding and ambitious AI labs"

— Mike Intrator, CEO and co-founder at CoreWeave

The new wave of generative AI applications and LLMs requires an enormous amount of computing power to manage and analyze large amounts of data. Today, scale and access are critical determining factors for the success of AI startups. CoreWeave has spent years preparing for this, and this MLPerf result shows where the industry needs to go next in order to meet the rising demand for ultra-performant compute at scale.

"What we’re seeing now is a redesign of the data center and the high-powered infrastructure required to power the AI revolution. Every decision and optimization we’ve made to our infrastructure has been purposeful in order to deliver the speed, efficiency, and performance that AI teams need to bring their products to market faster."

— Brian Venturo, CTO and co-founder at CoreWeave

The Record-Breaking H100 Cluster in the Cloud

In today’s AI race, being first to market matters. CoreWeave is committed to helping companies get access to the compute resources they need to go to market quickly.

CoreWeave was among the first providers to offer cloud instances with NVIDIA H100 GPUs, becoming generally available to clients during the NVIDIA GTC event in March. Today, these clusters power some of the largest and most ambitious LLMs being built.

The lack of infrastructure required to power the boom in generative AI is the industry’s most pressing challenge. That’s due in large part to the hyperscalers not being built to provide this type of compute on a contiguous scale.

CoreWeave was built to directly address the market’s need for advanced compute at scale. Unlike generalized cloud providers, CoreWeave’s specialized infrastructure provides blazing fast bare-metal performance and the supporting storage, networking, and software solutions to match. Teams that use CoreWeave Cloud access a wider variety of NVIDIA GPUs and have the flexibility to ‘right-size' their workloads to best match their demands and business needs. Importantly, CoreWeave’s compute solutions are optimized for highly parallelized workloads.

"Historically you didn’t get more than 2,000-3,000 GPUs from a single cloud provider in a single location. CoreWeave is building installations that are 20,000+ NVIDIA GPUs in one location. We are the new utility provider. AI is the new electricity and CoreWeave is building the grid."

— Brian Venturo, CTO and co-founder at CoreWeave

Meet Pi

The specific cluster used for the MLPerf submission is currently in use by Inflection AI, who generously donated compute for the MLPerf tests. Inflection AI has created Pi (“personal AI”), an AI designed to be a kind and supportive companion offering conversations, friendly advice, and concise information in a natural, flowing style.

One of the world's most sophisticated and advanced LLMs, Pi was trained using CoreWeave’s NVIDIA H100 instances to achieve a new level of simplicity and natural interactions.

"Anyone can experience the power of a personal AI today based on our state-of-the-art large language model that was trained on CoreWeave’s powerful network of H100 GPUs."

— Mustafa Suleyman, CEO and co-founder of Inflection AI

Inflection AI’s cluster used for the MLPerf submission includes over 40,000 cables and 500 miles of InfiniBand fiber cables. The NVIDIA Quantum InfiniBand networking technology, which CoreWeave uses in all its NVIDIA H100 and A100 instances, allow GPUs to communicate directly with each other with low latency at scale.

What’s Next for ML Performance?

There’s no question that LLMs and the services they power are fundamentally reshaping the computing landscape. AI has the potential to solve massive global problems and introduce efficiencies to nearly every business vertical, from always-on, customized advertising to developing cancer treatments.

But in order for the impact of AI to be possible, the industry needs better solutions that can deliver world-class performance at scale.

"Right now, much of the industry is focused on model training. But those models will soon need to be served and translated into products and services people can use. We know that AI inference demand is about to explode, and we’ve spent years preparing our infrastructure and culture to scale for this moment."

— Mike Intrator, CEO and co-founder at CoreWeave

Already, the industry has come a long way to make it faster, more efficient, and more cost-effective to train models and serve inference. This record-breaking MLPerf result is a testament to that. The better we get at this, the more likely we can see AI change the world.

At CoreWeave, we see this incredible result as raising the bar for what’s possible when it comes to ML performance in the public cloud. We are excited to continue supporting amazing AI teams like Inflection AI in building these exciting new applications and markets.

‍

MLPerf Results: CoreWeave and NVIDIA Showcase Record-Breaking, Cloud Native AI Supercomputer

Brian Venturo

Published on

June 27, 2023

CoreWeave and NVIDIA achieved record-breaking MLPerf results with their cloud-native AI supercomputer, demonstrating unmatched performance for training and inference workloads.

MLPerf Results: CoreWeave and NVIDIA Showcase Record-Breaking, Cloud Native AI Supercomputer

CoreWeave and NVIDIA MLPerf Submission: In Summary

What This Means for AI

The Record-Breaking H100 Cluster in the Cloud

Meet Pi

What’s Next for ML Performance?

MLPerf Results: CoreWeave and NVIDIA Showcase Record-Breaking, Cloud Native AI Supercomputer

Related Blogs

Building Pennsylvania into the Mid-Atlantic AI Hub

CoreWeave Launches the First Generally Available NVIDIA RTX PRO 6000 Blackwell Server Instances

CoreWeave to Acquire Core Scientific

CoreWeave Leads the Way with First NVIDIA GB300 NVL72 Deployment

Accelerating AI Innovation Summit: On-Demand

Benchmark Results: CoreWeave AI Object Storage Delivers 2+ GB/s per GPU Throughput Across any Number of GPUs

Accelerating AI Leadership: How CoreWeave’s MLPerf Results Unlock Customer Innovation

CoreWeave, NVIDIA, and IBM Set MLPerf Record with Largest NVIDIA GB200 Blackwell Cluster, Achieving Over 2× Faster Training

CoreWeave Expands its NVIDIA Blackwell Fleet with Generally Available NVIDIA HGX B200 Instances

Unlocking AI Inference at Scale: CoreWeave Joins Red Hat Open Source Project llm-d as Founding Member

Products

Solutions

AI Infrastructure

Why CoreWeave

Resources

About