Stragglers, Stalls, and Restarts: Why AI Training Throughput Breaks Down at Scale

Event details

Willy Markuske

Senior Field Engineer

CoreWeave

Tara Madhyastha

Senior Field Engineer

CoreWeave

—

30 minutes

Can your training infrastructure actually deliver?

AI training roadmaps don’t usually stall for the reasons teams expect. What looks like a capacity, cost, or iteration-speed problem is often an infrastructure issue underneath: the stack can’t sustain real model progress as training scales.

Built for AI Platform Leaders and infrastructure teams evaluating training infrastructure at scale, this 30-minute Training Tuesdays session unpacked the architectural decisions that shape training outcomes and shared a practical framework for evaluating whether infrastructure is translating allocated compute into results.

We closed with a look at CoreWeave ARENA, our production-ready AI lab for validating real models and pipelines before you go live. You saw how teams can evaluate throughput visibility, recovery behavior, and the signals production-scale validation actually surfaces.

In this webinar, we covered:

Why AI training roadmaps stalled even when teams have GPUs, budget, and models ready
Which signals revealed whether infrastructure can sustain model progress as runs get longer and more distributed
Why small tests and synthetic benchmarks miss the failure modes that matter at scale
How production-like validation helps teams assess throughput, resilience, and forward progress
How CoreWeave ARENA helped teams validate real workloads before making a broader infrastructure decision

Learn how to evaluate AI training infrastructure for real model progress, not just allocated capacity. Watch the recording.

Speakers

Willy Markuske

CoreWeave

Senior Field Engineer

Tara Madhyastha

CoreWeave

Senior Field Engineer

Upcoming events

Related webinars

No events found.

Stragglers, Stalls, and Restarts: Why AI Training Throughput Breaks Down at Scale

Event details

Can your training infrastructure actually deliver?

In this webinar, we covered:

Speakers

Upcoming events

More on-demand webinars

Related webinars

Upcoming events

More on-demand webinars

Strategies for Maximizing GPU Performance

AI Fleet Management 101

Feeding A 22,000 GPU Cluster with Data

The Zero Trust AI (Data) Cloud

The Best of Both Worlds: Slurm on Kubernetes

Accelerating HPC and AI with Slurm and SchedMD

Create a Self-Serve Platform for Kubernetes

Why Bare Metal is Better

From Experimentation to Production: Why Inference Is the Defining Layer of AI

Real Cloud Infrastructure for Real AI Workloads: Training and Inference at Production Scale

SUNK: Scale AI Training Without Breaking Your Infrastructure

NVIDIA HGX B300 on CoreWeave: What Changes for Agentic AI at Scale

Inside CoreWeave ARENA: Proving AI Production Readiness

Inside the Rack Scale Revolution: How CoreWeave and NVIDIA Are Building the Foundation for the Next Leap in AI

Unlock Agentic Breakthroughs with a Purpose-Built AI Cloud

How to Maximize Resiliency with AI-Native Observability

How to Move Beyond Tiers, Tradeoffs, and Runaway Costs in AI Storage

On-demand Webinar: How to measure and optimize AI infrastructure for large-scale training

Decoding the Economics of AI Infrastructure

Why NVIDIA Blackwell on CoreWeave

Products

Solutions

AI Infrastructure

Why CoreWeave

Resources

About