Stragglers, Stalls, and Restarts: Why AI Training Throughput Breaks Down at ScaleStragglers, Stalls, and Restarts: Why AI Training Throughput Breaks Down at ScaleStragglers, Stalls, and Restarts: Why AI Training Throughput Breaks Down at Scale
CoreWeave

Stragglers, Stalls, and Restarts: Why AI Training Throughput Breaks Down at Scale

Event details

Location
Willy Markuske
Senior Field Engineer
,
CoreWeave
Location
Tara Madhyastha
Senior Field Engineer
,
CoreWeave
Location
Schedule

Jun 23, 2026

11:00 am

ET

June

23

 — 

Location
30 minutes

Can your training infrastructure actually deliver?

AI training roadmaps don’t usually stall for the reasons teams expect. What looks like a capacity, cost, or iteration-speed problem is often an infrastructure issue underneath: the stack can’t sustain real model progress as training scales. 

Built for AI Platform Leaders and infrastructure teams evaluating training infrastructure at scale, this 30-minute Training Tuesdays session will unpack the architectural decisions that shape training outcomes and share a practical framework for evaluating whether infrastructure is translating allocated compute into results. 

We’ll close with a look at CoreWeave ARENA, our production-ready AI lab for validating real models and pipelines before you go live. You’ll see how  teams can evaluate throughput visibility, recovery behavior, and the signals production-scale validation actually surfaces.

In this webinar, we’ll cover: 

  • Why AI training roadmaps stall even when teams have GPUs, budget, and models ready
  • Which signals reveal  whether infrastructure can sustain model progress as runs get longer and more distributed
  • Why small tests and synthetic benchmarks miss the failure modes that matter at scale
  • How production-like validation helps teams assess throughput, resilience, and forward progress
  • How CoreWeave ARENA helps teams validate real workloads before making a broader infrastructure decision

Learn how to evaluate AI training infrastructure for real model progress, not just allocated capacity. Register now.

Speakers

Willy Markuske
Willy Markuske
CoreWeave
Senior Field Engineer
Tara Madhyastha
Tara Madhyastha
CoreWeave
Senior Field Engineer

GPU Compute,
Home v3,
Home v2,
Product - GPU Compute,
Product - Virtual Servers,
Solution - Pixel Streaming,
Solution - Machine Learning,
Product - VFX,
Product - Kubernetes,
Product - Concierge Render,
Home,