Inference that performs at scale

Dedicated Inference

The performance, infrastructure clarity, and explicit control you need to scale

Why dedicated inference

Maximize control without owning the cluster

With Dedicated Inference from CoreWeave, you bring the model and make the architectural choices that matter: GPU class, runtime, scaling, routing. CoreWeave runs the cluster, manages availability, and keeps performance and cost legible as you scale.

GPU choice that matches the workload

Pick the GPU class that fits your latency, throughput, and cost targets. Single-node or distributed multi-node serving, billed per GPU-hour.

Bring your own weights, stored in CoreWeave Object Storage

Point deployments at fine-tuned checkpoints, custom architectures, or OSS weights in CoreWeave Object Storage.

Open runtimes, OpenAI-compatible endpoints

vLLM and SGLang runtimes with OpenAI-compatible endpoints out of the box. Swap models, runtimes, or GPU classes without rebuilding the serving stack.

Gateway-managed routing and traffic control

A tenant-isolated gateway runs authentication, load balancing, and request routing across replicas. Optimize for latency, data locality, or compliance.

Cost that maps to infrastructure

Per GPU-hour billing against your chosen GPU class and capacity model. No egress fees, no ingress fees, no service markup.

Who it’s built for

For teams that need execution visibility without the ops overhead

The teams that get the most value from Dedicated Inference sit between "just use an API" and "run our own Kubernetes cluster."

AI Teams

Building custom and fine-tuned models

Teams deploying domain-specific or fine-tuned models who need custom weights support, GPU selection, and a path from training to production on the same infrastructure, without taking on cluster operations.

Enterprise

Running compliance and isolation-sensitive workloads

Organizations that need single-tenant GPU nodes, Availability Zone placement for data sovereignty, and predictable SLA targets for customer-facing AI systems without assembling bespoke infrastructure.

Platform Teams

Scaling from single-node to multi-node

ML platform teams who need to grow into distributed inference while preserving GPU choice, runtime flexibility, and per-GPU-hour cost visibility as workloads expand.

CoreWeave inference paths

Inference on your terms

Three inference paths built on the award-winning CoreWeave Cloud. Move between them as your workloads evolve—without replatforming—so you always get predictable performance and infrastructure-aligned economics.

Serverless inference

Pay-per-token inference on a curated OSS model catalog. No clusters to manage. Built-in tracing, evals, and observability for AI applications and agents.

Best for

Rapid iteration and AI app development

infrastructure management

Low — API only

Model support

Curated OSS + LoRAs

Pricing

Pay-per-token

Dedicated inference

Deploy custom weights on explicitly chosen GPUs. CoreWeave operates routing, scaling, and lifecycle. Full execution transparency, no cluster management.

Best for

Custom model serving at production scale

infrastructure management

Medium — GPU, Zone, Runtime

Model support

Open-source or custom weights

Pricing

Pay-per-GPU-hour

Inference on CKS

Full self-managed inference on CoreWeave Kubernetes Service. Own the entire serving stack—runtimes, scheduling, autoscaling, multi-node topology—on dedicated bare-metal GPU nodes.

Best for

Full infrastructure ownership and deep tuning

infrastructure management

High — full Kubernetes control

Model support

Any

Pricing

Pay-per-GPU-hour (Reserved, On-demand, Spot, Flex)

How it helps

What can you do with Dedicated Inference?

Serve custom models in production without cluster ownership

Bring your own weights, select your GPU, and ship to a live endpoint. CoreWeave operates everything between your model artifact and your users. No cluster setup, no Kubernetes expertise, no infra team required to maintain the serving layer.

Keep cost transparent and predictable as inference scales

Pay-per-GPU-hour billing against explicitly chosen GPU classes means cost stays tied to infrastructure decisions, not abstract consumption units. Existing CoreWeave reserved nodes can be directed to Dedicated Inference at contracted rates without incremental fees.

Move from fine-tuning to serving without replatforming

Weights stored in CoreWeave AI Object Storage deploy directly without data movement, whether you trained them on CoreWeave or brought them in. Just bring your weights to Dedicated Inference for production on the same infrastructure and runtimes.

How it works

Four steps to a live endpoint

Provision a gateway, configure your deployment, send requests, observe. You make the architectural choices; CoreWeave runs the cluster.

Create a gateway
Pick your CoreWeave Availability Zone. CoreWeave provides a tenant-isolated gateway that handles authentication, load balancing, and external routing.
Create a deployment
Point to model weights in CoreWeave AI Object Storage. Select your GPU type and inference runtime. Set min/max replica counts.
Send an inference request
The gateway exposes an OpenAI-compatible API. Your endpoint is ready to go. CoreWeave schedules, serves, and scales.
Monitor and iterate
Track performance, error rates, and GPU utilization in Grafana. Update configuration or swap model weights without taking the endpoint down.

Frequently asked questions

How is Dedicated Inference different from Inference on CKS?

CoreWeave runs the cluster for Dedicated Inference; you run it for CKS. With Dedicated, you configure GPU, runtime, scaling, and routing, and CoreWeave handles cluster operations, autoscaling, and availability. With CKS, you own the full Kubernetes stack: runtimes, scheduling, multi-node topology, and operational responsibility. Choose Dedicated when you want managed execution without giving up GPU and runtime choice. Choose CKS when you need to own and tune the entire stack.

Can I redirect existing CoreWeave reserved capacity to Dedicated Inference?

Yes. Customers with existing reserved nodes can allocate a portion of that capacity to Dedicated Inference at their contracted rates with no additional platform fee. On-demand GPU-hour billing is also available for workloads that don't require reserved capacity.

Can I update a deployed model without downtime?

Yes. Send a PATCH request with the updated deployment specification (model weights, GPU type, runtime, autoscaling parameters), and CoreWeave manages the rollout without taking down the serving endpoint. Production traffic continues uninterrupted during updates.

What is a gateway, and do I manage it?

A gateway is a tenant-isolated endpoint provisioned in your chosen Availability Zone. It handles authentication, load balancing, and request routing across all your model deployments. CoreWeave manages the gateway; you choose where it lives. You can deploy multiple gateways across zones for latency optimization or compliance, and configure which deployments receive traffic.

Where do my model weights need to live?

Model artifacts must be stored in CoreWeave AI Object Storage. If you're already training or fine-tuning on CoreWeave, your checkpoints are already in the right place, and no data movement is required. For models trained or stored elsewhere, you'll upload weights to CoreWeave AI Object Storage before deploying.

Which inference runtimes are supported?

Dedicated Inference supports vLLM and SGLang, with CoreWeave AI Object Storage and LOTA caching for accelerated model loading and KV reuse under sustained traffic. All deployments expose OpenAI-compatible endpoints. If your stack uses a custom runtime, talk to us about your architecture.