Machine Learning & AI

Can It Scale? How Autoscaling Impacts Compute Costs for Inference

Can It Scale? How Autoscaling Impacts Compute Costs for Inference

In Summary:

  • Effective autoscaling (faster model loading time + pod spin-up times) can reduce the cost of running inference and optimize your compute usage. 
  • It’s important to consider both latency and cost when comparing autoscaling scenarios; an unoptimized autoscaling solution can lead to performance degradation, resulting in a poor user experience and potentially more costs.
  • A better autoscaler makes for a more cost-effective solution without hindering your application’s performance.


When evaluating a cloud inference service, companies have to choose how they will define success. This means prioritizing no more than two of the three: accuracy, low latency, and low cost.

Whichever of the two options you choose, there is a drawback. A low-latency, highly-accurate service often means you’re paying for lots of idle compute. Of course, the ideal scenario would be to have all three, but it currently doesn’t exist for the majority of the market. 

Evaluating an inference involves weighing accuracy, low latency, and low cost.

The reality is that cost is always a consideration. So, how can companies run inference more efficiently? An important factor is autoscaling.

Effective autoscaling (faster model loading time + pod spin-up times) can drastically keep down the costs of running inference and optimize your compute usage. 

Autoscaling: What It Is and How It Works

What is autoscaling? 

Autoscaling is the process that enables an application in a cloud service to automatically scale capacity in response to changing demand. 

Typically the user sets the minimum and maximum replicas of pods for their inference service, and autoscaling is enabled any time the value of minReplicas differs from the value of maxReplicas. 

During periods of high usage, the autoscaler will spin up more pods to meet demand and scale pods down as usage declines. Some cloud inference services, like CoreWeave’s, can automatically scale to zero during long periods of idle time, consuming neither compute resources nor incurring billing.

This is a simple illustration of CoreWeave Inference Service.

Why is autoscaling important?

Autoscaling aims to optimize your compute usage and keep costs low by helping to avoid under or over-provisioning GPU resources. This makes it a vital component for most modern workloads, especially those with unpredictable usage like inference. 

Let’s say your app gets an unexpected promotion on social media (yay!) like this brand did. Autoscaling helps ensure your app can access enough resources needed to serve the spike in usage without completely slowing down the inference process.

With autoscaling, you don’t have to manually increase your infrastructure or frantically find more GPU resources to handle more requests. The autoscaler finds it for you. 

For inference services in which you pay only for the compute you use, the quality of its autoscaling can have a massive impact on your bottom line.

What factors impact autoscaling?

The main factors that impact the quality of autoscaling include:

  • Time to spin-up pods: The faster the cloud can spin up and spin down new pods, the faster it can react to spikes in usage. This leads to faster autoscaling, which can help keep the compute resources you’re using more aligned with what your service actually needs.
  • Time to load a model: After spinning up a new pod, the inference service must load the model onto a GPU. This can be as quick as five seconds for smaller models or much longer for larger models, which ultimately impacts the speed of the autoscaler. 
  • Quotas/access to more compute: Some cloud inference services will set a quota for the amount of compute you can access over a given time period. If your application exceeds this, then the autoscaler will not be able to reach more hardware and will likely drop pods (aka, a user’s request is lost). 
  • Virtualization/Hypervisor layer: Generalized cloud providers have a hypervisor layer that sits between the user’s control plane and the hardware, which spins up each new pod in a virtual machine. These added layers and steps slow down the process of inference autoscaling.
  • GPU type: GPUs with more modern architecture have better performance on a one-to-one GPU comparison.
  • Networking: Faster networking means faster transfer speed (aka, loading the model from storage to the GPU to run inference).

Does autoscaling impact performance?

Infrastructure built for modern AI workloads should not experience a lag in performance when autoscaling. Effective autoscaling (faster model loading time + pod spin-up times) can help keep down the costs of running inference and optimize your compute usage without impacting performance.

However, most cloud providers were built for general-purpose workloads; they run virtual machines, are not bare metal, and may not use sophisticated networking solutions. In this setup, autoscaling can cause performance degradation when starting new instances. This would happen during a large spike in usage and could result in a bad customer experience for many users.

Check out our benchmark report to see how CoreWeave’s inference service scaled new instances 8 - 10x faster than a popular generalized cloud. 

Autoscaling & Cost Analysis: How Autoscaling Impacts Compute Costs

A more responsive autoscaling service leads to faster inference, a better user experience, and lower overall costs for compute. 

To demonstrate how, we’ll show three scenarios with different levels of autoscaling: no autoscaling, slow autoscaling, and fast autoscaling. For all scenarios, we’ll show how long it takes an inference service to respond to a spike in 100 requests as well as how quickly the autoscaler can make more compute available. 

We’ll also show an approximated cost for each inference service and how we calculated that based on the compute usage.

Scenario 1: Idle Compute/No Autoscaling 

This first scenario demonstrates what an inference service looks like with no autoscaling capabilities and running ~25 pods at all times. During a spike to 100 requests, the inference service can only handle 25 requests at a time because its maxReplicas is set to 25. 

Pre-computing how many resources are required to handle bursts can be very risky since an unforeseen spike can result in a bad user experience. 

This is a very uncommon scenario for inference because it relies on continuous requests without much fluctuation in usage. You could see this setup for non-consumer services where the end user is flexible on time, such as biology or chemistry simulations like predicting protein structures.

Cost Analysis

Cost is based on how many GPUs you have running. Let's say for each unit of time (each bar on the x-axis), a single machine has a cost of P. The total cost is the sum of each timestep cost, where the timestep cost is P * (total compute available). 

Therefore the cost of the no autoscaling is 25 * P * 6 = 150P

It's not quite a fair comparison based only on cost, because the fast autoscaling serves the 100 requests the fastest.

Because this service type is always running 25 pods, you can find yourself paying for unused compute resources. This makes the whole service less cost-efficient unless you are constantly and consistently serving an expected amount of requests.

Overall benefits:

  • Reduce end-to-end latency
  • Simple to setup if managed yourself

Overall drawbacks:

  • Raises costs due to large overhead of idle compute
  • Need to be able to perfectly plan for expected traffic

Scenario 2: Slow Autoscaling 

This scenario demonstrates how slow autoscaling can lead to performance degradation. Of all three scenarios, this one takes the most amount of time to complete all 100 requests. 

The autoscaler is slow to spin up new pods when there’s a large spike in traffic because it starts with zero pods. This could be due to virtualization, in which a service relies on virtual machines to spin up new pods, or slower networking solutions. These additional layers between the job you want to run and the hardware can slow down your service, especially its ability to respond quickly to changes in demand.

Unfortunately, this scenario is not uncommon. A cloud provider with infrastructure and autoscaling that’s not specifically built for AI can see performance degradation with these large spikes. In recent benchmarks our team conducted, we found that it could take nearly ten minutes to scale a cluster at a generalized cloud provider.

Cost Analysis

Unlike the first scenario in which you pay for idle compute to be able to respond quicker to new requests, you don’t run any resources without present requests. This helps remove the costs of idle compute but can also lead to slower performance of your application.

It can be a challenge to determine how much the slower performance impacts your bottom line. If the user experience is bad enough that people prefer another, faster solution, the lag in performance can be very costly.

Overall benefits:

  • Reduce the amount of idle compute/low resource utilization
  • Helps reduce costs because no idle compute

Overall drawbacks:

  • Requests will wait in a queue
  • High latency can make for a bad user experience

Scenario 3: Optimal Autoscaling

This last inference service shows what optimal autoscaling could look like. Starting from zero pods, the service can quickly respond to the 100 requests and complete them all in record time. The service also scales down as fewer requests wait in the queue, scaling all the way back down to zero.

The problem: "Optimal" autoscaling won't be possible. This scenario is meant to provide an upper bound on performance; even if the autoscaling is perfect, the requests will still take some time to process since the ML inference itself takes time. 

Cost Analysis

There are two main metrics to look at here: Cost and latency. You can have the best latency by paying to run 100k GPUs all the time so whenever a request comes in you have GPUs ready for it. (Similar to the "optimal autoscaling" but you pay for everything to run 24/7).

Similar to scenario 2, you don’t pay for idle compute outside of the requests, so that immediately helps keep the overall costs lower. You still only pay for the compute resources you use (in blue), which are equivalent to the amount of resources required to complete the 100 requests. 

With efficient autoscaling, you can reduce the buffer of idle GPUs that you need to keep some level of latency. So same latency, lower cost.

Alternatively, you can keep the buffer the same to keep latency more steady when there is a burst of requests.

Overall benefits:

  • Don't pay for GPU time not being used by an inference request
  • Different ways to keep latency low while maintaining lower costs

Overall drawbacks:

  • Unrealistic, especially for any consumer-facing application
  • Paying for "perfect" autoscaling 24/7 will still be expensive

Autoscaling and Inference: Where to Go from Here

Since perfect autoscaling is not achievable—and cost will always be a consideration for your inference service—the best solution you will find is somewhere between scenarios 2 and 3. You want a solution that can quickly autoscale more pods without hindering your application’s performance, and thankfully, a better quality autoscaler can mean a more cost-effective solution.

Autoscaling is just one factor that can impact the overall cost of serving inference. As you start to plan compute capacity for serving inference from your application, it’s important to keep in mind how much you need—and how quickly that can grow. 

The costs for popular AI applications, such as ChatGPT, to serve inference is astronomically high, even compared to model training. 

That’s why every optimization for your cloud infrastructure must be purpose-built to support AI workloads, including autoscaling. If you’d like to learn more about autoscaling, capacity planning, and serving inference for your AI application, please contact us to schedule a meeting. 

Connect with us

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.