AI compute market signals and learning

Learn

When to Use Spot GPUs vs Reserved Capacity

Spot GPUs suit flexible work; reserved capacity suits predictable or critical AI workloads that need dependable access.

Buyers & OperatorsLearning path

One concept connected to AI compute market decisions.

5-8 minutesRead time

A practical introduction designed to be completed in one sitting.

Spot GPUs / Reserved Capacity / BuyersTags

Useful for founders, ml engineers, product managers, procurement teams, and analysts.

Plain-English definition

Plain-English definition

Use spot GPUs for flexible, fault-tolerant AI workloads that can pause, checkpoint, retry, or wait; use reserved capacity for predictable or mission-critical workloads that need reliable access at a planned time. Spot GPUs vs reserved capacity is a risk decision as well as a price decision.

Why it matters

Why it matters

Access terms affect effective compute cost. Spot capacity can lower a flexible experiment bill, but interruption can destroy savings for a time-sensitive training run or production service. Reservations can protect access during tight supply, but buyers may pay for unused capacity.

  • Batch experiments, offline inference, and restartable evaluation jobs can often trade certainty for lower cost.
  • Launch-critical training windows and live serving capacity usually place more value on availability and recovery terms.
  • The right mix can combine reserved baseload with spot overflow rather than apply one purchase model to every workload.
  • Spot availability and reservation pressure are useful signals about spare capacity and buyer urgency.

Simple example

Simple example

Assume a restartable batch job would cost an illustrative $10,000 on reliable capacity. A spot offer discounted by 60% would cost $4,000 if it finishes without interruption. If repeated interruptions require running the work three times, the spot compute expense becomes $12,000 and the completion deadline may slip.

  • The same discount can be valuable for checkpointed work and unacceptable for a production endpoint.
  • Reserved capacity might be justified for a product launch if unavailable GPUs cause lost users or missed commitments.
  • A blended plan could reserve predictable baseload and use spot only for surplus flexible work.
  • Illustrative savings are scenario math, not a current quote or a guaranteed interruption pattern.

Example figures are illustrative calculations, not current quoted market prices.

Market signal

How to read the market signal

Watch the relationship between spot availability, on-demand terms, and reservation access. When spot offers become scarce or priced nearer to reliable capacity, spare supply may be narrowing. When buyers cannot obtain reservations at a needed date or configuration, anticipated demand may be absorbing future capacity.

  • Compare the same accelerator class, region, cluster quality, delivery timing, and workload suitability before reading price spreads.
  • A spot discount accompanied by very limited quantities is not the same signal as abundant flexible supply.
  • Reservation demand can rise because buyers need reliability even if standard list prices appear stable.
  • Record price source, observation timestamp, capacity terms, and interruption caveat before using an offer as evidence.

Market read: spot measures spare flexible access; reservations measure the value of certainty. The spread between them matters only when both can serve comparable work.

Common mistake

Common mistake

Do not place a workload on spot simply because the price is lower. If checkpointing is missing, the dataset is costly to restage, the training deadline is fixed, or users depend on the service, interruption can create more cost and harm than the discount saves. Also do not reserve uncertain demand without measuring idle risk.

Practical takeaway

What you can do with this

Classify workloads before selecting access: flexible, deadline-driven, or mission-critical. Estimate both compute expense and the consequence of unavailable or interrupted capacity, then decide what portion warrants certainty.

  • ML teams: verify checkpoint, restart, scheduling, and data-restaging behavior before moving flexible work to spot.
  • Product managers: define the reliability and latency requirement for serving workloads before committing capacity.
  • Procurement teams: quote spot, on-demand, and reserved alternatives on consistent hardware and service assumptions.
  • Analysts: follow spot depth and reserved availability separately because they expose different market conditions.

Decision check: use spot only when the job can survive its interruption scenario, and reserve only when the cost of uncertain access or missed timing justifies commitment.

Helpful memory trick

Helpful memory trick

Spot is cheap when you can wait or restart; reserved is valuable when you cannot miss.