AI compute market signals and learning
← Back to Compute College

Compute College

Tokens per second explained

Learn what tokens per second means, how model throughput affects AI applications, and why throughput matters for AI compute capacity planning.

Compute & Pricing LessonsLearning path

One concept connected to AI compute market decisions.

5-8 minutesRead time

A practical introduction designed to be completed in one sitting.

Throughput / Tokens / ServingTags

Useful for developers, founders, procurement teams, and analysts tracking model-serving economics.

Plain-English definition

Plain-English definition

Tokens per second is the rate at which a model generates output tokens after response generation begins, making it a useful throughput measure for model serving.

Why it matters

Why it matters

Output throughput influences user wait time and how much demand a serving stack can handle. Faster useful output may let the same infrastructure serve more work, although other bottlenecks still matter.

  • Capability changes matter economically only when they affect deployed workloads or buyer choices.
  • Token volume, latency, retries, and throughput determine how a useful result becomes serving cost.
  • A ComputeTape reader should connect model evidence to inference demand and required AI compute capacity.

Simple example

Simple example

At an illustrative 50 generated tokens per second, an output of 1,000 tokens takes about 20 seconds after generation starts, before accounting for queueing or first-token delay.

  • Use the example to compare workload economics, not as a current market quote.
  • Record the task type, evaluation or workload conditions, and the cost inputs before comparing results.
  • A successful result is valuable only if its latency and cost fit the intended production use.

Example figures are illustrative calculations, not current quoted market prices.

Market signal

How to read the market signal

Higher usable throughput can reduce effective serving cost or expand capacity; falling throughput under load can reveal demand pressure on serving systems.

  • Look for adoption, routing, usage-volume, or capacity signals rather than a headline score alone.
  • Compare input tokens, output tokens, latency, tool rounds, retries, and completion quality together.
  • Keep sourced capability facts separate from interpretation about future AI compute demand.

Market read: this metric becomes an AI compute signal only when it changes serving volume, effective workload cost, or the capacity buyers require.

Common mistake

Common mistake

Do not confuse tokens per second with the full user experience: first-token latency, output length, quality, and batching also matter.

Practical takeaway

What you can do with this

Use measured tokens per second alongside expected output length and concurrent demand to estimate response time and required serving capacity.

  • Buyers: test the metric on tasks close to the workload you will pay to serve.
  • Builders: measure tokens, latency, retries, completion rate, and model price on each test run.
  • Analysts: require a source and an adoption mechanism before treating a model result as demand evidence.

Decision check: identify the capability measured, the serving cost driver it affects, and the buyer behavior that would make capacity demand change.

Helpful memory trick

Helpful memory trick

Tokens per second is the model output speedometer.

Compute College

Follow model releases as market signals

Follow model releases as AI compute market signals in the ComputeTape Morning Brief.

Get the Morning Brief

Compute College track

Model Costs

Continue this Compute College lesson path

Next lesson

Context window explained

Continue the Model Costs track.