Previous lesson
What is frontier model serving cost
Continue the Model Costs track.
Compute College
Reasoning models generate long chains of thought before answering, multiplying output tokens — and output tokens drive inference cost.
One concept connected to AI compute market decisions.
A practical introduction designed to be completed in one sitting.
Useful for founders, product managers, analysts, and investors.
Interactive calculator
5,000,000 requests/month → estimated $42,000 in monthly serving cost.
Starting values are illustrative defaults you can edit — not live ComputeTape benchmark prices. Replace them with a real quote.
Plain-English definition
A reasoning model spends extra compute at the time you ask the question — generating a long internal chain of thought before it produces a final answer. That extra thinking is made of output tokens, and output tokens are the costly part of serving a model, so the same question answered by a reasoning model can consume several times the compute of a standard model. This use of extra inference-time compute to improve answers is often called test-time compute.
Why it matters
Serving cost scales with the tokens a model generates, not just the prompt you send in — and reasoning models generate far more of them. That shifts compute demand from one-time training toward recurring inference and changes how buyers plan capacity.
Simple example
Suppose a standard answer is 500 output tokens, and the same question from a reasoning model produces 5,000 output tokens — 500 of answer plus 4,500 of "thinking." That is 5,000 ÷ 500 = 10× the output tokens for one answer, so if output tokens dominate the bill, that query can cost on the order of ~10× more to serve.
Example figures are illustrative calculations, not current quoted market prices.
Market signal
A jump in reasoning-model usage is an inference-demand signal, not a training-demand one. It pulls on serving capacity, high-bandwidth memory, and per-token economics rather than on one-off training clusters. Watch whether providers price reasoning or output tokens at a premium, and whether usage of long-output features is rising — both point to inference becoming the larger share of compute spend.
Market read: reasoning usage turns better answers into recurring inference demand. Evidence discipline: token-cost ratios and "thinking length" are model- and task-specific, so record the model, the date, and the source for any cost-per-token figure, and keep illustrative ratios separate from observed prices.
Common mistake
Treating a reasoning model's cost like a standard model's because the prompt is the same size. The prompt (input) is processed in parallel and is comparatively cheap; the cost lives in the output, which is generated one token at a time and grows the memory (KV cache) held per request. A short question can still produce a long, expensive answer.
Practical takeaway
Estimate reasoning-model cost from expected output length, not prompt size, and decide per task whether the extra reasoning is worth it.
Decision check: more reasoning is worth paying for only when the better answer changes the outcome enough to justify the extra tokens.
Helpful memory trick
Thinking out loud isn't free — every word of a model's reasoning is an output token, and output tokens are what you pay for.
Compute College
Get the ComputeTape Morning Brief for daily AI compute pricing, power, capacity, and infrastructure signals.
Compute College track
Continue this Compute College lesson path
Previous lesson
Continue the Model Costs track.
Next lesson
Continue the Model Costs track.