Estimating in AI Tokens

As AI writes more of the code, some teams are sizing work in tokens instead of story points

Story points have measured one thing for two decades: the relative effort a human team spends on a piece of work. But when an AI agent drafts the implementation, runs the tests, and iterates on the diff, the scarce resource starts to shift. A popular prediction making the rounds is that teams will soon estimate work not in points or hours, but in the number of AI tokens a task will burn. It's partly tongue-in-cheek, but it's also a useful reframe, and it fits planning poker neatly.

WHAT TOKEN ESTIMATION MEANS

A token is the unit large language models read and write text in — very roughly, three-quarters of a word. Every prompt, every file an agent reads, every revision it generates consumes tokens. When you estimate a task in tokens, you are answering a familiar question in a new currency: "how much work is this?" — where work now means the volume of AI back-and-forth needed to get to a merged change. Like story points, a token estimate is deliberately approximate. Nobody can predict an exact token count for a feature any more than they can predict exact hours. The value is in the relative sizing: a one-line config tweak is a few thousand tokens; a cross-cutting refactor is millions.

WHY TOKENS, NOT POINTS

Token estimation has one property story points famously lack: tokens add up. Story points are intentionally non-linear — a 13 is not "five more than an 8" of anything, which is why summing them across a sprint is statistically dubious. Tokens are a real, countable quantity, so a sprint total of "3.2 million tokens" is a meaningful number. There's a practical payoff, too: tokens convert to money. At a blended rate of dollars per million tokens, a sprint backlog turns into an approximate compute cost. An estimate can then answer "what will this sprint cost to build?" — not in salaried hours, but in the actual spend of running the agents that do the work.

THE AI TOKEN DECK

A token deck works like any other planning-poker scale, just with token-sized buckets that climb exponentially so the team estimates by order of magnitude rather than false precision: 1K · 5K · 25K · 100K · 500K · 2M Each step is roughly five times the last — the same exponential spacing that makes the Fibonacci scale work, so the cognitive habit carries straight over. A "?" card still means "I don't have enough information to estimate," and instead of the traditional coffee cup for a break, a token deck uses an oil drum — the robot's equivalent of needing a top-up.

TURNING TOKENS INTO COST

The optional piece is a single rate: dollars per one million tokens. Set it to whatever reflects your own blended cost across the models you run, and each agreed estimate picks up an approximate dollar figure beside it. A story sized at 500K tokens at a $15/1M rate shows as roughly $7.50. Across a backlog, those add into a running total — "this sprint is about 3.2M tokens, roughly $48 in compute, with two stories still unestimated." It is an order-of-magnitude sanity check, not an invoice: real token use depends on the model, the split between input and output, retries, and how much context each task drags in. Treat the number as a directional signal, exactly as you would a story-point total.

WHEN THIS MAKES SENSE

Token estimation is not a replacement for story points everywhere. It earns its keep when AI agents are genuinely doing a large share of the implementation, when your team wants visibility into compute spend, or simply when a planning session needs a bit of fun. For teams where humans still write most of the code, classic story points remain the better model of effort. Really, estimating in tokens is an experiment in what "effort" means when a model, not a person, does the work. You don't have to believe it replaces points to get something from it: once a sprint's work shows up as a dollar figure, it's easier to ask what's actually worth building.