
AI
The AI margin cliff
Anh-Tho Chuong • 4 min read
Jun 19
/4 min read
For twenty years, SaaS ran on one reflex: love your power users.
The marginal cost of serving one more was close to zero. A heavy user cost about the same as a light one and paid you more. Engagement and gross margin pointed the same way. Every retention model, every expansion motion, every land-and-expand deck assumed it.
AI breaks the assumption.
In an AI product the marginal cost of a heavy user is real, variable, and it grows with engagement. The customer who runs the most agents, writes the longest outputs, and routes the hardest problems to your best model is creating the most value and burning the most provider cost. Same contract. Same subscription. Different gross margin.
That gap has a name now. The AI margin cliff: the point where a customer's usage cost outgrows the value of their contract, and your best account quietly becomes your worst-margin one.
The obvious rebuttal: models keep getting cheaper. Per-token prices have fallen hard, year over year. Costs should be going down.
Per token, they are. The bill is going up anyway.
William Stanley Jevons noticed this in 1865. As steam engines got more coal-efficient, England burned more coal, not less. Cheaper-per-use made coal worth using for far more things, so total consumption climbed. Efficiency expanded demand instead of shrinking the bill.
Tokens are coal. Cheaper ones don't make teams spend less. They make heavier usage economical, so teams do far more of it. Reasoning models emit ten to fifty times more tokens per answer. Agents loop where a single call used to do. Capability unlocks workloads that didn't exist: the moment a model is good enough to run a multi-step agent, customers start running multi-step agents. The unit got cheaper. The number of units exploded.
Token pricing reads as linear. It isn't, and four things move your cost independently of what the customer pays you.
Output dominates. Most providers charge several times more for output than input. A model that thinks in a long hidden chain before answering can multiply the cost of one request while the user experience barely changes.
Model routing is the steepest lever. Move a workflow from a base model to a premium one and the per-token rate jumps several times over. Product ships the upgrade because it's better. Nobody re-prices the plan.
Tool calls compound. An agent loop re-sends its growing context on every step. Five tool calls is six round trips, each carrying more input than the last.
Retries are invisible. A failed attempt still costs a full call. A ten percent retry rate is a ten percent COGS line nobody put on a slide.
We built a stress test so you can put your own numbers in and watch where the margin breaks:
The problem is not that cost always rises. Costs can fall. A model gets cheaper faster than usage grows for a quarter, and your margin widens.
That is not safe either. If cost drops and your price is fixed, you are overcharging, blind to your real margin, and exposed to a competitor who passes the savings through.
The problem is not direction. It's decoupling. Your price is frozen. Your cost floats. The gap between them is your margin, and you are not watching it move.
The next reflex: this is solved, add usage-based pricing, meter the tokens.
Necessary. Not sufficient.
Metering tokens tells you how many tokens moved. It doesn't tell you what they cost you. The cost of a token depends on the model that produced it, whether it was input, output, or cached, and how many loops the agent took to get there. Bill a flat rate per million tokens and you've frozen a price while your cost floats underneath it. The next model release moves your cost and not your price. You're back on the cliff with extra steps.
This is the difference between token mode and price mode. Token mode meters what was consumed. Price mode meters what it cost you.
The honest objection to billing on real cost: it makes the customer's bill unpredictable, and buyers hate a bill that moves.
It doesn't have to. What you bill on and what the customer sees are two different things. You still design the experience: a flat tier, an allowance, a cap, prepaid credits. Predictability for the customer is a choice you keep.
What changes is that you make that choice knowing your cost. Someone absorbs the variance either way. Give the customer a hard cap and you carry the tail above it. Pass it through and they carry it. The variance doesn't disappear. It stops being a surprise on your own P&L after a model upgrade. And it doesn't have to be free.
This is what the Lago Agent SDK now does. It reads the real dollar cost of every LLM call from live provider prices, OpenRouter and AWS Bedrock, with no keys to manage. It applies your markup and turns the result into billable usage. Token mode if you want to price the dimensions yourself. Price mode if you want cost-plus with the margin built in.
On top of that sits the policy: markup, margin floors, included allowances, overages, credits, per-customer discounts. The customer sees one clean charge. Your cost structure and token math stay on your side. It's open-source, so you can read exactly how cost becomes a billable amount instead of trusting a black box.
Three ways off the cliff, whichever stack you run.
Separate premium usage from the base plan. Reasoning-model traffic is a different cost class. Meter it on its own and price it through, so the customers who route to the expensive models are the ones who pay for them.
Set a margin floor, not a price. Fix the margin and let the billable amount track cost upward, so your worst case is a known number instead of a surprise.
Bill cost-plus on real spend. Turn the actual dollar cost of each call into a billable unit, and keep the customer-specific policy in the billing layer instead of a spreadsheet someone updates after every release.
Cheaper tokens were supposed to be good news.
They are, if you can see what they cost you.
Jevons would have told you to check the bill.