Managing AI Spend for Startups: A Founder's Playbook for AI Infrastructure Costs

The setup

What changed in the last sixty days

For two years, every big-tech CEO who had something to sell told founders to spend with abandon on AI. The pitch was that tokens were the new electricity, that any company not maxing out consumption was falling behind, and that the right posture was to put a credit card on file and let the agents cook.

In the last sixty days that pitch has started to come apart in public.

Uber's CTO Praveen Neppalli Naga told The Information in April that the company had blown through its entire 2026 AI budget on Claude Code and Cursor in four months. He called it a head-exploding moment. Engineers were reporting personal API costs between $500 and $2,000 a month, the tools were too good to take away, and the math no longer worked at the scale Uber needed.

Uber's COO Andrew Macdonald followed up a few weeks later and said the quiet part out loud. Based on conversations with senior engineering leaders, higher token consumption was not producing a proportional increase in useful consumer features. The bill kept climbing. The output curve flattened. The COO is the person who has to defend that gap to a board, and he said the gap is real.

Microsoft, the other half of this week's story, started canceling internal Claude Code licenses for thousands of engineers in mid-May with a hard cutover to GitHub Copilot CLI by June 30. The official line is "toolchain unification." The reality is that Microsoft's own engineers, by multiple accounts, preferred Claude. The licenses got pulled anyway, the timing lined up with fiscal year-end, and the most honest read is that the unit economics of giving every engineer an unconstrained Claude seat did not survive contact with the bill.

Meta, for its part, had an employee build an internal leaderboard ranking people by tokens processed. The top user reportedly averaged 281 billion tokens. Sam Altman is offering every YC company in the current batch $2 million worth of OpenAI credits, which is being framed as compute-for-equity and is functionally a customer acquisition cost paid in advance to lock startups into a sales funnel they will not feel until they try to leave.

None of this is anti-AI. The tools work. That is the entire point. The tools work so well that engineers cannot stop using them, and the cost of unconstrained use has finally collided with the spreadsheet.

The diagnosis

Tokenmaxxing is a posture, not a strategy

The word doing the work in 2026 is tokenmaxxing, two x's, originally a Silicon Valley joke that has become a serious posture inside companies. At its most defensible, it means a willingness to spend aggressively on AI because the marginal token produces more value than the marginal hire. At its loudest and most expensive, it means defaulting every task to the most capable model, encouraging engineers to compete on consumption volume, and treating spend growth as a marketing line rather than a finance one.

The founders running into trouble are not the ones using a lot of AI. They are the ones who adopted the posture without the instrumentation. A startup CEO reads that Uber's engineers are spending $2,000 a month per person on Claude Code and concludes that not matching that number means falling behind. The CEO sends a Slack message telling the team to stop worrying about cost. Six weeks later finance sends a screenshot of the Anthropic invoice and the room goes quiet.

The herd dynamic is the dangerous part. The CEO of a big-tech vendor stands on stage and says the right answer is more tokens. Five other CEOs of vendors that sell tokens or seats nod along. The founder of a fifteen-person company reads the transcript, infers a consensus from people whose incentives are to sell more of the thing, and budgets accordingly. Nobody in the chain has accountability for the output.

The right question is not "are we spending enough on AI." It is "what are we getting per dollar, and is that number improving."

Tactical

Managing the spend

This section is the part founders should read with the Anthropic and OpenAI bills open in another tab. Six moves, in priority order, that meaningfully cut AI infrastructure cost without making the product or the engineering team worse.

1. Route the right model to the right task

This is the biggest lever and most teams never pull it. Anthropic's current pricing as of May 2026 puts a 25-times spread between the cheapest and most expensive model:

Claude API pricing, per million tokens (May 2026)

Haiku 4.5$1.00 in / $5.00 out
Sonnet 4.6$3.00 in / $15.00 out
Opus 4.7$5.00 in / $25.00 out

Every task that gets routed to Opus when Sonnet would have done the job is a 1.67x overspend on input and the same on output. Every task that gets routed to Sonnet when Haiku would have handled it is a 3x overspend. A workload running entirely on Opus that could have been split 70 percent Haiku, 25 percent Sonnet, 5 percent Opus will cost roughly four to five times as much as the routed version, with no measurable quality difference on the easy work.

The simple rule: Haiku for classification, routing, extraction, and short-context summarization. Sonnet for most reasoning, code edits, and agent tool use. Opus only when the last mile of quality matters and Sonnet has been measured to fall short. Build a router. Make the router observable. Audit it monthly.

2. Stack prompt caching and the batch API

Anthropic and OpenAI both offer two cost levers that compound. Cached input on Claude reads at roughly 10 percent of the standard input price, a 90 percent discount on any repeated context like a system prompt, a long document, or a fixed knowledge base. The batch API charges 50 percent of standard pricing for any workload you can tolerate a 24-hour turnaround on, which covers most evaluation, backfill, enrichment, and reporting jobs.

Stacked together on the right workload, the discount math gets ridiculous. A practitioner write-up making the rounds this spring described a document-processing job that dropped from $45 to $4.28 on 5,000 reviews, a 91 percent reduction, just from combining batch and caching. Another engineer documented a Claude API bill that went from $720 to $72 monthly by turning on caching alone.

Most startups have not turned either of these on. The default settings on the SDK assume you want fresh tokens at full price for every call. Treat the first week after onboarding any AI feature as a tuning week. The bill the second week should be a fraction of the bill the first week, or someone did the integration wrong.

3. Audit the per-seat stack the way you audit SaaS

The per-seat AI stack is the other half of the bill, and it is usually less visible than the API line because it gets paid on a corporate card. The current Q2 2026 enterprise rack rates that matter:

AI coding tool per-seat pricing, enterprise (May 2026)

Claude Code Teams$25/seat/mo ($20 annual)
GitHub Copilot Enterprise$39/seat/mo
Cursor Business$40/seat/mo
Claude Max$100–$200/mo per heavy user

Anthropic's own enterprise data, surfaced in multiple write-ups, puts the average Claude Code customer at roughly $13 per developer per active day, $150 to $250 per developer per month, with 90 percent of users staying under $30 per active day. That is the benchmark to measure your own stack against. If your team is significantly above it, something is misconfigured or someone is leaving an agent running in a loop at night.

The audit move is mundane: monthly seat reconciliation, real-time cost dashboards by engineer or by team, deactivation of dormant seats, and an explicit owner for the API key bill. None of this is glamorous and all of it works.

4. Build a unit economics model before you scale anything

The FinOps Foundation made the point well this year: AI cost is variable in a way cloud spend never was, because it scales with usage at runtime instead of with infrastructure at deploy time. The implication for founders is that you cannot manage AI cost the way you manage AWS. You have to manage it the way a SaaS CFO manages payment processing fees: as a per-transaction cost that needs to live inside the unit economics.

The number to compute is AI cost per active user, or per transaction, or per ticket resolved, whichever maps to your revenue event. If you charge $20 a month per seat and your AI cost per active user is $1.40, you have an 85 percent gross margin on the AI line and a story to tell. If your AI cost per active user is $9 and you charge $10, you have a problem that compounds with every signup. Most founders cannot tell you which one they are because nobody built the model.

Tag every AI call with a customer ID, a feature ID, and a workload class. Roll it up weekly. Put the result in the same dashboard as MRR. The number that matters is the trend, not the absolute level. A cost per outcome that drops 15 percent quarter over quarter is a learning curve. A cost per outcome that rises 15 percent quarter over quarter is a runaway car.

5. Put hard guardrails on the bill before you need them

The Uber story should be a forcing function on every founder reading it: have a kill switch. The minimum guardrail set in 2026:

Hard monthly budget caps per API key, set at the vendor, not in your own code. If you wrote the limiter yourself, the limiter has bugs.
Per-engineer daily soft caps on coding agent seats, with an explicit override path. The cap is not punishment. It is the prompt that makes someone notice when a script is in a loop.
Real-time alerting on spend anomalies at 50 percent, 75 percent, and 100 percent of the monthly cap. Route alerts to a human, not a Slack channel nobody reads.
Weekly cost-by-feature reports distributed automatically to the heads of engineering and finance. If finance only sees the bill at month-end close, the conversation is always reactive.
An owner. One person, named, accountable for the AI cost line. Not the CTO and not the CFO. A specific person whose job includes this number.

6. Apply the "is this replacing a person" test honestly

The original case for aggressive AI spend was that a $2,000-a-month tool that gave an engineer 5x output was cheaper than hiring a second engineer. That math is real on certain workloads. It is also being applied to a lot of workloads where it does not hold.

A useful audit script: for each meaningful AI line item, write one sentence that completes the phrase "this spend exists instead of." If the sentence is "instead of hiring an engineer to write boilerplate," the spend is probably defensible. If the sentence is "instead of asking the team to think harder before they kick off the agent," the spend is doing the job of clarity, badly. Jackie Lunger at Panorama, a YC-portfolio founder, put it cleanly in a16z speedrun: spending more tokens usually means the clarity of thought was not there when the task started. More tokens, she said, signal less precision.

Strategic

Storytelling the spend

The tactical side keeps the bill from running away. The strategic side decides whether the bill is read as an investment or as a mistake. Founders underweight this. The same number described two different ways lands very differently in a board meeting, and the difference compounds across fundraising rounds.

The three-number narrative every board deck needs

Stop reporting consumption. Report outcomes. The three numbers that turn AI spend into a defensible story:

AI cost per active user, per transaction, or per ticket resolved. Pick the unit that maps to revenue. Show the absolute number and the trend over the last six to eight weeks. Down and to the right is the story. Flat is acceptable if volume is growing fast. Up means you owe the room a diagnosis.
AI gross margin contribution. What is the AI-driven portion of revenue minus the AI cost to deliver it. Put it on the same chart as your overall gross margin. If the AI margin is higher than the company margin, AI is pulling the business toward better economics. If it is lower, you are subsidizing AI features with healthier ones and should know that.
Headcount avoided, output multiplied, or revenue enabled. The third number is the storytelling one. It is the answer to "what did the spend buy us that we could not have bought with people." Be specific. "We ship 2.4x more product per engineer than the same team did in Q1 2025" is a sentence. "We are leveraging AI to accelerate velocity" is not.

Investors and boards are not against AI spend. They are against AI spend they cannot evaluate. Three numbers, presented every cycle, gives them an evaluation framework and gives you a feedback loop.

The script for walking back tokenmaxxing without losing face

A material number of founders are about to have a conversation with their board where the AI bill is on the table and the easy positioning of the last twelve months no longer holds. The instinct is to defend. The better move is to lead with the diagnosis and the plan.

A version that works, lifted from how the most credible operators are framing this now:

Our AI spend ran ahead of our measurement infrastructure last quarter. We had the right instinct and the wrong instrumentation. Here is what we found when we tagged it, here is what we are routing differently, here is the per-unit cost we expect to land at by the end of Q3, and here is the trigger that gets us to rethink the budget again.

That paragraph does four things at once. It owns the gap without panicking about it. It signals that the team has the discipline to look at the numbers. It commits to a specific future number with a specific date. And it tells the board what would cause another reset, which is the part that makes the commitment credible. The boards that respond badly to that script are the boards you wanted to know about anyway.

Reframe consumption as investment, where it is one

Not every AI line item is operating expense. Some of it is buying a learning curve that compounds. The narrative move is to separate them on the page.

Run-rate AI cost belongs in the gross margin conversation, line-itemized like AWS or Stripe fees. Experimentation spend, the cost of running an evaluation harness, training a router, building a guardrail, exploring a new workload, belongs in the R&D conversation. Founders who let the two get blended end up with one big scary number that nobody can defend. Founders who separate them get to talk about a defensible cost of revenue and a separate, smaller, time-boxed investment in capability.

When the experimentation line drops to zero, that is a signal the team has stopped learning, not a win. The most credible reporting cadence treats the experimentation budget as a deliberate quarterly allocation with an explicit decision at the end of each cycle: graduate the workload into production, kill it, or fund another quarter.

What to say about the news

When a board member or an investor brings up the Uber or Microsoft story, they are usually testing one of two things. Either they want to know if you are paying attention, or they want to know if you have the discipline to avoid the same trap. Both readings reward a specific answer over a defensive one.

A working response template:

We watched the Uber number with interest. Our equivalent figure is X dollars per active user per month, down from Y dollars in Q1, driven by Z change in model routing and caching. Our hard ceiling for this category is N percent of monthly revenue, and we have the kill switch wired in at M. The trigger that gets us to revisit the ceiling is specific milestone.

That answer ends the conversation in the right way. It demonstrates instrumentation, commits to a constraint, and shows where the constraint would loosen. The wrong answer is to either dismiss the news or to overcorrect into austerity theater. Both signal that the founder is reacting to narrative instead of running a system.

Decide what to publish and what to keep in

Some founders are starting to publish their AI cost numbers publicly as a recruiting and credibility move. It works for a narrow set of companies, mostly AI-native startups whose product is the workload. For most post-raise companies, the right posture is to keep absolute numbers private and to publish frameworks, principles, and learnings. The Uber and Microsoft stories are useful precisely because they are specific. Your contribution to the public conversation does not need to be a dollar figure to be valuable; it needs to be a sentence that other founders can use.

Side-by-side

Two postures, two outcomes

Same tools. Same models. Same vendors. The difference between a company that compounds on AI and a company that gets compounded by it is the operating posture, not the technology stack.

Operating decision	Tokenmaxxing posture	Disciplined posture
Headline metric	Tokens consumed per engineer per month	AI cost per active user, with a trend line
Default model	The most capable available	The cheapest model that passes the eval
Caching and batching	Off, or on for some workloads no one audits	On by default, audited weekly
Budget controls	Soft guidance from the CEO	Hard vendor-level caps with named owners
Board narrative	"We are investing aggressively in AI."	Three numbers, with a unit, a trend, and a trigger
What happens at fiscal year-end	Surprise invoice, license cuts, public retreat	Same plan, recalibrated to the new pricing

Where Headroom fits

One operator. Built for handoff. Powered by the same tools the herd is paying full price for.

Headroom builds the financial close, payroll, spend controls, accounting systems, HR ops, and compliance infrastructure that post-raise companies need after a round. The work is fixed-scope, time-boxed, and ends with a clean handoff to your team. There is no retainer and no ongoing pod.

Part of what makes that model work is that the back office, including the reporting layer that surfaces numbers like AI cost per active user, is built with the same discipline this article describes. Right model, right task. Caching on. Hard caps wired in. Three numbers on the dashboard, not thirty. The boring posture, applied to the new line item.

The salary burn is the obvious cost. The real cost is your attention. The same is true for AI spend. The dollar figure is the part everyone fixates on. The attention drain of a founder who has to defend an uninstrumented number to a board is the part that actually slows the company down.

Founder FAQ

Questions we hear

What is tokenmaxxing?

Tokenmaxxing is the practice of maximizing AI token consumption inside a company and using that consumption as a proxy for productivity or AI adoption. In its loudest form it means defaulting every task to the most capable and expensive model, encouraging engineers to compete on token volume, and treating consumption growth as the headline metric. The term went mainstream in 2026 after Uber's CTO told The Information the company had burned its entire 2026 AI budget on Claude Code and Cursor in four months.

How much should a seed or Series A startup spend on AI infrastructure?

There is no universal answer, but the working benchmark for engineering tools in 2026 is roughly $150 to $250 per developer per month, based on Anthropic's published Claude Code enterprise data. Product API spend is a separate line and should be modeled per active user or per transaction. A useful default is to cap combined AI spend at no more than 20 percent of monthly engineering payroll until you have a unit economics model that justifies more.

What is the fastest way to cut an AI bill without hurting product quality?

Three moves stack: route the right model to the right task (Haiku for classification and routing, Sonnet for most reasoning, Opus only for the hard last mile), turn on prompt caching for any repeated system prompt or document context (about 90 percent off cached input), and move offline workloads to the batch API (a flat 50 percent discount). In combination these can cut spend by 90 to 95 percent on the workloads where they apply.

How do I explain AI spend to investors and the board?

Stop reporting consumption. Report three numbers: AI cost per active user (or per transaction), AI gross margin contribution, and headcount avoided or output multiplied. If the cost per outcome is going down quarter over quarter, you are running a learning curve and the spend is an investment. If it is going up, you are tokenmaxxing and need to fix that before the next deck.

Is tokenmaxxing always a bad idea?

No. There are workloads where spraying tokens at a problem is the right answer, like overnight security agents attacking your own architecture, or running thousands of variations against a strong test suite. The problem is not high token use. The problem is high token use with no thesis, no measurement, and no accountability for the outcome. Disciplined AI use can be very expensive. Tokenmaxxing as a posture is something else.

What should I do if my AI spend is already out of control?

Two-week sprint. Week one: tag every API call by feature, customer, and workload class; pull seat-level usage from every vendor; identify the top three line items by absolute dollars. Week two: route the cheapest competent model to the heavy workloads, enable caching and batching where applicable, set hard caps at the vendor level, and stand up a weekly report. Most teams cut 40 to 60 percent of the bill in those two weeks without any product change.

Sources and further reading

Where the numbers in this piece came from

Keep reading

More on how we work

Post-Raise Back-Office Playbook AI Finance Agents for Startups Fractional CFO vs. Headroom Internal Hire vs. Headroom Back to home

Get in touch

Tell us where things stand

If your AI bill has gotten ahead of your reporting, or your back office is still the founder and a spreadsheet, email [email protected] or use the form on the home page. We'll follow up within a couple of days.