Cut Your LLM Bill Without Cutting Quality
A practical, ordered playbook for lowering production LLM costs. Prompt hygiene, caching, model routing, retrieval, budgets, and the measurement that ties it all together.
Most LLM bills are not expensive because the work is hard. They are expensive because nobody looked. Tokens leak in a hundred small ways: a system prompt that grew over six months, a whole knowledge base shoved into every call, the biggest model answering "yes" to a yes-or-no question. The good news is that the same sloppiness that runs up the bill is usually easy to find and cheap to fix.
This is the order we use when a client asks us to bring their inference spend down without breaking the product. Start at the top. Each step pays for the next.
Start With Prompt And Context Hygiene
Read your actual prompts before you change anything. Pull ten real requests from production and look at what you are sending. Most teams are surprised. The system prompt has accumulated instructions nobody remembers writing. The same boilerplate ships on every call. Worse, the whole knowledge base, or a fat chunk of it, gets pasted into context whether the question needs it or not.
You pay for every token in, on every single call. A bloated context is a tax you pay forever.
Concrete things to cut:
- Long preambles and example sets that no longer change the output. Test by removing them and checking quality.
- The "just in case" documents stuffed into context that are irrelevant to most questions.
- Repeated instructions that say the same thing three ways.
- Verbose output formats when a short one would do. You pay for output tokens too, and they usually cost more than input.
This step alone often takes a meaningful slice off the bill, and it requires no new infrastructure. It is the highest return work you will do.
Cache What You Can
Once your prompts are lean, stop paying for the same tokens twice.
Prompt caching covers the parts that do not change. Your system prompt, your tool definitions, and any fixed instructions are identical on every call. Most providers let you mark that stable prefix as cached, so you pay full price once and a steep discount after that. For chat-style features with a long fixed setup, the savings add up fast. Put the stable content at the front and the variable content at the back, since caching works on a shared prefix.
Semantic caching covers repeated answers. Users ask the same things in different words. If you store past answers and match new questions by meaning, not exact text, you can serve a saved response and skip the model entirely. This helps a lot, and it is also the place to be careful. Set a similarity threshold high enough that "how do I cancel" and "how do I upgrade" never collide, and do not cache anything that depends on a specific user, the time, or live data. Cache the stable, factual stuff. Leave the personal stuff alone.
Route To The Right Model
You do not need your strongest model for most turns. You need it for the hard ones.
Send easy turns to a small, cheap model. Classification, short rewrites, simple extraction, routine chat: a smaller model handles these well, often at a fraction of the cost per token. Reserve the large model for reasoning, long synthesis, or anything where a wrong answer is costly.
A simple router goes a long way:
- Use a small model by default.
- Escalate to the large model on signals like length, low confidence, or a failed validation check.
- Let the small model attempt first and retry on the large one only when the output fails a quick test.
Be honest about the tradeoff. Routing adds a moving part, and a bad routing rule sends hard questions to a weak model and produces quiet quality drops. Log every route decision so you can audit it. Done well, this is frequently the single biggest lever after context hygiene.
Use Retrieval To Shrink Context
Retrieval is often sold as an accuracy feature. It is also a cost feature.
Fetch the few passages that matter instead of shipping everything. If you index your knowledge base and pull the top handful of relevant chunks per question, your context shrinks from thousands of tokens to a few hundred, and the answers usually get better because the model is not distracted by noise. You trade a little retrieval infrastructure for a large and permanent drop in input tokens. For any feature grounded in a document set, this pairs naturally with the hygiene work above.
Batch, Stream, And Set Hard Budgets
A few mechanical habits keep spend predictable.
Batch the work that is not interactive. Overnight summaries, bulk tagging, and offline evaluations do not need an instant reply. Many providers offer a batch tier at a real discount for jobs you can wait on. Move every non-urgent job there.
Stream user-facing responses. Streaming does not lower token cost, but it lets you cap output mid-flight and lets users stop a wrong answer early, which avoids paying for tokens nobody wanted. It also makes a cheaper, slower model feel fast enough to use.
Set token budgets and per-feature cost ceilings. Cap max output tokens per call so one runaway response cannot balloon. Give each feature a cost ceiling and alert when it drifts. Budgets turn a surprise invoice into an early warning.
Measure Cost Per Request And Per User
None of the above sticks without numbers.
Track cost per request and cost per active user, broken down by feature. Once you can see which feature and which cohort drive spend, the next fix becomes obvious instead of a guess. Watch for the small group of heavy users who quietly consume most of the budget, and decide on purpose whether that is fine, worth a limit, or worth a pricing change. Measurement is also how you prove a change helped rather than just moved cost somewhere else.
A sensible order: clean the prompts, cache the stable parts, route by difficulty, add retrieval, then batch, stream, and cap. Measure throughout so each step is honest.
If you are running LLM features in production and the bill is growing faster than the value, we are happy to help. At 1 Degree Solutions we build and ship custom Alexa skills and AI products, and a lot of that work is making them fast, accurate, and affordable to run.
More on ai
What AI Agents Can Actually Do for Your Business in 2026
AI agents are real, useful, and easy to oversell. Here is a plain-English look at what they do well today, where they still need a human, and how to start.
Do You Actually Need Custom AI, or Is an Off-the-Shelf Tool Enough?
A straight, vendor-neutral answer to the question every founder is asking right now. Most teams do not need custom AI yet. Here is how to tell when you do.
Building Neutral AI: How We Ship Production Systems Without the Hype
Hype-free engineering principles for AI products that serve users, not nudge them. Grounding, refusal, evals, cost-bounding, the boring decisions that actually ship.