Tokenization and model routing: when “dumb” LLMs beat frontier models—and when they don’t
How tokenization shapes bills, why small fast models work for formatting and classification, and when you still need frontier reasoning—plus how to prove routing decisions in production.
#Why tokenization is still the first line item
Every production LLM bill eventually decomposes into tokens: chunks of text (often subwords) that encoders feed to the model. Whether you buy by input, output, cached context, or provider-specific discounts, tokenization is the shared ruler—misunderstanding it means mispricing entire workflows.
For a grounded definition of subword tokenization and why it dominates modern LLMs, see Wikimedia’s overview of byte-pair encoding and related methods. Provider tooling such as OpenAI’s tokenizer guide remains the practical way to sanity-check prompt length before you ship.
Frontier models are dazzling on benchmarks, but most agent steps are not benchmark moments. A huge fraction of production calls are schema shaping, summarization for humans, entity extraction, routing decisions, and guardrail checks—tasks where latency and per-token cost matter more than marginal reasoning depth.
#The honest split: cheap cognition vs high-stakes cognition
Good fits for smaller or “dumber” models (relative to your stack’s flagship):
- Structured transforms: JSON repair, field normalization, turning bullets into tables.
- Binary or low-arity classification: “Does this ticket belong to billing?” with a confidence score.
- Retrieval hygiene: rewriting a query, stripping PII categories you’ve already classified.
- First-pass drafting before a stronger model or a human edits.
Reserve frontier models for steps where errors compound: multi-hop reasoning, ambiguous policy interpretation, safety-critical summarization of medical or legal text, or any step where a wrong output propagates into customer-visible artifacts without a gate.
Routing is not laziness—it is capital allocation. The trick is making routing auditable: if finance asks why April’s bill spiked, “we used GPT-4 everywhere” is not an answer; “step 7 used model family X because policy P fired” is.
#How LLM Workbench makes routing defensible
LLM Workbench centers run bundles and typed trace facts—not just dashboards—so you can show which step invoked which model, with model_io receipts alongside the DAG. That is the difference between “we think we routed cheaply” and “here is the frozen graph, the policy checkpoint, and the exact token-bearing call.”
If you are new to the vocabulary—model_io, gates, workflow snapshots—start with What LLM Workbench solves in production LLM stacks and the protocol overview. Teams already on the Vercel AI SDK can wrap generateText-style calls with @llm-workbench/ai-sdk so those receipts land without rewriting orchestration (see also Who benefits—and how to start).
#Anti-patterns that erase your savings
- Same prompt, stronger model “just in case.” You pay frontier rates for formatting.
- Routing in application code only. Without trace linkage, nobody can reconstruct why a cheap path was skipped during an incident.
- Ignoring cache and KV patterns. Provider docs (for example Anthropic’s prompt caching documentation) are evolving quickly; your observability layer should capture what actually hit cache, not what you assumed.
#Next step: prove it in one workflow
Pick a single DAG with at least three LLM steps. Route the obviously mechanical step to a compact model, keep the reasoning step on your flagship, and ensure every call emits model_io tied to step ids. Export a bundle, diff two runs, and ask whether a skeptical finance partner could follow the story without Slack archaeology.
For mission-level context on why replay beats folklore, read Why LLM Workbench exists—and subscribe via /feed.xml if you want protocol- and economics-oriented posts as we publish them.