Tokenization and model routing: when “dumb” LLMs beat frontier models—and when they don’t

#Why tokenization is still the first line item

Every production LLM bill eventually decomposes into tokens: chunks of text (often subwords) that encoders feed to the model. Whether you buy by input, output, cached context, or provider-specific discounts, tokenization is the shared ruler—misunderstanding it means mispricing entire workflows.

For a grounded definition of subword tokenization and why it dominates modern LLMs, see Wikimedia’s overview of byte-pair encoding and related methods. Provider tooling such as OpenAI’s tokenizer guide remains the practical way to sanity-check prompt length before you ship.

Frontier models are dazzling on benchmarks, but most agent steps are not benchmark moments. A huge fraction of production calls are schema shaping, summarization for humans, entity extraction, routing decisions, and guardrail checks—tasks where latency and per-token cost matter more than marginal reasoning depth.

#The honest split: cheap cognition vs high-stakes cognition

Good fits for smaller or “dumber” models (relative to your stack’s flagship):

Structured transforms: JSON repair, field normalization, turning bullets into tables.
Binary or low-arity classification: “Does this ticket belong to billing?” with a confidence score.
Retrieval hygiene: rewriting a query, stripping PII categories you’ve already classified.
First-pass drafting before a stronger model or a human edits.

Reserve frontier models for steps where errors compound: multi-hop reasoning, ambiguous policy interpretation, safety-critical summarization of medical or legal text, or any step where a wrong output propagates into customer-visible artifacts without a gate.

Routing is not laziness—it is capital allocation. The trick is making routing auditable: if finance asks why April’s bill spiked, “we used GPT-4 everywhere” is not an answer; “step 7 used model family X because policy P fired” is.

#How LLM Workbench makes routing defensible

LLM Workbench centers run bundles and typed trace facts—not just dashboards—so you can show which step invoked which model, with model_io receipts alongside the DAG. That is the difference between “we think we routed cheaply” and “here is the frozen graph, the policy checkpoint, and the exact token-bearing call.”

If you are new to the vocabulary—model_io, gates, workflow snapshots—start with What LLM Workbench solves in production LLM stacks and the protocol overview. Teams already on the Vercel AI SDK can wrap generateText-style calls with @llm-workbench/ai-sdk so those receipts land without rewriting orchestration (see also Who benefits—and how to start).

#Anti-patterns that erase your savings

Same prompt, stronger model “just in case.” You pay frontier rates for formatting.
Routing in application code only. Without trace linkage, nobody can reconstruct why a cheap path was skipped during an incident.
Ignoring cache and KV patterns. Provider docs (for example Anthropic’s prompt caching documentation) are evolving quickly; your observability layer should capture what actually hit cache, not what you assumed.

#Next step: prove it in one workflow

Pick a single DAG with at least three LLM steps. Route the obviously mechanical step to a compact model, keep the reasoning step on your flagship, and ensure every call emits model_io tied to step ids. Export a bundle, diff two runs, and ask whether a skeptical finance partner could follow the story without Slack archaeology.

For mission-level context on why replay beats folklore, read Why LLM Workbench exists—and subscribe via /feed.xml if you want protocol- and economics-oriented posts as we publish them.