This post considers 3 ways of using extremely cheap and fast Cerebras-style smaller model “subagents” to generate significantly cheaper and faster output tokens instead of the primary model.
Utilize a small model as a “CPU branch predictor” to predict which terminal command or tool a coding model will use next
Output tokens are an astonishing 3-8x more expensive than input tokens. What if we could use a small model to provide a set of predicted outputs for a large model to then approve of, replacing the 10-50 large model token outputs with 30-150 extremely-cheap small model prediction tokens and 1-2 large model approval tokens? A prime use case for this would be predicting the next command a coding agent will execute. During an agentic conversation, terminal commands are extremely repetitive and predictable as they often involve exploring files and systems repeatedly. This makes them prime candidates for a lower intelligence model to correctly predict. Instead of the coding model typing the command themselves, they call a predictCmd() tool that sends the conversation history to a smaller model like Llama 3.1 8B running on Cerebras to then generate 1-3 predictions of terminal commands the coding agent might want to next execute. The coding agent then responds with a single-token number of which choice they would like to execute, or they have the option to reject the suggestions and output the command. Just like a CPU’s branch predictor, if a small model can achieve a high enough “hit rate” (rate of accepted predictions) that surpasses the cost if the primary model were to do it on its own, we could save on token costs, and potentially generate commands faster. You might be concerned that these extra 30-150 tokens would be sent to the parent model. Yes, that’s true, however those are input token costs, which are 3-8x less expensive than output tokens. If the cost of adding those input tokens is still less than if the primary model were to generate the command itself, then you could save. There could be some measurable improvements in the primary model’s performance too if it has a set of options to choose from rather than writing the answer itself – just like how humans are better at completing multiple-choice questions than fill-in-the-blank. This could be extended to predicting other types of tool calls like those from MCP servers too.
Utilize a small model to summarize terminal outputs
This is likely implemented by many agentic coding frameworks already, but basically, pass the terminal output of a command and maybe the conversation history / what is relevant to a Cerebras model, and have it extract what is important / the result to send back to the primary model. This has the potential to significantly cut input model costs and token lengths, as some command outputs, even when trimmed or filtered by the primary agent, still contain a lot unnecessary context / tokens.
Utilize a small model to write code instead of the larger one
Code is extremely token-heavy and can constitute a likely 5-30% of a coding model’s generated output tokens (remember, these are expensive!). Instead of using the larger model to write all of the code tokens, what if we had the larger model define what code needs to be edited and describe the edits using less-token-heavy language (or none at all). For example, the functionality of code can likely be described with much fewer English tokens, logic statements, psuedocode, or even a different, more-efficient programming language than the one the agent is working with. The small model will then act as a translator, simply translating the specification that the parent model generated into an actual code edit. This saves a considerable output token costs, and the resulting code is forwarded back to the primary model as cheaper input tokens – not output tokens. Additionally, the code can be written significantly faster (potentially 1000+ TPS). The full conversation history likely doesn’t even need to be provided. Just a file and instructions about what needs to be changed. Here we’re leveraging the fact that certain types of tokens (like English) can be much more information-dense than what the model actually needs to output. The smaller model model acts kinda like a compiler. The parent and child models could likely be trained as a pair where the parent model wouldn’t even need to describe what needs to be edited. The child model can just infer what needs to be edited based on the thinking of the parent model.
Agents offloading their thinking
All 3 of these ideas explore the idea of the primary agent “offloading their thinking” to smaller and faster models. This parallels with how human software engineers themselves are offloading their thinking and token generation to models right now. Just like how a human software engineer is offered tab complete when coding, we’re offering tab completion and compression for models themselves. Basically, we’re establishing an intelligence hierarchy: delegating tasks to the most efficient intelligent system for the task, all working together to achieve a larger common goal. Obviously this paradigm has already been explored with subagents, and the ideas here act like those, but I’m sure we could explore further and continue to generate similar efficiency gains like these.
If you’d like to discuss agentic coding ideas with me like this one, I’m available at [email protected].