Everyone wants more and cheaper LLMs. To get us there, model inference providers should provide pricing closer to the actual cost of running the LLM, allowing for more competitive pricing and for engineers to better-optimize how and when they use LLMs. The two ideas I propose are for token prices to better reflect their quadratic scaling cost, and to price LLM usage based on the cost of electricity (which is usually dependent on the time of day).
Note that this is a hypothesis mostly based on my existing knowledge about energy markets and that more evidence-based research is needed in the area of actual costs of running an LLM (these costs are more difficult to measure, find, and vary highly on the deployed infrastructure). Also, if inference companies can make a buck with better pricing, there’s a pretty good chance they would have done it by now. Likely, as LLM usage continues to grow and pricing at economies of scale becomes more relevant, these ideas will come to fruition.
Token pricing
Right now model inference providers provide pricing tiers at different input/output token lengths. For example, Claude Sonnet 4.5 costs $3 per mil input tokens for requests under 200k tokens, but $6 per mil for requests over 200k. This is because model compute required scales quadratically with number of tokens.
I think inference providers should take this a step further, and establish even more fine-tuned pricing. They should establish even more pricing bins, or even provide a true quadratic-scaling pricing formula.
This would be useful for engineers so that they can better design agent workflows. For example, current coding agents take in a significant amount of prior conversation history – much of which is irrelevant to the current question the user is asking. For example, if a user has an ongoing chat where they are adding multiple features across multiple prompts in a codebase, if those prompts are not relevant to each other the conversation history is going to include a lot of tokens and often times the context window will be filled to the max, leading to higher LLM runtime costs. If coding agents could summarize what is important to answer a prompt into a smaller set of tokens, engineers would save in input token costs when running coding agents.
Dynamic electricity-dependent pricing
Electricity is the largest ongoing expense for datacenter operators. Not only is it required to run the servers, but also for cooling. Electricity will likely have a greater increasing share in total operational costs for running models for the following reasons:
- Electricity prices continue to climb due to data center electricity demand growth outpacing generation and transmission growth
- The depreciation cost of GPUs is decreasing as older generation GPUs continue to have a longer useful life due to current high inference demand (we just don’t have enough GPUs!). A Google Cloud VP noted that 7 year old TPUs were still seeing 100% utilization in Google data centers.
Given that electricity can ~3 times cheaper at night, inference providers should provide LLM pricing that reflects difference in time of day cost. Dynamic electricity pricing is already an option for residential consumers whose utilities allow them to take advantage of this. Exploiting this efficiency opportunity should be made to LLM consumers likewise.
Considering global request routing and regional energy costs
One may point out that LLM requests unlike general electricity use can be made globally to infrastructure, allowing for more stabilized model pricing. LLM electricity use is at an advantage because you can run a model in one country or region where electricity prices might be much cheaper than where the LLM user is located. Inference providers can then utilize the global internet to forward tokens – a cost that is significantly less than the compute costs of running an LLM. This can help stabilize costs throughout the the day, as different countries wake up and wind down with different peak and off-peak periods. Google already does this with their global Vertex AI endpoint.
There are a lot more factors at play though that likely mean LLM costs are not stabilized, despite global LLM routing be an option. These include:
- Enterprise customers, which constitute and will continue to constitute the significant majority of LLM users, have data sovereignty concerns and want all LLM requests to be processed in specific counties. This of course hurts the benefit of multi-region routing and can put more pressure and higher costs on a specific region.
- Data center physical infrastructure does not match where electricity prices are the cheapest, and deploying infrastructure to those areas can be cost more than their benefits. For example there are overhead costs of permitting, local construction supply, available grid hookups, etc. in deciding where to place datacenters.
- Nation states have technology security concerns and do not allow advanced GPUs to be exported to some countries. For example, the US has banned advanced Nvidia GPUs from being exported to China.
The results of these effects can be shown in the amount of GPU compute each country currently has. Despite the US having on-average greater electricity costs than China or gulf states, the majority and a growing share of all new GPU installations are inside the county.
If LLM prices are going to be dependent on US electricity prices, which heavily depend on the time of day, inference providers should provide pricing that is dependent on electricity prices too.
As of November 2025, Companies already for awhile have offered batch pricing to allow users to save on processing requests at delayed times. They should take the next step and provide dynamic pricing for on-demand requests as well.
This will allow engineers to schedule LLM agents such as coding agents during off-peak hours, and possibly review work and iterate during daytime hours. Note that such a system is not possible with currently available batch requests because agentic systems rely on realtime feedback loops.
More
I love platforms like OpenRouter as they provide an accessible marketplace for LLM inference. Integration of dynamic pricing into it would be a game changer for market efficiency and competition.
Additionally, inference providers could find even more ways to efficiently price LLMs such as by offering higher prices for higher throughput or latency (factors determined by the hardware being used to run the model). This would be useful for delegating different types of agents to different workloads. For example, low priced, low throughput examples like data processing could be useful for time-insensitive cloud agentic flows whereas high-throughput models could be useful for human-in-the-loop/quick-iteration agentic coding. Provisioned throughput is already an option on cloud providers like GCP.
Extremes will at some point become the norm as large models become useful for things like high frequency trading.