Google has introduced a new feature in its Gemini API, claiming it will reduce the cost for third-party developers using its latest AI models.
The feature, called "implicit caching," is said to cut costs by 75% for "repeated context" passed to the model via the Gemini API. It supports Google's Gemini 2.5 Pro and 2.5 Flash models.
This could be good news for developers, as the cost of using cutting-edge models has been rising.
Caching is a widely adopted practice in the AI industry, reducing computational demands and costs by reusing frequently accessed or precomputed data within models. For instance, caching can store answers to questions users frequently ask the model, eliminating the need for regenerating responses to the same queries.
Previously, Google offered model prompt caching but limited it to explicit prompt caching, meaning developers had to define their most frequently used prompts. While cost savings were guaranteed, explicit prompt caching often required significant manual effort.
Some developers were dissatisfied with how Google implemented explicit caching on Gemini 2.5 Pro, stating it could lead to unexpected high API bills. Last week, complaints peaked, prompting the Gemini team to apologize and commit to making changes.
In contrast to explicit caching, implicit caching is automatic. It is enabled by default for Gemini 2.5 models, and if a Gemini API request hits the cache, it will pass on the cost savings.
"When you send a request to one of the Gemini 2.5 models, if the request shares a common prefix with a previous request, then it qualifies to hit the cache," Google explained in a blog post. "We will dynamically pass the cost savings back to you."
According to Google's developer documentation, the minimum number of prompt tokens for implicit caching in 2.5 Flash is 1,024, and 2,048 for 2.5 Pro, which isn't particularly large. This means triggering these automatic savings doesn't require much effort. Tokens are the raw bits of data that models process, with a thousand tokens equivalent to about 750 words. Given previous issues with Google's claims about caching cost reductions, there are some considerations buyers should keep in mind. First, Google advises developers to retain repeated context at the beginning of requests to maximize the chances of an implicit cache hit. Context that may vary from request to request should be appended at the end, the company stated.
Moreover, Google hasn’t provided any third-party verification to confirm that the new implicit caching system will deliver the promised automatic savings. Thus, we’ll need to wait and see feedback from early adopters.