DeepSeek has introduced a new experimental model called V3.2-exp aimed at significantly reducing inference costs during long-context operations. The announcement was made via a post on Hugging Face, where the team also shared a link to the corresponding academic paper hosted on GitHub.
The central feature of this new model is DeepSeek Sparse Attention, a sophisticated system illustrated in the diagram below. At its core, the system uses a component known as the "Lightning Indexer" to prioritize specific excerpts within the context window. Following this, a separate mechanism, the "Fine-Grained Token Selection System," selects particular tokens from these excerpts to load into the limited attention window. Together, these components enable the sparse attention model to operate efficiently over long contexts with relatively low server overhead.
The benefits of this system are particularly evident in long-context scenarios. Early tests by DeepSeek revealed that the cost of a simple API call could be reduced by up to 50%. While further validation is needed to confirm these results, the model’s open-weight release on Hugging Face means third-party evaluations of the paper’s claims should emerge soon.
This latest advancement is part of a broader wave of innovations targeting inference costs—the server expenses associated with running pre-trained AI models, distinct from training costs. In DeepSeek’s case, researchers have been exploring ways to make the fundamental transformer architecture more efficient, achieving notable success.
Based in China, DeepSeek has played an unconventional role in the AI boom, especially for those viewing AI research as a key front in the U.S.-China tech rivalry. Earlier this year, the company gained attention for its R1 model, which was primarily trained using reinforcement learning at a fraction of the cost of U.S. counterparts. However, the model did not spark the widespread revolution in AI training that some had anticipated, and the company has since stepped back from the spotlight.
While the new “sparse attention” approach may not generate the same level of excitement as R1, it could still offer valuable insights to U.S.-based providers looking to keep inference costs under control.