DeepSeek OCR 2 Reduces Visual Tokens by 80% and Outperforms Gemini 3 Pro in Document Parsing AI NEWS

Home
AInews
DeepSeek OCR 2 Reduces Visual Tokens by 80% and Outperforms Gemini 3 Pro in Document Parsing

DeepSeek OCR 2 Reduces Visual Tokens by 80% and Outperforms Gemini 3 Pro in Document Parsing

2026-02-02

Deepseek has unveiled a new visual encoder capable of reorganizing image information based on semantic meaning, rather than processing it in a strict top-down, left-to-right sequence.

Conventional vision-language models segment images into patches and process them in a fixed order, typically starting from the top-left corner and moving towards the bottom-right. According to Deepseek researchers, this approach does not align with how human vision actually works. Our eyes flexibly follow patterns based on content. For instance, when tracing a spiral, we do not jump across the image line by line. Instead, we follow the shape.

Deepseek OCR 2 aims to emulate this behavior. The new DeepEncoder V2 first processes visual tokens based on their content, rearranging them contextually, before a language model interprets the resulting sequence. The underlying idea is that this two-stage processing workflow enables a more genuine comprehension of two-dimensional image content.

Language Model Replaces Traditional Visual Encoder

At the core of DeepEncoder V2 is the replacement of a typical CLIP component with a compact language model architecture based on Alibaba's Qwen2 0.5B. The researchers introduced what they term causal flow tokens. These are learnable query tokens that can be attached to visual tokens and have access to all image information as well as previous queries.

The paper describes this as forming a two-phase process. First, the encoder reorganizes visual information based on content. Then, a downstream LLM decoder performs reasoning on the now-ordered sequence. Only the rearranged causal flow tokens are passed to the decoder, not the original visual tokens.

Fewer Tokens, Better Performance

Deepseek OCR 2 uses between 256 and 1120 visual tokens per image, depending on the content. Comparable models often require over 6000 or 7000 tokens. On the OmniDocBench v1.5 benchmark, which encompasses 1355 pages across nine categories, the model achieved an overall score of 91.09%, the researchers reported.

This represents an improvement of 3.73 percentage points over its predecessor, Deepseek OCR. Gains were particularly notable in correctly identifying reading order. In document parsing tasks, Deepseek OCR 2 also outperformed Gemini 3 Pro under a similar token budget.

Performance improved in practical metrics as well. The repetition rate, which measures how often the model gets stuck in loops of redundant text, decreased. When serving as the OCR backend for Deepseek's language models, this rate dropped from 6.25% to 4.17%. For batch processing of PDFs used as training data, the rate fell from 3.69% to 2.88%.

However, the model does have some weaknesses. For example, its performance on newspapers is worse than the previous version. The researchers cite two factors: the lower token limit may pose issues for newspaper pages dense with text, and the training data contained only 250,000 newspaper pages, which is insufficient material for this category.

A Step Towards Unified Multimodal Processing

The researchers view DeepEncoder V2 as progress towards standardized multimodal processing. In the future, encoder architectures might evolve to process text, speech, and images using the same fundamental framework, only adapting the query tokens based on the modality. The paper suggests this approach holds promise for ultimately achieving true understanding of two-dimensional content.

The code and model weights are publicly available on GitHub and Hugging Face.

Deepseek had just released the first-generation Deepseek OCR in October of last year. That system processes text documents as images and reduces memory requirements by a factor of ten. Consequently, language models can retain significantly more context, which is beneficial for long chat histories or large document volumes. According to Deepseek, the system can process up to 33 million pages daily, making it particularly suitable for generating large training datasets.

AirOps

AI workflow platform for efficient content creation

Icon

AI ad creator for quick video and image ads

Raptor AI

AI agents that check code security of your app

Fellow AI

AI meeting notes with seamless workflow integration

Darrow AI

AI tool for uncovering valuable legal cases

Magnifi AI

AI investing platform with chat guidance

Skywork AI

AI-powered research tool for professional content

RECENT AI TOOLS

Aura Build

AirOps

Icon

Raptor AI

Fellow AI

RECENT AI NEWS

Anthropic Raises Additional $30 Billion in Series G, Adds $380 Billion in Value

Google Deepmind Upgrades Gemini 3 Deep Think for Complex Scientific and Engineering Tasks

OpenAI to Retire GPT-4o and Three Legacy Models Tomorrow

ByteDance's Next-Gen AI Model Generates Clips from Text, Images, Audio, and Video

OpenAI Launches New High-Speed Coding Model

Qwen-Image-2.0 Nearly Renders Ancient Chinese Calligraphy and Slides with Perfect Text Accuracy

OpenAI Upgrades Responses API with Features for Long-Running AI Agents

After SpaceX Merger, Two Other xAI Co-founders Also Depart

RECENT AI TOOLS