Deepseek has unveiled a new visual encoder capable of reorganizing image information based on semantic meaning, rather than processing it in a strict top-down, left-to-right sequence.
Conventional vision-language models segment images into patches and process them in a fixed order, typically starting from the top-left corner and moving towards the bottom-right. According to Deepseek researchers, this approach does not align with how human vision actually works. Our eyes flexibly follow patterns based on content. For instance, when tracing a spiral, we do not jump across the image line by line. Instead, we follow the shape.
Deepseek OCR 2 aims to emulate this behavior. The new DeepEncoder V2 first processes visual tokens based on their content, rearranging them contextually, before a language model interprets the resulting sequence. The underlying idea is that this two-stage processing workflow enables a more genuine comprehension of two-dimensional image content.
Language Model Replaces Traditional Visual Encoder
At the core of DeepEncoder V2 is the replacement of a typical CLIP component with a compact language model architecture based on Alibaba's Qwen2 0.5B. The researchers introduced what they term causal flow tokens. These are learnable query tokens that can be attached to visual tokens and have access to all image information as well as previous queries.
The paper describes this as forming a two-phase process. First, the encoder reorganizes visual information based on content. Then, a downstream LLM decoder performs reasoning on the now-ordered sequence. Only the rearranged causal flow tokens are passed to the decoder, not the original visual tokens.
Fewer Tokens, Better Performance
Deepseek OCR 2 uses between 256 and 1120 visual tokens per image, depending on the content. Comparable models often require over 6000 or 7000 tokens. On the OmniDocBench v1.5 benchmark, which encompasses 1355 pages across nine categories, the model achieved an overall score of 91.09%, the researchers reported.
This represents an improvement of 3.73 percentage points over its predecessor, Deepseek OCR. Gains were particularly notable in correctly identifying reading order. In document parsing tasks, Deepseek OCR 2 also outperformed Gemini 3 Pro under a similar token budget.
Performance improved in practical metrics as well. The repetition rate, which measures how often the model gets stuck in loops of redundant text, decreased. When serving as the OCR backend for Deepseek's language models, this rate dropped from 6.25% to 4.17%. For batch processing of PDFs used as training data, the rate fell from 3.69% to 2.88%.
However, the model does have some weaknesses. For example, its performance on newspapers is worse than the previous version. The researchers cite two factors: the lower token limit may pose issues for newspaper pages dense with text, and the training data contained only 250,000 newspaper pages, which is insufficient material for this category.
A Step Towards Unified Multimodal Processing
The researchers view DeepEncoder V2 as progress towards standardized multimodal processing. In the future, encoder architectures might evolve to process text, speech, and images using the same fundamental framework, only adapting the query tokens based on the modality. The paper suggests this approach holds promise for ultimately achieving true understanding of two-dimensional content.
The code and model weights are publicly available on GitHub and Hugging Face.
Deepseek had just released the first-generation Deepseek OCR in October of last year. That system processes text documents as images and reduces memory requirements by a factor of ten. Consequently, language models can retain significantly more context, which is beneficial for long chat histories or large document volumes. According to Deepseek, the system can process up to 33 million pages daily, making it particularly suitable for generating large training datasets.