Chinese tech giant Baidu has unveiled ERNIE-4.5-VL-28B-A3B-Thinking, a new multimodal AI model capable of incorporating image processing directly into its reasoning workflow.
The company asserts that the model outperforms leading commercial systems such as Google’s Gemini 2.5 Pro and OpenAI’s GPT-5 High across multiple multimodal benchmarks. Despite utilizing only 3 billion active parameters—thanks to its routed architecture, which brings the total parameter count to 28 billion—the model delivers robust performance and can run efficiently on a single 80 GB GPU, such as the Nvidia A100.
Released under the Apache 2.0 license, ERNIE-4.5-VL-28B-A3B-Thinking is freely available for commercial use, although its reported capabilities have not yet been independently verified.
A standout feature of the model is its “thinking with images” capability, which enables dynamic image cropping to focus on critical visual details. In one demonstration, the system automatically zoomed in on a blue sign and accurately extracted its text content.
Additional evaluations show that the model can precisely locate individuals within images and return their bounding box coordinates, solve mathematical problems by analyzing circuit diagrams, and recommend optimal visiting times based on data visualizations. For video inputs, it extracts subtitles and aligns scenes with specific timestamps. Moreover, it can leverage external tools like web-based image search to identify unfamiliar objects.
While Baidu highlights the model’s ability to crop and process images during inference, this approach isn’t entirely novel. In April 2025, OpenAI introduced similar functionality in its o3 and o4-mini models, which natively integrate images into their internal reasoning chains and employ built-in operations like zooming, cropping, and rotating to tackle visual tasks—setting new standards for agent-like reasoning and problem-solving.
What’s particularly noteworthy now is that these advanced visual reasoning capabilities—once exclusive to proprietary Western models—are rapidly emerging in open-source Chinese alternatives, appearing just months after their debut in Western AI systems.