Meituan Officially Open-Sources LongCat-Image Model: Editing Capability Tops Open-Source SOTA

2025-12-08

Today, Meituan's LongCat team officially released and open-sourced the LongCat-Image generative model. Leveraging a high-performance architecture, systematic training strategies, and advanced data engineering, this 6-billion-parameter model achieves performance in text-to-image generation and image editing that closely rivals much larger models.

LongCat-Image employs a unified architecture for both text-to-image synthesis and image editing, combined with a progressive learning strategy. Despite its compact 6B parameter footprint, it delivers exceptional synergy across instruction fidelity, image quality, and text rendering—particularly excelling in controllability for single-image editing and comprehensive coverage of Chinese characters.

Key Strength #1: Highly Controllable Image Editing

LongCat-Image has achieved state-of-the-art (SOTA) results among open-source models on major image editing benchmarks such as GEdit-Bench and ImgEdit-Bench. This breakthrough stems from a tightly integrated training paradigm and data strategy. To preserve the aesthetic knowledge and generative capabilities of its base text-to-image model while avoiding the narrowed action space typical of post-training phases, LongCat-Image initializes from a mid-training checkpoint and adopts a multi-task joint learning framework that simultaneously optimizes text-to-image generation and instruction-based editing. This approach deepens the model’s understanding of complex and diverse editing commands. Furthermore, by incorporating multi-source pretraining data with instruction rewriting techniques and introducing human-curated, high-quality data during supervised fine-tuning (SFT), the model achieves simultaneous improvements in instruction adherence, generalization, and visual consistency between original and edited images.

Key Strength #2: Precise and Comprehensive Chinese Text Rendering

Addressing a longstanding industry challenge—accurate Chinese text generation—LongCat-Image implements a curriculum learning strategy to enhance character coverage and rendering precision. During pretraining, it learns glyph representations from tens of millions of synthetic samples, covering all 8,105 characters in the General Standard Chinese Character Set. In the SFT phase, real-world text-in-image data improves generalization across fonts and layout styles. Reinforcement learning (RL) further refines output quality by integrating dual reward models based on OCR accuracy and aesthetic appeal, enhancing both textual correctness and natural integration with background visuals. Additionally, by encoding specified prompt text at the character level, the model significantly reduces memory overhead, enabling a leap in learning efficiency for text rendering. This capability robustly supports demanding use cases such as poster design, commercial advertising, classical poetry illustrations, couplets, storefront signage, and logo creation—even for rare or complex Chinese characters.

Beyond text and editing, LongCat-Image also enhances texture realism and visual fidelity through systematic data curation and an adversarial training framework. During pretraining and mid-training, AI-generated content (AIGC) data is rigorously filtered to avoid “plastic-looking” textures and local optima. All SFT data undergoes manual quality screening to align with mainstream aesthetic standards. In the RL stage, an innovative AIGC detector serves as a reward model, using adversarial signals to guide the model toward learning authentic physical textures, lighting, and material qualities found in the real world.

Objective Benchmark Evaluation

Comprehensive benchmarking confirms LongCat-Image’s competitive edge: on image editing tasks, it scores 4.50 on ImgEdit-Bench and 7.60/7.64 on GEdit-Bench (Chinese/English), matching open-source SOTA and approaching leading closed-source models. For Chinese text rendering, it achieves a remarkable 90.7 on the ChineseWord benchmark—significantly outperforming all other models and ensuring accurate rendering of both common and rare characters. In text-to-image generation, it scores 0.87 on GenEval and 86.8 on DPG-Bench, demonstrating strong foundational capabilities that remain competitive against both top open-source and proprietary models.

Subjective User-Centric Evaluation

Prioritizing real-world user experience, we conducted rigorous subjective evaluations using industry-standard methodologies across the two core functionalities: text-to-image generation and image editing.

For text-to-image, large-scale Mean Opinion Score (MOS) assessments were performed across four dimensions: text-image alignment, visual plausibility, realism, and aesthetic quality. LongCat-Image demonstrates superior realism compared to leading open and closed models, while achieving open-source SOTA in alignment and plausibility. For image editing, side-by-side (SBS) comparative evaluations focused on overall editing quality and visual consistency—key indicators of user satisfaction. Results show that although LongCat-Image trails commercial systems like Nano Banana and Seedream 4.0 to some extent, it significantly outperforms all other open-source alternatives.

To foster a transparent, open, and collaborative ecosystem, we are fully open-sourcing the multi-stage text-to-image models (including mid-training and post-training checkpoints) alongside the dedicated image editing model—enabling seamless support from cutting-edge research to real-world commercial deployment. We believe true innovation thrives through community collaboration. Developers are warmly invited to explore, experiment with, and contribute to LongCat-Image as we jointly push the boundaries of visual generative AI.