Today, Xiaohongshu has partnered with Fudan University to unveil InstanceAssemble, a groundbreaking solution in the field of layout-to-image generation. By introducing an innovative "instance assembly attention" mechanism, this approach enables precise image synthesis from simple to complex and sparse to dense layouts. The research has been accepted by NeurIPS 2025.
In recent years, AI-driven image generation has rapidly evolved—from early text-to-image models toward more advanced layout-controlled methods. These newer techniques generate images based on spatial constraints provided by users, such as bounding boxes, segmentation masks, or skeletal structures.
One major challenge in layout-to-image generation lies in ensuring that AI accurately places objects both semantically and spatially within designated regions. Common issues include misaligned object positioning, semantic inconsistencies, and excessive computational demands.
The newly introduced InstanceAssemble framework, developed jointly by Fudan University and Xiaohongshu, successfully achieves fine-grained control over object placement, marking a significant leap toward "precise composition" in AI-generated imagery.
Built upon the state-of-the-art diffusion transformer architecture, InstanceAssemble introduces a novel "instance assembly attention" module. Users simply input bounding box coordinates and textual descriptions for each object, and the model generates semantically appropriate content at the specified locations. Whether handling scenes with just a few elements or highly cluttered environments, InstanceAssemble maintains high fidelity in both layout alignment and semantic coherence.
Notably, the method adopts a lightweight adaptation strategy that significantly lowers deployment barriers. It requires no full-model retraining—only approximately 71 million additional parameters (about 3.46% extra) are needed to adapt Stable Diffusion 3-Medium, while adaptation to Flux.1 adds merely 0.84% extra parameters.
In evaluations, InstanceAssemble demonstrated superior performance on a densely packed dataset containing 900,000 instances, outperforming existing approaches by a large margin.
To better assess layout-image alignment, the research team also introduced Denselayout, a new benchmark featuring 5,000 images and 90,000 annotated instances, along with a proposed evaluation metric called Layout Grounding Score (LGS).
Experiments show that InstanceAssemble excels across diverse layout conditions. Even when trained exclusively on sparse layouts (≤10 instances), it maintains robust performance on dense layouts (≥10 instances).
The technology is now open-sourced, with code and pre-trained models publicly available on GitHub, offering strong support for applications in design, advertising, and digital content creation.