Diffusion Models: Unleashing Creativity, Control, and Consistency in AI
Latest 50 papers on diffusion model: Dec. 21, 2025
Diffusion models are rapidly evolving, transforming how we generate, edit, and understand complex data, from stunning images and realistic videos to robust robotic actions and even medical diagnoses. Recent breakthroughs, highlighted by a surge of innovative research, are pushing the boundaries of what these generative powerhouses can achieve, focusing on improved control, efficiency, and real-world applicability.
The Big Idea(s) & Core Innovations
The core challenge many of these papers address is achieving granular control and consistency in generative tasks without sacrificing quality or speed. Traditional generative models often struggle with semantic fidelity, temporal coherence, or the ability to follow intricate prompts. A unifying theme emerges: decomposing complex generation tasks into more manageable, controllable components and leveraging existing robust models (like 2D diffusion models or large language models) as powerful priors.
For instance, the ability to generate hyper-realistic, animatable 3D avatars from a single image is a significant leap. Researchers from University of California, San Diego and NVIDIA introduce Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation, which distills 3D-aware expression knowledge from 2D diffusion models into a feed-forward encoder. Their key insight is deforming Gaussians in a high-dimensional feature space, enabling expressive details like wrinkles while maintaining 3D consistency and blazing-fast inference speeds of 107 FPS.
In the realm of image synthesis, IIT, National Centre for Scientific Research “Demokritos” and University of West Attica present REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion. REGLUE enhances latent diffusion models by incorporating both global and local semantic information from Vision Foundation Models, significantly improving image quality and accelerating training convergence. This highlights the power of enriched contextual understanding in the latent space.
For long-form content like video, the challenge of maintaining temporal and geometric consistency is paramount. ´Ecole Polytechnique F´ed´erale de Lausanne (EPFL) tackles this with Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models. They show that by factoring video generation into reasoning, composition, and animation stages, they achieve superior results on benchmarks and dramatically reduce sampling steps through visual anchoring. Similarly, The Chinese University of Hong Kong and ByteDance introduce Resampling Forcing for End-to-End Training for Autoregressive Video Diffusion via Self-Resampling. This teacher-free framework reduces exposure bias and improves temporal consistency in long videos by simulating inference-time errors and using a novel history routing mechanism.
Beyond visual generation, diffusion models are enhancing practical applications. For robotics, ORU (Orebro University) leverages them for Single-View Shape Completion for Robotic Grasping in Clutter, improving grasp success rates by reconstructing complete object shapes from partial observations. In a similar vein, CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion from Institute of Robotics and Artificial Intelligence, University X introduces a multi-modal diffusion model that co-generates video and action sequences, leading to more natural and effective robot control. Protecting intellectual property and ensuring safe use is also critical. The Hong Kong Polytechnic University proposes DeContext as Defense: Safe Image Editing in Diffusion Transformers, a novel defense mechanism that prevents unauthorized image editing by disrupting attention flow without compromising visual quality.
Under the Hood: Models, Datasets, & Benchmarks
These advancements are underpinned by novel architectural designs, clever training strategies, and new benchmarks:
- REGLUE (https://arxiv.org/pdf/2512.16636): Introduces a lightweight semantic compressor to integrate multi-layer VFM features with VAE latents, leveraging the SiT backbone for improved image synthesis. Code is available at https://github.com/giorgospets/reglue.
- Yuan-TecSwin (https://arxiv.org/pdf/2512.16586): A text-conditioned diffusion model from [Google Research] that incorporates Swin-Transformer blocks for enhanced text-to-image generation and contextual understanding.
- DeContext (https://arxiv.org/pdf/2512.16625): Targets multi-modal attention mechanisms in DiT-based models (like Flux Kontext and Step1X-Edit) with subtle perturbations for robust defense. Code is available at https://github.com/LinghuiiShen/DeContext.
- GMODiff (https://arxiv.org/pdf/2512.16357): Reframes HDR reconstruction as a gain map refinement task, leveraging pre-trained Latent Diffusion Models (LDMs) to achieve superior visual quality and inference efficiency. Authored by researchers from [Northwestern Polytechnical University and Xi’an University of Architecture and Technology].
- FOD-Diff (https://arxiv.org/pdf/2512.16075): The first diffusion model for Fiber Orientation Distribution (FOD) prediction, employing a 3D multi-channel patch learning strategy and a spherical harmonic attention module for medical imaging. From [H. Tang et al.].
- OUSAC (https://arxiv.org/pdf/2512.14096): From [University of Georgia], this framework optimizes guidance scheduling and adaptively caches features for DiT acceleration, achieving up to 60% cost savings and improved FID scores. It uses evolutionary optimization over a hybrid discrete-continuous space.
- Qwen-Image-Layered (https://arxiv.org/pdf/2512.15603): An end-to-end diffusion model from [HKUST(GZ) and Alibaba] that decomposes RGB images into semantically disentangled RGBA layers using an RGBA-VAE, VLD-MMDiT architecture, and multi-stage training. Code is available at https://github.com/QwenLM/Qwen-Image-Layered.
- GRAN-TED (https://arxiv.org/pdf/2512.15560): Introduced by [Peking University and Kuaishou Technology], this proposes a two-stage training paradigm for text encoders for diffusion models, along with TED-6K, a text-only benchmark for efficient encoder evaluation. Code is at https://anonymous.4open.science/r/GRAN-TED-4FCC/.
- StructDiff (https://arxiv.org/pdf/2503.09560): A structure-aware diffusion model from [Southeast University and Shanghai AI Laboratory] for 3D fine-grained medical image synthesis, incorporating paired image–mask templates and Confidence-aware Adaptive Learning.
- DDMS (https://arxiv.org/pdf/2404.10512): A deep diffusion model from [Harbin Institute of Technology] for four-hour thunderstorm nowcasting, utilizing geostationary satellite brightness temperature data. Code is available at https://github.com/bigfeetsmalltone/DDMS.
Impact & The Road Ahead
The advancements highlighted in these papers signify a pivotal moment for diffusion models. We are moving beyond mere image generation towards highly controllable, efficient, and robust AI systems capable of tackling complex, real-world problems. The ability to disentangle components like pose and expression in avatars or scene construction and temporal synthesis in videos opens doors for vastly more intuitive and powerful creative tools.
From enhanced privacy protection in sensitive data like infant videos with methods like BLANKET (https://arxiv.org/pdf/2512.15542 from [Czech Technical University in Prague]), to automated drug discovery with agentic reasoning systems like StructBioReasoner (https://arxiv.org/pdf/2512.15930 by [Argonne National Laboratory and University of Chicago]), diffusion models are broadening their impact. Their application in control engineering, as seen in Generative design of stabilizing controllers with diffusion models: the Youla approach (https://arxiv.org/pdf/2512.15725 from [University of Technology Sydney]), and in multi-objective optimization with Preference-Guided Diffusion (https://arxiv.org/pdf/2503.17299 from [TU Munich and Stanford University]) hints at their potential to revolutionize scientific and engineering disciplines.
Even foundational theoretical work, like A Unification of Discrete, Gaussian, and Simplicial Diffusion (https://arxiv.org/pdf/2512.15923 from [New York University]), promises to streamline future research by providing a unified framework for diverse diffusion methods. As models become more efficient, robust, and controllable, we can expect to see them integrated into a wider array of applications, from safer AI systems and interactive virtual worlds with WorldPlay (https://arxiv.org/pdf/2512.14614 by [Hong Kong University of Science and Technology]), to better medical diagnostics and sustainable urban planning with Generative Urban Flow Modeling (https://arxiv.org/pdf/2512.14725 from [Universidad Politécnica de Madrid]). The future of AI, powered by increasingly sophisticated diffusion models, looks incredibly bright and full of creative possibilities!
Share this content:
Discover more from SciPapermill
Subscribe to get the latest posts sent to your email.
Post Comment