Diffusion Models: Pioneering the Next Generation of AI Perception and Creation
Latest 100 papers on diffusion models: May. 9, 2026
Diffusion models have rapidly ascended to the forefront of AI research, transforming how we approach generative tasks and pushing the boundaries of what’s possible in image, video, and even scientific data synthesis. This wave of innovation addresses long-standing challenges in fidelity, control, efficiency, and real-world applicability. Let’s dive into some of the latest breakthroughs that are shaping the future of AI/ML.
The Big Idea(s) & Core Innovations
The latest research showcases a clear trend: moving beyond raw generative power to achieving precise control, improving efficiency, and expanding applicability across diverse domains, often by rethinking fundamental assumptions of diffusion. For instance, the challenge of generating rare but valid compositions, where models tend to default to common patterns, is tackled by DCR: Counterfactual Attractor Guidance for Rare Compositional Generation from University of Maryland at College Park. This training-free method uses counterfactual attractor guidance to suppress default biases, ensuring rare but semantically correct outputs. Similarly, in video generation, preserving temporal consistency and extending video length without quality degradation is critical. FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction by National University of Defense Technology addresses spectral concentration in self-attention windows that causes blurring and repetitive motion, using SVD to preserve high-rank local variations. Complementing this, Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency from National University of Singapore replaces traditional Lagrangian optical flow with an Eulerian alternative and Bidirectional Geometric Consistency to bound error accumulation, leading to more stable long-horizon video generation.
Beyond visual aesthetics, control and reliability are paramount for real-world integration. In multi-reward reinforcement learning for diffusion models, MARBLE: Multi-Aspect Reward Balance for Diffusion RL by Zhejiang University introduces a gradient-space optimization framework. It tackles the ‘specialist sample phenomenon’ where scalar reward aggregation leads to conflicting gradients, achieving simultaneous improvements across multiple reward dimensions without manual tuning. For robotics, the debate on what makes a useful latent space is settled by Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models from Mila – Quebec AI Institute. It demonstrates that semantic latent spaces (e.g., V-JEPA) consistently outperform reconstruction-aligned ones for policy-relevant tasks, even if pixel metrics are lower, by better preserving action-relevant structure. This emphasis on semantic control is echoed in EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields by Fudan University, which projects robot actions into camera-aligned visual action fields (KVAFs) to guide video generation, effectively bridging the domain gap between abstract actions and video synthesis.
The push for efficiency is profound. Continuous-Time Distribution Matching for Few-Step Diffusion Distillation from Nankai University redefines distillation by migrating discrete-time DMD to continuous optimization, achieving state-of-the-art 4-step image generation without adversarial training. Similarly, CM3D-AD: Two Steps Are All You Need by R.V. College of Engineering reformulates 3D point cloud anomaly detection as a single-step manifold projection problem using consistency models, achieving 80x faster inference than diffusion-based methods. SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation from Wuhan University leverages MLLMs for Selective One-Way Diffusion, controlling information flow in a training-free manner to prevent undesired interference between image regions, dramatically improving condition consistency and speed. WaDiGAN-SR: A Wavelet Diffusion GAN for Image Super-Resolution by Sapienza University of Rome combines Discrete Wavelet Transform with Diffusion GANs for real-time super-resolution in just 2 timesteps.
Theoretical underpinnings are also advancing. The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models from Chalmers University of Technology reveals class variance as the primary determinant of learning order, offering insights into generalization. Expressivity of Bi-Lipschitz Normalizing Flows: A Score-Based Diffusion Perspective by University of Bremen theoretically connects bi-Lipschitz flows and diffusion models, showing that expressivity limitations stem from uniform Lipschitz bounds, not bi-Lipschitz regularity itself.
New applications are emerging in scientific and industrial domains. Diffusion model for SU(N) gauge theories by ETH Zurich applies score-matching to lattice gauge theories, demonstrating successful sampling competitive with Hybrid Monte Carlo. Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors from Ecole Polytechnique utilizes diffusion models as expressive spatial priors for reconstructing rain fields, outperforming traditional methods. GRIFDIR: Graph Resolution-Invariant FEM Diffusion Models in Function Spaces over Irregular Domains by University of Cambridge introduces FEM convolutions for resolution-invariant graph diffusion models, handling unstructured meshes and complex geometries, crucial for scientific machine learning. PODiff: Latent Diffusion in Proper Orthogonal Decomposition Space for Scientific Super-Resolution from University of Western Australia performs diffusion in POD coefficient space for probabilistic super-resolution, achieving comparable accuracy with 165x fewer parameters and analytic uncertainty propagation. Stochastic Schrödinger Diffusion Models for Pure-State Ensemble Generation from RIKEN iTHEMS adapts score-based diffusion to quantum pure-state ensembles on complex projective manifolds, a groundbreaking step for quantum machine learning.
Under the Hood: Models, Datasets, & Benchmarks
The papers introduce or heavily rely on a rich ecosystem of models, datasets, and benchmarks to validate their innovations:
- Relit-LiVE: Uses MIT multi-illumination, DL3DV, SpatialVID-HQ, RemovalBench, SOBAv2, Pexels datasets. Code: https://github.com/zhuxing0/Relit-LiVE
- DCR: Evaluated on Mochi, HunyuanVideo, CogVideoX backbones. Mochi preview: https://huggingface.co/genmo/mochi-1-preview
- FreeSpec: Uses Wan2.1 and LTX-Video. Introduces VBench-Long dataset. Project demo: https://fdchen24.github.io/FreeSpec-Website/
- MARBLE: Stable Diffusion 3.5 Medium, PickScore, HPSv2, CLIPScore, GenEval, Aesthetic Score, ImageReward, UniReward. Code: https://github.com/canyu-zhao/marble
- Reconstruction or Semantics?: Uses Bridge V2, SOAR datasets. HuggingFace model: https://huggingface.co/Nilaksh404/semantic-wm. Project: https://hskalin.github.io/semantic-wm
- Continuous-Time Distribution Matching: SD3-Medium, Longcat-Image. Project: https://byliutao.github.io/cdm_page/. Code: https://github.com/byliutao/cdm
- The Interplay of Data Structure and Imbalance: Fashion MNIST. Dataset: https://github.com/zalandoresearch/fashion-mnist
- Eulerian Motion Guidance: Stable Video Diffusion (SVD), RAFT, WebVid-10M.
- EA-WM: WorldArena benchmark, RoboTwin dataset, Wan2.2. Code: https://github.com/Wan-Video/Wan2.2
- Diffusion model for SU(N) gauge theories: SU(3) lattice gauge configurations. Code: https://github.com/jkomijani/lattice_ml, https://github.com/jkomijani/normflow
- Free Decompression: Open-source software freealg.
- Bayesian Rain Field Reconstruction: OpenMRG dataset. Code: https://github.com/lifalb/rainfield-dm, https://github.com/haihabi/PyNNcml
- Conditional Diffusion Under Linear Constraints: CelebA-HQ, LSUN Church, ImageNet. Code: https://github.com/ahmad-aghapour/lcdm
- CM3D-AD: Anomaly-ShapeNet, Real3D-AD, ShapeNetCore.v2.
- MidSteer: Alpaca, LAION-5B.
- The Illusion of Forgetting: Stable Diffusion v1.4, NudeNet, DeepSeek v3.2.3. Code to be released.
- Posterior Inference in Latent Space: Rastrigin, Ackley, Rosenbrock (200D), RoverPlanning (60D), HalfCheetah (102D), Mopta (124D), LassoBench DNA (180D). Code: https://github.com/GFNOrg/gfn-diffusion, https://github.com/LeoIV/BAxUS, https://github.com/ksehic/LassoBench, https://github.com/rtqichen/torchdiffeq.
- SOWing Information: Aesthetic-4K. Code: https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth, project page: https://pyh-129.github.io/SOW/
- WaDiGAN-SR: CelebA-HQ, Shipspotting. Code: https://www.github.com/aloilor/WaDiGAN-SR
- D-OPSD: DreamBooth, in-house anime dataset. Project page: https://vvvvvjdy.github.io/d-opsd
- Computer-Aided Design Generation: DeepCAD dataset. Code to be released.
- Local Intrinsic Dimension Unveils Hallucinations: EDM, 11kHands, AFHQV2, FFHQ, RSNA, GaussianGrid, SimpleShapes.
- Concurrence of Symmetry Breaking and Nonlocality Phase Transitions: DiT-XL, Stable Diffusion 3 Medium. Code: https://github.com/zsxqblz/symmetry_nonlocality_transition
- Advancing Aesthetic Image Generation via Composition Transfer: Text-to-image-2M. Code not provided.
- From Diffusion to Rectified Flow: Stable Diffusion v1.5, SAM, CLIP ViT-L/14. Code not provided.
- SAMIC: LSDIR, Kodak, CLIC2020, UHD. Code: https://github.com/Jasmine-aiq/SAMIC
- Efficient Geometry-Controlled High-Resolution Satellite Image Synthesis: GeoSynth, Open Street Map. Code: https://github.com/Vladimirescu/EfficientGeometrySatelliteSynthesis
- Stage-adaptive audio diffusion modeling: AudioSet, FreeSound, VCTK, AudioCaps, USAD encoder.
- Towards General Preference Alignment: Stable Diffusion 1.5, SDXL, Pick-a-Pic, Parti-Prompts, HPSV2.
- Stream-T1: Wan2.1-T2V-14B. Project: https://stream-t1.github.io/
- Leveraging Pretrained Language Models as Energy Functions: OpenWebText, FLAN-T5, UL2, T5-Gemma 2B-2B.
- ArtiFixer: DL3DV-10K, Mip-NeRF 360, Nerfbusters, Wan 2.1 T2V-14B. Code: https://research.nvidia.com/labs/sil/projects/artifixer
- Large Language Models are Universal Reasoners: SANA, SigLIP 2, Qwen LLM, GenEval, DPG-Bench. Code not provided.
- Flow Sampling: SPICE, GEOM-DRUGS, eSEN force field. PyTorch pseudocode provided.
- Stream-R1: Wan2.1-T2V-1.3B, Wan2.1-T2V-14B, VBench, VidProM, MovieGen Video Bench. Project: https://stream-r1.github.io
- Towards accurate extreme event likelihoods: cBottle, ERA5. Code: https://github.com/NVlabs/cBottle/tree/main
- GeoTopoDiff: PTFE, Fontainebleau sandstone datasets.
- AHPA: SD-VAE. Code: https://github.com/UIOSN/AHPA
- Information Theory and Statistical Learning: DDPM-CIFAR10-32.
- AsymK-Talker: HDTF, VFHQ, AVSpeech, OpenHumanVid, TalkVid, Wan2.1-T2V-1.3B.
- Memorization In Stable Diffusion: Webster 500, Membench, LAION, MS COCO, Lexica Art.
- Metadata, Wavelet, and Time Aware Diffusion Models: fMoW, Sentinel2-fMoW. Code: https://github.com/LuigiSigillo/MWT-Diff
- Quaternion Wavelet-Conditioned Diffusion Models: DIV2K, Flickr2K, OST, RealSR, DRealSR, ShipSpotting, Stable Diffusion. Code: https://www.github.com/Fascetta/ResQu
- Active Sampling for Ultra-Low-Bit-Rate Video Compression: UVG, MCL-JCV. Code: DaS diffusion model (https://github.com/guzekai/DaS).
- TOC-SR: DIV2K, LS-DIR, FFHQ, RealSR, DRealSR, FLUX model family VAE.
- Anon: ImageNet, CIFAR-10, GPT. Code: https://anonymous.4open.science/r/Anon-6511/
- CBV: MSCOCO, VQA v2, CLIP, SD v1.4, LLaVA-v1.5-7B, MiniGPT-v2, InstructBLIP-7B, Qwen-2.5-VL-7B.
- SlimDiffSR: AID, NWPU-RESISC45, DOTA, DIOR, Real-RefRSSRD. Code: https://github.com/wwangcece/SlimDiffSR
- Exploring Data-Free LoRA Transferability: Wan2.1-T2V-1.3B, FastWan2.1-T2V-1.3B, Rolling Forcing, HunyuanVideo-1.5, Steamboat-Willie-1.3B, Jinx-v2, Film-Noir, Retro-Anime LoRAs. Code: https://github.com/Noahwangyuchen/CASA
- Skipping the Zeros: Calorimeter images, Tabula Muris scRNA, Human Lung Pulmonary Fibrosis scRNA, MNIST, Fashion-MNIST. Code: https://github.com/PhilSid/sparsity-exploring-diffusion
- SteeringDiffusion: ArtBench-10, Stable Diffusion 1.5, SDXL, CLIP ViT-L/14.
- Unifying Deep Stochastic Processes: FFHQ, DIV2K, LOL, ImageNet, Rain1400. ItoVision PyTorch library.
- SwiftPie: DreamBench, DreamBench++, DreamCache.
- Action Agent: RECON, Unitree G1 episodes. Code not provided.
- SRGAN-CKAN: DIV2K. Code: https://github.com/RINavarro/SRGAN-CKAN
- SIFT-VTON: VITON-HD. Code: https://github.com/takesukeDS/SIFT-VTON
- Local Intrinsic Dimension Unveils Hallucinations: IDR benchmark, Fashion-MNIST, CIFAR-10, SVHN, LAION-Aesthetics, Stable Diffusion v1.5.
- Visual Implicit Autoregressive Modeling: ImageNet 256×256. Code: https://github.com/mobiushy/VIAR
- Phase-map synthesis from magnitude-only MR images: fastMRI. Code: https://github.com/BerkMSahin/phase-map-synthesis-with-SBDM
- Disciplined Diffusion: Stable Diffusion V1.4, I2P, SneakyPrompt, NSFW-56k, COCO-25k, NudeNet.
- Watch Your Step: CIFAR-10, MNIST, Fashion-MNIST, CelebA-HQ, COCO, Stable Diffusion v1.5.
- Towards High Fidelity Face Swapping: CASIA FaceSwapping. Code: https://github.com/CASIA-NLPRAI/face-swapping-survey
- DissolveStereo: DynamiCrafter, DepthCrafter. Code: https://github.com/shijianjian/DissolveStereo
- Repurposing Image Diffusion Models for Adversarial Synthetic Structured Data: UCI Adult Income, Stable Diffusion v1.5 U-Net, SDMetrics.
- UniVidX: Wan2.1-T2V-14B, VideoMatte240K, InteriorVid. Project: https://houyuanchen111.github.io/UniVidX.github.io/
- Colorful-Noise: Aesthetic-4K. Code: https://github.com/nadavc220/colorful-noise, project: https://nadavc220.github.io/colorful-noise/
- Trees to Flows and Back: No specific datasets/code mentioned but focuses on gradient boosting and flow matching.
- When Do Diffusion Models learn to Generate Multiple Objects?: MOSAIC dataset (newly introduced), LAION-2B, GenEval, SPEC, COMFORT. Code: https://github.com/eugene6923/MOSAIC.git
- Information-geometric adaptive sampling: QM9, ZINC250k, Planar, SBM, Ego-small. GRUM, GDSS models.
- Consistent Diffusion Language Models: OpenWebText. Dataset: https://github.com/j%D1%8Fegor/Gokaslan-Cohen/OpenWebText-pl.
- Diffusion Models are Secretly Zero-Shot 3DGS Harmonizers: Stable Diffusion 2.1, IC-Light, DN-Splatter, splatfacto. Code: https://github.com/ (base on nerfstudio framework).
- CollaFuse: CelebA, CIFAR-10, Animals-with-Attributes2 (Awa2). Code: https://github.com/SimeonAllmendinger/collafuse.
- PhyCo: Physics-IQ benchmark, Kubric, Qwen2.5-VL-3B. Project: https://phyco-video.github.io.
- AdvDMD: DPG-Bench, GenEval, Qwen-Image, SD3.5-medium, SD3-medium. Code: https://github.com/SJTU-DENG-Lab/AdvDMD.
- From LLM-Driven Trading Card Generation: Pokémon TCG Developer Portal API, Pokécardmaker, Magic Set Editor, Niji LoRA, Pokémon Ken Sugimori style LoRA. Code: https://github.com/JohannesPfau/generativePokemonTCG.
- Noise2Map: SpaceNet7, WHU Building Dataset, xView2, AID. Code: https://github.com/alishibli97/noise2map.
- ABC: CelebV-HQ, Sky-Timelapse, SEVIR VIL. Project: https://abc-diffusion.github.io/. Code: https://github.com/gabeguo/abc_diffusion.
- Simple Self-Conditioning Adaptation: OpenWebText, QM9, CIFAR-10, Species10 genome.
- VIPaint: EDM, Stable Diffusion 3.5, Latent Diffusion Models, LSUN-Church, ImageNet64, ImageNet256.
- Language Diffusion Models are Associative Memories: LM1B, GPT-2 tokenizer.
- Unified 4D World Action Modeling: Wan2.2-TI2V-5B, RoboCasa, RoboTwin 2.0. Project: https://sharinka0715.github.io/X-WAM/.
- Delta Score Matters!: Stable Diffusion 1.5, SDXL, SD3.5 Medium, CogVideoX-2B, ModelScope-1.7B, Pick-a-Pic, DrawBench, GenEval, MS-COCO 2017, ChronoMagic-Bench-150.
- ACPO: CIFAR-10, Anime-faces, LSUN Church, Visual Genome, AGIQA-1K, AGIQA-3K, AIGCIQA2023, AIGCQA-30K-Image, DiffusionDB, DrawBench, PartiPrompts.
- FlowS: Waymo Open Motion Dataset (WOMD).
- Data Balancing Strategies: Extensive review of methods.
- A Systematic Post-Train Framework: HPSv3, HPDv3++, Qwen3.5.
- Benchmarking Layout-Guided Diffusion Models: Flickr30k Entities. Code: github.com/lparolari/cobench.
- Edge-Cloud Collaborative Reconstruction: MSCM, UCMerced.
- The Thinking Pixel: ImageNet, GenEval, DPG benchmark, FrozenLake.
- Exploring Time Conditioning in Diffusion Generative Models: AFHQ-Cat, CelebA, CIFAR10, ImageNet. Code: https://github.com/liuzhuozheng-LI/time-uncond-diffusion.
- ResetEdit: Stable Diffusion SD-v2.1, PIE-Bench, ImageNet-R-TI2I, SD-Prompt.
- Generative diffusion models for spatiotemporal influenza forecasting: Influpaint. Code: https://github.com/ACCIDDA/Influpaint, https://github.com/cdcepi/Flusight-forecast-data, https://github.com/cdcepi/FluSight-forecast-hub, project: https://accidda.github.io/Influpaint/.
- VibeToken: ImageNet 256×256. Code: https://github.com/SonyResearch/VibeToken.
Impact & The Road Ahead
These advancements herald a new era for AI. The enhanced control over generative models, exemplified by MidSteer: Optimal Affine Framework for Steering Generative Models from Queen Mary University of London, which unifies concept erasure and switching, will lead to more aligned, safer, and user-friendly AI tools. The efficiency breakthroughs, like those in TOC-SR: Task-Optimal Compact Diffusion for Image Super Resolution by Samsung Research Institute Bangalore and VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models from SonyAI, are making high-quality generation accessible for real-time applications and edge devices, democratizing powerful AI capabilities.
Critically, the research also highlights vulnerabilities and areas for deeper understanding. The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization from Chinese Academy of Sciences reveals that unlearning methods for diffusion models can be circumvented, prompting a re-evaluation of AI safety and privacy. Similarly, Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings by Yonsei University uncovers a surprising mechanism behind memorization, opening new avenues for robust model design. The findings from Understanding diffusion models requires rethinking (again) generalization from Inria call for new theoretical frameworks to truly grasp what these models learn before memorization sets in, especially in multi-object generation, as explored by When Do Diffusion Models learn to Generate Multiple Objects? from TU Darmstadt.
Looking forward, the integration of diffusion models with other modalities and paradigms, such as LLMs in Large Language Models are Universal Reasoners for Visual Generation by Johns Hopkins University, promises to unlock multimodal reasoning and more intelligent content creation. The application in scientific domains, from climate modeling in Towards accurate extreme event likelihoods from diffusion model climate emulators by NVIDIA to quantum machine learning in SSDMs, underscores their versatility. As researchers continue to unravel the theoretical underpinnings and develop practical, robust solutions, diffusion models are set to be a cornerstone for intelligent systems that can not only generate astonishing content but also understand, reason, and interact with the world in unprecedented ways.
Share this content:
Post Comment