Diffusion Models: Sculpting Reality from Pixels to Proteins, with Unprecedented Control and Efficiency
Latest 100 papers on diffusion model: Jul. 4, 2026
Diffusion models are rapidly evolving, transforming from remarkable image generators into versatile tools capable of sculpting reality across diverse domains, from high-fidelity 3D scenes to complex molecular structures and even human interactions. Recent research highlights a thrilling surge in innovations focused on refining control, boosting efficiency, and pushing the boundaries of what these models can achieve, often without extensive retraining.
The Big Idea(s) & Core Innovations
One of the most profound shifts is the move towards explicit 3D and 4D generation with unprecedented fidelity. Papers like “Align4D: Alignment Is All You Need For X-to-4D Generation” by Qiaowei Miao and colleagues from Zhejiang University introduce a unified framework for arbitrary modality-to-4D synthesis, decoupling 3D geometry from temporal motion via novel alignment techniques. Similarly, Qualcomm AI Research’s “PixGS: Pixel-Space Diffusion for Direct 3D Gaussian Splat Generation” bypasses lossy latent compression, achieving high-quality 3D Gaussian Splatting (3DGS) generation in ~1 second. Further pushing this frontier, “Pano2World: End-to-End 3D Generation via Unified Multi-View Sequences” from Ke Holdings Inc. converts single indoor panoramas into explorable 3DGS scenes in minutes, using joint multi-view generation. For animated 3D humans, “PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing” from HKUST reconstructs relightable avatars from a single image in under a minute.
Beyond 3D, the papers showcase remarkable advancements in control and safety. Imperial College London’s “EquiSteer: Cross-Attention Steering Towards a Fairer Text-Guided Image Generation” debiases text-to-image models in a training-free manner by steering cross-attention activations. Complementary work, “Concept Removal Guidance: Evidence-Calibrated Negative Guidance for Safe Diffusion Sampling” by Yoonseok Choi and colleagues from KAIST, adaptively calibrates negative guidance to suppress unsafe content. “ViDiT: Learn Once, Edit Anywhere: Visual Direction Transfer for Diffusion Models” by Virginia Tech learns continuous editing directions from visual examples, enabling zero-shot transfer of fine-grained semantic edits across domains. For critical applications, “No Prompt, No Leaks: A Robust Generative Steganography Framework via Prompt-Free Diffusion” from Xiangtan University eliminates text prompt vulnerabilities in steganography, using semantic priors for robust secret embedding.
Efficiency and utility are also paramount. “Accelerated Likelihood Maximization for Diffusion-based Versatile Content Generation” from Seoul National University introduces ALM, a training-free sampling strategy for inpainting, outpainting, and 3D texturing that achieves 185x speedup. “Improved Immiscible Diffusion: Accelerate Diffusion Training by Reducing Its Miscibility” by UC Berkeley and UT Austin demonstrates up to 4x faster training across various models and tasks. Cornell University’s “Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding” achieves better speed-quality tradeoffs for language models and enables arbitrary-position infilling. In the medical domain, “Discrete Diffusion Language Models for Interactive Radiology Report Drafting” by Stanford and Ghent University shows diffusion models can be 3.5-4.4x faster than AR models for medical VQA while supporting interactive ‘any-order infill’.
Under the Hood: Models, Datasets, & Benchmarks
The innovations above are built upon or contribute to a rich ecosystem of models, datasets, and benchmarks:
- Align4D: Introduces the X4D dataset, the first quadruple dataset (prompt, image, video, 3D) for X-to-4D benchmarking, alongside the Consistent4D dataset. Project page: https://miaoqiaowei.github.io/Align4D/
- PointDiT: A minimalist pixel-space diffusion model directly on raw point maps, conditioned on DINOv3 features. Project page: https://haofeixu.github.io/pointdit
- Hierarchical Anti-Aesthetics (HAA): Evaluated across CelebA-HQ, VGGFace2 datasets and SD-v1.4, v1.5, v2.1, SD-V3.0, FLUX models.
- QWERTY: Leverages Wan 2.2 TI2V-5B and CogVideoX-I2V-5B backbones for training-free motion control. Uses VIPSeg and DL3DV datasets for evaluation. Code to be released.
- PixGS: Utilizes G-Objaverse and G-Objaverse-XL Alignment datasets, evaluated on T3Bench. Fast inference (~1s on A100).
- Set Diffusion: Achieves SOTA on GSM8K and ImageNet benchmarks. Code: https://github.com/kuleshov-group/setdlms
- ProSAC-CT: Evaluated on Mayo-2016, Mayo-2020, QIN-LUNG-CT, LoDoPaB datasets for low-dose CT denoising.
- InterCMDM: Uses Dual-Stream Causal Diffusion Transformer on InterHuman and Inter-X datasets.
- Discrete Diffusion Language Models for Radiology: Adapts DiffusionGemma-26B for medical VQA and introduces ‘any-order infill’. Code: https://github.com/mxvp/discrete_diffusion_RRG
- MapDreamer: Latent diffusion for lane-level map generation from aerial imagery, using UrbanLaneGraph and DINOv3 features.
- Generative AI and Federated Learning for IDS (Survey): Reviews various generative models for NSL-KDD, UNSW-NB15, CICIDS2017 datasets in security.
- ADMC: Attention-based diffusion model for missing modality completion, tested on IEMOCAP and MIntRec datasets.
- High-dimensional Embedding Prior (HEP): Plug-and-play framework for MRI reconstruction using fastMRI and SIAT datasets. Code: https://github.com/yqx7150/HEP-MRIRec
- Diffusion-GR2: Converts AR reasoning re-rankers to block-diffusion, achieving speedup on Amazon Review Beauty dataset.
- AVSR-Diff: Decoupled framework for arbitrary-scale video super-resolution, uses Temporally-Gated Feature Recurrence (TGFR) and Scale-Aware Fourier Refinement (SAFR). Project page: https://kaist-viclab.github.io/AVSR-Diff/
- TRCGL-Net: Conditional diffusion model for long-tailed multi-label chest X-ray classification, using PadChest dataset. Code: https://github.com/November-1113/TRCGL-Net
- Valdi: Combines latent diffusion with value-based Model Predictive Control for online planning, evaluated on CarRacing.
- Training-Free Debiasing (TES): Optimizes text embeddings with CLIP-based feedback, compatible with Stable Diffusion v1.5 and v2.1.
- Accelerating Discrete Diffusion Models: First parallel-in-time algorithm for τ-leaping methods. https://arxiv.org/pdf/2607.00773
- Not All Prediction Targets: Reveals x-prediction as key for manifold preservation in guidance, using a 143-species bird benchmark. Code: https://github.com/ManLuML/on-manifold-tfg
- AnF-DiffPET: CT-conditioned diffusion for low-dose PET denoising, using HECKTOR, PSMA, NaF, RIDER datasets.
- PAPA: Online personalized active preference alignment without reward models. Code: https://github.com/NasikNafi/papa
- The Illusion of High Utility in Safety Alignment: Introduces SAGE to mitigate ‘semantic collapse’, using TIFA benchmark for evaluation. Project page: https://adeelyousaf.github.io/SAGE_ECCV26_Project_Page/
- Vitality-Aware Compression: First compression for image-to-shape Diffusion Transformers, achieving 66% size reduction on Step1X-3D, Hunyuan3D 2.0, TRELLIS.
- RetailSMV: Introduces RetailSMV corpus (32,105 synchronized egocentric/exocentric retail videos) to study video world models trained on Cosmos3-Nano.
- DriftScope: A prompt-level diagnostic for detecting concept damage in adapted diffusion models. Project page: https://hyping111.github.io/DriftScope/
- Vertigo Vertigo: Experimental AI reconstruction of Hitchcock’s Vertigo using Wan 2.2 video diffusion model. Project page: https://www.adamcole.studio/work/vertigo-vertigo
- Multi-Embodiment Robotic Retargeting: Unified transformer-based diffusion for transferring human motions to heterogeneous robots like Unitree G1/H1, LimX Oli/Tron1, Boston Dynamics Atlas.
- Cross-Space Distillation: Introduces Bridge (Bφ) module for distilling knowledge from modern teachers (e.g., SD3.5, Flux) to compact students (SD1.5) with different latent spaces. https://arxiv.org/pdf/2606.32020
- Mesh BDF: Barycentric Dominance Field for native 3D mesh generation, integrated into VAE/Diffusion frameworks. Project page: https://gaochao-s.github.io/pages/MeshBDF/
- Look But Don’t Touch with Sparse Autoencoders: Proposes Patch Embedding Replacement (PER) for concept erasure in diffusion models, achieves 95.33% on UnlearnCanvas. Project page: https://eidoslab.github.io/PER/
- Histogram-constrained Image Generation (HIG): Training-free control for diffusion models (e.g., SDXL, FLUX.1[dev]). Project page: https://maps-research.github.io/hig/
- Introduction to SDEs for Generative ML: Theoretical framework unifying diffusion, score, and flow matching. Code: https://github.com/olewinther/generative-ode-sde
- Wavelet-Optimized Pseudo-3D Accelerated Diffusion: For truncated Computed Laminography, using ASTRA Toolbox for forward/backward projections.
- Patch-PODiff-ViT: Structured latent diffusion for super-resolution and uncertainty quantification on ROMS SST, NIH ChestX-ray14, FFHQ datasets.
- AC3S: Adaptive Conditioning for 3D-Aware Synthetic Data Generation, using ShapeNet, Objaverse, OmniObject3D and a multi-agent VLM framework. Project page: https://ac3s.cvmlgroup.web.illinois.edu/
- InfiniVerse: Unified pipeline for unbounded scene generation for autonomous driving, using Waymo Open Dataset, nuScenes, and CogvideoX1.5-5B. https://arxiv.org/pdf/2606.31109
- Efficient Sim-to-Real Transfer: Uses Cosmos Policy (video diffusion) with AnyTask and Isaac Gym for robotic manipulation on Franka Research 3 robot. https://arxiv.org/pdf/2606.31101
- Structure-Preserving Mean Flow (SPMF): For vascular image translation between NIRII and 2PF modalities, using NIR2PF and Fundus datasets. https://arxiv.org/pdf/2606.31095
- Omni-Flow: Distributed scheduling for multimodal inference built on SGLang, supporting LongCat-Next and HunyuanImage-3. Code: https://github.com/meituan-longcat/omni-flow.git
- Diffusion-Based Material Regularization: For physics-based inverse rendering, using Stanford-ORB, Synthetic4Relight, DTC-Synthetic datasets. https://arxiv.org/pdf/2606.31065
- OTCache: Optimal Transport for Geometry-Aware Caching in Diffusion Models, accelerating FLUX.1 [dev], Qwen-Image, HunyuanVideo. Code: https://github.com/UnicomAI/OTCache
- WarpI2I: Saliency-guided image warping for I2I translation with FLUX-based synthetic data on human relighting, driving scenes. Project page: https://shenzheng2000.github.io/WarpI2I.github.io/
- PhotoQuilt: Training-free arbitrary-resolution photomosaics, compatible with SD2.1, FLUX.1 & FLUX.2. Project page: https://kooroshrh.github.io/photo-quilt/
- Quality-Aware Modulation (QRM): Lightweight transformer module for SD3.5, trained with Reward Feedback Learning. https://arxiv.org/pdf/2606.30934
- Gradient Smoothing: Optimization paradigm for various architectures including diffusion models. Code: https://github.com/sugolov/gradient-smoothing
- Unsupervised Thermodynamics of Molecular Diffusion Models: Theoretical framework for molecular diffusion models, applied to protein-ligand mutations. https://arxiv.org/pdf/2606.30687
- µFlow: One-class deepfake detector trained on FFHQ (real images), tested on WILD dataset (19 unseen generators). https://arxiv.org/pdf/2606.30528
- Non-parametric recovery of causal diffusion mechanisms: Theoretical work for learning causal dynamics from steady-state observations. arXiv:2606.30467
- Beyond Point Estimates for Glaucoma Visual Field Forecasting: Diffusion models for probabilistic VF forecasting on UWHVF and ICVF datasets. https://arxiv.org/pdf/2606.30417
- Diffusion Fine-tuning with Rewarded Moment Matching Distillation (RMMD): Applied to GenCast weather forecasting (ERA5 dataset) and ImageNet. https://arxiv.org/pdf/2606.30414
- MUSE: Repurposes timestep embedding for multi-task dense prediction (depth/normal estimation) on Stable Diffusion v2.1 and diverse benchmarks. https://arxiv.org/pdf/2606.30370
- UniGP: Unified generation and perception using MMDiT backbone for depth/normal estimation and controllable generation. Project page: guoqincode.github.io/UniGP
- The Surprising Effectiveness of Video Diffusion Models for Hand Motion Reconstruction (ViDiHand): Leverages video diffusion models for 4D two-hand motion from egocentric video, SOTA on ARCTIC, HOT3D, HOI4D. Project page: https://vidihand.github.io
- Intermediate Text Representation Guided T2I Generation: Uses intermediate text representations for concept-association bias, introduces OAO-AttackBench. Project page: https://soyoun-won.github.io/one-and-only-ir-guidance/
- Your Data Manifold is Secretly a Reward Model (Shell-LCC): Uses intrinsic manifold geometry of SFT data as a cost-free reward for text-to-video generation, tested on UltraVideo. Project page: https://needylove.github.io/Shell-LCC/
- SkelEM: Self-supervised axial super-resolution in volume microscopy, introduces BRAVE-ASR benchmark. https://arxiv.org/pdf/2606.30012
- Variance Reduction on the Camera Axis (MV-SDI): Training-free framework for text-to-3D, uses SD 2.1 prior. Code: https://github.com/threestudio-project/threestudio
- LEOSTP: Diffusion model for LEO satellite network traffic prediction, integrates geographic semantic information. https://arxiv.org/pdf/2606.29856
- HomeDiffusion: Zero-shot object customization for indoor scenes, uses HD visual encoder and multi-view representation learning, with ZOC-Indoor-Eval benchmark. https://arxiv.org/pdf/2606.29828
- Nemotron-Labs-Diffusion-Image: 8B parameter masked discrete diffusion model for high-resolution T2I, uses LAION-2B, COYO-700M. arXiv:2606.29814
- Rethinking Forgery Attacks on Semantic Watermarks: Formalizes black-box forgery attacks, evaluates on SD 2.1, SDXL, PixArt-Σ, FLUX.1, SD3.
- Lie Group Diffusion Models for Quantum Circuit Synthesis: Uses diffusion on SU(2) manifold for quantum gates, evaluated on TFIM, Heisenberg-XXZ Hamiltonians. Code: https://github.com/joesingh-ai/su2-diffusion
- Dipole Diffusion Error in Thin Geometry: Theoretical work on subsurface scattering. https://arxiv.org/pdf/2606.29387
- FDM-MFVT: Few-step diffusion for mask-free virtual try-on, introduces MFVT dataset (30,000 pairs). https://arxiv.org/pdf/2606.29319
- ASTAD: Asymmetric Style Transfer for Autonomous Driving, uses DINO for semantic priors. Code: https://github.com/Dingyi-Yao/ASTAD
- DTI: Dynamic Trajectory Initialization for Face Video Super-Resolution, uses DINOv3 and Wan2.1-1.3B T2V DiT. Code: https://github.com/MediaX-SJTU/DTI
- Evidence-Based Text-Conditioned 3D CT Synthesis: First 3D CT generation for ovarian cancer, using MAISI VAE and Qwen2.5-14B-Instruct. Code: https://github.com/francescapia/OvESyn
- Stochastic Optimal Control Sampling (SOCS): For diffusion inverse problems, compatible with VP/VE-SDEs, LDMs, SD3. Code: https://github.com/zjqwq01/SOCS-DIP
- SATB-VR: Few-step video restoration, uses SNR-Aware Trajectory Blending. Code: https://github.com/chenxx89/SATB-VR
- Constrained Tabular Diffusion for Finance (CTDF): Training-free constraint enforcement for tabular data on Airbnb and Lending Club datasets.
- JuZhou 1.0: First edge-native T2I foundation model (0.387B params) trained on China-developed AI accelerators, achieves GenEval 0.69. https://arxiv.org/pdf/2606.28421
- DiffRGD: Inference-time guidance via Riemannian Gradient Descent, preserving Gaussian latent distribution. Code: https://github.com/jwliao1209/DiffRGD
- LoRAShield: Data-free editing for securing personalized LoRA sharing on platforms like Civitai. https://arxiv.org/pdf/2507.07056
- Stochastic and Non-local Closure Modeling: Latent score-based generative AI for turbulent flows. https://arxiv.org/pdf/2506.20771
- Physics-Informed Distillation of Diffusion Models (PIDDM): Enforces PDE constraints on final samples, achieves one-step generation. https://arxiv.org/pdf/2505.22391
- Multi-objective Low-altitude IRS-assisted ISAC Optimization: GenAI-enhanced DRL for 6G wireless networks. https://arxiv.org/pdf/2502.10687
- Seed-to-Seed (StS): Unpaired image translation in diffusion seed space, using Stable Diffusion 2.1, BDD100k, DENSE. https://arxiv.org/pdf/2409.00654
- SSM Meets Video Diffusion Models: Replaces temporal attention with bidirectional Mamba blocks for long-term video generation. Code: https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models
- VGB for Masked Diffusion Model (MDM-VGB): Reward-guided sampler for discrete diffusion, applied to Sudoku, QM9, DNA, protein design. Code: https://github.com/KraitGit/MDM-VGB
- Unleashing Infinite Motion (Uni-Mo): Generates expressive quadruped motions from language prompts using video diffusion, introduces Quad-Imaginarium dataset. Code: https://github.com/GaoLii/Quad-Imaginarium.git
- Monocular Avatar Reconstruction: Cascaded diffusion priors for 3D avatar reconstruction with fewer than 100 scans. Project page: https://marcus-avatar.github.io
- Beyond Sparse Supervision (ADC-GNN): Diffusion-guided learning for few-shot graph fraud detection on Amazon review-spam, YelpChi, T-Finance. Code: https://github.com/llmllmllm/ADC-GNN
- OSOR: One-step diffusion inpainting for effect-aware object removal, introduces CORNE dataset (280K pairs). Code: https://github.com/Zhouqm-Git/osor
- Diffusion Model Attribution via Spectral Coupling: Spectral Denoising Signatures (SDS) for non-invasive model attribution. Code: https://github.com/Pragati-Meshram/SGS
- TempAct: Planner-Executor RL for temporal plausibility in AR video generation, uses Self-Forcing, LongLive backbones. Code: https://github.com/jingw193/TempAct
- Controllable Histopathology Image Synthesis (CHIS): Training-free framework for histopathology image synthesis. Code: https://github.com/IBIL-Code/CHIS
Impact & The Road Ahead
The collective impact of this research is profound. We are witnessing diffusion models evolve into indispensable tools for precision creation, robust control, and ethical deployment. The ability to generate complex 3D and 4D content from diverse inputs, debias generative outputs, accelerate training and inference, and enable interactive applications promises to revolutionize fields from autonomous driving and medical imaging to robotic control and content creation. The emergence of specialized architectures like Diffusion Transformers (DiTs) and hybrid approaches combining diffusion with GANs or State Space Models (SSMs) points to an exciting future where these models are not just powerful, but also efficient, controllable, and adaptable to novel challenges.
Challenges remain, such as ensuring perceptual fidelity at scale, bridging the gap between theoretical guarantees and practical implementation, and rigorously auditing models for hidden biases and vulnerabilities. However, the rapid pace of innovation, particularly in training-free methods and principled theoretical frameworks, indicates a clear path forward. Diffusion models are not just generating images; they are generating a future where AI-powered creativity and problem-solving are more accessible, nuanced, and aligned with human needs than ever before.
Share this content:
Discover more from SciPapermill
Subscribe to get the latest posts sent to your email.
Post Comment