Model Compression: Shrinking AI Models for a Smarter, Faster Future
Latest 50 papers on model compression: Sep. 8, 2025
The relentless growth of AI models, particularly Large Language Models (LLMs) and Vision Transformers, has brought unprecedented capabilities but also formidable challenges. These models demand immense computational resources and memory, making their deployment on edge devices or in real-time applications incredibly difficult. This necessitates efficient model compression, a vibrant area of research focused on reducing model size and computational footprint without sacrificing performance. Recent breakthroughs, as highlighted in a collection of cutting-edge research, are pushing the boundaries of what’s possible, enabling a future where powerful AI can run almost anywhere.### The Big Ideas & Core Innovationscentral theme across these papers is the innovative combination and refinement of established compression techniques like pruning, quantization, and knowledge distillation, often with novel architectural insights. The goal is clear: smaller, faster models that are just as capable, if not more so. significant innovation comes from Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models by Jialin Zhao, Yingtao Zhang, and Carlo Vittorio Cannistraci from Tsinghua University. They introduce PIFA, a lossless meta low-rank representation that compresses redundant information in weight matrices, achieving impressive memory and computational savings. Coupled with an Online Error-Accumulation-Minimization Reconstruction (MPIFA) method, this approach significantly boosts performance, even achieving a 2.1× speedup over dense layers at 55% density. Similarly, CALR: Corrective Adaptive Low-Rank Decomposition for Efficient Large Language Model Layer Compression by Muchammad Daniyal Kautsar et al. (affiliated with IEEE, Meta, Google Research, and top universities) presents CALR, an adaptive and corrective low-rank decomposition for LLMs that significantly reduces parameters while preserving performance, showcasing its effectiveness in both one-shot and general language understanding scenarios. Complementing this, Mohammad Mozaffari et al. from the University of Toronto, Google DeepMind, and NVIDIA Research in their paper SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression introduce SLIM, a unified one-shot compression framework that integrates quantization, sparsity, and low-rank approximation. This achieves up to 5.66% accuracy improvement on LLaMA-2-7B and remarkable layer-wise speedups of up to 4.3×., the art of reducing numerical precision, is another major frontier. The paper ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models by Chao Zeng et al. from ByteDance Inc. introduces a framework for arbitrary-precision inference, achieving 1.6× speedup and 2.7× memory compression on LLaMA-7B. Relatedly, Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining by D. Cao and S. Aref tackles the extreme end of quantization, showing that saliency-aware partial retraining can significantly reduce accuracy degradation in ultra-low-bit LLMs. The ethical implications of compression are also being considered; “How Quantization Impacts Privacy Risk on LLMs for Code?” by Md Nazmul Haque et al. from North Carolina State University and University of Alberta highlights that 8-bit static quantization can reduce privacy risks while maintaining performance, although a trade-off exists with lower bit-widths.distillation, where a smaller “student” model learns from a larger “teacher,” continues to evolve. Knowledge Distillation with Refined Logits by Wujie Sun et al. from Zhejiang University introduces Refined Logit Distillation (RLD), a method that dynamically refines teacher logits to preserve crucial class correlations while eliminating misleading information, improving student performance. An Empirical Study of Knowledge Distillation for Code Understanding Tasks by Ruiqi Wang et al. from Harbin Institute of Technology demonstrates that feature-based KD enables student models to retain 98% of teacher performance with only 5% of parameters for code understanding. A novel twist comes from Synthetic Adaptive Guided Embeddings (SAGE): A Novel Knowledge Distillation Method by Suleyman O. Polat et al. from the University of North Texas, which dynamically generates synthetic data in high-loss regions of the embedding space to improve student model performance.approaches are also gaining traction. Integrating Pruning with Quantization for Efficient Deep Neural Networks Compression by Authors A, B, and C explores the synergy of these two techniques for DNN compression, demonstrating improved efficiency. For specialized applications, S2Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation by Weilun Feng et al. from the Chinese Academy of Sciences and ETH Zürich presents a post-training quantization method for video diffusion transformers that achieves 3.9× model compression and 1.3× inference acceleration without compromising visual quality, leveraging salient data selection and sparse token distillation.these core techniques, new paradigms are emerging. Forget the Data and Fine-Tuning! Just Fold the Network to Compress by Dong Wang et al. (Graz University of Technology, ETH Zurich) introduces model folding, a data-free compression method that merges structurally similar neurons, outperforming existing data-free methods, especially at high sparsity levels. And looking to the future, Is Quantum Optimization Ready? An Effort Towards Neural Network Compression using Adiabatic Quantum Computing by Zhehui Wang et al. (A*STAR Singapore, Huadian Coal Industry Group, Chinese Academy of Sciences) explores adiabatic quantum computing for fine-grained pruning-quantization, showing it can outperform classical algorithms in complex optimization spaces.### Under the Hood: Models, Datasets, & Benchmarksinnovations are often demonstrated and validated using a range of critical resources:LLMs for Compression: Papers heavily feature prominent LLMs like LLaMA-2-7B, LLaMA-2-13B, LLaMA, Qwen, PHI, Pythia, CodeGen, GPT-Neo, and DeepSeek-R1, showcasing broad applicability of compression techniques. Code-Specific Models: CodeBERT, CodeGPT, and PLBART are utilized to study adversarial robustness and knowledge distillation in code understanding contexts.Image/Video Models: Vision Transformers (ViT), Video Diffusion Models (VDMs), YOLOv5n for object detection, and Graph Neural Networks (GNNs) illustrate applications in computer vision.Novel Architectures & Frameworks:DaMoC (Data and Model Compression Framework) for efficient LLM selection in fine-tuning.FR-KAN+ and SA-DSD (Self-Attention Dynamic Sampling Distillation) for GNN-to-KAN knowledge distillation from Dalian Jiaotong University.SLIM (One-shot Quantization and Sparsity with Low-rank Approximation) for LLM weight compression (Code: https://github.com/Mohammad-Mozaffari/slim).CognitiveArm for real-time EEG-controlled prosthetic arms (Code).Mix-LN for improved Layer Normalization in deep LLMs (Code).MoBE (Mixture-of-Basis-Experts) for compressing MoE-based LLMs (Code).LINR-PCGC (Lossless Implicit Neural Representations for Point Cloud Geometry Compression) (Code).VDMini (Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models) (Code).OWLed (Outlier-weighed Layerwise Pruning for Efficient Autonomous Driving) (Code).FAIR-Pruner (Flexible Automatic Identification and Removal Pruner) (Code).S2Q-VDiT for quantized video diffusion transformers (Code).MOR-ViT (Efficient Vision Transformer with Mixture-of-Recursions) (Code).ABQ-LLM for arbitrary-bit quantization (Code).RLD for refined logit distillation (Code).ULB-SAPR for ultra-low-bit quantization with saliency-aware partial retraining (Code).PIFA for lossless meta low-rank representation (Code).SLTN-GA (Scalable Lottery Ticket Networks using Genetic Algorithms) (Code).Model Folding for data-free compression (Code).DLR for deep diagonal state space models (Code).CompLeak for privacy risk evaluation (https://arxiv.org/pdf/2507.16872)INSIGHT surveys in-network systems with frameworks like Planter and Quark (https://github.com/planter-ml/planter, https://github.com/quark-ai/quark).Key Datasets/Benchmarks: ImageNet-1K, CIFAR-100, MS MARCO, BoolQ, GSM8K, USMLE benchmarks for medical LLMs, HSI-Drive v2.0 for autonomous driving, ArabicMMLU, EnglishMMLU, and Kannada-ARC-C-2.5K for multilingual evaluation, GLUE benchmarks for NLP, and various code understanding tasks.Hardware Focus: Intel Loihi 2 for neuromorphic computing and STM32H7 microcontroller units for lightweight object detection showcase the push towards specialized, efficient hardware deployment.### Impact & The Road Aheadimplications of this research are profound. Efficient model compression is not merely an optimization; it’s a gateway to ubiquitous, powerful AI. Imagine highly capable LLMs running directly on your smartphone, real-time advanced perception systems in autonomous vehicles with minimal latency, or personalized medical AI assistants on local devices, ensuring privacy and responsiveness. This research makes such scenarios increasingly feasible. For instance, the DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Tasks Based on Data and Model Compression framework from Ant Group, China, can reduce training time 20-fold, dramatically accelerating LLM fine-tuning for specific domains like medical Q&A., the road ahead is not without challenges. Model Compression vs. Adversarial Robustness: An Empirical Study on Language Models for Code by Md. Abdul Awal et al. from the University of Saskatchewan reveals a critical trade-off: compressed models can be significantly more vulnerable to adversarial attacks, especially knowledge-distilled ones. This highlights the need for compression techniques that are “robustness-aware.” Furthermore, CompLeak: Deep Learning Model Compression Exacerbates Privacy Leakage from anonymous authors, points out that compression can inadvertently increase privacy leakage, a crucial consideration for deployment in sensitive applications.work will likely focus on developing integrated, holistic compression strategies that simultaneously optimize for efficiency, accuracy, robustness, and privacy. The exploration of quantum optimization, as presented by Zhehui Wang et al., and principled approximation methods by Pedro Savarese from TOYOTA TECHNOLOGICAL INSTITUTE AT CHICAGO, suggests entirely new paradigms for model design and optimization. Benchmarking frameworks like LLMC+ from Nanyang Technological University and SenseTime Research are crucial for systematically evaluating these complex trade-offs, particularly for multimodal models. The push towards Edge General Intelligence with Agentic AI and Agentification by Zhang, Y. et al. (Tsinghua University, Stanford University, UC Berkeley, MIT) underscores the necessity of model compression for deploying memory-enabled, context-aware AI on resource-constrained edge devices.era of truly pervasive and intelligent AI is dawning, driven by these remarkable advancements in model compression. The excitement is palpable as researchers continue to shrink the digital giants, making them nimble enough to reshape our world.
Post Comment