Class Imbalance: Navigating the AI Frontier with Novel Solutions and Trustworthy AI
Latest 100 papers on class imbalance: Aug. 25, 2025
Class imbalance remains a pervasive and critical challenge across diverse domains in AI and Machine Learning. From medical diagnoses where rare diseases are often overlooked to cybersecurity systems battling infrequent, yet devastating, attacks, models trained on imbalanced datasets frequently achieve misleadingly high accuracy while failing to adequately detect the minority class. This phenomenon can have severe real-world consequences, undermining trust and efficacy. Fortunately, recent research is pushing the boundaries, introducing innovative solutions that not only tackle class imbalance head-on but also emphasize interpretability and robustness, paving the way for more reliable AI systems.
The Big Idea(s) & Core Innovations
One prominent theme in recent breakthroughs is the strategic use of generative models and advanced augmentation techniques to create synthetic, yet realistic, data for underrepresented classes. The paper, GFlowNets for Learning Better Drug-Drug Interaction Representations, by A. T. Wasi et al. from the Information Sciences Institute, University of Southern California, proposes an innovative framework combining Generative Flow Networks (GFlowNets) with Variational Graph Autoencoders (VGAE) to generate synthetic DDI samples, critically improving predictions for rare drug interactions. Similarly, SMOGAN: Synthetic Minority Oversampling with GAN Refinement for Imbalanced Regression, by Shayan Alahyari and Mike Domaratzki from Western University, introduces a two-step framework using DistGAN to refine synthetic samples, ensuring they align with true feature-target distributions for imbalanced regression tasks. In medical imaging, Diffusion-Based User-Guided Data Augmentation for Coronary Stenosis Detection by S. Seo et al. from MediPixel Inc. leverages diffusion models and ControlNet to generate synthetic coronary angiograms with controlled stenosis, significantly improving lesion detection and severity classification without additional labeling costs. The work, SkinDualGen: Prompt-Driven Diffusion for Simultaneous Image-Mask Generation in Skin Lesions, further extends this, using Stable Diffusion for simultaneous image-mask generation in skin lesions, directly addressing data scarcity and class imbalance.
Another significant area of innovation is the development of specialized network architectures and loss functions that are inherently robust to class imbalance. The LLM-empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions by Yongju Jia et al. from Shandong University, proposes MDPR, a plug-and-play framework with dynamic prompt routing and multi-dimensional semantic knowledge, offering instance-adaptive class representation. For vision-language tasks, VasoMIM: Vascular Anatomy-Aware Masked Image Modeling for Vessel Segmentation from De-Xing Huang et al. from the Chinese Academy of Sciences uses anatomy-guided masking and consistency loss to enhance vascular representation, tackling class imbalance in vessel segmentation. In cybersecurity, CRoC: Context Refactoring Contrast for Graph Anomaly Detection with Limited Supervision by Xiao Langley and Jiawei Li from The Chinese University of Hong Kong leverages context refactoring and contrastive learning to boost Graph Neural Network (GNN) robustness against camouflage in imbalanced scenarios. The paper, CLIMD: A Curriculum Learning Framework for Imbalanced Multimodal Diagnosis by Kai Han et al. from the University of Jinan, introduces a curriculum learning approach for multimodal medical diagnosis, progressively guiding training without relying on data augmentation. A crucial insight from Class Unbiasing for Generalization in Medical Diagnosis by Lishi Zuo et al. from The Hong Kong Polytechnic University is the introduction of a class-wise inequality loss combined with Group Distributionally Robust Optimization (G-DRO) to effectively mitigate both class imbalance and class-feature bias, leading to better generalization in medical diagnosis. Lastly, the theoretical work Class Imbalance in Anomaly Detection: Learning from an Exactly Solvable Model by F.S. Pezzicoli et al. from Université Paris-Saclay challenges conventional wisdom, demonstrating that a perfectly balanced training set isn’t always optimal, revealing distinct regimes of noise sensitivity that impact performance in anomaly detection. This highlights the nuanced understanding required beyond simple balancing acts.
Under the Hood: Models, Datasets, & Benchmarks
These advancements are underpinned by a rich ecosystem of models, specialized datasets, and rigorous benchmarking frameworks. Key resources enabling these innovations include:
- MDPR Framework: A plug-and-play architecture for Vision-Language Models (VLMs) that uses a multi-dimensional semantic library for dynamic prompt routing. Code available at https://anonymous.4open.science/r/MDPR-328C/README.md.
- Edge Selector Model: A hybrid ML/metaheuristic model using tabular binary classifiers and Graph Neural Networks (GNNs) for Vehicle Routing Problems. Code at https://github.com/bachtiarherdianto/MH-Edge-Selector.
- Subjective Logic Methodology: A formal framework for modeling dataset trustworthiness and bias in AI training datasets. Code available at https://github.com/Ouatt-Isma/-Trustworthiness-of-AI-Training-Dataset.
- CRoC Framework: Enhances GNNs for graph anomaly detection using context refactoring and contrastive learning, evaluated across seven real-world datasets. Code at https://github.com/XsLangley/CRoC_ECAI2025.
- VasoMIM Framework: Masked Image Modeling integrated with vascular anatomy for vessel segmentation in X-ray angiograms. Project page: https://dxhuang-casia.github.io/VasoMIM.
- SAM with Cross-Entropy Masking (CEM): Adapts the Segment Anything Model for remote sensing change detection, showing a 2.5% F1-score improvement on the S2Looking dataset. Code at https://github.com/humza909/SAM-CEM-CD.
- GraphFedMIG Framework: A federated graph learning paradigm that treats class imbalance as a generative data augmentation task, with code at https://github.com/NovaFoxjet/GraphFedMIG.
- InceptoFormer: A multi-signal neural framework combining Inception1D and Transformers for Parkinson’s disease severity evaluation. Code at https://github.com/SafwenNaimi/InceptoFormer.
- DBIF-AUNet: A Dual-Branch Interactive Fusion Attention model for pleural effusion segmentation, featuring nested deep supervision and hierarchical adaptive hybrid loss. Paper available at https://arxiv.org/pdf/2508.06191.
- SMOGAN Framework: Utilizes DistGAN for refining synthetic samples in imbalanced regression tasks, demonstrated on 23 benchmark datasets. Paper at https://arxiv.org/pdf/2504.21152.
- F2PASeg Architecture: An efficient architecture for pituitary anatomy segmentation during endoscopic surgery, with a large-scale PAS dataset. Code at https://github.com/paulili08/F2PASeg.
- MetroPT Dataset: Used in An Explainable Machine Learning Framework for Railway Predictive Maintenance, enabling real-time fault prediction with Explainable AI. Code involves River package and Highcharts library.
- ALScope Toolkit: A unified platform for Deep Active Learning (DAL) algorithms, supporting open-set recognition and data imbalance. Code at https://github.com/WuXixiong/DALBenchmark.
- KACQ-DCNN: A hybrid classical-quantum neural network for heart disease detection, achieving state-of-the-art performance with interpretability and uncertainty quantification. Paper: https://arxiv.org/pdf/2410.07446.
- Multi-VQC: A Quantum Machine Learning (QML) approach for healthcare classification, specifically for imbalanced datasets. Code at https://github.com/quantum-ml-research/Multi-VQC.
- GraphALP Framework: Combines LLMs and pseudo-labeling for node classification in class-imbalanced graphs with noisy labels. Paper at https://arxiv.org/pdf/2507.18153.
- SynthCTI Framework: LLM-driven synthetic CTI generation for underrepresented MITRE ATT&CK techniques. Code at https://github.com/dessertlab/cti-to-mitre-with-nlp.
- Kolmogorov Arnold Networks (KANs): Evaluated for imbalanced data classification, showing unique performance characteristics compared to MLPs. Paper at https://arxiv.org/pdf/2507.14121.
- SMOTETomek and FedProx Pipeline: Addresses class imbalance and non-IID data in federated learning for clinical applications, specifically tested on the Stroke Prediction Dataset. Paper at https://arxiv.org/pdf/2508.10017.
Impact & The Road Ahead
The impact of this research is profound, extending far beyond academic benchmarks. In medical imaging and diagnosis, these advancements mean earlier detection of diseases like cancer, Parkinson’s, and diabetic retinopathy, with models that are not only more accurate but also interpretable, fostering greater trust among clinicians. The development of frameworks like AutoML-Med streamlines ML applications in healthcare by minimizing manual intervention, crucial for scaling AI solutions. For cybersecurity, novel approaches to anomaly detection in cyber-physical systems and federated graph learning for fraud detection bolster defense mechanisms against evolving threats, ensuring robust and privacy-preserving solutions.
Critically, the push for Explainable AI (XAI), seen in papers like An Enhanced Focal Loss Function to Mitigate Class Imbalance in Auto Insurance Fraud Detection with Explainable AI and Explainable Vulnerability Detection in C/C++ Using Edge-Aware Graph Attention Networks, is central to building trustworthy AI. This transparency is vital in high-stakes applications where understanding why a model makes a particular prediction is as important as the prediction itself. Furthermore, the emerging use of quantum-enhanced machine learning (Multi-VQC) demonstrates a frontier where even more robust solutions to class imbalance might lie.
The road ahead involves further integrating these solutions, creating truly adaptive and ethical AI systems. Challenges such as temporal data imbalance and concept drift (as highlighted in Empirical Evaluation of Concept Drift in ML-Based Android Malware Detection) remain, underscoring the need for continuous learning and adaptation. As we continue to refine generative models, develop more intelligent architectures, and prioritize interpretability and fairness, we are moving closer to a future where AI’s benefits are equitably distributed, and its decisions are understood and trusted by all.
Post Comment