Data Privacy in AI/ML: From Unlearning to Unseen Domains and Efficient Edge Intelligence
Latest 21 papers on data privacy: May. 9, 2026
Data privacy remains a paramount concern in the rapidly evolving landscape of AI and Machine Learning. As models grow in complexity and data becomes increasingly distributed, ensuring confidentiality, robustness against attacks, and compliance with regulations like GDPR presents multifaceted challenges. Fortunately, recent research offers promising breakthroughs, pushing the boundaries of what’s possible in privacy-preserving AI. This post dives into a selection of these advancements, highlighting innovations that span unlearning, federated learning, secure inference, and domain adaptation.
The Big Idea(s) & Core Innovations
At the heart of recent developments lies a drive to build more resilient and ethical AI systems. One critical area is machine unlearning, the ability to remove specific data’s influence from a trained model. However, “Revisiting Privacy Leakage in Machine Unlearning: Membership Inference Beyond the Forgotten Set” by Stevens Institute of Technology and University of Connecticut researchers, reveals a nuanced challenge: unlearning can paradoxically increase privacy risks for retained data. Their novel Tri-Class Unlearning Membership Inference Attack (TC-UMIA) effectively distinguishes forgotten, retained, and unseen data by leveraging model predictions before and after unlearning, achieving up to 95.6% accuracy. This highlights that true privacy requires holistic strategies, not just data deletion.
Complementing this, Brac University researchers in “Machine Unlearning for Class Removal through SISA-based Deep Neural Network Architectures” propose a modified SISA (Sharded, Isolated, Sliced, and Aggregated) framework for efficient class-level unlearning in CNNs. By using sequential class-level slicing, a reinforced replay mechanism, and a gating network, they achieve exact unlearning for specific classes while preserving overall model performance and drastically reducing retraining costs. This provides a practical path towards GDPR compliance.
Federated Learning (FL) continues to be a cornerstone for privacy-preserving distributed AI. The University of Texas at Arlington and Mississippi State University introduce “PERFECT: Personalized Federated Learning for CBRS Radar Detection”, a framework enabling geographically dispersed Environmental Sensing Capability (ESC) sensors to collaboratively train radar detection models without sharing raw spectral data. Their innovation lies in federated personalization—shared base layers with private, personalized heads—to maintain an FCC-mandated 99% recall even with non-IID (non-independently and identically distributed) data. This concept of hybrid global-local models is further explored by Avignon University in “FedPLT: Scalable, Resource-Efficient, and Heterogeneity-Aware Federated Learning via Partial Layer Training”. FedPLT allows clients to train only partial layers based on their resources, significantly reducing trainable parameters (71-82%) while achieving comparable performance to full-model training, crucial for resource-constrained IoT environments.
Addressing data heterogeneity and security in FL, “Sample Selection Using Multi-Task Autoencoders in Federated Learning with Non-IID Data” by Gebze Technical University proposes multi-task autoencoders combined with unsupervised outlier detection (like One-Class SVM) to filter noisy or malicious samples on client devices, boosting accuracy by up to 7.02% under 40% noise. Meanwhile, Southeast University introduces “CO-EVO: Co-evolving Semantic Anchoring and Style Diversification for Federated DG-ReID”, a novel federated framework for domain generalization in person re-identification that tackles semantic-style conflict. CO-EVO combines camera-invariant semantic anchoring with global style diversification to learn robust features across diverse camera styles, outperforming prior methods by +2.0% mAP.
Beyond federated training, privacy in inference and data utility is also evolving. Seoul National University presents “A Target-Free Harmonization Method for MRI”, TgtFreeHarmony, which harmonizes MRI data across different scanners without requiring access to target domain data, a critical breakthrough for multi-center medical studies with stringent privacy rules. It uses disentanglement-based generators and Bayesian optimization to estimate and synthesize target styles. For clinical decision support, Anurag University’s “Federated Semi-Supervised Graph Neural Networks with Prototype-Guided Pseudo-Labeling for Privacy-Preserving Gestational Diabetes Mellitus Prediction” (FedTGNN-SS) offers a framework for GDM prediction, combining prototype-guided pseudo-labeling with privacy-safe prototype sharing. This allows hospitals to collaborate and improve models even with limited labeled data, without sharing sensitive patient EHRs.
Finally, the survey “Split and Aggregation Learning for Foundation Models Over Mobile Embodied AI Network (MEAN): A Comprehensive Survey” by Qianzhou Chen et al. provides a holistic view of Split Learning (SL) and Aggregation Learning (AL) for large foundation models in 6G networks. It highlights how SL enhances privacy by transmitting intermediate activations instead of raw data, while AL optimizes model updates, paving the way for ubiquitous, privacy-preserving AI.
Under the Hood: Models, Datasets, & Benchmarks
These innovations are often built upon robust models, diverse datasets, and rigorous benchmarks:
- Privacy-Preserving ML Framework: “A Privacy-Preserving Machine Learning Framework for Edge Intelligence: An Empirical Analysis” evaluates Differential Privacy (DP), Secure Multi-party Computation (SMC), and Fully Homomorphic Encryption (FHE) on real implementations and trace-based simulations. It utilizes TensorFlow Privacy, CrypTen, and Concrete-ML (TFHE) with AlexNet and LeNet-5 models, and the UEA & UCR Time Series Classification Archive and EdgeSimPy simulation toolkit.
- Federated Learning for Security: “DeTrigger: A Gradient-Centric Approach to Backdoor Attack Mitigation in Federated Learning” analyzes backdoor attacks using CIFAR-10, CIFAR-100, GTSRB, STL-10 datasets. “Toward Resilient 5G Networks: Comparative Analysis of Federated and Centralized Learning for RF Jamming Detection” uses 1DCNN models on real-world 5G RF signals (band n71) captured via ADALM-PLUTO SDR.
- Bioacoustics & Model Merging: “Ecologically-Constrained Task Arithmetic for Multi-Taxa Bioacoustic Classifiers Without Shared Data” leverages the BEATs iter3+ AS2M pretrained audio encoder and datasets like BirdCLEF, Watkins Marine Mammal Sound Database, and AnuraSet.
- Clinical NLP & LLMs: “BIT.UA-AAUBS at ArchEHR-QA 2026: Evaluating Open-Source and Proprietary LLMs via Prompting in Low-Resource QA” benchmarked GPT-4.1, Claude Sonnet, Gemini series, MedGemma 3 27B, Qwen3, and Llama 3.1 on the ArchEHR-QA 2026 shared task. Code is available at https://github.com/bioinformatics-ua/ArchEHR-QA-2026.
- Text-to-SQL for SLMs: “FINER-SQL: Boosting Small Language Models for Text-to-SQL” utilizes Group Relative Policy Optimization with fine-grained rewards on BIRD and Spider benchmarks. Code is open-sourced at https://github.com/thanhdath/finer-sql.
- Multi-Camera Surveillance: “Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation” uses Probabilistic Faster R-CNN with a diffusion-based generation module on Cityscapes, BDD100K, KITTI, Sim10k, and Foggy Cityscapes datasets, built on the Detectron2 framework.
- MRI Harmonization: TgtFreeHarmony (https://arxiv.org/pdf/2605.01282) used OASIS-3 and SRPBS datasets. Code is at https://github.com/SNU-LIST/TgtFreeHarmony.git.
- Edge LLM Benchmarking: “Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers” provides a multi-dimensional evaluation framework for LLMs on Raspberry Pi 5 + Hailo-10H, Jetson Orin Nano, and M5Stack AX630c platforms. The benchmarking suite is at https://github.com/SquidyBallinx11011/LLM-Edge-Benchmarking-Suite.
Impact & The Road Ahead
These advancements herald a new era for privacy-preserving AI. The nuanced understanding of unlearning’s impact on retained data, as revealed by TC-UMIA, is crucial for developing truly secure deletion mechanisms. Practical frameworks like modified SISA and FedPLT demonstrate that privacy and efficiency don’t have to be mutually exclusive, enabling real-world deployments in areas like autonomous systems, surveillance, and healthcare.
The rise of personalized federated learning (PERFECT), adaptive split learning for LLMs (SplitFT), and target-free domain adaptation (TgtFreeHarmony) addresses the persistent challenges of data heterogeneity and multi-institutional collaboration. These solutions promise more robust, equitable, and widely adoptable AI systems. Furthermore, initiatives like DeRelayL (https://arxiv.org/pdf/2605.02935), a blockchain-based decentralized relay learning paradigm from Shenzhen MSU-BIT University, aim to democratize AI training, allowing common users to contribute to and even own parts of large models. This vision, alongside the integration of distributed AI with emerging 6G technologies, promises a future where powerful AI is accessible, collaborative, and inherently privacy-aware.
Share this content:
Post Comment