Data Privacy in the Age of AI: Safeguarding Information in Federated Learning, Language Models, and Beyond
Latest 24 papers on data privacy: Mar. 21, 2026
Data privacy has become a paramount concern in the rapidly evolving landscape of AI and Machine Learning. As models become more powerful and data-hungry, ensuring sensitive information remains protected is a critical challenge. Recent research showcases exciting breakthroughs, offering innovative solutions across diverse domains, from securing distributed learning systems to enhancing the privacy and fairness of large language models (LLMs).
The Big Idea(s) & Core Innovations
The central theme uniting many of these advancements is the quest for robust, scalable, and privacy-preserving AI systems. Federated Learning (FL) stands out as a key paradigm, enabling collaborative model training without centralizing raw data. However, FL itself presents vulnerabilities, such as data poisoning. A novel defense mechanism from University of Example and Institute of Advanced Research in their paper, “A Model Consistency-Based Countermeasure to GAN-Based Data Poisoning Attack in Federated Learning”, leverages model consistency to detect and mitigate GAN-based data poisoning, offering a practical solution for securing distributed learning.
Building on FL’s promise, several papers tackle its efficiency and personalization challenges. Eman M. AbouNassara, Amr Elshalla, and Sameh Abdulahb from Menoufia University introduce “FedPBS: Proximal-Balanced Scaling Federated Learning Model for Robust Personalized Training for Non-IID Data”. FedPBS innovatively combines batch-size-aware scaling with selective proximal regularization to handle non-IID (non-independent and identically distributed) data and client heterogeneity, ensuring robust performance across diverse environments. Similarly, Ping Guo et al. from City University of Hong Kong and Hong Kong Metropolitan University present “Few-for-Many Personalized Federated Learning” (FedFew), which redefines personalized federated learning by reformulating it as a few-for-many optimization problem, dramatically improving scalability while maintaining personalization with a small number of shared server models.
Efficiency in FL is further enhanced by adaptive mechanisms. In “Adaptive Deadline and Batch Layered Synchronized Federated Learning”, authors propose a dynamic approach to deadlines and batch sizes, significantly reducing communication overhead without sacrificing model accuracy. Addressing the need for data removal in FL, Minh-Duong Nguyen et al. from VinUniversity, University of Sydney, and Trinity College Dublin introduce “Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression” (FOUL). FOUL enables efficient unlearning of specific client data from FL models without costly retraining, preserving privacy and communication efficiency.
The application of privacy-preserving FL extends into critical sectors. In healthcare, Yuanfang Ren et al. from the University of Florida demonstrate the power of “Federated Learning with Multi-Partner OneFlorida+ Consortium Data for Predicting Major Postoperative Complications”. This study showcases FL models that predict postoperative complications and mortality with strong generalizability, all while meticulously preserving patient data privacy across multiple institutions. Beyond healthcare, Wei Li, Yaxin Zhang, and Yuxuan Wang from Tsinghua University propose “DeFRiS: Silo-Cooperative IoT Applications Scheduling via Decentralized Federated Reinforcement Learning” for IoT application scheduling. DeFRiS enables decentralized, silo-cooperative resource management across distributed edge nodes, reducing central coordination and enhancing efficiency in privacy-sensitive IoT environments. In 6G networks, Xiaoyu Zhang et al., also from Tsinghua University, introduce “SliceFed: Federated Constrained Multi-Agent DRL for Dynamic Spectrum Slicing in 6G”, providing a framework for decentralized and collaborative spectrum slicing that improves resource utilization and reduces latency.
Moving beyond FL, secure data handling is paramount in other AI contexts. Seydina Ousmane Diallo et al. from Télecom SudParis propose “Enabling Multi-Client Authorization in Dynamic SSE” (MASSE), a dynamic multi-client Searchable Symmetric Encryption (SSE) scheme. MASSE allows fine-grained access control and efficient revocation in encrypted databases, enabling secure, dynamic multi-client searches without revealing keywords or attributes to the server.
LLMs, while revolutionary, come with their own set of privacy and safety concerns. Frank P-W. Lo et al. introduce “AutoScreen-FW: An LLM-based Framework for Resume Screening”, leveraging LLMs for automated resume screening with integrated personas and evaluation metrics. This framework highlights how structured LLM use can improve objectivity and efficiency in HR workflows by reducing human bias. For ensuring LLM safety and compliance, Wenbin Hu et al. from Hong Kong University of Science and Technology present “OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset”. This dataset fills a crucial gap by providing real-world, rule-based safety compliance data across multiple domains, critical for aligning LLMs with legal and ethical standards.
Furthermore, researchers are exploring how to improve LLM utility while managing risks. Author 1 and Author 2 from MindSLab, Tsinghua University and Stanford University introduce an “An Agentic System for Schema Aware NL2SQL Generation” which combines Small Language Models (SLMs) with LLMs for cost-efficient and accurate Natural Language to SQL generation, improving schema handling and reducing inference costs. The paper “PMAx: An Agentic Framework for AI-Driven Process Mining” by A. Antonov et al. from University of Lugano introduces a secure, end-to-end workflow for process mining that enables non-technical users to analyze event logs through natural language queries without compromising data privacy or security.
Under the Hood: Models, Datasets, & Benchmarks
Recent research heavily relies on specialized models and datasets to drive innovation and rigorously evaluate privacy-preserving AI:
- Federated Learning Algorithms: FedPBS integrates techniques from FedProx and FedBS for robust training on non-IID data. FedFew focuses on few-for-many multi-objective optimization to scale personalized FL. FOUL introduces a two-stage FUL algorithm combining learning-to-unlearn (L2U) via model disentanglement and on-server gradient matching. FEDOT employs orthogonal transformations to achieve dual privacy in black-box FL.
- Benchmarks for FL: The paper “Benchmarking Federated Learning in Edge Computing Environments: A Systematic Review and Performance Evaluation” highlights FEMNIST and Shakespeare datasets as key for evaluating FL in edge environments due to high data heterogeneity. FOUL introduces a novel benchmarking approach using domain generalization datasets like PACS, VLCS, and OfficeHome, along with the ‘time-to-forget’ metric.
- Privacy-Preserving Infrastructure: MASSE (from “Enabling Multi-Client Authorization in Dynamic SSE”) is a dynamic multi-client Searchable Symmetric Encryption (SSE) scheme designed for attribute-based access control and efficient revocation in encrypted databases.
- LLM Evaluation & Datasets: AutoScreen-FW (from “AutoScreen-FW: An LLM-based Framework for Resume Screening”) uses LLM personas and structured evaluation metrics. OmniCompliance-100K (“OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset”) is a crucial new dataset for aligning LLMs with real-world safety and ethical regulations. Author A and Author B in “TrainDeeploy: Hardware-Accelerated Parameter-Efficient Fine-Tuning of Small Transformer Models at the Extreme Edge” explore efficient fine-tuning of small transformer models for edge deployment, leveraging hardware acceleration for performance. The work “There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective” emphasizes context-aware and culturally relevant evaluation metrics for offline LLM performance, specifically for the Turkish language.
- Code Repositories: Many projects emphasize reproducibility: DeFRiS code is available at https://github.com/DeFRiS-Team/DeFRiS. PMAx has an open-source implementation within the ProMoAI tool suite at https://github.com/fit-process-mining/PMAx-evaluation. FEDOT’s code can be found at https://github.com/SeoulNationalUniversity/FEDOT. The agentic system for NL2SQL generation is available at https://github.com/mindslab25/CESMA. Builtjes and Hering provide a reproducible, open-source pipeline for clinical text extraction using LLMs at https://github.com/builtjes/llm_extractinator.
Impact & The Road Ahead
These advancements collectively paint a vibrant picture of a future where AI systems are not only intelligent but also inherently secure and privacy-aware. Federated learning, in particular, is proving its worth as a cornerstone for privacy-preserving AI, enabling collaboration in sensitive domains like healthcare and IoT without compromising user data. The development of robust defense mechanisms against data poisoning, along with techniques for efficient unlearning and personalized training, directly addresses critical concerns for real-world FL deployment.
The progress in LLMs, from rule-grounded safety datasets like OmniCompliance-100K to agentic systems for complex tasks, highlights a concerted effort to build more responsible and ethically aligned AI. The emphasis on transparency, as discussed in “Consumer Rights and Algorithms” by D McNair and Gregory M Dickinson, is crucial, stressing the need for frameworks that ensure algorithmic accountability and fairness. Even in areas like human-computer interaction, as shown by “Memory Printer: Exploring Everyday Reminiscing by Combining Slow Design with Generative AI-based Image Creation” by Zhou Fang and Janet Yi-Ching Huang, the ethical implications of AI, such as false memories and algorithmic bias, are being carefully considered.
The implications of this research are far-reaching. Secure FL will accelerate AI adoption in highly regulated industries, while robust LLM frameworks will pave the way for more reliable and safer human-AI interactions. The continuous development of optimal algorithms, such as for Eulerian trails in “Optimal Enumeration of Eulerian Trails in Directed Graphs” by Ben Bals et al., also contributes foundational elements crucial for handling large-scale data with privacy considerations. The path forward involves continued interdisciplinary research, bridging cryptography, machine learning, and human-centered design to ensure that AI’s incredible power is harnessed responsibly, ethically, and with an unwavering commitment to data privacy.
Share this content:
Post Comment