Differential Privacy: From Robust Algorithms to Guarding LLMs and Causal Inference
Latest 19 papers on differential privacy: Jun. 13, 2026
Differential Privacy (DP) has emerged as the gold standard for quantifying and enforcing privacy in data analysis and machine learning. As AI systems become more ubiquitous and data-hungry, ensuring that sensitive information remains confidential without compromising utility is paramount. Recent research showcases exciting advancements, pushing the boundaries of DP to address novel challenges in algorithm design, large language models, synthetic data generation, and even causal inference. Let’s dive into some of the latest breakthroughs.
The Big Idea(s) & Core Innovations
The core challenge in differential privacy often lies in balancing robust privacy guarantees with practical utility and computational efficiency. Several recent papers tackle this multifaceted problem from different angles.
For instance, the widespread use of the Gaussian mechanism in DP has long been a practical choice, and now, theoretical backing from Yu Wei, Alexander Bienstock, and Antigoni Polychroniadou (Georgia Institute of Technology, JPMorgan AI Research & AlgoCRYPT CoE) in their paper, “Asymptotic Optimality of the High-Dimensional Gaussian Mechanism and Improved Low-Dimensional Mechanisms for Differential Privacy”, proves its asymptotic optimality in high dimensions for strong privacy settings. This means that for complex, high-dimensional data, the Gaussian mechanism is indeed the best additive-noise mechanism available, a crucial validation for practitioners. They also introduce the Spherical Generalized Gamma (SGG) mechanisms, which can achieve up to 15% lower Mean Squared Error (MSE) in low-dimensional settings.
Bridging the gap between theoretical DP parameters and practical deployment, Bogdan Kulynych and Antti Honkela (Lausanne University Hospital, University of Helsinki), in “On Choosing the μ Parameter in Gaussian Differential Privacy”, provide principled mappings from pure DP (ε-DP) to Gaussian DP (μ-GDP). Their key insight is that μ ≈ ε/5 serves as a conservative general-purpose conversion, offering concrete guidance for calibrating privacy parameters based on worst-case membership inference attack success.
Robustness in DP is further enhanced by Kelly Ramsay (York University) in “Computationally tractable robust differentially private mean estimation”, which introduces the ‘balloon mean’. This novel estimator, utilizing iterative clipping over expanding Mahalanobis balls, not only satisfies zero-concentrated differential privacy but also provides robustness against heavy tails and adversarial contamination, achieving minimax-optimal convergence rates while remaining computationally tractable.
Addressing the sensitive area of synthetic data generation (SDG), Paul Andrey et al. (Univ. Lille, Inria, CNRS, Centrale Lille) highlight a critical issue in “Disparate Impact in Synthetic Data Generation”. They formalize disparate impact in SDG, demonstrating that modeling errors and even DP mechanisms can lead to unfair utility across sensitive groups. Their work reveals that methods with broader hypothesis classes offer better utility and less disparate impact, though group-wise modeling can increase estimation errors under high privacy. This is further echoed by Steven Golob, Sikha Pentyala, and Martine De Cock (University of Washington Tacoma, Ghent University) in “SoK: Reconstruction Attacks on Synthetic Tabular Data (Insights from Winning the NIST CRC)”, which shows that the synthetic data generator choice, not the attack sophistication, is the dominant factor in privacy risk, and that DP mainly protects at small budgets (ε ≤ 1).
For tabular data specifically, Toan Tran et al. (Emory University, Microsoft Research) propose Tab-PE in “Differentially Private Synthetic Data via APIs 4: Tabular Data”. This method extends the Private Evolution framework to implicitly capture high-order correlations that state-of-the-art marginal-based DP methods often miss, outperforming baselines in utility and efficiency under strict privacy regimes.
Protecting LLM inference is tackled by Peihua Mai et al. (National University of Singapore et al.) with “SharedRequest: Privacy-Preserving Model-Agnostic Inference for Large Language Models”. This framework protects user prompts by mixing original queries with noisy variants at the batch level, achieving model-agnostic privacy without altering LLM parameters. This approach yields over 20% utility improvement compared to DP baselines and significantly reduces query costs.
In decentralized learning, Yunsheng Yuan et al. (Shandong University, Inspur Cloud Information Technology Co) introduce DPDL in “DPDL: Towards Differential Privacy Preservation in Decentralized Stochastic Learning on Non-IID Data”. DPDL applies DP via Gaussian noise to cross-gradients, calibrating them with cosine similarity to handle non-IID data heterogeneity, achieving linear speedup convergence and robust defense against gradient inversion attacks.
For complex relational databases, Andrew Cascio et al. (Duke University, Binghamton University, Penn State University) present “DP4SQL: Differentially Private SQL with Flexible Privacy Policies”. This system moves beyond rigid privacy models, allowing data curators to specify granular plausible deniability requirements for different entities and attributes, leading to more appropriate noise injection for SQL queries.
Beyond traditional DP, Takao Murakami et al. (ISM/ROIS/AIST/RIKEN AIP, UEC, AIST) introduce a new privacy notion, Fully Oblivious Differential Privacy (FODP), in “Fully Oblivious Differential Privacy for Frequency Estimation in the Augmented Shuffle Model with Trusted Processors”. FODP strengthens standard DP to prevent side-channel attacks from memory access patterns and control flows in Trusted Execution Environments (TEEs), offering a general framework with concrete algorithms for secure frequency estimation.
Finally, the intersection of privacy and causal inference is explored by Jian Yang et al. (Hong Kong University of Science and Technology et al.) in “Preserving Data Privacy in Learning Causal Structure with Fully Homomorphic Encryption”. They propose a privacy-preserving method for learning causal structures using Fully Homomorphic Encryption (FHE), demonstrating that FHE can overcome challenges like division and logarithm operations through approximation techniques, achieving high consistency with plaintext methods while keeping data encrypted.
Under the Hood: Models, Datasets, & Benchmarks
The innovations discussed rely on a diverse array of models, datasets, and benchmarks, showcasing the broad applicability and rigorous evaluation of these DP techniques.
- Language Models: The “Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models” paper by Bartłomiej Marek et al. (CISPA Helmholtz Center) extensively uses Pythia models (1.4B, 1B, 410M, etc.), GPT-Neo, and OLMo 1B/2 1B on datasets like Pile, Bookcorpus2, GitHub, Enron Emails, and SAMSum. It highlights that LoRA (Low-Rank Adaptation) offers the best privacy-utility trade-offs for out-of-distribution data. The SharedRequest framework also utilizes Qwen2.5-0.5B/1.5B discriminator models with Legal-QA, Medical-QA, and MMLU-Biz datasets.
- Synthetic Data Generation: Andrey et al. used American Community Survey (ACS) data via the folktables library for their disparate impact analysis, providing a GitLab repository: https://gitlab.inria.fr/magnet/thesepaulandrey/disparate-impact-sdg. Golob et al.’s SoK evaluated 9 SDG methods across 5 datasets and contributed new attacks, winning the NIST Privacy CRC 2025. Tab-PE, with its code at https://github.com/microsoft/DPSDA, proposes a broad collection of new datasets to better reflect high-order correlations in DP tabular data generation.
- Confidential Computing & Telemetry: EnclaveScale from Hung Dang et al. (Van Lang University, Appota Group, FPT Corporation) utilizes Intel TDX enclaves on GCP Confidential VMs (C3 instances), with NVIDIA H100, A100, L4 GPUs, and workloads like MLPerf Training v4.0 JAX BERT-Large and PyTorch ResNet-50. Their Rust implementation and associated attestation tools are available for exploring.
- Location Privacy: “Protecting K-Nearest Neighbor Queries from Location Inference Attacks” by Zhiyu Sun et al. (East China Normal University et al.) evaluates their DPRS framework on real-world datasets like Brightkite and Gowalla, with code accessible at https://github.com/reanatom/DPRS.
- Causal Inference: The FHE-based causal structure learning by Jian Yang et al. leverages the Microsoft SEAL FHE library (https://github.com/microsoft/SEAL) and EVA FHE compiler (https://github.com/microsoft/EVA) to process data from the bnlearn repository.
- Decentralized Learning: DPDL, by Yuan et al., was validated on MNIST and CIFAR-10 datasets against gradient inversion attacks.
Impact & The Road Ahead
These advancements signify a pivotal shift in how we approach privacy in AI/ML. The theoretical grounding for the Gaussian mechanism and practical parameter calibration will empower more confident deployment of DP. The focus on fairness in synthetic data, robust mean estimation, and explicit modeling of high-order correlations will lead to more trustworthy and useful privacy-preserving datasets. The emergence of model-agnostic DP for LLMs and hardware-assisted DP in TEEs marks critical steps toward securing complex, real-world AI systems, particularly in sensitive domains like mental health screening and data center telemetry.
The research into Fully Oblivomorphic Encryption for causal inference and multi-objective submodular maximization with DP opens new frontiers for privacy-preserving analytics where complex computations are required over encrypted data. However, challenges remain, particularly in the high-privacy regimes where utility trade-offs are significant, and in ensuring comprehensive privacy auditing across intricate pre-train/adapt pipelines. The need for flexible privacy policies in databases (DP4SQL) also highlights the demand for more nuanced and adaptable DP solutions.
Looking ahead, we can expect continued innovation at the intersection of DP with other privacy-enhancing technologies (PETs) like FHE and TEEs, further refining the balance between utility, privacy, and performance. The community’s growing awareness of disparate impact and empirical privacy risks, especially in LLMs, will drive the development of more sophisticated auditing frameworks and robust, fair, and practical privacy solutions for the next generation of AI.
Share this content:
Post Comment