Healthcare AI: From Trustworthy LLMs to Collaborative Robotics, Powering a Safer and Smarter Future
Latest 62 papers on healthcare: Mar. 28, 2026
The landscape of healthcare is undergoing a profound transformation, driven by rapid advancements in AI and Machine Learning. From enhancing clinical decision-making and patient care to streamlining administrative workflows and ensuring data privacy, AI is poised to revolutionize nearly every facet of the industry. Recent research highlights a concerted effort to build more trustworthy, efficient, and accessible AI systems, addressing critical challenges in areas like data privacy, model reliability, and human-AI collaboration.
The Big Idea(s) & Core Innovations
At the heart of these advancements is the pursuit of AI that not only performs complex tasks but does so reliably and ethically. A significant thrust is in leveraging Large Language Models (LLMs) for diverse healthcare applications. For instance, MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare by Nigam, Sarkar, and Patel (University of Birmingham, Heritage Institute of Technology) introduces a multilingual dialogue dataset and a parameter-efficient model, MedAidLM, to enable personalized, multi-turn medical consultations, bridging accessibility gaps for low-resource populations. Complementing this, Multi-Method Validation of Large Language Model Medical Translation Across High- and Low-Resource Languages by Anyaegbuna et al. (Stanford University, UCSF, Harvard Medical School) demonstrates that frontier LLMs maintain high semantic fidelity in medical translation across diverse languages, offering equitable healthcare access.
However, the power of LLMs comes with caveats, especially concerning reliability and privacy. Tan et al. (Microsoft Research Asia, Hong Kong University of Science and Technology), in their paper A Decade-Scale Benchmark Evaluating LLMs’ Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations, introduce CPGBench, revealing significant gaps in LLMs’ ability to adhere to clinical guidelines, flagging critical safety concerns. To tackle privacy, Adam Jakobsen (University of California, Berkeley) presents Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation, demonstrating how RAG with LLMs can simulate realistic psychiatric patient data while preserving privacy. Similarly, Aithal, Kotz, and Mitchell (University of Colorado Anschutz) propose PLACID in PLACID: Privacy-preserving Large language models for Acronym Clinical Inference and Disambiguation, a framework for accurate, privacy-preserving acronym disambiguation in clinical narratives using on-device small LLMs.
The emphasis on trustworthiness and interpretability extends beyond LLMs. MedForge: Interpretable Medical Deepfake Detection via Forgery-aware Reasoning by Chen et al. (National University of Singapore, CUHK) introduces MedForge-Reasoner, an MLLM-based detector that performs pre-hoc localized reasoning to identify and explain medical image forgeries, enhancing trust in AI diagnostics. Furthermore, Agboola (Epalea), in Theoretical Foundations of Latent Posterior Factors: Formal Guarantees for Multi-Evidence Reasoning, provides a robust theoretical framework for aggregating heterogeneous evidence, offering critical guarantees for calibration and robustness in high-stakes AI applications.
Beyond diagnostic and NLP tasks, AI is redefining chronic care management and hospital operations. Rethinking Health Agents: From Siloed AI to Collaborative Decision Mediators by Chung, Xu, and Pollack (University of Washington, Columbia University) reframes health agents as collaborative mediators to enhance multi-stakeholder interactions through shared situational awareness. This echoes the insights from Chung et al. (University of Washington, Johns Hopkins University) in Co-designing for the Triad: Design Considerations for Collaborative Decision-Making Technologies in Pediatric Chronic Care, highlighting the importance of digital tools that support collaborative decision-making among patients, caregivers, and providers. For large-scale data, Rajamohan et al. (New York University) introduce RAVEN in Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction, a generative pretraining strategy for sequential EHR data that accounts for event recurrence, improving zero-shot generalization across diverse diseases.
Under the Hood: Models, Datasets, & Benchmarks
These innovations are powered by novel models, carefully curated datasets, and rigorous benchmarks:
- CPGBench: Introduced by Tan et al. (Microsoft Research Asia) in A Decade-Scale Benchmark Evaluating LLMs’ Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations, this is the first decade-scale benchmark for evaluating LLMs on clinical practice guideline adherence in multi-turn conversations. It comprises 32,155 recommendations from 3,418 global CPG documents.
- RAVEN: From Rajamohan et al. (New York University) in Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction, this is a recurrence-aware foundation model for structured longitudinal EHR data, focused on next-visit event prediction. The paper emphasizes a recurrence-aware regularization mechanism.
- MedAidDialog & MedAidLM: Nigam, Sarkar, and Patel (University of Birmingham) introduce MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare, a multilingual medical dialogue dataset, along with MedAidLM, a parameter-efficient model for conversational medical assistance. Both are designed for low-resource settings.
- VOLMO: In VOLMO: Versatile and Open Large Models for Ophthalmology, Qin et al. (Yale University, Carnegie Mellon University) propose a model-agnostic and data-open framework for developing ophthalmology-specific multimodal LLMs, demonstrating strong performance in image-description generation and disease screening.
- CarePilot & CareFlow: Ghosh et al. (Indian Institute of Technology Patna, Mohamed bin Zayed University of AI) introduce CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare, a multi-agent framework for automating healthcare workflows. It is benchmarked on CareFlow, an expert-annotated benchmark for long-horizon healthcare tasks.
- MedForge-90K & MedForge-Reasoner: Chen et al. (National University of Singapore) introduce MedForge-90K, a large-scale benchmark for medical deepfake detection with expert-guided reasoning, and MedForge-Reasoner, an MLLM-based detector that performs pre-hoc localized reasoning. Code is available at https://anonymous.4open.science/r/MedForge-Reasoner-anonymize-2295.
- TrustFed: Kumar, Alam, and Chakraborty (Indian Institute of Technology Delhi) propose TrustFed: Enabling Trustworthy Medical AI under Data Privacy Constraints, a federated learning framework that ensures statistically valid uncertainty quantification in medical imaging, leveraging representation-aware client assignment and soft-nearest threshold aggregation.
- AEGIS: For regulatory compliance, Afdideh et al. (Karolinska Institutet) introduce AEGIS: An Operational Infrastructure for Post-Market Governance of Adaptive Medical AI Under US and EU Regulations, a governance framework enabling continuous model updates while adhering to regulatory standards.
- LLM-CAT: Zheng et al. (Peking University) propose Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking, a computerized adaptive testing (CAT) framework for efficient and accurate evaluation of LLMs in medical domains, with code at https://github.com/zjiang4/LLM-CAT.
Impact & The Road Ahead
The impact of this research is profound, painting a future where AI systems are not just tools but trusted collaborators in healthcare. The push for privacy-preserving techniques like federated learning (as seen in Federated Learning with Multi-Partner OneFlorida+ Consortium Data for Predicting Major Postoperative Complications by Ren et al. (University of Florida) and Federated Learning for Privacy-Preserving Medical AI by Hoang (University of Surrey)) and synthetic data generation (Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation) is critical for deploying AI in sensitive medical contexts. Ethical fairness, as advocated by Roy et al. (Arizona State University, Cornell University) in Ethical Fairness without Demographics in Human-Centered AI, is becoming a core design principle, ensuring equitable outcomes without perpetuating biases.
The development of multi-agent systems, such as Chung et al.’s (University of Washington, Columbia University) concept of AI as collaborative decision mediators in Rethinking Health Agents: From Siloed AI to Collaborative Decision Mediators, signals a shift from siloed AI to integrated human-AI teams. This is further reinforced by Nguyen et al. (University of New Brunswick) in Position: Multi-Agent Algorithmic Care Systems Demand Contestability for Trustworthy AI, arguing for contestability as a fundamental requirement for trustworthy multi-agent algorithmic care systems, enabling human oversight and accountability.
Future research will likely focus on closing the identified gaps in LLM adherence to clinical guidelines, enhancing robustness to adversarial attacks, and building more sophisticated governance frameworks for adaptive medical AI (like AEGIS). The integration of AI with extended reality (XR) for real-time decision support, as demonstrated by Sanchez, Tran, and Chheang (San Jose State University) in Towards Extended Reality Intelligence for Monitoring and Predicting Patient Readmission Risks, also holds immense promise. As AI continues to mature, its role will evolve from augmenting human capabilities to fundamentally transforming healthcare delivery, making it more intelligent, accessible, and human-centered than ever before.
Share this content:
Post Comment