Interpretability: Unveiling the Inner Workings of AI, From Neurons to Clinical Decisions

Latest 100 papers on interpretability: Mar. 7, 2026

The quest for interpretability in AI and Machine Learning has never been more critical. As models grow increasingly complex and are deployed in high-stakes domains like healthcare, finance, and autonomous systems, simply achieving high accuracy is no longer enough. We need to understand why models make certain decisions, to build trust, identify biases, and ensure reliability. Recent research has seen a surge in innovative approaches, pushing the boundaries of what’s possible in explainable AI (XAI) and offering a glimpse into a future where transparency is not a luxury, but a core component of intelligent systems.

The Big Idea(s) & Core Innovations

Several papers highlight a paradigm shift from purely predictive models to those that inherently offer insights into their reasoning. A recurring theme is the move towards inherently interpretable architectures or methods that synthesize explanations, rather than merely extracting them post-hoc. For instance, the paper “An interpretable prototype parts-based neural network for medical tabular data” by Jacek Karolczak and Jerzy Stefanowski (Poznan University of Technology) introduces MEDIC, a prototype-based neural network that mimics clinical reasoning for medical tabular data. This means the model’s decisions are directly tied to discrete, human-understandable prototypes, aligning with medical thresholds and clinician language. This contrasts with traditional black-box models, fostering trust in healthcare AI.

Similarly, “Causal Neural Probabilistic Circuits” by Weixin Chen and Han Zhao (University of Illinois Urbana-Champaign) enhances interpretability by integrating causal inference with probabilistic modeling. Their CNPC model is designed to approximate interventional class distributions, performing robustly even under distributional shifts, and offering a principled way to integrate causal reasoning into predictive models.

In the realm of multimodal AI, interpretability is also seeing significant advancements. MedCoRAG, a framework presented in “MedCoRAG: Interpretable Hepatology Diagnosis via Hybrid Evidence Retrieval and Multispecialty Consensus” by Zheng Li et al. (Nanjing University of Science and Technology), combines retrieval-augmented generation (RAG) with multi-agent collaboration to emulate multidisciplinary consultations. This dynamic integration of medical knowledge graphs and clinical guidelines creates a structured, evidence-based diagnostic process, enhancing transparency and trust in AI diagnosis. For robust robotic manipulation, “Observing and Controlling Features in Vision-Language-Action Models” by Lucy Xiaoyang Shi et al. (University of California, Berkeley & others) proposes a framework for observing and controlling internal features in vision-language-action models, making complex multi-modal systems more adaptable and controllable.

Another significant thrust is the focus on making black-box models more transparent through clever analytical tools. “Exact Functional ANOVA Decomposition for Categorical Inputs Models” by Baptiste Ferrere et al. (EDF R&D, IMT, Sorbonne Université) offers a closed-form functional ANOVA decomposition for categorical data, overcoming the limitations of sampling-based SHAP approximations. This provides exact and efficient explanations, especially valuable for high-cardinality tabular data. In a similar vein, “Enhancing the Interpretability of SHAP Values Using Large Language Models” by Xianlong Zeng and Kewen Zhu (Ohio University) bridges the gap further by using LLMs to translate complex SHAP outputs into plain language, making explanations accessible to non-technical users.

For understanding internal model dynamics, several papers delve into the microscopic workings of LLMs. “A Gauge Theory of Superposition: Toward a Sheaf-Theoretic Atlas of Neural Representations” by Hossein Javidnia (Dublin City University) introduces a gauge-theoretic framework with sheaf theory to model superposition and identify geometric obstructions to global interpretability, providing certified bounds on interference. “Hidden Breakthroughs in Language Model Training” by Sara Kangaslahti et al. (Harvard University, Google Research) uses POLCA to identify interpretable conceptual shifts during training, providing insights into when and how LLMs acquire skills like arithmetic. Challenging the assumption of true reasoning in LLMs, “Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering” by Kyle Cox et al. reveals that LLMs often pre-commit to answers before generating their Chain-of-Thought (CoT), suggesting CoT may not always reflect genuine reasoning, and can even be steered. This highlights the need for more faithful interpretability methods.

Under the Hood: Models, Datasets, & Benchmarks

These advancements are often powered by novel architectural designs, specialized datasets, and rigorous benchmarks:

MEDIC (Prototype-based NN): Introduced in “An interpretable prototype parts-based neural network for medical tabular data”, this model features differentiable discretization aligned with medical thresholds. Evaluated on datasets like Diabetes Data Set, Cirrhosis Patient Survival Prediction Dataset, and Chronic Kidney Disease. Public code: https://github.com/.
GALACTIC (Counterfactuals for Time-series Clustering): From “GALACTIC: Global and Local Agnostic Counterfactuals for Time-series Clustering” by Christos Fragkathoulas et al., this framework uses constrained gradient optimization and Minimum Description Length (MDL) for sparse, meaningful perturbations. No public code provided in the summary.
ASR-TRA (Reinforcement Learning for ASR Robustness): Proposed in “Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards” by Linghan Fang et al., this causal RL framework leverages learnable decoder prompts and audio-text semantic rewards. Public code: https://github.com/fangcq/ASR-TRA.
STCV (Sparse Regression for Non-linear Dynamics): In “Towards a data-scale independent regulariser for robust sparse identification of non-linear dynamics” by Jay Rauta et al., STCV is a magnitude-free algorithm based on Coefficient of Variation. Public code: https://github.com/RautJ/STCV.
MedCoRAG (Hybrid RAG-Multi-Agent for Hepatology): Introduced in “MedCoRAG: Interpretable Hepatology Diagnosis via Hybrid Evidence Retrieval and Multispecialty Consensus” by Zheng Li et al., it uses medical knowledge graphs and clinical guidelines, validated on MIMIC-IV dataset. No public code provided.
SPIRIT (Perceptive Shared Autonomy): From “SPIRIT: Perceptive Shared Autonomy for Robust Robotic Manipulation under Deep Learning Uncertainty”, this framework integrates perception and autonomous decision-making. Resources: https://sites.google.com/view/robotspirit. No public code provided.
MUTEX & URTOX (Urdu Toxic Span Detection): In “MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection” by Inayat Arshad et al., URTOX is the first manually annotated token-level dataset for Urdu (14,342 samples). Public code: https://github.com/finalyear226-lab/urdu-toxic-span-dataset.
BioLLMAgent (Hybrid LLM-RL for Psychiatry): Introduced in “BioLLMAgent: A Hybrid Framework with Enhanced Structural Interpretability for Simulating Human Decision-Making in Computational Psychiatry”. Public code: https://github.com/your-organization/BioLLMAgent.
VideoHV-Agent (Multi-Agent for Long Video QA): From “Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding” by Zheng Wang et al., this system features specialized agents for hypothesis generation and evidence gathering. Public code: https://github.com/Haorane/VideoHV-Agent.
DeformTrace (Deformable SSM for Forgery Localization): In “DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization” by Xiaodong Zhu et al., it uses Deformable Self-SSM (DS-SSM) and Relay Tokens. No public code provided.
GDS (Gradient Deviation Scores for Data Detection): From “From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models” by Ruiqi Zhang et al., this method analyzes gradient behavior. Public code: https://github.com/kiky-space/icml-pdd.
AGF (Attention-Gravitational Field): Introduced in “Attention’s Gravitational Field: A Power-Law Interpretation of Positional Correlation” by Edward Zhang, this framework reinterprets positional correlations. Public code: https://github.com/windyrobin/AGF/tree/main.
Model Medicine & Neural MRI: From “Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models” by Jihoon ‘JJ’ Jeong, this proposes a taxonomy and diagnostic tools. Public code for Neural MRI: https://github.com/ModuLabs/NeuralMRI.
T3CEN (Hypertoroidal Covering for Color Equivariance): In “A Hypertoroidal Covering for Perfect Color Equivariance” by Yulong Yang et al. (Princeton University), this network achieves perfect equivariance to HSL shifts. Evaluated on datasets like Caltech-256, Oxford-IIIT Pet, SmallNORB. No public code provided.
XPlore (Counterfactual Explanation in GNNs): Introduced in “Beyond Edge Deletion: A Comprehensive Approach to Counterfactual Explanation in Graph Neural Networks” by Matteo De Sanctis et al. (Sapienza University of Rome), XPlore considers edge insertions and node-feature perturbations. No public code provided.
SAEs with Weight Regularization: From “Stable and Steerable Sparse Autoencoders with Weight Regularization” by Piotr Jedryszek and Oliver M. Crook (University of Oxford), this approach uses L2 weight penalties for stability. Public code: https://github.com/LukeMarks/feature-aligned-sae.
LISTA-Transformer (Fault Diagnosis): In “LISTA-Transformer Model Based on Sparse Coding and Attention Mechanism and Its Application in Fault Diagnosis” by Zhang, Li et al., this model combines sparse coding and attention for feature extraction. No public code provided.
Implicit U-KAN2.0 (Medical Image Segmentation): From “Implicit U-KAN2.0: Dynamic, Efficient and Interpretable Medical Image Segmentation”, it integrates SONO blocks and MultiKAN layers for efficiency and interpretability. Public code: https://math-ml-x.github.io/IUKAN2/.
DCENWCNet (WBC Classification with LIME): Introduced in “DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability” by Sibasish Das et al. (Amrita Vishwa Vidyapeetham), this CNN ensemble uses LIME for interpretability. No public code provided.
GeoTop (Geometric-Topological Analysis for Classification): In “GeoTop: Advancing Image Classification with Geometric-Topological Analysis” by Mariem Abaacha and Ian Morilla (Université de Paris), GeoTop combines TDA and LKCs. Public code: https://github.com/MorillaLab/GeoTop/tree/main/Code.
SSAE (Step-Level Sparse Autoencoder): From “Step-Level Sparse Autoencoder for Reasoning Process Interpretation” by Xuan Yang et al. (City University of Hong Kong), SSAE interprets LLM reasoning at the step level. Public code: https://github.com/Miaow-Lab/SSAE.
BRIGHT (Breast Pathology Foundation Model): Introduced in “BRIGHT: A Collaborative Generalist-Specialist Foundation Model for Breast Pathology” by Xiaojing Guo et al., this dual-pathway model is validated on TCGA datasets. No public code provided.
DaFFs for PINNs: In “Enhancing Physics-Informed Neural Networks with Domain-aware Fourier Features: Towards Improved Performance and Interpretable Results” by Alberto Miño Calero et al. (NTNU, ETH Zürich), DaFFs allow PINNs to inherently satisfy boundary conditions. No public code provided.
IPL (Interpretable Polynomial Learning): From “Towards Accurate and Interpretable Time-series Forecasting: A Polynomial Learning Approach” by Bo Liu et al. (Xi’an Jiaotong University), IPL uses polynomial representations for interpretability. Public code: https://github.com/Ariesoomoon/IPL_TS_experiments.
Hybrid NN with ℓ1-regression: In “Embedding interpretable ℓ1-regression into neural networks for uncovering temporal structure in cell imaging” by Fabian Kabus et al. (University of Freiburg), this method combines neural networks with ℓ1-regularized VAR models. No public code provided.
TVF (Time-Varying Filtering): Introduced in “Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising” by Riccardo Rota et al. (Logitech Europe S.A.), TVF merges DSP interpretability with deep learning. No public code provided.
GTDiagnosis (Visual-Language for GTD Diagnosis): From “Intelligent Pathological Diagnosis of Gestational Trophoblastic Diseases via Visual-Language Deep Learning Model” by Yuhang Liu et al. (Tsinghua University), this expert model uses visual-language deep learning. Public code: https://github.com/GTDiagnosisTeam/GTDiagnosis.
SHD Detection with GAM: In “Detecting Structural Heart Disease from Electrocardiograms via a Generalized Additive Model of Interpretable Foundation-Model Predictors” by Ya Zhou et al. (Fuwai Hospital), this framework uses ECG foundation models with a generalized additive model. Evaluated on the EchoNext benchmark dataset. No public code provided.
Radiomic Feature Sets (Knee MRI): From “Retrieving Patient-Specific Radiomic Feature Sets for Transparent Knee MRI Assessment” by Yaxii C and J. C. Nguyen (University of California, San Francisco), this retrieval-based approach is for patient-specific feature selection. Public code: https://github.com/YaxiiC/OA_KLG_Retrieval.git.
NLLB-200 Probing: In “Universal Conceptual Structure in Neural Translation: Probing NLLB-200’s Multilingual Geometry” by Kyle Mathewson (University of Alberta), this work explores multilingual geometry. Public code: https://github.com/kylemath/InterpretCognates.
Composite Indicators with Decision Rules: From “An Explainable and Interpretable Composite Indicator Based on Decision Rules” by Salvatore Corrente et al. (University of Catania), this novel method uses logical ‘if…then…’ rules. No public code provided.
Spoken Language Biomarker for Cognitive Impairment: In “Evaluating Spoken Language as a Biomarker for Automated Screening of Cognitive Impairment” by Maria R. Lima et al. (Imperial College London), this ML pipeline uses linguistic features. Evaluated on DementiaBank datasets. Public code: https://github.com/mariarlima/ml-speech-biomarkers.
REFORM (Reasoning for Multimodal Manipulation Detection): From “Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection” by Yuchen Zhang et al. (Xi’an Jiaotong University), REFORM uses GRPO-based RL. It introduces the ROM dataset (704k samples). Public code: https://github.com/YcZhangSing/REFORM.
BAED (Few-shot Graph Learning with Explanation): In “BAED: a New Paradigm for Few-shot Graph Learning with Explanation in the Loop” by Chao Chen et al. (Harbin Institute of Technology), BAED integrates belief propagation and auxiliary GNNs. No public code provided.
Explanation-Guided Adversarial Training: From “Explanation-Guided Adversarial Training for Robust and Interpretable Models”, this framework combines adversarial training with explanation guidance. Public code: https://github.com/your-organization/explanation-guided-adversarial-training.
Representational Geometry Markers: In “Diagnosing Generalization Failures from Representational Geometry Markers” by Chi-Ning Chou et al. (Flatiron Institute, Harvard University), this approach uses geometric measures for OOD prediction. Public code: https://github.com/chung-neuroai-lab/ood-generalization-geometry.
EMO-R3 (Reflective RL for Emotional Reasoning): From “EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models” by Yiyang Fang et al. (Wuhan University, Xiaomi Inc.), EMO-R3 uses structured emotional thinking and reflective rewards. Public code: https://github.com/xiaomi-research/emo-r3.
CausalProto (Unsupervised Causal Prototypical Networks): In “Unsupervised Causal Prototypical Networks for De-biased Interpretable Dermoscopy Diagnosis”, this network decouples pathological features from confounders. No public code provided.
sEMG Tokenization & ActionEMG-43: From “From Continuous sEMG Signals to Discrete Muscle State Tokens: A Robust and Interpretable Representation Framework” by Yuepeng Chen et al. (Beijing University of Posts and Telecommunications), this introduces ActionEMG-43, a large-scale sEMG dataset with 43 actions. No public code provided.

Impact & The Road Ahead

The impact of this research is profound, promising to unlock AI’s full potential in safety-critical and sensitive applications. The shift towards inherently interpretable models in healthcare, as seen with MEDIC and MedCoRAG, means AI can finally be a partner, not just a black box, for clinicians. Projects like GTDiagnosis, integrating visual-language deep learning for gestational trophoblastic disease diagnosis, exemplify how AI can drastically improve efficiency and accuracy in specialized medical fields, reducing diagnostic time from minutes to seconds. Furthermore, the development of patient-specific radiomic features for knee MRI assessment, as explored in “Retrieving Patient-Specific Radiomic Feature Sets for Transparent Knee MRI Assessment” by Yaxii C and J. C. Nguyen, ensures that AI-driven diagnostics are both precise and auditable.

For foundational models, the insights from papers like “The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology” by Alper YILDIRIM (Independent Researcher), which explores architectural interventions to reduce grokking delays, and “Compressed Sensing for Capability Localization in Large Language Models” by Anna Bair et al. (Carnegie Mellon University), revealing that LLM capabilities are localized to sparse subsets of attention heads, are crucial. These works pave the way for more efficient model design, targeted debugging, and enhanced control over AI behavior.

The development of specialized tools like GLUScope for analyzing gated activation functions and frameworks like TopicENA for scalable discourse analysis underscore the growing need for sophisticated methods to dissect and understand complex AI systems. The critical self-reflection in “Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions” by Saleh Afroogh et al. (University of Texas at Austin) serves as a potent reminder that our pursuit of interpretability must be grounded in scientific rigor and verification, moving beyond superficial explanations. The journey toward truly transparent and trustworthy AI is long, but these recent breakthroughs show we are steadily moving towards a future where AI not only performs brilliantly but also explains itself clearly, fostering greater collaboration and confidence in human-AI partnerships.

Share this content:

Spread the love

Interpretability: Unveiling the Inner Workings of AI, From Neurons to Clinical Decisions

Latest 100 papers on interpretability: Mar. 7, 2026

The Big Idea(s) & Core Innovations

Under the Hood: Models, Datasets, & Benchmarks

Impact & The Road Ahead

Hi there 👋

Get a roundup of the latest AI paper digests in a quick, clean weekly email.

Post Comment Cancel reply

Latest 100 papers on interpretability: Mar. 7, 2026

The Big Idea(s) & Core Innovations

Under the Hood: Models, Datasets, & Benchmarks

Impact & The Road Ahead

Hi there 👋

Get a roundup of the latest AI paper digests in a quick, clean weekly email.

Transformers Unleashed: From Training Efficiency to Real-World Impact and Theoretical Foundations

Explainable AI’s Next Frontier: Beyond Black Boxes, Towards Trustworthy Decisions

Post Comment Cancel reply