Representation Learning: Unifying Modalities, Tackling Bias, and Pioneering New Frontiers
Latest 50 papers on representation learning: Sep. 8, 2025
Representation learning continues to be a cornerstone of modern AI/ML, driving breakthroughs across diverse fields from computer vision to drug discovery. The ability of models to automatically learn meaningful, compact representations from raw data underpins many of the impressive capabilities we see today. However, challenges persist, particularly concerning data scarcity, bias, computational efficiency, and the integration of heterogeneous data sources. Recent research delves into these critical areas, pushing the boundaries of what’s possible and offering innovative solutions for more robust, fair, and efficient AI systems.
The Big Idea(s) & Core Innovations
One prominent theme across recent papers is the fusion and alignment of multiple modalities to create richer, more comprehensive representations. For instance, in “BiListing: Modality Alignment for Listings” from Airbnb authors Guillaume Guy, Mihajlo Grbovic, Chun How Tan, and Han Zhao, a novel approach aligns text and images of listings using large language models, demonstrating significant improvements in search ranking and revenue. Similarly, Hyeon Bang, Eunjin Choi, Seungheon Doh, and Juhan Nam from KAIST introduce “PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music”, integrating audio, symbolic (MIDI), and textual descriptions to capture subtle nuances in solo piano music, surpassing general-purpose models in text-to-music retrieval. This emphasis on multimodal integration is further extended in medical imaging by Yuheng Li and his team from Georgia Institute of Technology and Emory University with “MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting”, which combines local detection with global understanding through multi-scale semantic alignment for 3D CT scans.
Another significant thrust is improving representation learning in resource-constrained or challenging data environments. In “What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?”, Ibne Farabi Shihab, Sanjeda Akter, and Anuj Sharma from Iowa State University introduce Policy-Aware Matrix Completion (PAMC) to exploit low-rank structures in reward functions, drastically improving sample efficiency in sparse-reward reinforcement learning. For drug discovery, Amartya Banerjee et al. from UNC Chapel Hill and IIT Bombay propose “Valid Property-Enhanced Contrastive Learning for Targeted Optimization & Resampling for Novel Drug Design” (VECTOR+), which structures chemical latent spaces based on biological function, enabling effective molecular design from limited datasets. The challenge of long-tailed visual recognition is addressed by Yifan Lan et al. from Huazhong University of Science and Technology in “Mixture of Balanced Information Bottlenecks for Long-Tailed Visual Recognition”, using BIB and MBIB to preserve label-related information through loss re-balancing and self-distillation.
The research also tackles the critical issue of bias and fairness. Seyyed-Kalantari, Mittelstadt, Zietlow, and Zong from institutions like University of California, Berkeley, and ETH Zurich, in “A Primer on Causal and Statistical Dataset Biases for Fair and Robust Image Analysis”, highlight that current debiasing techniques often fail in real-world deployment and can lead to a ‘levelling down’ effect, underscoring the need for a deeper understanding of causal and statistical biases. Relatedly, Dayeon Ki et al. from the University of Maryland and NAVER Cloud address “Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint” (ORACLE), enforcing orthogonality between semantic and language representations to improve disentanglement in cross-lingual embeddings. This makes for better performance in tasks like cross-lingual retrieval, particularly in code-switching scenarios.
Finally, several papers explore novel architectural and theoretical foundations for representation learning. Eslam Abdelaleem et al. from Emory University introduce “Deep Variational Multivariate Information Bottleneck (DVMIB)”, a unifying framework that generalizes various dimensionality reduction methods, leading to novel techniques like DVSIB which produce superior latent spaces. For graph data, Zhiyu Wang et al. from the University of Cambridge present “Topotein: Topological Deep Learning for Protein Representation Learning”, which captures hierarchical protein structures using novel data structures and SE(3)-equivariant networks, outperforming existing GNNs in protein analysis. Similarly, Sofía Pérez Casulo et al. from Universidad de la República introduce “LASE: Learned Adjacency Spectral Embeddings”, a neural architecture that learns interpretable, parameter-efficient spectral node embeddings through gradient descent, offering robustness to missing edges.
Under the Hood: Models, Datasets, & Benchmarks
These advancements are underpinned by new models, specialized datasets, and rigorous benchmarks:
- SST-iTransformer: Proposed in Parking Availability Prediction via Fusing Multi-Source Data with A Self-Supervised Learning Enhanced Spatio-Temporal Inverted Transformer by Yin Huang et al., this model leverages a dual-branch attention mechanism for parking prediction.
- PianoBind: Introduced by Hyeon Bang et al. from KAIST in PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music, trained on the PIAST dataset, with publicly released code and pretrained weights available, and a demo at https://hayeonbang.github.io/PianoBind/.
- MedVista3D: A multi-scale vision-language model for 3D CT scans, discussed in MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting by Yuheng Li et al., leveraging a Radiology Semantic Matching Bank (RSMB).
- Topotein (PCC & TCPNet): From Topotein: Topological Deep Learning for Protein Representation Learning by Zhiyu Wang et al., utilizes Protein Combinatorial Complexes (PCC) and a Topology-Complete Perceptron Network (TCPNet) for hierarchical protein structures. Code is available at github.com/ZW471/TopoteinWorkshop.
- PAMC: Policy-Aware Matrix Completion, a framework for sparse-reward RL from What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning? by Ibne Farabi Shihab et al.
- Teacher-Student Model: For mitosis detection and classification, using MIDOG++ and MITOS WSI datasets, with code at https://github.com/MIDOGChallenge/teacher-student-mitosis, as seen in Teacher-Student Model for Detecting and Classifying Mitosis in the MIDOG 2025 Challenge by Seungho Choe et al.
- Graph-based Offline RL for Sepsis: Investigates GraphSAGE and GATv2 on the MIMIC-III dataset, as presented in Exploring a Graph-based Approach to Offline Reinforcement Learning for Sepsis Treatment by Taisiya Khakharova et al.
- PointAD+: For zero-shot 3D anomaly detection, leveraging CLIP’s generalization, introduced by Qihang Zhou et al. from Zhejiang University in PointAD+: Learning Hierarchical Representations for Zero-shot 3D Anomaly Detection.
- SurGBSA: A physics-informed pretraining model for molecular dynamics simulations, using CASF-2016 and PDBBind MD data, discussed in SurGBSA: Learning Representations From Molecular Dynamics Simulations by Derek Jones et al.
- NCPF: Neural Canonical Polyadic Factorization for traffic data imputation, with code at https://github.com/yourusername/ncpf, from Neural Canonical Polyadic Factorization for Traffic Analysis by Wenyu Luo et al.
- Structure-preserving contrastive learning: Uses global topology and local graph geometry regularizers for spatial time series, with code at https://github.com/yiru-jiao/spclt, presented in Structure-preserving contrastive learning for spatial time series by Yiru Jiaoa et al.
- Predict, Cluster, Refine: A self-supervised framework for graph representation learning, with code at https://github.com/your-username/predict-cluster-refine, from Predict, Cluster, Refine: A Joint Embedding Predictive Self-Supervised Framework for Graph Representation Learning by John Doe and Jane Smith.
- DCR Framework: For lifelong person re-identification, with code at https://github.com/LiuShiBen/DCR, discussed in Domain Consistency Representation Learning for Lifelong Person Re-Identification by Liu Shiben et al.
- DVSIB: Deep Variational Symmetric Information Bottleneck, a novel method from the DVMIB framework, presented in Deep Variational Multivariate Information Bottleneck – A Framework for Variational Losses by Eslam Abdelaleem et al.
- Cost-Driven LQG Control: Explores learning state representations by predicting multi-step costs in Linear Quadratic Gaussian control, by Yi Tian et al. from MIT and University of Maryland in Cost-Driven Representation Learning for Linear Quadratic Gaussian Control: Part I.
- EmoPerso: A self-supervised emotion-aware framework for personality detection, with code at https://github.com/slz0925/EmoPerso, from EmoPerso: Enhancing Personality Detection with Self-Supervised Emotion-Aware Modelling by Lingzhi Shen et al.
- Information-theoretic Multi-view Learning: A framework using mutual information for feature fusion, from Towards Comprehensive Information-theoretic Multi-view Learning by Zaiyan Khan.
- Synthetic Data & Hard Negatives: Studies using synthetic data for self-supervised vision transformers, by Nikolaos Giakoumoglou et al. from Imperial College London in Fake & Square: Training Self-Supervised Vision Transformers with Synthetic Data and Synthetic Hard Negatives and Unsupervised Training of Vision Transformers with Synthetic Negatives, with code for the latter at https://github.com/giakoumoglou/synco-v2.
- DroneSR: A framework for few-shot thermal image super-resolution, introducing the DroneSR dataset, with code at https://github.com/wengzp1/GARLSR, from DroneSR: Rethinking Few-shot Thermal Image Super-Resolution from Drone-based Perspective by Weng Zhipeng et al.
- Temporal Representation Learning for Ultrasound: Uses EchoNet-Dynamic dataset and temporally consistent masking, presented in Temporal Representation Learning for Real-Time Ultrasound Analysis by Yves Stebler et al.
- Propagation-aware Representation Learning: For social media graph analytics, with code at https://github.com/WeiJiang01/RPRL, from Towards Propagation-aware Representation Learning for Supervised Social Media Graph Analytics by Wei Jiang et al.
- CascadeFormer: A two-stage cascading transformer for skeleton-based human action recognition, with code at https://github.com/Yusen-Peng/CascadeFormer and checkpoints at https://huggingface.co/YusenPeng/CascadeFormerCheckpoints, from CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition by Yusen Peng and Alper Yilmaz.
- Entropy-based Speech Representation: Employs an entropy-based token aggregation framework for semantic speech representations, with code at https://www.modelscope.cn/models/iic/, in Entropy-based Coarse and Compressed Semantic Speech Representation Learning by Jialong Zuo et al.
- STVH & M-STVH: For multi-focused video group activities hashing, by Zhongmiao Qi et al. from Ningbo University in Multi-Focused Video Group Activities Hashing.
- ReDi: A joint image-feature synthesis method for generative image modeling, integrating DINOv2, with project page at https://representationdiffusion.github.io/, proposed by Theodoros Kouzelis et al. in Boosting Generative Image Modeling via Joint Image-Feature Synthesis.
- ViTaMIn: A robot-free visuo-tactile manipulation interface, with code at https://chuanyune.github.io/ViTaMIn_page, from ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface by Fangchen Liu et al.
- cMIM: Contrastive Mutual Information Machine, an extension of MIM for unified generative and discriminative representation learning, with code at https://github.com/NVIDIA/MIM, by Micha Livne from NVIDIA in Contrastive MIM: A Contrastive Mutual Information Framework for Unified Generative and Discriminative Representation Learning.
- Learnable Weighted Hybrid Autoencoder: For model order reduction, with code at https://github.com/csml-rpi/deep-ae-with-svd-convergence, from Beyond the Kolmogorov Barrier: A Learnable Weighted Hybrid Autoencoder for Model Order Reduction by Nithin Somasekharan and Shaowu Pan.
- InDiD: Instant Disorder Detection via Representation Learning, for change point detection, with a newly labeled video dataset and code at https://github.com/romanenkova95/InDiD, from InDiD: Instant Disorder Detection via Representation Learning by Evgenia Romanenkova et al.
- Atypical Video Dataset: Introduced in What Can We Learn from Harry Potter? An Exploratory Study of Visual Representation Learning from Atypical Videos by Qiyue Sun et al., for open-world visual representation learning. Dataset resources: https://julysun98.github.io/atypical_dataset.
- SatDINO: For self-supervised pretraining in remote sensing, using fMoW-RGB dataset, with code at https://github.com/strakaj/SatDINO, from SatDINO: A Deep Dive into Self-Supervised Pretraining for Remote Sensing by Jakub Straka and Ivan Gruber.
- LINKO: An LLM-augmented framework for multi-ontology integration in medical concept representation, with code at https://github.com/mohsen-nyb/LINKO.git, from Multi-Ontology Integration with Dual-Axis Propagation for Medical Concept Representation by Mohsen Nayebi Kerdabadi et al.
- Robust Spatial Representations from Binaural Audio: Utilizes HRTF-DATABASE and ANF-Generator, from Learning Robust Spatial Representations from Binaural Audio through Feature Distillation by S. Doclo and J. Bitzer.
- EEGDM: For EEG representation learning using latent diffusion models, from EEGDM: Learning EEG Representation with Latent Diffusion Model by Shaocong Wang et al.
- MAEs for Ultrasound Signals: Applying masked autoencoders for robust representation learning in ultrasound signal processing, from Masked Autoencoders for Ultrasound Signals: Robust Representation Learning for Downstream Applications by Author Name 1 and Author Name 2.
- Human-AI Collaborative Bot Detection: An unsupervised framework using contrastive representation learning and LLMs for MMORPGs, from Human-AI Collaborative Bot Detection in MMORPGs by Jaeman Son and Hyunsoo Kim.
- Collaborative Evolution of Intelligent Agents: A multi-agent framework for microservice systems, by Renzi Meng from Northeastern University in Collaborative Evolution of Intelligent Agents in Large-Scale Microservice Systems.
- Multi-Modal Tumor and Peritumoral Feature Fusion Network: For distant metastasis prediction in head and neck cancer, from Prediction of Distant Metastasis for Head and Neck Cancer Patients Using Multi-Modal Tumor and Peritumoral Feature Fusion Network by Author A and Author B.
- Latent Double Machine Learning (latent DML): Integrates latent variable modeling into causal effect estimation, with code at https://github.com/nitaifingerhut/C-DML, from Latent Variable Modeling for Robust Causal Effect Estimation by Tetsuro Morimura et al.
- Autoencoders for Semantic Geometry: Surveys VAE, VQVAE, and SAE for compositional and distributional semantics, from Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder by Yingji Zhang et al.
- Disentangled World Models (DisWM): For visual reinforcement learning with latent distillation, with a project page at https://qiwang067.github.io/diswm, by Qi Wang et al. in Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning.
- Noro: Noise-Robust One-shot Voice Conversion, using hidden speaker representation learning, from Noro: Noise-Robust One-shot Voice Conversion with Hidden Speaker Representation Learning by Zhang, Y. et al.
- SDGNN: Parameter-Free Structural-Diversity Message Passing for Graph Neural Networks, with code at https://github.com/mingyue15694/SGDNN/tree/main, by Mingyue Kong et al. in Parameter-Free Structural-Diversity Message Passing for Graph Neural Networks.
Impact & The Road Ahead
The collective impact of this research is profound, touching upon nearly every facet of AI/ML. The advances in multimodal representation learning, exemplified by BiListing and PianoBind, promise more intuitive and powerful AI systems that can understand and interact with the world through richer sensory data. This has direct implications for areas like augmented reality, smart interfaces, and content generation.
Critically, the efforts to address bias and improve fairness in models, as highlighted by “A Primer on Causal and Statistical Dataset Biases” and ORACLE, are essential for building trustworthy AI. As AI systems become more ubiquitous, ensuring they are robust and equitable across diverse populations and languages is paramount. The focus on low-resource and sparse-data settings, seen in VECTOR+ and PAMC, democratizes AI development, enabling powerful applications in domains where labeled data is scarce, such as drug discovery and personalized medicine.
Innovations in areas like topological deep learning for proteins (Topotein), advanced signal processing for ultrasound and EEG (MAEs for Ultrasound Signals, EEGDM), and robust control systems (Cost-Driven LQG Control) point towards a future where AI can tackle increasingly complex scientific and engineering challenges. The use of synthetic data and hard negatives for vision transformers (Fake & Square, Unsupervised Training of Vision Transformers) also offers a promising path to reduce the reliance on vast, expensive real-world datasets, fostering more sustainable AI development.
The integration of LLMs in diverse applications, from enhancing medical concept representations (LINKO) to verifying bot detection in MMORPGs (Human-AI Collaborative Bot Detection), illustrates the growing versatility and power of these models. This signifies a trend towards human-AI collaborative systems that leverage AI for enhanced decision-making and explainability.
The road ahead will likely see continued convergence of different AI paradigms—deep learning, reinforcement learning, and causal inference—driven by foundational work like DVMIB and Latent Double Machine Learning. Expect further breakthroughs in creating AI systems that are not only intelligent but also adaptable, fair, and capable of operating autonomously in complex, real-world environments. The vibrant research landscape confirms that representation learning remains at the heart of this exciting journey, continually redefining what’s possible in the world of AI.
Post Comment