Loading Now

Segment Anything Model: Lighting Up, Slimming Down, and Diving into 3D — Recent Breakthroughs in Foundation Models

Latest 13 papers on segment anything model: May. 23, 2026

The Segment Anything Model (SAM) has revolutionized computer vision with its unparalleled ability to segment any object given a simple prompt. However, its real-world deployment has faced hurdles, from performance degradation under diverse conditions to high computational demands and the need for task-specific adaptation. Recent research has been pushing the boundaries, making SAM more robust, efficient, and versatile across a stunning array of applications, from medical imaging to civil engineering and beyond.

The Big Idea(s) & Core Innovations

The overarching theme across these papers is the intelligent adaptation and enhancement of foundation models like SAM and SAM2 (and even SAM3!), tackling their limitations without sacrificing their core strengths. A key challenge is maintaining robustness across varied conditions. For instance, in “Lighting-aware Unified Model for Instance Segmentation,” researchers from Iowa State University introduce PLAP-LCA, a lightweight adapter featuring a Lighting Convolutional-Attention (LCA) module. This module, unlike implicit methods, uses Laplacian edge detection to explicitly model the physical signature of lighting variations at object boundaries, providing structurally-grounded illumination invariance. This innovation allows foundation models to maintain performance under diverse real-world lighting, a critical step for practical deployment.

Another significant area of focus is expanding SAM’s utility beyond traditional image segmentation. The paper “FG-TreeSeg: Flow-Guided Tree Crown Segmentation without Instance Annotations” by Pengyu Chen et al. introduces a training-free framework that remarkably transfers flow-based delineation dynamics from biomedical imaging to remote sensing. By leveraging the star-convex morphological properties of tree crowns, similar to biological cells, FG-TreeSeg (https://arxiv.org/pdf/2602.00470) achieves robust, annotation-free tree crown segmentation. Similarly, Stefano Colamonaco et al. from KU Leuven in their work, “Weakly Supervised Segmentation as Semantic-Based Regularization” (https://arxiv.org/pdf/2605.13674), introduce a neurosymbolic approach using differentiable fuzzy logic to integrate weak annotations and structural priors, outperforming fully supervised baselines on Pascal VOC.

Efficiency and 3D capabilities are also hot topics. Huawei Noah’s Ark Lab’s “TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model” (https://arxiv.org/pdf/2605.18013) demonstrates how joint spatial-temporal memory token compression can drastically reduce memory footprint (to 7%!) while retaining 90% of SAM 2.1’s performance, enabling real-time, device-side video segmentation. For 3D applications, “3D Modeling and Automated Measurement of Concrete Cracks via Segment Anything Refinement and Visual Inertial LiDAR Fusion” (https://arxiv.org/pdf/2501.09203) by Pengru Deng et al. from Central South University, enhances SAM for concrete crack segmentation and 3D reconstruction using a novel crack-aware prompt strategy and Visual Inertial LiDAR (VIL) fusion, achieving sub-millimeter accuracy. Furthering 3D segmentation, Raushan Joshi and Jean-Yves Guillemaut from the University of Surrey introduce a “Robust Prior-Guided Segmentation for Editable 3D Gaussian Splatting” (https://arxiv.org/pdf/2605.16065), integrating SAM-HQ masks with learned priors for multiview consistent 3D object segmentation and editing in real-time.

From a benchmarking perspective, Xiangxiang Cui et al. from Beijing Normal University and University of Surrey highlight critical vulnerabilities in “MedFM-Robust: Benchmarking Robustness of Medical Foundation Models” (https://arxiv.org/pdf/2605.19027). They reveal that fine-tuning strategies significantly impact robustness, with LoRA showing nearly double the degradation of full fine-tuning, especially under medical-specific perturbations. In parallel, Yuyuan Liu et al. from University of Oxford tackle multimodal challenges in “AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting” (https://arxiv.org/pdf/2506.01015), addressing “audio prompt dilution” in SAM2 for audio-visual segmentation with a multi-scale feature pyramid and contrastive learning.

Efficiency is further addressed by Hoai-Chau Tran et al. from University of Illinois at Urbana-Champaign in “SparseSAM: Structured Sparsification of Activations in Segment Anything Models” (https://arxiv.org/pdf/2605.17633), a training-free framework that achieves 2x speedup and 2.8x memory reduction by jointly sparsifying attention and MLP layers. For practical applications, Eugenia Moris et al. from Arionkoder LLC have developed an “End-to-end plaque counting and virus titration from laboratory plate images with deep learning” (https://arxiv.org/pdf/2605.16008), adapting SAM2 for well segmentation and SAM for plaque detection to automate virus quantification, showing strong agreement with manual counts. For scene understanding, Jiyuan Liu et al. from Hangzhou Dianzi University propose “M4-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection” (https://arxiv.org/pdf/2605.11760), enhancing SAM2 for RGB-D video salient object detection using Modality-Aware MoE-LoRA and prompt-free initialization. Finally, Yuxin Du et al. from Shanghai Jiao Tong University introduce “Focusable Monocular Depth Estimation” (https://arxiv.org/pdf/2605.11756), a new task for target-conditioned depth inference, leveraging SAM3’s spatial selectivity with Depth Anything’s geometry priors to prioritize foreground accuracy. Bridging 2D and 3D segmentation, Yixiao Song et al. from Beijing Jiaotong University present “PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting” (https://arxiv.org/pdf/2605.11520), which uses 3D Gaussian Splatting as a unified representation to transfer 2D semantics from SAM to 3D point clouds unsupervised, overcoming projection overlap issues.

Under the Hood: Models, Datasets, & Benchmarks

These advancements are powered by ingenious modifications to existing models, novel datasets, and rigorous benchmarking, pushing the boundaries of what SAM can do.

  • Lighting-aware Unified Model for Instance Segmentation (https://arxiv.org/pdf/2605.20436) introduces PLAP-LCA (a dual-branch adapter with a Lighting Convolutional-Attention module) and a novel Unity-based synthetic dataset for physically accurate illumination evaluation. Code and dataset are mentioned to be released.
  • MedFM-Robust: Benchmarking Robustness of Medical Foundation Models (https://arxiv.org/pdf/2605.19027) presents a comprehensive benchmark for Med-VLMs (LLaVA-Med, MedGemma, Gemini-2.5-flash, GPT-4o-mini) and SAM-based segmentation models (MedSAM, SAM-Med2D), using a modality-adaptive perturbation pipeline across 8 medical imaging modalities. The code and resources are available at https://github.com/AbnerAI/MedFM-Robust.
  • 3D Modeling and Automated Measurement of Concrete Cracks (https://arxiv.org/pdf/2501.09203) builds upon SAM and DeepLabv3+ with a crack-aware prompt generation strategy and utilizes FastLIO2 LiDAR-Inertial Odometry. Code is at https://github.com/XR‐Lee/CrackSeg.
  • TinySAM 2: Extreme Memory Compression (https://arxiv.org/pdf/2605.18013) features RepViT as a lightweight image encoder and joint memory management and spatiotemporal compression. It’s evaluated on SA-V, DAVIS 2017, and YTVOS datasets, using a 0.5% subset of SA-1B.
  • SparseSAM: Structured Sparsification of Activations (https://arxiv.org/pdf/2605.17633) is a training-free framework for SAM-B, SAM-L, and SAM-H checkpoints, employing Stripe-Sort Attention and Residual-Consistency MLP. It uses MS-COCO and HQ-44K datasets.
  • FG-TreeSeg: Flow-Guided Tree Crown Segmentation (https://arxiv.org/pdf/2602.00470) combines SegFormer with Cellpose-SAM for training-free tree crown segmentation, evaluated on NEON and BAMFORESTS datasets. Code is available for Cellpose-SAM and SegFormer.
  • Robust Prior-Guided Segmentation for Editable 3D Gaussian Splatting (https://arxiv.org/pdf/2605.16065) integrates SAM-HQ with a prior-guided label reassignment method for 3D Gaussian Splatting. A new high-quality multiview mask dataset is released at https://huggingface.co/datasets/joshir/3D-Scene-Segmentation-HQ.
  • End-to-end plaque counting and virus titration (https://arxiv.org/pdf/2605.16008) uses SAM2 for well segmentation and SAM (ViT-B backbone) for plaque detection. It introduces the Titra web platform (https://titra.app/) and a MAYV/CVB3 dataset from smartphone photography. Code is to be released at github.com/arionkoder/titra-ai.
  • AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting (https://arxiv.org/pdf/2506.01015) introduces an AuralFuser module and AudioCon contrastive learning strategy for SAM2. It’s tested on AVSBench and Ref-AVS datasets. Code is at https://github.com/yyliu01/AuralSAM2.
  • Weakly Supervised Segmentation as Semantic-Based Regularization (https://arxiv.org/pdf/2605.13674) proposes a neurosymbolic approach for SAM, evaluated on Pascal VOC 2012 and REFUGE2 datasets. Code is available at https://github.com/StefanoColamonaco/Logic-Guided-Segmentation.
  • M4-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (https://arxiv.org/pdf/2605.11760) extends SAM2 with Modality-Aware MoE-LoRA, Gated Multi-Level Feature Fusion, and Pseudo-Guided Initialization. It performs state-of-the-art on DViSal, RDVS, and ViDSOD-100 RGB-D VSOD benchmarks. Code is at https://github.com/HankLiu2020/M4-SAM.
  • Focusable Monocular Depth Estimation (https://arxiv.org/pdf/2605.11756) introduces the FocusDepth framework combining SAM3 and Depth Anything through Multi-Scale Spatial-Aligned Fusion (MSSA), and establishes FDE-Bench, a new benchmark for target-centric depth evaluation.
  • PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation (https://arxiv.org/pdf/2605.11520) leverages 3D Gaussian Splatting as an intermediate representation with SAM for unsupervised 3D point cloud segmentation on ScanNet-V2 and S3DIS. Code is at https://github.com/SebastianYIXIAO/pointGS.

Impact & The Road Ahead

The collective impact of this research is profound, propelling foundation models like SAM closer to real-world, robust, and efficient deployment. The advancements in illumination invariance, memory compression, and training-free adaptations mean that these powerful models can now operate in more challenging environments and on resource-constrained devices. The foray into 3D segmentation, multimodal processing (audio-visual), and domain-specific applications (medical, civil engineering, forestry) opens up new frontiers for automation and analysis. The creation of new benchmarks and datasets, particularly for robustness and specific tasks, is crucial for fostering continued progress.

Looking ahead, we can expect further innovations in parameter-efficient fine-tuning that maintain robustness, the seamless integration of multimodal cues, and more sophisticated neurosymbolic approaches that combine the power of foundation models with human-understandable logic and priors. The goal is clear: to make AI perception systems not just capable, but also reliable, adaptable, and efficient enough to transform industries and enhance our daily lives. The Segment Anything Model, continuously refined and expanded, is poised to remain at the forefront of this exciting evolution.

Share this content:

mailbox@3x Segment Anything Model: Lighting Up, Slimming Down, and Diving into 3D — Recent Breakthroughs in Foundation Models
Hi there 👋

Get a roundup of the latest AI paper digests in a quick, clean weekly email.

Spread the love

Post Comment