OV3D-CG none

Open-vocabulary 3D Instance Segmentation
with Contextual Guidance

ICCV 2025

1Key Laboratory of AI Safety of Chinese Academy of Sciences (CAS),
Institute of Computing Technology, CAS, Beijing, 100190, China
2University of Chinese Academy of Sciences, Beijing, 100049, China

Abstract

Open-vocabulary 3D instance segmentation (OV-3DIS), which aims to segment and classify objects beyond predefined categories, is a critical capability for embodied AI applications. Existing methods rely on pre-trained 2D foundation models, focusing on instance-level features while overlooking contextual relationships, limiting their ability to generalize to rare or ambiguous objects. To address these limitations, we propose an OV-3DIS framework guided by contextual information. First, we employ a Class-agnostic Proposal Module, integrating a pre-trained 3D segmentation model with a SAM-guided segmenter to extract robust 3D instance masks. Subsequently, we design a Semantic Reasoning Module, which selects the best viewpoint for each instance and constructs three 2D context-aware representations. The representations are processed using Multimodal Large Language Models with Chain-of-Thought prompting to enhance semantic inference. Notably, our method outperforms state-of-the-art methods on the ScanNet200 and Replica datasets, demonstrating superior open-vocabulary segmentation capabilities. Moreover, preliminary implementation in real-world scenarios verifies our method's robustness and accuracy, highlighting its potential for embodied AI tasks such as object-driven navigation.

Method

none
The Class-agnostic Proposal Module extracts initial 3D instance masks using both a pre-trained 3D instance segmenter and a SAM-based segmenter. The Semantic Reasoning Module selects the best viewpoint and represents the instances using bounding boxes, landmarks, and SAM-based masks. Multi-modal LLMs with Chain-of-Thought (CoT) prompting are used to assign semantic categories to each instance based on the contextual information from the scene.

Quantitative Results

Common Objects

none
"potted plant"
none
"fireplace"

Uncommon Objects

none
"board games"
none
"lampshade"

Object Color

none
"yellow chair"
none
"purple duvet"

Long Description Affordance

none
"A grey fabric chair positioned next to a white desk"
none
"a red cylindrical air compressor mounted on a metal rack"

More Visualization Results

BibTeX

BibTex Code Here