👆Click-Gaussian: Interactive Segmentation to Any 3D Gaussians

ECCV 2024
Seokhun Choi1*, Hyeonseop Song1*, Jaechul Kim1, Taehyeong Kim2†, Hoseok Do1†
1LG Electronics, 2Seoul National University, *Equal contribution, †Co-corresponding author

We present 👆Click-Gaussian, a swift and precise method for interactive segmentation of 3D Gaussians using two-level granularity feature fields derived from 2D segmentation masks.

Abstract

Interactive segmentation of 3D Gaussians opens a great opportunity for real-time manipulation of 3D scenes thanks to the real-time rendering capability of 3D Gaussian Splatting. However, the current methods suffer from time-consuming post-processing to deal with noisy segmentation output. Also, they struggle to provide detailed segmentation, which is important for fine-grained manipulation of 3D scenes. In this study, we propose Click-Gaussian, which learns distinguishable feature fields of two-level granularity, facilitating segmentation without time-consuming post-processing. We delve into challenges stemming from inconsistently learned feature fields resulting from 2D segmentation obtained independently from a 3D scene. 3D segmentation accuracy deteriorates when 2D segmentation results across the views, primary cues for 3D segmentation, are in conflict. To overcome these issues, we propose Global Feature-guided Learning (GFL). GFL constructs the clusters of global feature candidates from noisy 2D segments across the views, which smooths out noises when training the features of 3D Gaussians. Our method runs in 10 ms per click, 15 to 130 times as fast as the previous methods, while also significantly improving segmentation accuracy.

Method Overview

Click-Gaussian architecture.

Overview of the proposed method. i) Our approach augments pre-trained 3D Gaussians with two-level granularity features $\mathbf{f}_i$. ii) These features are trained through contrastive learning, utilizing 2D rendered feature maps $\mathbf{F}$ and their corresponding SAM-generated masks $M$. iii) To address inconsistencies in mask signals across views, we introduce a Global Feature-guided Learning approach. For clarity, Global Feature-guided Learning at the fine-level is omitted from the illustration. For more details, please refer to our paper.

Demo

Comparison

Comparison on LERF-Mask Dataset

Click-Gaussian architecture.

The results are displayed in three lines per scene (Teatime, Ramen, and Figurines in order). Each scene's first two rows show coarse and fine level segmentation results, respectively, and the third row shows the PCA visualizations of each model's finest-level feature field. Our approach demonstrates superior segmentation ability in both coarse and fine levels. Red and yellow boxes indicate noisy and under-segmentation results, respectively.

Click-Gaussian architecture.

Our approach performs more detailed and cleaner extractions of Gaussians, up to 130 times faster than other baselines.

Click-Gaussian architecture.

Our method shows more exact and fine-grained results against baselines in performing automatic segmentation of everything on novel views.

Comparison on SPIn-NeRF Dataset

Click-Gaussian architecture.

Our method achieves more precise Gaussian extraction, highlighted by red dotted lines, and runs about 15 times faster than SAGA.

Application

OpenAI recently announced Sora, a groundbreaking text-to-video generation model. To demonstrate Click-Gaussian's versatility in scene segmentation and manipulation tasks, we applied our method to some Sora-generated videos (santorini and snow-village). After pre-training 3DGS on each generated video, with Click-Gaussian, users can flexibly make desired modifications such as resizing, translation, and text-based editing.

More Results

We show our PCA-visualized coarse and fine level feature fields in three scenes (Figurines, Teatime, and Ramen) from LERF-Mask Dataset.

We present automatic segmentation results (third and fifth columns) along with PCA visualizations of rendered feature maps (second and fourth columns) at two granularity levels for various nine scenes (first column) from the LeRF Dataset. Objects classified with the same ID in the segmentation results share the same overlaid color across the three given views, as each global cluster ID remains consistent throughout a scene.

BibTeX

@article{choi2024click,
        title={Click-Gaussian: Interactive Segmentation to Any 3D Gaussians},
        author={Choi, Seokhun and Song, Hyeonseop and Kim, Jaechul and Kim, Taehyeong and Do, Hoseok},
        journal={arXiv preprint arXiv:2407.11793},
        year={2024}
}