Abstract

We present ObjectMatch, a semantic and object-centric camera pose estimator for RGB-D SLAM pipelines. Modern camera pose estimators rely on direct correspondences of overlapping regions between frames; however, they cannot align camera frames with little or no overlap. In this work, we propose to leverage indirect correspondences obtained via semantic object identification. For instance, when an object is seen from the front in one frame and from the back in another frame, we can provide additional pose constraints through canonical object correspondences. We first propose a neural network to predict such correspondences on a per-pixel level, which we then combine in our energy formulation with state-of-the-art keypoint matching solved with a joint Gauss-Newton optimization. In a pairwise setting, our method improves registration recall of state-of-the-art feature matching, including from 24% to 45% in pairs with 10% or less inter-frame overlap. In registering RGB-D sequences, our method outperforms cutting-edge SLAM baselines in challenging, low-frame-rate scenarios, achieving more than 35% reduction in trajectory error in multiple scenes.

Video

Method Overview

Overview of our approach to incorporate object correspondence grounding in global pose estimation. From a set of input RGB- D frames, ObjectMatch predicts object instances for each frame with dense normalized object correspondences. The predicted object instances are used to identify objects across frames, forming indirect object correspondences. We combine object correspondences with an RGB-D version of SuperGlue [SuperGlue, BundleFusion] keypoint matches in a joint energy optimization that yields both camera and object poses in a global registration.

Pairwise Registration Results

We show very low-overlap frame pairs where feature matching fails, while our method can still estimate camera poses using object correspondences in the videos below. On the left, an overlay of top-1 object matching and canonical correspondences appear. Then, on the right, the RGB-D registration with camera poses and top-1 matching object pose appears.

A chair from the back and from the side

A table (desk) from the side and from the front

A chair from the front-left and from the top

A cabinet from the top and from the front-left

SLAM Sequence Registration Results

We show SLAM reconstructions in low frame-rate TUM-RGBD scenes @ 1Hz and ScanNet scenes @ 1.5Hz. In the below videos, we first show an animated sequence reconstructions. We then show two object-based loop closures that our method detects while classical feature matching misses.

TUM-RGBD Fr3 Long Office @ 1Hz

ScanNet 0169_00 @ 1.5Hz

ScanNet 0207_00 @ 1.5Hz

BibTeX

@inproceedings{gumeli2023objectmatch,
      title={ObjectMatch: Robust Registration using Canonical Object Correspondences},
      author={G{\"u}meli, Can and Dai, Angela and Nie{\ss}ner, Matthias},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      pages={13082--13091},
      year={2023}
  }

ObjectMatch: Robust Registration using Canonical Object Correspondences

ObjectMatch estimates camera and object poses from two or more RGB-D images by leveraging object correspondences.