TLDR

We propose to build Multiview Scene Graphs (MSG) from unposed images, topologically representing a scene with interconnected place and object nodes


Teaser Image

Multiview Scene Graph (MSG). The task of MSG takes unposed RGB images as input and outputs a place+object graph. The graph contains place-place edges and place-object edges. Connected place nodes represent images taken at the same place. The same object recognized from different views is associated and merged as one node and connected to the corresponding place nodes.


Abstract

A proper scene representation is central to the pursuit of spatial intelligence where agents can robustly reconstruct and efficiently understand 3D scenes. A scene representation is either metric, such as landmark maps in 3D reconstruction, 3D bounding boxes in object detection, or voxel grids in occupancy prediction, or topological, such as pose graphs with loop closures in SLAM or visibility graphs in SfM. In this work, we propose to build \textit{Multiview Scene Graphs} (MSG) from unposed images, representing a scene topologically with interconnected place and object nodes. The task of building MSG is challenging for existing representation learning methods since it needs to jointly address both visual place recognition, object detection, and object association from images with limited fields of view and potentially large viewpoint changes. To evaluate any method tackling this task, we developed an MSG dataset and annotation based on a public 3D dataset. We also propose an evaluation metric based on the intersection-over-union score of MSG edges. Moreover, we develop a novel baseline method built on mainstream pretrained vision models, combining visual place recognition and object association into one Transformer decoder architecture. Experiments demonstrate our method has superior performance compared to existing relevant baselines.


Method

Method Image

The AoMSG model. Places and objects queries are obtained by cropping the image feature map using corresponding bounding boxes. The queries are then fed into the Transformer decoder to obtain the final places and objects embeddings. Bounding boxes are in different colors for clarity. The parameters in the Transformer decoder and the linear projector heads are trained with supervised contrastive learning. Image encoder and object detector are pretrained and frozen.


BibTeX



Coming Soon
      
    

Acknowledgements

The work was supported in part through NSF grants 2238968 and 2322242, and the NYU IT High Performance Computing resources, services, and staff expertise.