FusionSense

Bridging Common Sense, Vision, and Touch for Robust Sparse-View Reconstruction

1 New York University, 2 Carnegie Mellon University, 3 University of Illinois, Urbana-Champaign
* Equal Contribution

Snapshot

TLDR: Robot reconstructing visually and geometrically accurate surroundings with sparse visual and tactile data


Novel View Synthesis (RGB)

Object Reconstruction (RGB)


Novel View Synthesis (Depth)

Novel View Synthesis (Normal)


How It Works

1. Robust Global Shape Representation: Visual hull and depth estimated from foundation models initiates 3D Gaussians. RGB-D images and foundation-model-estimated normals are used to supervise subsequent training.

2. Active Touch Selection: Geometric properties and common-sense from VLM guide the robot to touch the most informative regions.

3. Local Geometric Optimization: Add tactile readings as new anchor Gaussian points to improve the original 3D Gaussians.

Video Comparison




Method

Overview Image

1.1 Hybrid Structure Priors:
(1) Visual Hull: After taking sparse view images, we use GPT-4o to classify the centric object. The classification label is fed to Grounded SAM 2 to acquire the masks of the object. The masks along with the camera poses recorded by the robot are used to estimate the visual hull. This method is robust against traditionally challenging surfaces and materials
(2) Metric Depth Estimator: For every sparse view RGB image, we use Metric3D v2 to estimate the depth. We do not use the more accurate depth captured by our RealSense camera because we found estimated depth smoother and less noisy.
(3) Gaussians Initialization: The visual hull and estimated depth are combined to initialize the 3D Gaussian Primitives.

1.2 Hull Pruning:
(4) Visual Hull Pruning: During training, we design a "shell" around the visual hull to remove floaters around it. Empircally, we notice that those floaters around objects, while small at first, often snowball to very large negative impact on the overall reconstruction quality.
(5) Supervision: We then use RGB-D images captured by our RealSense camera and normals estimated from DSINE to supervise the subsequent training.


More Visualizations



Novel View Synthesis (RGB)

Ground Truth Comparison (RGB)


Novel View Synthesis (Depth)

Novel View Synthesis (Normal)

Novel View Synthesis (RGB)

Ground Truth Comparison (RGB)


Novel View Synthesis (Depth)

Novel View Synthesis (Normal)

Novel View Synthesis (RGB)

Ground Truth Comparison (RGB)


Novel View Synthesis (Depth)

Novel View Synthesis (Normal)

Novel View Synthesis (RGB)

Ground Truth Comparison (RGB)


Novel View Synthesis (Depth)

Novel View Synthesis (Normal)

Novel View Synthesis (RGB)

Ground Truth Comparison (RGB)


Novel View Synthesis (Depth)

Novel View Synthesis (Normal)

Novel View Synthesis (RGB)

Ground Truth Comparison (RGB)


Novel View Synthesis (Depth)

Novel View Synthesis (Normal)

Novel View Synthesis (RGB)

Ground Truth Comparison (RGB)


Novel View Synthesis (Depth)

Novel View Synthesis (Normal)

Novel View Synthesis (RGB)

Ground Truth Comparison (RGB)


Novel View Synthesis (Depth)

Novel View Synthesis (Normal)

BibTeX


@misc{fang2024fusionsense,
  title={FusionSense: Bridging Common Sense, Vision, and Touch for Robust Sparse-View Reconstruction}, 
  author={Irving Fang and Kairui Shi and Xujin He and Siqi Tan and Yifan Wang and Hanwen Zhao and Hung-Jui Huang and Wenzhen Yuan and Chen Feng and Jing Zhang},
  year={2024},
  eprint={2410.08282},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2410.08282}, 
}

Acknowledgements

Jing Zhang and Chen Feng are the corresponding authors. The work was supported in part through NSF grants 2024882, 2152565, and 2238968.