1. Robust Global Shape Representation: Visual hull and depth estimated from foundation models initiates 3D Gaussians. RGB-D images and foundation-model-estimated normals are used to supervise subsequent training.
2. Active Touch Selection: Geometric properties and common-sense from VLM guide the robot to touch the most informative regions.
3. Local Geometric Optimization: Add tactile readings as new anchor Gaussian points to improve the original 3D Gaussians.
1.1 Hybrid Structure Priors:
(1) Visual Hull: After taking sparse view images, we use GPT-4o to classify the centric object.
The classification label is fed to Grounded
SAM 2 to acquire the masks of the object. The masks along with the camera poses recorded by the
robot are used to estimate the visual hull. This method is robust against traditionally challenging
surfaces and materials
(2) Metric Depth Estimator: For every sparse view RGB image, we use Metric3D v2 to estimate the depth. We do not use the
more accurate depth captured by our RealSense
camera because we found estimated depth smoother and less noisy.
(3) Gaussians Initialization: The visual hull and estimated depth are combined to initialize
the 3D Gaussian Primitives.
1.2 Hull Pruning:
(4) Visual Hull Pruning: During training, we design a "shell" around the visual hull to remove
floaters around it. Empircally, we notice that those floaters around objects, while small at first, often
snowball to very large negative impact on the overall reconstruction quality.
(5) Supervision: We then use RGB-D images captured by our RealSense camera and normals
estimated from DSINE to supervise the subsequent
training.
@inproceedings{fang2025fusionsense,
title={FusionSense: Bridging Common Sense, Vision, and Touch for Robust Sparse-View Reconstruction},
author={Irving Fang and Kairui Shi and Xujin He and Siqi Tan and Yifan Wang and Hanwen Zhao and Hung-Jui Huang and Wenzhen Yuan and Chen Feng and Jing Zhang},
booktitle={2025 IEEE International Conference on Robotics and Automation (ICRA)},
year={2025}
}