Visual-Auditory Proprioception of Soft Finger Shape and Contact

DeepCoFi

Qinsong Guo* Ke Yang* Hanwen Zhao Haohan Fang Haoxuan Wang Chen Feng
New York University, Brooklyn, NY, USA
DeepCoFi teaser image from paper

Overview of our multimodal proprioception framework. A soft robotic finger is instrumented with an internal camera, speaker, and microphone. The camera captures global bending, while spectrograms from acoustic reflections provide complementary contact cues, especially in occluded regions. The modalities are fused and processed through sequential folding modules: Folding 1 reconstructs the global pose, and Folding 2 refines the surface with localized contact deformations.

Abstract

Soft robotic fingers require precise proprioception of both global deformation and local contact to enable safe and dexterous manipulation. Vision-based methods can reconstruct overall shape but struggle under severe occlusion, while audio-only approaches provide complementary cues but lack spatial detail. We present DeepCoFi, a lightweight multimodal proprioception framework that fuses internal camera images with acoustic spectrograms to jointly recover finger geometry and contact. The framework leverages the complementary strengths of vision and acoustics and employs a FoldingNet-based two-stage decoder that first reconstructs global bending and then refines local contact deformations. To support this integration, we introduce a soft finger design that incorporates an exoskeleton-mounted camera and microphone in a single molding step, preserving compliance while enabling multimodal sensing. Experiments on a comprehensive dataset and real-world grasping tasks show that DeepCoFi achieves robust proprioception under occlusion and generalizes effectively to unseen deformations and contact conditions.

Method Overview

DeepCoFi pipeline figure

DeepCoFi model architecture. The framework encodes multimodal proprioceptive inputs (internal images and spectrograms) through ResNet-18 backbones to produce a fused latent codeword. The decoder applies two sequential folding modules: Fold 1 reconstructs the global bending shape, and Fold 2 refines local contact deformations in the predicted point cloud.

Experimental Setup

Contact data collection setup

Contact data collection setup. Left: experimental configuration with reference pads and contact pads mounted on the finger for repeatable geometry and controlled indentation. Right: contact pad schematic with a 3 x 8 slot grid (24 sites) used to parameterize contact locations.

Grasping experiment setup and test objects

Grasping experiment. Left: two actuated soft robotic fingers with internal camera, speaker, and microphone perform convergent grasping motions. Right: test objects include letters A-F and geometric solids for classification.

Results

Confusion matrix for object classification

Confusion matrix for object classification with 85% overall accuracy on the test set. Cube and Octahedron achieve near-perfect recall, while the geometrically complex Icosahedron is most challenging.

Large-Scale Qualitative Comparison

Qualitative results across representative validation and test conditions

Qualitative results for representative validation and test conditions. Ground truth is shown in green and predictions in red. The two-stage multimodal model consistently localizes contacts, including occluded sites under large bending, while single-modality or baseline variants miss or misplace distant contacts.

BibTeX

@inproceedings{deepcofi2026,
  title={Visual-Auditory Proprioception of Soft Finger Shape and Contact},
  author={Guo, Qinsong and Yang, Ke and Zhao, Hanwen and Fang, Haohan and Wang, Haoxuan and Feng, Chen},
  booktitle={Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
  year={2026}
}