DeepCoFi

Qinsong Guo* Ke Yang* Hanwen Zhao Haohan Fang Haoxuan Wang Chen Feng

New York University, Brooklyn, NY, USA

Abstract

Soft robotic fingers require precise proprioception of both global deformation and local contact to enable safe and dexterous manipulation. Vision-based methods can reconstruct overall shape but struggle under severe occlusion, while audio-only approaches provide complementary cues but lack spatial detail. We present DeepCoFi, a lightweight multimodal proprioception framework that fuses internal camera images with acoustic spectrograms to jointly recover finger geometry and contact. The framework leverages the complementary strengths of vision and acoustics and employs a FoldingNet-based two-stage decoder that first reconstructs global bending and then refines local contact deformations. To support this integration, we introduce a soft finger design that incorporates an exoskeleton-mounted camera and microphone in a single molding step, preserving compliance while enabling multimodal sensing. Experiments on a comprehensive dataset and real-world grasping tasks show that DeepCoFi achieves robust proprioception under occlusion and generalizes effectively to unseen deformations and contact conditions.

Method Overview

DeepCoFi model architecture. The framework encodes multimodal proprioceptive inputs (internal images and spectrograms) through ResNet-18 backbones to produce a fused latent codeword. The decoder applies two sequential folding modules: Fold 1 reconstructs the global bending shape, and Fold 2 refines local contact deformations in the predicted point cloud.

Experimental Setup

Grasping experiment setup and test objects

Large-Scale Qualitative Comparison

Qualitative results across representative validation and test conditions

Qualitative results for representative validation and test conditions. Ground truth is shown in green and predictions in red. The two-stage multimodal model consistently localizes contacts, including occluded sites under large bending, while single-modality or baseline variants miss or misplace distant contacts.

BibTeX

@inproceedings{deepcofi2026, title={Visual-Auditory Proprioception of Soft Finger Shape and Contact}, author={Guo, Qinsong and Yang, Ke and Zhao, Hanwen and Fang, Haohan and Wang, Haoxuan and Feng, Chen}, booktitle={Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)}, year={2026} }

Visual-Auditory Proprioception of Soft Finger Shape and Contact