VLM See, Robot Do:

Human Demo Video to Robot Action Plan via Vision Language Model

Beichen Wang^*¹, Juexiao Zhang^*¹, Shuwen Dong^†¹ Irving Fang^†¹, Chen Feng¹,

¹ New York University,
^* Equal contribution, first authors. ^† Equal contribution, second authors.

TLDR

Interpret human demonstration videos and generate robot action plans using a pipeline of keyframe selection, visual perception and vision language model reasoning.

Method

Module 1: Keyframe Selection
We use APIs from the MediaPipe to detect the hand keypoints and calculate the speed of the hand.
The speed plot is then interpolated to be continous and the valleys are used as the keyframe selections.

Data and demos

We collected a dataset of human demonstration videos in three diverse catogories: vegetable organization, garment organization, and wooden block stacking.

Below are the data and corresponding results.

Human demonstration
Here is a demonstration video for vegetable organization. The videos illustrate how a human arrange the vegetable toys into specific containers one by one.

Robot execution
In this video, the robot executes the vegetable organization task in the same order as the human demonstrates.

BibTeX


Coming Soon

Acknowledgements

The work was supported in part through NSF grants 2238968, 2322242, and 2024882, and the NYU IT High Performance Computing resources, services, and staff expertise.