VLM See, Robot Do:

Human Demo Video to Robot Action Plan via Vision Language Model

1 New York University,
* Equal contribution, first authors. Equal contribution, second authors.

TLDR

Interpret human demonstration videos and generate robot action plans using a pipeline of keyframe selection, visual perception and vision language model reasoning.


Teaser Image


Method

Method Image

Module 1: Keyframe Selection
We use APIs from the MediaPipe to detect the hand keypoints and calculate the speed of the hand.
The speed plot is then interpolated to be continous and the valleys are used as the keyframe selections.


Data and demos

We collected a dataset of human demonstration videos in three diverse catogories: vegetable organization, garment organization, and wooden block stacking.

Data Image

Below are the data and corresponding results.

Human demonstration
Here is a demonstration video for vegetable organization. The videos illustrate how a human arrange the vegetable toys into specific containers one by one.

Robot execution
In this video, the robot executes the vegetable organization task in the same order as the human demonstrates.

BibTeX



Coming Soon
      
    

Acknowledgements

The work was supported in part through NSF grants 2238968, 2322242, and 2024882, and the NYU IT High Performance Computing resources, services, and staff expertise.