EgoPAT3Dv2

Abstract

A robot's ability to anticipate the 3D action target location of a hand's movement from egocentric videos can greatly improve safety and efficiency in human-robot interaction (HRI). While previous research predominantly focused on semantic action classification or 2D target region prediction, we argue that predicting the action target's 3D coordinate could pave the way for more versatile downstream robotics tasks, especially given the increasing prevalence of headset devices. This study substantially expands EgoPAT3D, the sole dataset dedicated to egocentric 3D action target prediction. We augment both its size and diversity, enhancing its potential for generalization. Moreover, we substantially enhance the baseline algorithm by introducing a large pre-trained model and human prior knowledge. Remarkably, our novel algorithm can now achieve superior prediction outcomes using solely RGB images, eliminating the previous need for 3D point clouds and IMU input. Furthermore, we deploy our enhanced baseline algorithm on a real-world robotic platform to illustrate its practical utility in a straightforward HRI task. This demonstration underscores the real-world applicability of our advancements and may inspire more HRI use cases involving egocentric vision.

EgoPAT3Dv2

Overview Video (3 mins)

Detail Video (7 mins.)

Method

We employ ConvNeXt_Tiny (denoted by ψ) to extract visual feature v_t = ψ(X_t) from each RGB frame. Hand landmarks {LM¹_t, LM²_t ... LM²¹_t} are firstly extracted by the Hand API from Google's MediaPipe. If no hand detected, then all landmarks are set to be 0. A multi-layer perceptron (MLP) denoted by φ is then used to encode hand landmarks to features h_t = φ(LM^21stack_t). After the feature encodings, the two features were concatenated and fed into another MLP to obtain the fused feature u_t = MLP(cat(v_t,h_t)) for a single frame.

We use a 2-layer LSTM to process the fused feature. The steps to handle LSTM outputs are similar to the original EgoPAT3D baseline. We divide a 3D space into grids of dimension 1024×1024×1024 and aim to generate a confidence score for each grid. We used three separate MLPs to process the output of the LSTM and obtain the confidence scores in three dimensions. For example, without loss of generality, for dimension x at frame t, let g ∈ ℝ¹⁰²⁴ denote all the grids in the x-dimension, where we normalize the coordinates of each grid to be in [-1, 1]. The score vector s^x_t ∈ ℝ¹⁰²⁴ is computed by s^x_t = MLP_X(LSTM(u_t, l_t-1)), where l_t-1 is the learned hidden representation and l₀ is set to be 0. A binary mask m_t^x ∈ ℝ¹⁰²⁴ is used to remove the value for all the grids where the confidence is less than a threshold γ. Let s^x_t[i], m^x_t[i] denote the score and mask for the i-th grid, we have that:

\[ m^x_{t}[i] = \begin{cases} 1, & i \in \{j \,|\, s^x_{t}[j] > \gamma\} \\ 0, & i \in \{j \,|\, s^x_{t}[j] \leq \gamma\} \\ \end{cases} \]

The masked score is then calculated by ŝ^x_t = m^x_t ⊙ s^x_t, where ⊙ denotes the element-wise dot product. Then, we can get the estimated target position value for dimension x at frame t as:

\[ x_t \in \mathcal{R} = (\hat{s}^x_{t})^Tg \]

We conduct post-processing for each result produced by the LSTM to incorporate human prior knowledge. For each frame t, we choose the coordinate of the landmark that marks the end of the index finger to be the 2D hand position Ĥ_t. The predicted 3D target position P_t (in meter) was transformed into 2D position Ĥ_Pt in pixel values with the help of camera intrinsic parameter K and image resolution (4K in our case). We ignore the depth information in this transformation. We calculate the hand position offset between each frame by ĥ_t = ||Ĥ_t - Ĥ_t-1||₂ and keep track of the max historical hand position offset Ĝ_t = max(ĥ_t) for i < t. The final 2D position is calculated as Ĝ_Pt = Ĥ_Pt*(ĥ_t/Ĝ_t) + Ĥ_t*(1-ĥ_t/Ĝ_t). The 2D result is then transformed back to a 3D position ĥat{P_t} with the pre-transformed depth, again with camera intrinsic parameter K and image resolution, to serve as the final prediction.

BibTeX

@INPROCEEDINGS{Fang2024,
  author={Fang, Irving and Chen, Yuzhong and Wang, Yifan and Zhang, Jianghan and Zhang, Qiushi and Xu, Jiali and He, Xibo and Gao, Weibo and Su, Hao and Li, Yiming and Feng, Chen},
  booktitle={2024 IEEE International Conference on Robotics and Automation (ICRA)}, 
  title={EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction}, 
  year={2024},
  volume={},
  number={},
  pages={3036-3043},
  doi={10.1109/ICRA57147.2024.10610283}}

Acknowledgements

Chen Feng is the corresponding author. This work is supported by NSF Grant 2026479, and by NYU IT High-Performance Computing resources, services, and staff expertise.

EgoPAT3Dv2

Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction

ICRA 2024

A human wearing a helmet camera manipulates objects in a shared workspace with a UR10E cobot. The cobot tries to reach the anticipated 3D action target with the shortest Cartesian path.

Abstract

Overview Video (3 mins)

Detail Video (7 mins.)

Method

BibTeX

Acknowledgements