A robot's ability to anticipate the 3D action target location of a hand's movement from egocentric videos can greatly improve safety and efficiency in human-robot interaction (HRI). While previous research predominantly focused on semantic action classification or 2D target region prediction, we argue that predicting the action target's 3D coordinate could pave the way for more versatile downstream robotics tasks, especially given the increasing prevalence of headset devices. This study substantially expands EgoPAT3D, the sole dataset dedicated to egocentric 3D action target prediction. We augment both its size and diversity, enhancing its potential for generalization. Moreover, we substantially enhance the baseline algorithm by introducing a large pre-trained model and human prior knowledge. Remarkably, our novel algorithm can now achieve superior prediction outcomes using solely RGB images, eliminating the previous need for 3D point clouds and IMU input. Furthermore, we deploy our enhanced baseline algorithm on a real-world robotic platform to illustrate its practical utility in a straightforward HRI task. This demonstration underscores the real-world applicability of our advancements and may inspire more HRI use cases involving egocentric vision.

We employ ConvNeXt_Tiny (denoted by ψ) to extract visual feature v_{t} = ψ(X_{t}) from each RGB frame. Hand landmarks {LM^{1}_{t}, LM^{2}_{t} ... LM^{21}_{t}} are firstly extracted by the Hand API from Google's MediaPipe. If no hand detected, then all landmarks are set to be 0. A multi-layer perceptron (MLP) denoted by φ is then used to encode hand landmarks to features h_{t} = φ(LM^{21stack}_{t}). After the feature encodings, the two features were concatenated and fed into another MLP to obtain the fused feature u_{t} = MLP(cat(v_{t},h_{t})) for a single frame.

We use a 2-layer LSTM to process the fused feature. The steps to handle LSTM outputs are similar to the original EgoPAT3D baseline. We divide a 3D space into grids of dimension 1024×1024×1024 and aim to generate a confidence score for each grid. We used three separate MLPs to process the output of the LSTM and obtain the confidence scores in three dimensions. For example, without loss of generality, for dimension x at frame t, let g ∈ ℝ^{1024} denote all the grids in the x-dimension, where we normalize the coordinates of each grid to be in [-1, 1]. The score vector s^{x}_{t} ∈ ℝ^{1024} is computed by s^{x}_{t} = MLP_{X}(LSTM(u_{t}, l_{t-1})), where l_{t-1} is the learned hidden representation and l_{0} is set to be 0. A binary mask m_{t}^{x} ∈ ℝ^{1024} is used to remove the value for all the grids where the confidence is less than a threshold γ. Let s^{x}_{t}[i], m^{x}_{t}[i] denote the score and mask for the i-th grid, we have that:

\[
m^x_{t}[i] =
\begin{cases}
1, & i \in \{j \,|\, s^x_{t}[j] > \gamma\} \\
0, & i \in \{j \,|\, s^x_{t}[j] \leq \gamma\} \\
\end{cases}
\]

The masked score is then calculated by ŝ^{x}_{t} = m^{x}_{t} ⊙ s^{x}_{t}, where ⊙ denotes the element-wise dot product. Then, we can get the estimated target position value for dimension x at frame t as:

\[ x_t \in \mathcal{R} = (\hat{s}^x_{t})^Tg \]

We conduct post-processing for each result produced by the LSTM to incorporate human prior knowledge. For each frame t, we choose the coordinate of the landmark that marks the end of the index finger to be the 2D hand position Ĥ_{t}. The predicted 3D target position P_{t} (in meter) was transformed into 2D position Ĥ_{Pt} in pixel values with the help of camera intrinsic parameter K and image resolution (4K in our case). We ignore the depth information in this transformation. We calculate the hand position offset between each frame by ĥ_{t} = ||Ĥ_{t} - Ĥ_{t-1}||_{2} and keep track of the max historical hand position offset Ĝ_{t} = max(ĥ_{t}) for i < t. The final 2D position is calculated as Ĝ_{Pt} = Ĥ_{Pt}*(ĥ_{t}/Ĝ_{t}) + Ĥ_{t}*(1-ĥ_{t}/Ĝ_{t}). The 2D result is then transformed back to a 3D position ĥat{P_{t}} with the pre-transformed depth, again with camera intrinsic parameter K and image resolution, to serve as the final prediction.

The access to the new EgoPAT3Dv2 dataset will be available soon.

```
Coming Soon
```