DeepExplorer

Abstract

We propose DeepExplorer, a simple and lightweight metric-free exploration method for topological mapping of unknown environments. It performs Task And Motion Planning (TAMP) entirely in image feature space.

The Task Planner is a recurrent network using the latest image observation sequence to hallucinate a feature as the next-best exploration goal.

The Motion Planner then utilizes the current and the hallucinated features to generate an action taking the agent towards that goal.

The two planners are jointly trained via deeply-supervised imitation learning from expert demonstrations. During exploration, we iteratively call the two planners to predict the next action, and the topological map is built by constantly appending the latest image observation and action to the map and using visual place recognition (VPR) for loop closing.

The resulting topological map efficiently represents an environment's connectivity and traversability, so it can be used for tasks such as visual navigation. We show DeepExplorer's exploration efficiency and strong sim2sim generalization capability on large-scale simulation datasets like Gibson and MP3D. Its effectiveness is further validated via the image-goal navigation performance on the resulting topological map. We further show its strong zero-shot sim2real generalization capability in real-world experiments.

DeepExplorer

Presentation Video

Method

The feature extractor g_ψ takes image I_t as input and generates the corresponding feature vector f_t. TaskPlanner π_{θ_T} is a recurrent neural network (RNN) consuming a sequence of features {f_t−10 , · · · , f_t } to hallucinate the next best feature to visit f̂_t+1 . MotionPlanner π_{θ_M} consumes the concatenation of f_t and f̂_t+1 and generates the action to move the agent towards the hallucinated feature. During training, we supervise all the intermediate outputs including the intermediate hallucinated features { f̂_t−9 , · · · , f̂_t } and the intermediate actions {â_t−10 , · · · , â_t−1 }, in addition to the final output f̂_t+1 and â_t. During inference, current observation I_t is firstly encoded and fed into π_{θ_T} to hallucinate f̂_t+1 , and then f̂_t+1 combined with the f_t is fed into π_{θ_M} for motion planning. L_T is L₂ loss and L_M is cross entropy loss (the subscripts T and M denote Task and Motion respectively). h_t denotes the hidden state of RNN.

Dataset and Exploration results

(a) Gibson dataset (b) MP3D dataset (c) Real-World indoor dataset

We employ three datasets for a comprehensive evaluation: (1) Gibson dataset, (2) MatterPort3D (MP3D) dataset, and (3) Real-World indoor dataset.

BibTeX


@INPROCEEDINGS{He-RSS-23, 
  AUTHOR    = {Yuhang He AND Irving Fang AND Yiming Li AND Rushi Bhavesh Shah 
  AND Chen Feng}, 
  TITLE     = {{Metric-Free Exploration for Topological Mapping by Task and 
  Motion Imitation in Feature Space}}, 
  BOOKTITLE = {Proceedings of Robotics: Science and Systems}, 
  YEAR      = {2023}, 
  ADDRESS   = {Daegu, Republic of Korea}, 
  MONTH     = {July}, 
  DOI       = {10.15607/RSS.2023.XIX.099} 
}

Acknowledgements

Chen Feng is the corresponding author. The work is supported by NSF grant 2238968.

DeepExplorer

Metric-Free Exploration for Topological Mapping by Task and Motion Imitation in Feature Space

RSS 2023

Abstract

Presentation Video

Method

Dataset and Exploration results

BibTeX

Acknowledgements