Learning Simultaneous Navigation and Construction in Grid Worlds
AbstractWe propose to study a new learning task, mobile construction, to enable an agent to build designed structures in 1/2/3D grid worlds while navigating in the same evolving environments. Unlike existing robot learning tasks such as visual navigation and object manipulation, this task is challenging because of the interdependence between accurate localization and strategic construction planning. In pursuit of generic and adaptive solutions to this partially observable Markov decision process (POMDP) based on deep reinforcement learning (RL), we design a Deep Recurrent Q-Network (DRQN) with explicit recurrent position estimation in this dynamic grid world. Our extensive experiments show that pre-training this position estimation module before Q-learning can significantly improve the construction performance measured by the intersection-over-union score, achieving the best results in our benchmark of various baselines including model-free and model-based RL, a handcrafted SLAM-based policy, and human players.
Mobile Construction TaskIntelligent agents, from animal architects (e.g., mound-building termites and burrowing rodents) to human beings, can simultaneously build structures while navigating inside such a dynamically evolving environment, revealing robust and coordinated spatial skills like localization, mapping, and planning. Can we create artificial intelligence (AI) to perform similar mobile construction tasks?
To handcraft such an AI using existing robotics techniques is difficult and non-trivial. A fundamental challenge is the tight interdependence of robot localization and long-term planning for environment modification. If GPS and techniques alike are not available (often due to occlusions), robots have to rely on simultaneous localization and mapping (SLAM) or structure from motion (SfM) for pose estimation. But mobile construction violates the basic static-environment assumption in classic visual SLAM methods, and even challenges SfM methods designed for dynamic scenes (Saputra et al., 2018). Thus, we need to tackle the interdependence challenge to strategically modify the environment while efficiently updating a memory of the evolving structure in order to perform accurate localization and construction.
Deep reinforcement learning (DRL) offers another possibility, especially given its recent success in game playing and robot control. Can deep networks learn a generic and adaptive policy that controls the AI to build calculated structures as temporary localization landmarks which eventually evolve into the designed one? To answer this question, we design an efficient simulation environment with a series of mobile construction tasks in 1/2/3D grid worlds. This reasonably simplifies the environment dynamics and sensing models while keeping the tasks nontrivial, and allows us to focus on the aforementioned interdependence challenge before advancing to other real-world complexities.
Recurrent Position EstimationAfter formulating mobile construction tasks, we benchmarked our naive model-free DRL baselines on these tasks. We found that these DRL policies perform worse than our human baselines, especially on 2D and 3D tasks. We believe the low performance is due to the difficulty in learning meaningful representations and an effective control policy jointly via RL training alone. Especially, the aforementioned interdependence challenge of mobile construction tasks requires the agent to localize itself in a dynamic environment where the surrounding structure could change after the agent build a new brick. Inspired by some recent studies (Stooke et al., 2021; Lample & Chaplot, 2017; Jaderberg et al., 2016; Mirowski et al., 2016) which decouple representation learning from RL, we propose our method which combines (1) a pre-trained localization network (L-Net) to estimate the current agent position and (2) a DRQN to select the best action based on the predicted positions and observations.
We can see the difference of performance between DRQN and DRQN+Lnet in constant/variable dense environments:
(a) DRQN performance in 1/2/3d constant (dense) environments
(b) DRQN+Lnet performance in 1/2/3d constant (dense) environments
(c) DRQN performance in 1/2/3d variable (dense) environments
(d) DRQN+Lnet performance in 1/2/3d variable (dense) environments