CityWalker Icon

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

New York University
* Equal contribution Corresponding author

TLDR

We leverage thousands of hours of online city walking and driving videos to train autonomous agents for robust, generalizable navigation in urban environments through scalable, data-driven imitation learning.



Abstract

Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings.

CityWalker Teasing

Method

Pipeline

Overall Pipeline of CityWalker. Our training pipeline starts with internet-sourced videos, using visual odometry to obtain relative poses between frames. At each time step, the model receives past observations, past trajectory, and target location as input. They are encoded via a frozen image encoder and a trainable coordinate encoder. A transformer processes these inputs to generate future tokens. An action head and an arrival head decode these tokens into action and arrival status predictions. During training, future frame tokens from future frames guide the transformer to hallucinate future tokens.


Experiment Results

Scaling Results

Performance and Data Scale. We show the model performance evaluated by MAOE with respect to the size of the training data measured by video length in hours. We also show the zero-shot performance of our model trained with only driving videos and mixed driving and walking videos


Qualitative Results

Qualitative Results

Qualitative Results. We divide the results into three categories. Success: predicted action aligns well with ground truth action. Large error: predicted action does not align with ground truth but may still lead to successful navigation. Fail: predicted action may lead to failed navigation. The most significant observation is that large errors in offline data do not necessarily lead to failure in navigation, due to the multimodality characteristic of policy learning. For example, in the fifth row, although the ground truth action takes a detour to the right of the traffic drum, the predicted action that goes straight from the left of the drum should also lead to successful navigation.

BibTeX



Coming Soon
      
    

Acknowledgements

The work was supported by NSF grants 2238968, 2121391, 2322242 and 2345139; and in part through the NYU IT High Performance Computing resources, services, and staff expertise. We also thank Xingyu Liu and Zixuan Hu for their help in data collection.