Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI

TL;DR

Visual realism is insufficient for embodied AI. Trustworthy benchmarking demands the metric-scale geometric grounding that previous pipelines lack.

Why Existing Pipelines Fail?

Casual videos tend to have unidirectional capture, whereas our capture has diverse camera views.
Vision-only 3D reconstruction is still NOT as good as multi-sensor fusion SLAM (click tab below).

Vid2Sim Ours

GaussGym Ours

Our Framework

Our pipeline begins with multi-sensor capture using the MetaCam device in real-world urban spaces.
MetaCam Studio processes the raw data via LIV-SLAM to produce a colorized, globally consistent metric point cloud and accurate camera poses.
We initialize 3D Gaussians from the metric point cloud and render per-view depth maps from this initialization.
The 3DGS model is optimized with both photometric and depth losses.
In parallel, we extract a reliable collision mesh from the same global point cloud.
We integrate the trained 3DGS model and the collision mesh into a single USD scene.
The USD scene can be directly loaded into Isaac Sim for training and evaluating navigation policies.

BibTeX

@article{liu2025wanderland, title={Wanderland: Geometrically Grounded Simulationfor Open-World Embodied AI}, author={Xinhao Liu* and Jiaqi Li* and Youming Deng and Ruxin Chen and Yingjia Zhang and Yifei Ma and Li Guo and Yiming Li and Jing Zhang and Chen Feng}, journal={arXiv preprint arXiv:2511.20620}, year={2025} }

Wanderland

TL;DR

Why Existing Pipelines Fail?

Our Framework

Data Statistics and Comparison

BibTeX