Extrapolated Urban View Synthesis Benchmark

1 NYU  2 NVIDIA  3 University of Pennsylvania  4 USC  5 Stanford University
* Equal Contribution

TLDR

    We build a comprehensive real-world benchmark for quantitatively and qualitatively evaluating extrapolated novel view synthesis in large-scale urban scenes.

Overview Image
Our key contributions. Previous evaluations for urban view synthesis have primarily focused on interpolated poses, as the lack of ground truth data has made it challenging to evaluate extrapolated poses. We address this gap by providing real-world data that enables both quantitative and qualitative evaluations of extrapolated view synthesis in urban scenes. The quantitative results reveal a significant performance drop in Gaussian Splatting when handling extrapolated views, highlighting the need for more robust NVS methods.

Abstract

Photorealistic simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis (NVS), a crucial capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs. Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at real-time speeds and have been widely used in modeling large-scale driving scenes. However, their performance is commonly evaluated using an interpolated setup with highly correlated training and test views. In contrast, extrapolation, where test views largely deviate from training views, remains underexplored, limiting progress in generalizable simulation technology. To address this gap, we leverage publicly available AV datasets with multiple traversals, multiple vehicles, and multiple cameras to build the first Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct quantitative and qualitative evaluations of state-of-the-art Gaussian Splatting methods across different difficulty levels. Our results show that Gaussian Splatting is prone to overfitting to training views. In addition, incorporating diffusion priors and improving geometry cannot fundamentally improve NVS under large view changes, highlighting the need for more robust approaches. We have released our data to help advance self-driving and urban robotics simulation technology.


Dataset Visualization

Overview Image
Dataset distribution. Our dataset comprises 90,810 frames distributed over 104 cases, capturing a diverse array of multi-traversal paths and multi-agent interactions across varying difficulty levels.

Results Comparison across Different Levels

Overview Image
Qualitative and quantitative results across three difficulty levels. The results show a clear degradation in performance as the difficulty level increases, highlighting the challenge of maintaining consistency and realism in complex urban scenarios.

Baseline Comparison Video

Translation-only experiment uses train and test traversal on different lanes.

Train

Test

Radar Chart of Baseline Comparison

Baseline Comparison Table

Overview Image

BibTeX


@misc{han2024extrapolatedurbanviewsynthesis,
      title={Extrapolated Urban View Synthesis Benchmark}, 
      author={Xiangyu Han and Zhen Jia and Boyi Li and Yan Wang and Boris Ivanovic and Yurong You and Lingjie Liu and Yue Wang and Marco Pavone and Chen Feng and Yiming Li},
      year={2024},
      eprint={2412.05256},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.05256}, 
}

Acknowledgements

This work was supported in part through NSF grants 2238968 and 2121391, and the NYU IT High Performance Computing resources, services, and staff expertise. Yiming Li is supported by NVIDIA Graduate Fellowship (2024-2025).