2

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos
Memorize What Matters: Emergent Scene Decomposition from Multitraverse
Multiview Scene Graph
FusionSense: Bridging Common Sense, Vision, and Touch for Robust Sparse-View Reconstruction
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model
EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction
Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset
ActFormer: Scalable Collaborative Perception via Active Queries
NYC-Indoor-VPR: A Long-Term Indoor Visual Place Recognition Dataset with Semi-Automatic Annotation
Tell Me Where You Are: Multimodal LLMs Meet Place Recognition