5

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos
Multiview Scene Graph
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model
EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction
Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset
ActFormer: Scalable Collaborative Perception via Active Queries
LUWA Dataset: Learning Lithic Use-Wear Analysis on Microscopic Images
LiDAR-based 4D Occupancy Completion and Forecasting
Among Us: Adversarially Robust Collaborative Perception by Consensus
SSCBench: A Large-Scale 3D Semantic Scene Completion Benchmark for Autonomous Driving