EgoPush

Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots

New York University
* Equal Contribution, † Corresponding Author

EgoPush is a learning framework that enables a mobile robot to perform long-horizon, multi-object non-prehensile rearrangement using only egocentric vision, without relying on global maps, precise localization, or external tracking.

Try it your self!

Click start to load Unity WebGL. Clicking outside the Unity area will stop it. If performance is not ideal, open in new tab.

Open in new tab Troubleshooting

Abstract

Humans can rearrange objects in cluttered environments using egocentric perception, navigating occlusions without global coordinates. Inspired by this capability, we study long-horizon multi-object non-prehensile rearrangement for mobile robots using a single egocentric camera. We introduce EgoPush, a policy learning framework that enables egocentric, perception-driven rearrangement without relying on explicit global state estimation that often fails in dynamic scenes. EgoPush designs an object-centric latent space to encode relative spatial relations among objects, rather than absolute poses. This design enables a privileged reinforcement-learning (RL) teacher to jointly learn latent states and mobile actions from sparse keypoints, which is then distilled into a purely visual student policy. To reduce the supervision gap between the omniscient teacher and the partially observed student, we restrict the teacher’s observations to visually accessible cues. This induces active perception behaviors that are recoverable from the student’s viewpoint. To address long-horizon credit assignment, we decompose rearrangement into stage-level subproblems using temporally decayed, stage-local completion rewards. Extensive simulation experiments demonstrate that EgoPush significantly outperforms end-to-end RL baselines in success rate, with ablation studies validating each design choice. We further demonstrate zero-shot sim-to-real transfer on a mobile platform in the real world.

How It Works

1.1 Object-Centric Representation:
(1) Task Roles: EgoPush partitions scene objects into three task roles: active object (currently pushed), anchor object (defines the target relation), and obstacles.
(2) Role-wise Encoding: A shared-weight estimator encodes each role into a latent embedding, and these embeddings are concatenated as an object-centric latent state.

1.2 Relative Spatial Reasoning:
(3) Relation-First Representation: Weight sharing places all roles in a common feature space, enabling the policy to reason over relative spatial relations instead of isolated object states.

Object-Centric Representation

BibTeX

@article{An2026EgoPush, title = {EgoPush: Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots}, author = {An, Boyuan and Wang, Zhexiong and Wang, Yipeng and Li, Jiaqi and Li, Sihang and Zhang, Jing and Feng, Chen}, journal={arXiv preprint arXiv:2602.18071}, year = {2026} }

EgoPush

Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots

TL;DR

How Do Mobile Robots Rearrange Objects using only Egocentric Vision?

Task Formulation

EgoPush Deployment in Real-World

Try it your self!

Abstract

Failure Cases of Baselines

EgoPush Method

How It Works

Object-Centric Representation

Results in Simulation

Depth Processing and Sim2Real

Comparison of Depth Processing Methods

Results in Real World

BibTeX