From Intention to Execution

Probing the Generalization Boundaries of Vision-Language-Action Models

New York University
* Equal Contribution, Project Lead

TLDR

1. INT-ACT: probing suite to evaluate the generalization capability of robotic VLAs.
2. Benchmarking SOTA VLAs to understand their generalization boundaries.


INT-ACT Categories

INT-ACT is a probing suite to evaluate the generalization capability of robotic VLAs. It consists of three categories of tasks that probe the generalization boundaries of VLAs.

Object Diversity: Ability to handle out-of-distribution objects.
Language Complexity: Ability to understand complex language instructions.
Vision-Language Thinking: Ability to perform commonsense reasoning and visual-language thinking.

Overview Image

Truly generalist policies require perceptual ability beyond the object distributions encountered during training or fine-tuning.

In SimplerEnv, which assume the fine-tuning dataset is BridgeV2, all manipulation tasks are Put {Source} on {Target}. Therefore, We introduce four categories of out-of-distribution objects that resemble original objects in affordances/grasping difficulty.

OOD Source: Source object not present in BridgeV2, but target object is.
OOD Target: Target object not present in BridgeV2, but source object is.
OOD Source + Target: Both source and target objects are not present in BridgeV2.
OOD Relation: Relation between objects is different from the training data. For example, if the training data has Put {Source} on {Target}, then the OOD relation can be Put {Target} on {Source}.


Benchmarking Results

All VLAs exhibit persistent intention-action gaps. They correctly interpret out-of-distribution objects or instructions, thanks to their pretrained VLM, but their execution accuracy still falls sharply.


BibTeX



Coming Soon
      
    

Acknowledgements

Chen Feng is the corresponding authors.