From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

INT-ACT Categories

INT-ACT is a probing suite to evaluate the generalization capability of robotic VLAs. It consists of three categories of tasks that probe the generalization boundaries of VLAs.

Object Diversity: Ability to handle out-of-distribution objects.
Language Complexity: Ability to understand complex language instructions.
Vision-Language Thinking: Ability to perform commonsense reasoning and visual-language thinking.

Truly generalist policies require perceptual ability beyond the object distributions encountered during training or fine-tuning.

In SimplerEnv, which assume the fine-tuning dataset is BridgeV2, all manipulation tasks are Put {Source} on {Target}. Therefore, We introduce four categories of out-of-distribution objects that resemble original objects in affordances/grasping difficulty.

OOD Source: Source object not present in BridgeV2, but target object is.
OOD Target: Target object not present in BridgeV2, but source object is.
OOD Source + Target: Both source and target objects are not present in BridgeV2.
OOD Relation: Relation between objects is different from the training data. For example, if the training data has Put {Source} on {Target}, then the OOD relation can be Put {Target} on {Source}.

Benchmarking Results

All VLAs exhibit persistent intention-action gaps. They correctly interpret out-of-distribution objects or instructions, thanks to their pretrained VLM, but their execution accuracy still falls sharply.

BibTeX


Coming Soon

Acknowledgements

Chen Feng is the corresponding authors.

From Intention to Execution

Probing the Generalization Boundaries of Vision-Language-Action Models

TLDR

INT-ACT Categories

Benchmarking Results

BibTeX

Acknowledgements