INT-ACT is a probing suite to evaluate the generalization capability of robotic VLAs.
It consists of three categories of tasks that probe the generalization boundaries of VLAs.
Object Diversity: Ability to handle out-of-distribution objects.
Language Complexity: Ability to understand complex language instructions.
Vision-Language Thinking: Ability to perform commonsense reasoning and visual-language thinking.
Truly generalist policies require perceptual ability beyond the object distributions encountered during training or fine-tuning.
In SimplerEnv, which assume the fine-tuning dataset is BridgeV2, all manipulation tasks are Put {Source} on {Target}
. Therefore, We introduce four categories of out-of-distribution objects that resemble original objects in affordances/grasping difficulty.
OOD Source: Source object not present in BridgeV2, but target object is.
OOD Target: Target object not present in BridgeV2, but source object is.
OOD Source + Target: Both source and target objects are not present in BridgeV2.
OOD Relation: Relation between objects is different from the training data. For example, if the training data has Put {Source} on {Target}
, then the OOD relation can be Put {Target} on {Source}
.
All VLAs exhibit persistent intention-action gaps. They correctly interpret out-of-distribution objects or instructions, thanks to their pretrained VLM, but their execution accuracy still falls sharply.
Coming Soon