Researchers propose a task-agnostic pretraining approach for Vision-Language-Action (VLA) models to address the scarcity of expert demonstrations. The method is based on a "Decomposition Hypothesis," which separates the acquisition of physical competence (movement) from semantic alignment (task execution). This framework suggests that only semantic alignment requires language supervision, potentially reducing the dependency on costly triplets of observations, instructions, and actions.

Read original