“Helix addresses several issues previous robotic approaches faced, including balancing speed and generalization, scalability to manage high-dimensional actions, and architectural simplicity using standard models,” according to Figure.
Additionally, separating S1 and S2 enables independent improvements to each system without reliance on a shared observation or action space.
A dataset of around 500 hours of teleoperated behaviors was collected to train Helix, utilizing an auto-labeling VLM to generate natural language instructions.
The architecture involves a 7B-parameter VLM and an 80M parameter transformer for control, processing visual inputs to enable responsive control based on the latent representations generated by the VLM.