Part 4/12:
Dubbed Heterogeneous Pre-trained Transformers (HPT), this system unifies diverse data streams—visual inputs, sensor signals, robotic arm movements, and human actions—into a common "language." Much like how language models convert words and sentences into tokens for processing, HPT transforms robotic and visual data into tokens that the underlying Transformer architecture can interpret. This unified representation enables the model to recognize patterns spanning different tasks and environments, facilitating rapid adaptability.
How Does HPT Work?
The architecture comprises three main components:
- Stems: These act as translators, converting heterogeneous input signals from robots—such as images from cameras or sensor readings—into a shared representational language.