Part 2/10:
Traditional Transformers rely on self-attention mechanisms, which compute relationships between every pair of tokens (or image patches). The problem? This process scales quadratically with the number of tokens, making it computationally prohibitive for large images like 1080p or higher. For such images, over 60% of total computation in a model dedicated to vision is consumed just by creating and applying attention matrices.
Hydra Attention introduces a paradoxical but effective approach: instead of limiting the number of attention heads, it maximizes them to as many as there are features in the data. By doing so, it transforms the attention computation into a linear-scale process—scaling directly in both tokens and features—eliminating the quadratic bottleneck.