Part 4/10:
Accuracy Boosts: Utilizing Hydra Attention in standard vision models like ViT-B16 (a Vision Transformer baseline) results in higher accuracy on ImageNet. For example, with just two Hydra layers, the accuracy improves slightly; adding more layers further boosts efficiency and performance.
Reduced Computation: The floating point operations (FLOPs)—a measure of computational effort—were cut by 4% overall, but with just 224x224 images, the attention-related FLOPs plummeted from around 4.1% of total to a mere 0.02%, effectively eliminating attention costs.