“If your model gets larger, you can solve much harder problems,” Peng said. “But if, at the same time, you also scale up your problems, it again becomes harder for larger models.” This suggests that the transformer architecture has inherent limitations.
You are viewing a single comment's thread from: