Self-Attention Heads and Scaled Dot-Product: Analysis of the mechanism that allows models to weight the importance of different parts of the input sequence March 22, 2026