Understanding Multi-Head Attention in Transformers
Self-attention already helps a transformer understand relationships between words using Query, Key, and Value. But there’s a problem. One attention mechanism usually ends up focusing on a limited kind of relationship at a time. Language doesn’t work like that. A sentence can have structure, meaning, and long-range links all at once. That’s why transformers use multi-head attention . What happens in multi-head attention Instead of doing attention once, the model does it multiple times in parallel
Comment
Sign in to join the discussion.
Loading comments…