MultiHeadAttention¶
Defined in fynance.models.attention
- class MultiHeadAttention(d_model, num_heads, dropout=0.0)[source]
Bases:
ModuleMulti-Head Self-Attention.
Building block of the Transformer architecture (Vaswani et al., 2017). Each attention head learns to attend to a different subspace of the input — useful when several types of dependency coexist in a sequence, e.g. short-term and long-term momentum. Outputs of the heads are concatenated and projected back through
w_o; residual connection plus layer norm stabilize training.For finance-specific use, this layer is typically stacked with a feed-forward sublayer to form a Transformer encoder block applied to a return / order-book sequence.
Splits the input into
num_headsheads, appliesScaledDotProductAttentionin parallel, then re-projects. A residual connection and layer norm are applied.- Parameters:
- d_modelint
Model dimension (must be divisible by
num_heads).- num_headsint
Number of attention heads.
- dropoutfloat, optional
Dropout on attention weights and output projection, default 0.
Examples
>>> import torch >>> mha = MultiHeadAttention(64, 4) >>> x = torch.randn(2, 10, 64) >>> out, attn = mha(x) >>> out.shape torch.Size([2, 10, 64]) >>> attn.shape torch.Size([2, 4, 10, 10])
- forward(x, mask=None)[source]
Forward pass.
- Parameters:
- xtorch.Tensor
Input of shape
(B, T, d_model).- masktorch.Tensor, optional
Attention mask of shape
(B, 1, T, T)or(B, T, T).
- Returns:
- torch.Tensor
Output of shape
(B, T, d_model).- torch.Tensor
Averaged attention weights of shape
(B, num_heads, T, T).