MultiHeadAttention¶

Defined in fynance.models.attention

class MultiHeadAttention(d_model, num_heads, dropout=0.0)[source]

Bases: Module

Multi-Head Self-Attention.

Building block of the Transformer architecture (Vaswani et al., 2017). Each attention head learns to attend to a different subspace of the input — useful when several types of dependency coexist in a sequence, e.g. short-term and long-term momentum. Outputs of the heads are concatenated and projected back through w_o; residual connection plus layer norm stabilize training.

For finance-specific use, this layer is typically stacked with a feed-forward sublayer to form a Transformer encoder block applied to a return / order-book sequence.

Splits the input into num_heads heads, applies ScaledDotProductAttention in parallel, then re-projects. A residual connection and layer norm are applied.

Parameters:

d_modelint: Model dimension (must be divisible by num_heads).
num_headsint: Number of attention heads.
dropoutfloat, optional: Dropout on attention weights and output projection, default 0.

Examples

>>> import torch
>>> mha = MultiHeadAttention(64, 4)
>>> x = torch.randn(2, 10, 64)
>>> out, attn = mha(x)
>>> out.shape
torch.Size([2, 10, 64])
>>> attn.shape
torch.Size([2, 4, 10, 10])

forward(x, mask=None)[source]

Forward pass.

Parameters:

xtorch.Tensor: Input of shape (B, T, d_model).
masktorch.Tensor, optional: Attention mask of shape (B, 1, T, T) or (B, T, T).

Returns:

torch.Tensor: Output of shape (B, T, d_model).
torch.Tensor: Averaged attention weights of shape (B, num_heads, T, T).