MultiHeadAttention

Defined in fynance.models.attention

class MultiHeadAttention(d_model, num_heads, dropout=0.0)[source]

Bases: Module

Multi-Head Self-Attention.

Building block of the Transformer architecture (Vaswani et al., 2017). Each attention head learns to attend to a different subspace of the input — useful when several types of dependency coexist in a sequence, e.g. short-term and long-term momentum. Outputs of the heads are concatenated and projected back through w_o; residual connection plus layer norm stabilize training.

For finance-specific use, this layer is typically stacked with a feed-forward sublayer to form a Transformer encoder block applied to a return / order-book sequence.

Splits the input into num_heads heads, applies ScaledDotProductAttention in parallel, then re-projects. A residual connection and layer norm are applied.

Parameters:
d_modelint

Model dimension (must be divisible by num_heads).

num_headsint

Number of attention heads.

dropoutfloat, optional

Dropout on attention weights and output projection, default 0.

Examples

>>> import torch
>>> mha = MultiHeadAttention(64, 4)
>>> x = torch.randn(2, 10, 64)
>>> out, attn = mha(x)
>>> out.shape
torch.Size([2, 10, 64])
>>> attn.shape
torch.Size([2, 4, 10, 10])
forward(x, mask=None)[source]

Forward pass.

Parameters:
xtorch.Tensor

Input of shape (B, T, d_model).

masktorch.Tensor, optional

Attention mask of shape (B, 1, T, T) or (B, T, T).

Returns:
torch.Tensor

Output of shape (B, T, d_model).

torch.Tensor

Averaged attention weights of shape (B, num_heads, T, T).