Transformers

Neil D. Lawrence

Back to lecture list

Transformers

Week 5: Transformers

[jupyter][google colab][reveal]

Neil D. Lawrence

Abstract:

This lecture builds on deep neural networks to explore transformer architectures, focusing on how attention mechanisms require sophisticated chain rule applications and how they connect to the overparameterization and generalization themes.

ML Foundations Course Notebook Setup

[edit]

We install some bespoke codes for creating and saving plots as well as loading data sets.

import importlib.util

cmd = install_command('pods')

%system {cmd}

cmd = install_command('mlai')

%system {cmd}

import notutils
import pods
import mlai
import mlai.plot as plot

From Deep Networks to Transformers

[edit]

In our previous lectures, we explored how composing layers of basis functions creates deep neural networks, and we examined the chain rule and automatic differentiation that makes training these networks possible. We’ve seen how we can consider structured data through convolutional neural networks, graph neural networks, recurrent networks. Today we’ll see how these foundations extend to one of the most important architectural innovations in deep learning: the transformer.

Transformers represent a fundamental shift from the sequential processing of RNNs to parallel attention mechanisms. This creates new challenges for automatic differentiation, as we’ll see.

The Attention Mechanism

The key insight of transformers is the attention mechanism, which allows the model to focus on different parts of the input sequence simultaneously. This creates a more complex gradient flow than standard neural networks.

Chain Rule for Transformer Attention

[edit]

The transformer attention mechanism is more complex than standard neural networks because the same input matrix \(\mathbf{X}\) appears in three different linear transformations. This creates a more intricate chain rule when computing gradients.

This multiple appearance is what allows the transformer to include variable length sequences. But it make sthe chain rule computation a little more complex than for a standard neural network.

The attention mechanism computes a weighted combination of values, where the weights are determined by the similarity between queries and keys. The softmax ensures the weights sum to one.

In a standard neural network, we have a single path from input to output. In transformer attention, we have three parallel paths through the same input, making the chain rule more complex.

The gradient flow through attention involves computing how the loss changes with respect to the attention weights and the value matrix.

The gradient with respect to the attention matrix comes from the product with the value matrix. This tells us how much each attention weight should change.

The gradient through the softmax requires the standard softmax gradient formula, accounting for the fact that attention weights sum to one.

The gradients for queries and keys come from their interaction in the attention logits. Each query interacts with all keys, and each key interacts with all queries.

Finally, we combine all three gradient paths to get the gradient with respect to the input matrix. This is the key insight: the same input appears in three different transformations.

The gradients for the weight matrices follow the standard pattern: input matrix transposed times the gradient of the output.

Our implementation uses a layered architecture where MultiHeadAttentionLayer composes multiple AttentionLayer instances. Each head computes its own attention independently, and gradients flow through each head separately before being combined.

Our implementation supports three attention modes: self-attention (single input), cross-attention (separate query and key-value inputs), and mixed attention (query from one input, key-value from another). Each mode has its own gradient computation pattern.

Implementing transformer gradients efficiently requires careful attention to memory usage and numerical stability. The softmax operation can be numerically unstable for large attention scores. Our implementation includes output projection which adds another gradient path.

The transformer attention mechanism requires a more sophisticated understanding of the chain rule because the same input participates in multiple parallel computations. Our layered implementation makes this complexity manageable by separating concerns and providing comprehensive gradient testing.

Students can verify the chain rule implementation using our comprehensive gradient testing framework. The tests in demonstrate how to use finite differences to verify that our analytical gradients match the mathematical theory.

Transformer Architecture

Now we’ll see how to build a complete transformer model, integrating all the components we’ve discussed.

Simple Transformer Implementation

[edit]

import numpy as np


# Create interesting arithmetic sequence data
X_seq, y_seq = create_interesting_sequence_data('arithmetic')

print(f"Sequence data: {X_seq.shape} -> {y_seq.shape}")
print(f"Sample sequence: {X_seq[0]}")
print(f"Target sequence: {y_seq[0]}")
print(f"Pattern: {X_seq[0]} -> {y_seq[0]} (arithmetic progression)")

Explore Different Sequence Types

# Let's explore different types of interesting sequences
print("Different Sequence Types for Transformer Learning:")
print("=" * 60)

# 1. Arithmetic sequences (mathematical patterns)
X_arith, y_arith = create_interesting_sequence_data('arithmetic')
print(f"\nArithmetic sequences:")
print(f"  Example: {X_arith[0]} -> {y_arith[0]}")
print(f"  Pattern: Each number increases by a fixed step")

# 2. Pattern sequences (repeating patterns)
X_pattern, y_pattern = create_interesting_sequence_data('pattern')
print(f"\nPattern sequences:")
print(f"  Example: {X_pattern[0]} -> {y_pattern[0]}")
print(f"  Pattern: Repeating cycles (A, B, C, A, B, C, ...)")

# 3. Text-like sequences (natural language structure)
X_text, y_text = create_interesting_sequence_data('text')
print(f"\nText sequences:")
print(f"  Example: {X_text[0]} -> {y_text[0]}")
print(f"  Pattern: Word-like structures with spaces and punctuation")

print(f"\nThese different sequence types test different transformer capabilities:")
print(f"- Arithmetic: Mathematical reasoning and number relationships")
print(f"- Pattern: Memory and pattern recognition across sequences")
print(f"- Text: Natural language structure and word boundaries")

Create and Test Basic Attention Layer

from mlai import AttentionLayer

d_model = 64
n_heads = 4
seq_length = 8
vocab_size = 30

# Create basic attention layer (new modular approach)
attention_layer = AttentionLayer(d_model)

# Test forward pass (self-attention)
X_test = np.random.randn(2, seq_length, d_model)
attn_output = attention_layer.forward(X_test)

print("Basic Attention Layer Test:")
print(f"Input shape: {X_test.shape}")
print(f"Output shape: {attn_output.shape}")
print(f"Model parameters: {len(attention_layer.parameters)}")
print(f"Layer type: {type(attention_layer).__name__}")
print("This is now a proper Layer that can be composed with other layers!")

Test Multi-Head Attention Layer

from mlai import MultiHeadAttentionLayer

# Test multi-head attention layer (new modular approach)
multi_head_attention = MultiHeadAttentionLayer(d_model, n_heads)

# Forward pass (self-attention)
X_test = np.random.randn(2, seq_length, d_model)
attn_output = multi_head_attention.forward(X_test)

print("Multi-Head Attention Layer Test:")
print(f"Input shape: {X_test.shape}")
print(f"Output shape: {attn_output.shape}")
print(f"Number of heads: {n_heads}")
print(f"Model parameters: {len(multi_head_attention.parameters)}")
print("This composes multiple AttentionLayer instances for true multi-head attention!")

Test Chain Rule in Attention Layer

# Test gradient flow through attention layer (demonstrating chain rule)
from mlai import MeanSquaredError

X_test = np.random.randn(2, seq_length, d_model)

# Forward pass through attention layer
output = attention_layer.forward(X_test)

# Create dummy loss using proper loss function (consistent with neural network)
target = np.random.randn(2, seq_length, d_model)
loss_fn = MeanSquaredError()
loss_value = loss_fn.forward(output, target)

# Backward pass (demonstrates three-path chain rule)
loss_gradient = loss_fn.gradient(output, target)
gradients = attention_layer.backward(loss_gradient)

print("Chain Rule Demonstration:")
print(f"Loss value: {loss_value:.4f}")
print(f"Input gradient shape: {gradients[0].shape}")
print(f"Input gradient norm: {np.linalg.norm(gradients[0]):.4f}")
print("This shows how gradients flow through Q, K, V transformations")
print("The attention layer implements the three-path chain rule internally")
print("Gradients are computed using our comprehensive gradient testing framework!")

Train Simple Attention Model with Layered Architecture

X_seq, y_seq = create_interesting_sequence_data('arithmetic')
model, losses = train_attention_model(X_seq, y_seq)

Figure: Transformer Training Progress for sequence modeling

Visualise Attention Weights

Figure: Attention weights visualisation from the first head showing which positions the model attends to

Test Different Numbers of Heads

# Test different numbers of heads (showing composition)
d_model = 64
seq_length = 8

for n_heads in [1, 2, 4, 8]:
    multi_head_attention = MultiHeadAttention(d_model, n_heads)
    X_test = np.random.randn(2, seq_length, d_model)
    
    output, attn_weights = multi_head_attention.forward(X_test, X_test, X_test)
    
    print(f"n_heads={n_heads}: output shape={output.shape}, attn shape={attn_weights.shape}")
    print(f"  Each head processes {d_model//n_heads} dimensions")

Test Different Activation Functions

from mlai import SoftmaxActivation

from mlai import SigmoidAttentionActivation

from mlai import IdentityMinusSoftmaxActivation

# Compare different activation functions for attention
from mlai import SoftmaxActivation, SigmoidAttentionActivation, IdentityMinusSoftmaxActivation

d_model = 32
seq_length = 4
X_test = np.random.randn(1, seq_length, d_model)

print("Comparing Attention Activation Functions:")
print("=" * 50)

# Standard softmax attention
softmax_attention = Attention(d_model, activation=SoftmaxActivation())
output_softmax, weights_softmax = softmax_attention.forward(X_test, X_test, X_test)

print("1. SoftmaxActivation (Standard):")
print(f"   Weights sum: {weights_softmax.sum(axis=-1)[0, 0]:.6f}")
print(f"   Weights range: [{weights_softmax.min():.6f}, {weights_softmax.max():.6f}]")
print(f"   Attention matrix:\n{weights_softmax[0, :, :]}")

# Sigmoid with normalization
sigmoid_attention = Attention(d_model, activation=SigmoidAttentionActivation())
output_sigmoid, weights_sigmoid = sigmoid_attention.forward(X_test, X_test, X_test)

print("\n2. SigmoidAttentionActivation:")
print(f"   Weights sum: {weights_sigmoid.sum(axis=-1)[0, 0]:.6f}")
print(f"   Weights range: [{weights_sigmoid.min():.6f}, {weights_sigmoid.max():.6f}]")
print(f"   Attention matrix:\n{weights_sigmoid[0, :, :]}")

# Identity minus softmax (interesting alternative)
identity_attention = Attention(d_model, activation=IdentityMinusSoftmaxActivation())
output_identity, weights_identity = identity_attention.forward(X_test, X_test, X_test)

print("\n3. IdentityMinusSoftmaxActivation:")
print(f"   Weights sum: {weights_identity.sum(axis=-1)[0, 0]:.6f}")
print(f"   Weights range: [{weights_identity.min():.6f}, {weights_identity.max():.6f}]")
print(f"   Diagonal entries (1-softmax): {np.diag(weights_identity[0, :, :])}")
print(f"   Off-diagonal entries (-softmax): {weights_identity[0, 0, 1]:.6f}, {weights_identity[0, 1, 0]:.6f}")
print(f"   Attention matrix:\n{weights_identity[0, :, :]}")

print("\nKey Differences:")
print("- Softmax: Standard attention, weights sum to 1, all positive")
print("- Sigmoid: Alternative activation, weights sum to 1, all positive") 
print("- Identity-Minus-Softmax: Diagonal positive (1-softmax), off-diagonal negative (-softmax), sum to 0")
print("  This creates a 'contrast' attention pattern that emphasizes self-connections while")
print("  de-emphasizing connections to other positions.")

Visualise Different Attention Patterns

Figure: Comparison of different attention activation functions showing how they create different attention patterns

# Show the different attention patterns
print("Attention Pattern Analysis:")
print("Softmax: Standard attention, weights sum to 1")
print("Sigmoid: Alternative activation, weights sum to 1") 
print("Identity-Softmax: Contrast attention, weights sum to 0")
print("This demonstrates different attention behaviors!")

The different attention activation functions create fundamentally different behaviors:

Softmax Attention (Standard): - All weights are positive (0 to 1) - Each row sums to 1 (probability distribution) - Represents ‘how much to attend to each position’ - Higher values = more attention

Sigmoid + Normalization: - Similar to softmax but uses sigmoid activation - All weights positive, rows sum to 1 - Alternative way to create attention weights

Identity Minus Softmax: - Diagonal entries: positive (1 - softmax) - Off-diagonal entries: negative (-softmax) - Each row sums to 0 (not 1!) - Creates ‘contrast’ attention: * Positive diagonal = ‘attend to self’ * Negative off-diagonal = ‘de-emphasize others’ - Could be useful for tasks requiring: * Self-focus (diagonal emphasis) * Contrast learning (positive vs negative weights) * Sparse attention patterns

This demonstrates how different activation functions can create fundamentally different attention behaviors, even with the same underlying Q, K, V computation!

Positional Encoding Layer Test

from mlai import PositionalEncodingLayer

# Test positional encoding layer (new modular approach)
pe_layer = PositionalEncodingLayer(d_model, max_length=100)
X_test = np.random.randn(2, seq_length, d_model)

X_with_pe = pe_layer.forward(X_test)

print("Positional Encoding Layer Test:")
print(f"Input shape: {X_test.shape}")
print(f"Output shape: {X_with_pe.shape}")
print(f"PE added: {np.allclose(X_test + pe_layer.pe[:seq_length], X_with_pe)}")
print(f"Layer parameters: {len(pe_layer.parameters)} (should be 0 - no trainable params)")
print("This is now a proper Layer that can be composed with other layers!")

Build Transformer with Layered Architecture

# Create transformer using the new layered architecture
from mlai import LayeredNeuralNetwork, MultiHeadAttentionLayer, PositionalEncodingLayer

vocab_size = 30
d_model = 64
n_heads = 4

# Create layers for transformer
pos_encoding = PositionalEncodingLayer(d_model)
attention = MultiHeadAttentionLayer(d_model, n_heads)

# Build transformer using layered architecture
transformer_layers = [pos_encoding, attention]
transformer = LayeredNeuralNetwork(transformer_layers)

print("Layered Transformer Model Test:")
print(f"Model dimension: {d_model}")
print(f"Number of heads: {n_heads}")
print(f"Number of layers: {len(transformer_layers)}")
print(f"Total parameters: {len(transformer.parameters)}")
print(f"Layer types: {[type(layer).__name__ for layer in transformer_layers]}")
print("This demonstrates the new modular layered architecture!")

# Test forward pass with layered architecture
X_test = np.random.randn(2, 8, d_model)  # Embedded input
output = transformer.forward(X_test)

print("Layered Transformer Test:")
print(f"Input shape: {X_test.shape}")
print(f"Output shape: {output.shape}")
print(f"Model parameters: {len(transformer.parameters)}")
print("This demonstrates the new modular layered architecture!")
print("Each layer can be composed and tested independently!")

Training Layered Transformer Model

X_seq, y_seq = create_interesting_sequence_data('pattern')
transformer_model, transformer_losses = train_transformer_model(X_seq, y_seq, vocab_size=15)

Figure: Simple Transformer Training Progress

Compare Different Sequence Types

# Train on different sequence types to see how transformers handle different patterns
print("Training Transformers on Different Sequence Types:")
print("=" * 60)

# Train on arithmetic sequences
print("\n1. Arithmetic Sequences (Mathematical Patterns):")
X_arith, y_arith = create_interesting_sequence_data('arithmetic')
model_arith, losses_arith = train_attention_model(X_arith, y_arith)
print(f"Final loss: {losses_arith[-1]:.4f}")

# Train on pattern sequences  
print("\n2. Pattern Sequences (Repeating Patterns):")
X_pattern, y_pattern = create_interesting_sequence_data('pattern')
model_pattern, losses_pattern = train_attention_model(X_pattern, y_pattern)
print(f"Final loss: {losses_pattern[-1]:.4f}")

# Train on text sequences
print("\n3. Text Sequences (Natural Language Structure):")
X_text, y_text = create_interesting_sequence_data('text')
model_text, losses_text = train_attention_model(X_text, y_text)
print(f"Final loss: {losses_text[-1]:.4f}")

print(f"\nResults Analysis:")
print(f"- Arithmetic sequences: Tests mathematical reasoning")
print(f"- Pattern sequences: Tests memory and pattern recognition")
print(f"- Text sequences: Tests natural language understanding")
print(f"Lower loss indicates better learning of the underlying pattern!")

Benefits of the New Layered Architecture

# Demonstrate the benefits of the new modular layered architecture
print("Benefits of the New Layered Architecture:")
print("=" * 50)

# 1. Composable layers
from mlai import LayeredNeuralNetwork, MultiHeadAttentionLayer, PositionalEncodingLayer, LinearLayer, ReLUActivation

# Create different layer combinations
layers1 = [PositionalEncodingLayer(d_model), MultiHeadAttentionLayer(d_model, n_heads)]
layers2 = [PositionalEncodingLayer(d_model), MultiHeadAttentionLayer(d_model, n_heads), 
           LinearLayer(d_model, d_model), ReLUActivation()]

model1 = LayeredNeuralNetwork(layers1)
model2 = LayeredNeuralNetwork(layers2)

print(f"Model 1 (Attention only): {len(model1.parameters)} parameters")
print(f"Model 2 (Attention + Linear): {len(model2.parameters)} parameters")
print(f"Layer types in Model 1: {[type(l).__name__ for l in layers1]}")
print(f"Layer types in Model 2: {[type(l).__name__ for l in layers2]}")

# 2. Independent layer testing
print("\nIndependent Layer Testing:")
attention_layer = MultiHeadAttentionLayer(d_model, n_heads)
pos_layer = PositionalEncodingLayer(d_model)

# Test each layer independently
X_test = np.random.randn(2, 8, d_model)
pos_output = pos_layer.forward(X_test)
attn_output = attention_layer.forward(pos_output)

print(f"Positional encoding output shape: {pos_output.shape}")
print(f"Attention output shape: {attn_output.shape}")
print("Each layer can be tested and debugged independently!")

# 3. Gradient testing
print("\nGradient Testing:")
print("All layers have comprehensive gradient testing using finite differences")
print("This ensures mathematical correctness of the implementations")

# 4. Parameter management
print("\nParameter Management:")
print(f"Attention layer parameters: {len(attention_layer.parameters)}")
print(f"Positional encoding parameters: {len(pos_layer.parameters)} (should be 0)")
print("Each layer manages its own parameters with proper getter/setter methods")

print("\nThis demonstrates the power of the new modular architecture!")
print("Layers are composable, testable, and mathematically verified!")

The new layered architecture provides several key benefits:

1. Modularity and Composability: - Each layer is a self-contained unit with a consistent interface - Layers can be composed in any order to create complex architectures - Easy to experiment with different layer combinations

2. Independent Testing: - Each layer can be tested independently using our comprehensive gradient testing - Forward and backward passes are verified using finite differences - Mathematical correctness is ensured through numerical verification

3. Clean Separation of Concerns: - Attention logic is separate from positional encoding - Each layer has a single responsibility - Easy to understand and debug individual components

4. Consistent Interface: - All layers implement the same forward(), backward(), and parameters interface - Works seamlessly with LayeredNeuralNetwork - Follows the same patterns as other neural network components

5. Educational Clarity: - Students can understand each component in isolation - Clear demonstration of how complex architectures are built from simple components - Shows the power of composition over inheritance

This modular approach makes transformer architectures much more accessible and maintainable!

Training and Generalization

How do transformers relate to the overparameterization and generalization themes we discussed in the previous lecture?

Attention as Implicit Regularization

The attention mechanism provides a form of implicit regularization. Unlike the explicit regularization we discussed for standard neural networks, attention creates sparse, interpretable patterns that emerge during training.

Overparameterization in Transformers

Transformers generalize well precisely because they are highly overparameterized. This extends our previous discussion of how overparameterization enables generalization through the optimization process, with the attention mechanism providing additional structural constraints.

Summary and Future Directions

Transformers represent a significant evolution in deep learning architectures, but they also raise new questions about optimization, generalization, and the fundamental principles of learning. The attention mechanism provides a new form of inductive bias that we’re still learning to understand theoretically.

Thanks!

For more information on these subjects and more you might want to check the following resources.

Transformers

Week 5: Transformers

ML Foundations Course Notebook Setup

From Deep Networks to Transformers

The Attention Mechanism

Chain Rule for Transformer Attention

Transformer Architecture

Simple Transformer Implementation

Explore Different Sequence Types

Create and Test Basic Attention Layer

Test Multi-Head Attention Layer

Test Chain Rule in Attention Layer

Train Simple Attention Model with Layered Architecture

Visualise Attention Weights

Test Different Numbers of Heads

Test Different Activation Functions

Visualise Different Attention Patterns

Positional Encoding Layer Test

Build Transformer with Layered Architecture

Training Layered Transformer Model

Compare Different Sequence Types

Benefits of the New Layered Architecture

Training and Generalization

Attention as Implicit Regularization

Overparameterization in Transformers

Summary and Future Directions

Further Reading

Thanks!

References