[edit]

Week 5: Transformers

[reveal]

Nic Lane

Abstract:

This lecture will introduce transformers.

Plan for the Day

Think back to last week: Seq2Seq

Neural Machine Translation by Jointly Learning to Align and Translate


Learning Alignment - the first Attention

Function \(f\) implemented as a fully connected layer.

Attention, Attention!

Plan for the Day

Attention is All You Need

Attention is All You Need

\[ \text{attention}(Q,K,V)=\text{softmax}\left(\frac{QK^{\top}}{\sqrt{d_{K}}}\right) V\]

Attention is All You Need

Attention is All You Need

\[ \text{attention}(Q,K,V)=\text{softmax}\left(\frac{QK^{\top}}{\sqrt{d_{K}}}\right) V\]

Attention is All You Need: the Transformer

Attention is All You Need

The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German
and English-to-French newstest 2014 tests at a fraction of the training cost.

Plan for the Day

Reformers

Reformers: Locality Sensitive Hashing

Rather than attending all-to-all, split the sequence up. Kitaev et al. (2020) employ a hashing scheme first proposed by Andoni et al. (2015).

Reformers performance

Loss: log-likelihood and perplexity expressed in bits per dimension.

Linformers

Linformers

Linformers

The idea is very simple - add a simple projection between the weighted K, Q and their joint dot product.
This fixes to a constant the dimension of the matrices entering the self-attention mechanism.

Linformers performance

Training Transformers

Layer Normalization in Transformers

Training Transformers

Unbalanced Gradients
Amplification-induced Instability

Switch Transformers

Switch Transformers

Switch Transformers - speed-up

Switch Transformers - multi-task performance

Plan for the Day

Image is Worth 16x16 Words Embedding and Architecture

Image is Worth 16x16 Words

Image is Worth 16x16 Words

AlphaFold - the protein folding problem

Convolutional solution: AlphaFold 1

Transformer-based solution: AlphaFold 2

Summary of the Day