# Week 4: Recurrent Neural Networks

[reveal]

Abstract:

This lecture will introduce recurrent neural networks, these structures allow us to deal with sequences.

## Different from what we had before:

• different input type (sequences)
• different network building blocks
• multiplicative interactions
• gating
• skip connections
• different objective
• maximum likelihood
• generative modelling

## Modelling sequences

• input to the network: $x_1, x_3, \ldots, x_T$
• sequences of different length
• sometimes ‘EOS’ symbol
• sequence classification (e.g. text classification)
• sequence generation (e.g. language generation)
• sequence-to-sequence (e.g. translation)

### RNN: different uses

figure from Andrej Karpathy’s blog post

### Generating sequences

Goal: model the distribution of sequences

$p(x_{1:T}) = p(x_1, \ldots, x_T)$

Idea: model it one-step-at-a-time:

$p(x_{1:T}) = p(x_T\vert x_{1:T-1}) p(x_{T-1} \vert x_{1:T-2}) \cdots p(x_1)$

### But, it was not that easy

• vanilla RNNs forget too quickly
• colab illustration

Vanilla RNN:

$\mathbf{h}_{t+1} = \sigma(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b_h})$

$\hat{y} = \phi(W_y \mathbf{h}_{T} + \mathbf{b}_y)$

### The gradients of the loss are

\begin{align} \frac{\partial \hat{L}}{\partial \mathbf{h}_t} &= \frac{\partial \hat{L}}{\mathbf{h}_T} \prod_{s=t}^{T-1} \frac{\partial h_{t+1}}{\partial h_t} \\ &= \frac{\partial \hat{L}}{\mathbf{h}_T} \prod_{s=t}^{T-1} D_s W^{T-t}_h, \end{align}

where

• $D_t = \operatorname{diag} \left[\sigma'(W_t \mathbf{h}_{t-1} + + W_x \mathbf{x}_t + \mathbf{b_h})\right]$

• if $\sigma$ is ReLU, $\sigma'(z) \in \{0, 1\}$

### The norm of the gradient is upper bounded

\begin{align} \left\|\frac{\partial \hat{L}}{\partial \mathbf{h}_t}\right\| &\leq \left\|\frac{\partial \hat{L}}{\mathbf{h}_T}\right\| \prod_{s=t}^{T-1} \left\|D_s\right\| \left\|W_h\right\|^{T-t}, \end{align}

• the norm of $D_s$ is less than 1 (ReLU)
• the norm of $W_h$ can cause gradients to explode

### Unitary Evolution RNNs

Idea: constrain $W_h$ to be unit-norm.

### Unitary Evolution RNNs

Compose weight matrix out of simple unitary transforms:

$W_h = D_3R_2\mathcal{F}^{-1}D_2\Pi R_1\mathcal{F}D_1$

### More typical solution: gating

Vanilla RNN:

$\mathbf{h}_{t+1} = \sigma(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b_h})$

Gated Recurrent Unit:

\begin{align} \mathbf{h}_{t+1} &= \mathbf{z}_t \odot \mathbf{h}_t + (1 - \mathbf{z}_t) \tilde{\mathbf{h}}_t \\ \tilde{\mathbf{h}}_t &= \phi\left(W\mathbf{x}_t + U(\mathbf{r}_t \odot \mathbf{h}_t)\right)\\ \mathbf{r}_t &= \sigma(W_r\mathbf{x}_t + U_r\mathbf{h}_t)\\ \mathbf{z}_t &= \sigma(W_z\mathbf{x}_t + U_z\mathbf{h}_t)\\ \end{align}

## GRU diagram

### LSTM: Long Short-Term Memory

• by Hochreiter and Schmidhuber (1997)
• improved/tweaked several times since
• more gates to control behaviour
• 2009: Alex Graves, ICDAR connected handwriting recognition competition
• 2013: sets new record in natural speech dataset
• 2014: GRU proposed (simplified LSTM)
• 2016: neural machine translation

### ResNets

• allow for much deeper networks (101, 152 layer)
• performance increases with depth
• new record in benchmarks (ImageNet, COCO)
• used almost everywhere now

### Resnets behave like ensembles

from (Veit et al, 2016)

### Back to RNNs

• like ResNets, LSTMs and GRU create “shortcuts”
• allows information to skip processing
• data-dependent gating
• data-dependent shortcuts

## Different from what we had before:

• different input type (sequences)
• different network building blocks
• multiplicative interactions
• gating
• skip connections
• different objective
• maximum likelihood
• generative modelling

### RNN: different uses

figure from Andrej Karpathy’s blog post

### To engage with this material at home

Try the char-RNN Exercise from Udacity.