[edit]

Week 4: Recurrent Neural Networks

[reveal]

Ferenc Huszár

Abstract:

This lecture will introduce recurrent neural networks, these structures allow us to deal with sequences.

Different from what we had before:

Modelling sequences

Recurrent Neural Network

RNN: Unrolled through time

RNN: different uses

figure from Andrej Karpathy’s blog post

Generating sequences

Goal: model the distribution of sequences

\[ p(x_{1:T}) = p(x_1, \ldots, x_T) \]

Idea: model it one-step-at-a-time:

\[ p(x_{1:T}) = p(x_T\vert x_{1:T-1}) p(x_{T-1} \vert x_{1:T-2}) \cdots p(x_1) \]

Modeling sequence distributions

Training: maximum likelihood

Sampling sequences

Char-RNN: Shakespeare

from Andrej Karpathy’s 2015 blog post

Char-RNN: Wikipedia

from Andrej Karpathy’s 2015 blog post

Char-RNN: Wikipedia

from Andrej Karpathy’s 2015 blog post

Char-RNN example: random XML

from Andrej Karpathy’s 2015 blog post

Char-RNN example: LaTeX

from Andrej Karpathy’s 2015 blog post

But, it was not that easy

Vanishing gradient problem

Vanishing/exploding gradients problem

Vanilla RNN:

\[ \mathbf{h}_{t+1} = \sigma(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b_h}) \]

\[ \hat{y} = \phi(W_y \mathbf{h}_{T} + \mathbf{b}_y) \]

The gradients of the loss are

\[\begin{align} \frac{\partial \hat{L}}{\partial \mathbf{h}_t} &= \frac{\partial \hat{L}}{\mathbf{h}_T} \prod_{s=t}^{T-1} \frac{\partial h_{t+1}}{\partial h_t} \\ &= \frac{\partial \hat{L}}{\mathbf{h}_T} \prod_{s=t}^{T-1} D_s W^{T-t}_h, \end{align}\]

where

The norm of the gradient is upper bounded

\[\begin{align} \left\|\frac{\partial \hat{L}}{\partial \mathbf{h}_t}\right\| &\leq \left\|\frac{\partial \hat{L}}{\mathbf{h}_T}\right\| \prod_{s=t}^{T-1} \left\|D_s\right\| \left\|W_h\right\|^{T-t}, \end{align}\]

Unitary Evolution RNNs

Idea: constrain \(W_h\) to be unit-norm.

Unitary Evolution RNNs

Compose weight matrix out of simple unitary transforms:

\[ W_h = D_3R_2\mathcal{F}^{-1}D_2\Pi R_1\mathcal{F}D_1 \]

More typical solution: gating

Vanilla RNN:

\[ \mathbf{h}_{t+1} = \sigma(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b_h}) \]

Gated Recurrent Unit:

\[\begin{align} \mathbf{h}_{t+1} &= \mathbf{z}_t \odot \mathbf{h}_t + (1 - \mathbf{z}_t) \tilde{\mathbf{h}}_t \\ \tilde{\mathbf{h}}_t &= \phi\left(W\mathbf{x}_t + U(\mathbf{r}_t \odot \mathbf{h}_t)\right)\\ \mathbf{r}_t &= \sigma(W_r\mathbf{x}_t + U_r\mathbf{h}_t)\\ \mathbf{z}_t &= \sigma(W_z\mathbf{x}_t + U_z\mathbf{h}_t)\\ \end{align}\]

GRU diagram

LSTM: Long Short-Term Memory

Side note: dealing with depth

Side note: dealing with depth

Side note: dealing with depth

Deep Residual Networks (ResNets)

Deep Residual Networks (ResNets)

ResNets

Resnets behave like ensembles

from (Veit et al, 2016)

DenseNets

DenseNets

Back to RNNs

Different from what we had before:

RNN: different uses

figure from Andrej Karpathy’s blog post

To engage with this material at home

Try the char-RNN Exercise from Udacity.