Different from what we’ve seen before:
- different input type (sequences)
- different network building blocks
- multiplicative interactions
- gating
- skip connections
- different objective
- maximum likelihood
- generative modelling
Modelling sequences
- input to the network: x1,x2,…,xT
- sequences of different length
- sometimes ‘EOS’ symbol
- sequence classification (e.g. text classification)
- sequence generation (e.g. language generation)
- sequence-to-sequence (e.g. translation)
Recurrent Neural Network
![]()
RNN: Unrolled through time
![]()
Generating sequences
Goal: model the distribution of sequences
p(x1:T)=p(x1,…,xT)
Idea: model it one-step-at-a-time:
p(x1:T)=p(xT|x1:T−1)p(xT−1|x1:T−2)⋯p(x1)
Modeling sequence distributions
![]()
Training: maximum likelihood
![]()
Sampling sequences
![]()
But, it was not that easy
- vanilla RNNs forget too quickly
- vanishing gradients problem
- exploding gradients problem
Vanishing/exploding gradients problem
Vanilla RNN:
ht+1=σ(Whht+Wxxt+bh)
ˆy=ϕ(WyhT+by)
The gradients of the loss are
∂ˆL∂ht=∂ˆL∂hTT−1∏s=t∂hs+1∂hs=∂ˆLhT(T−1∏s=tDs)WT−th,
where * Dt=diag[σ′(Wtht−1++Wxxt+bh)] * if σ is ReLU, σ′(z)∈{0,1}
The norm of the gradient is upper bounded
‖∂ˆL∂ht‖≤‖∂ˆLhT‖‖Wh‖T−tT−1∏s=t‖Ds‖,
- the norm of Ds is less than 1 (ReLU)
- the norm of Wh can cause gradients to explode
More typical solution: gating
Vanilla RNN:
ht+1=σ(Whht+Wxxt+bh)
Gated Recurrent Unit:
ht+1=zt⊙ht+(1−zt)˜ht˜ht=ϕ(Wxt+U(rt⊙ht))rt=σ(Wrxt+Urht)zt=σ(Wzxt+Uzht)
GRU diagram
![]()
LSTM: Long Short-Term Memory
- by Hochreiter and Schmidhuber (1997)
- improved/tweaked several times since
- more gates to control behaviour
- 2009: Alex Graves, ICDAR connected handwriting recognition competition
- 2013: sets new record in natural speech dataset
- 2014: GRU proposed (simplified LSTM)
- 2016: neural machine translation
RNNs for painting
![]()
Spatial LSTMs generating textures
![]()
Seq2Seq: neural machine translation
![]()
General algorithms as Seq2Seq
convex hull and triangulation
![]()
Pointer networks
![]()
Revisiting the basic idea
![]()
“Asking the network too much”
Attention layer
![]()
Attention layer
Attention weights:
αt,s=eeTtds∑ueeTtds
Context vector:
cs=T∑t=1αt,set
Attention layer visualised
![]()
Side note: dealing with depth
![]()
Side note: dealing with depth
![]()
Side note: dealing with depth
![]()
Deep Residual Networks (ResNets)
![]()
Deep Residual Networks (ResNets)
![]()
ResNets
- allow for much deeper networks (101, 152 layer)
- performance increases with depth
- new record in benchmarks (ImageNet, COCO)
- used almost everywhere now
DenseNets
![]()
DenseNets
![]()
Back to RNNs
- like ResNets, LSTMs and GRU create “shortcuts”
- allows information to skip processing
- data-dependent gating
- data-dependent shortcuts
{Different from what we had before:
- different input type (sequences)
- different network building blocks
- multiplicative interactions
- gating
- skip connections
- different objective
- maximum likelihood
- generative modelling