figure from Andrej Karpathy’s blog post
Goal: model the distribution of sequences
\[ p(x_{1:T}) = p(x_1, \ldots, x_T) \]
Idea: model it one-step-at-a-time:
\[ p(x_{1:T}) = p(x_T\vert x_{1:T-1}) p(x_{T-1} \vert x_{1:T-2}) \cdots p(x_1) \]
from Andrej Karpathy’s 2015 blog post
from Andrej Karpathy’s 2015 blog post
from Andrej Karpathy’s 2015 blog post
from Andrej Karpathy’s 2015 blog post
from Andrej Karpathy’s 2015 blog post
Vanilla RNN:
\[ \mathbf{h}_{t+1} = \sigma(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b_h}) \]
\[ \hat{y} = \phi(W_y \mathbf{h}_{T} + \mathbf{b}_y) \]
\[\begin{align} \frac{\partial \hat{L}}{\partial \mathbf{h}_t} &= \frac{\partial \hat{L}}{\partial \mathbf{h}_T} \prod_{s=t}^{T-1} \frac{\partial h_{s+1}}{\partial h_s} \\ &= \frac{\partial \hat{L}}{\mathbf{h}_T} \left( \prod_{s=t}^{T-1} D_s \right) W^{T-t}_h, \end{align}\]
where * \(D_t = \operatorname{diag} \left[\sigma'(W_t \mathbf{h}_{t-1} + + W_x \mathbf{x}_t + \mathbf{b_h})\right]\) * if \(\sigma\) is ReLU, \(\sigma'(z) \in \{0, 1\}\)
\[\begin{align} \left\|\frac{\partial \hat{L}}{\partial \mathbf{h}_t}\right\| &\leq \left\|\frac{\partial \hat{L}}{\mathbf{h}_T}\right\| \left\|W_h\right\|^{T-t} \prod_{s=t}^{T-1} \left\|D_s\right\|, \end{align}\]
Vanilla RNN:
\[ \mathbf{h}_{t+1} = \sigma(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b_h}) \]
Gated Recurrent Unit:
\[\begin{align} \mathbf{h}_{t+1} &= \mathbf{z}_t \odot \mathbf{h}_t + (1 - \mathbf{z}_t) \tilde{\mathbf{h}}_t \\ \tilde{\mathbf{h}}_t &= \phi\left(W\mathbf{x}_t + U(\mathbf{r}_t \odot \mathbf{h}_t)\right)\\ \mathbf{r}_t &= \sigma(W_r\mathbf{x}_t + U_r\mathbf{h}_t)\\ \mathbf{z}_t &= \sigma(W_z\mathbf{x}_t + U_z\mathbf{h}_t)\\ \end{align}\]
travelling salesman
convex hull and triangulation
“Asking the network too much”
Attention weights:
\[ \alpha_{t,s} = \frac{e^{\mathbf{e}^T_t \mathbf{d}_s}}{\sum_u e^{\mathbf{e}^T_t \mathbf{d}_s}} \]
Context vector:
\[ \mathbf{c}_s = \sum_{t=1}^T \alpha_{t,s} \mathbf{e}_t \]
Try the char-RNN Exercise from Udacity.
from (Veit et al, 2016)
{Different from what we had before: