[edit]

Week 6: Representation and Transfer Learning

[reveal]

Ferenc Huszár

Abstract:

In this lecture Ferenc will introduce us to the notions behind representation and transfer learning.

Unsupervised learning

Unsupervised learning goals

UL as distribution modeling

Deep learning for modelling distributions

Latent variable models

\[ p_\theta(x) = \int p_\theta(x, z) dz \]

Latent variable models

\[ p_\theta(x) = \int p_\theta(x\vert z) p_\theta(z) dz \]

Motivation 1

“it makes sense”

Motivation 2

manifold assumption

Motivation 3

from simple to complicated

\[ p_\theta(x) = \int p_\theta(x\vert z) p_\theta(z) dz \]

Motivation 3

from simple to complicated

\[ \underbrace{p_\theta(x)}_\text{complicated} = \int \underbrace{p_\theta(x\vert z) }_\text{simple}\underbrace{p_\theta(z)}_\text{simple} dz \]

Motivation 3

from simple to complicated

\[ \underbrace{p_\theta(x)}_\text{complicated} = \int \underbrace{\mathcal{N}\left(x; \mu_\theta(z), \operatorname{diag}(\sigma_\theta(z)) \right)}_\text{simple}\underbrace{\mathcal{N}(z; 0, I)}_\text{simple} dz \]

Motivation 4

variational learning

(Kingma and Welling, 2019) Variational Autoencoder

Variational autoencoder

Variational encoder: interpretable \(z\)

Self-supervised learning

basic idea

example: jigsaw puzzles

(Noroozi and Favaro, 2016)

Data-efficiency in downstream task

(Hènaff et al, 2020)

Linearity in downstream task

(Chen et al, 2020)

Several self-supervised methods

Example: instance classification

Example: contrastive learning

Example: Masked Language Models

image credit: (Lample and Conneau, 2019)

BERT

Why should any of this work?

Predicting What you Already Know Helps: Provable Self-Supervised Learning

(Lee et al, 2020)

Provable Self-Supervised Learning

Assumptions:

Provable Self-Supervised Learning

\(X_1 \perp \!\!\! \perp X_2 \vert Y, Z\)

Provable Self-Supervised Learning

Provable Self-Supervised Learning

\[ X_1 \perp \!\!\! \perp X_2 \vert Y, Z \]

Provable Self-Supervised Learning

\[ 👀 \perp \!\!\! \perp 👄 \vert \text{age}, \text{gender}, \text{ethnicity} \]

Provable Self-Supervised Learning

If \(X_1 \perp \!\!\! \perp X_2 \vert Y\), then

\[ \mathbb{E}[X_2 \vert X_1] = \sum_k \mathbb{E}[X_2\vert Y=k] \mathbb{P}[Y=k\vert X_1 = x_1] \]

Provable Self-Supervised Learning

\[\begin{align} &\mathbb{E}[X_2 \vert X_1=x_1] = \\ &\left[\begin{matrix} \mathbb{E}[X_2\vert Y=1], \ldots, \mathbb{E}[X_2\vert Y=k]\end{matrix}\right] \left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right] \end{align}\]

Provable Self-Supervised Learning

\[\begin{align} &\mathbb{E}[X_2 \vert X_1=x_1] = \\ &\underbrace{\left[\begin{matrix} \mathbb{E}[X_2\vert Y=1], \ldots, \mathbb{E}[X_2\vert Y=k]\end{matrix}\right]}_\mathbf{A}\left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right] \end{align}\]

Provable Self-Supervised Learning

\[ \mathbb{E}[X_2 \vert X_1=x_1] = \mathbf{A}\left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right] \]

Provable Self-Supervised Learning

\[ \mathbf{A}^\dagger \mathbb{E}[X_2 \vert X_1=x_1] = \left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right] \]

Provable Self-Supervised Learning

\[ \mathbf{A}^\dagger \underbrace{\mathbb{E}[X_2 \vert X_1=x_1]}_\text{pretext task} = \underbrace{\left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right]}_\text{downstream task} \]

Provable self-supervised learning summary

Recap

Variational learning

\[ \theta^\text{ML} = \operatorname{argmax}_\theta \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i) \]

Variational learning

\[ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i) - \operatorname{KL}[q_\psi(z\vert x_i) \| p_\theta(z\vert x_i)] \]

Variational learning

\[ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i) + \mathbb{E}_{z\sim q_\psi} \log \frac{p_\theta(z\vert x_i)}{q_\psi(z\vert x_i)} \]

Variational learning

\[ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \mathbb{E}_{z\sim q_\psi} \log \frac{p_\theta(z\vert x_i) p_\theta(x_i)}{q_\psi(z\vert x_i)} \]

Variational learning

\[ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \mathbb{E}_{z\sim q_\psi} \log \frac{p_\theta(z, x_i)}{q_\psi(z\vert x_i)} \]

Variational learning

\[ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \mathbb{E}_{z\sim q_\psi(z\vert x_i)} \log p(x_i\vert z) - \operatorname{KL}[q_\psi(z\vert x_i)\vert p_\theta(z)] \]

Variational learning

\[ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \underbrace{\mathbb{E}_{z\sim q_\psi(z\vert x_i)} \log p(x_i\vert z)}_\text{reconstruction} - \operatorname{KL}[q_\psi(z\vert x_i)\vert p_\theta(z)] \]

Discussion of max likelihood

Representation learning vs max likelihood

Representation learning vs max likelihood

Representation learning vs max likelihood

Representation learning vs max likelihood

Representation learning vs max likelihood

Representation learning vs max likelihood

Discussion of max likelihood