Week 6: Representation and Transfer Learning
Abstract:
In this lecture Ferenc will introduce us to the notions behind representation and transfer learning.
Unsupervised learning
- observations x1,x2,…
- drawn i.i.d. from some pD
- can we learn something from this?
Unsupervised learning goals
- can we learn something from this?
- a model of data distribution pθ(x)≈pD(x)
- compression
- data reconstruction
- sampling/generation
- a representation z=gθ(x) or qθ(z|x)
- downstream classification task
- data visualisation
- a model of data distribution pθ(x)≈pD(x)
UL as distribution modeling
- defines goal as modeling pθ(x)≈pD(x)
- θ: parameters
- maximum likelihood estimation: θML=argmaxθ∑xi∈Dlogpθ(xi)
Deep learning for modelling distributions
- auto-regressive models (e.g. RNNs)
- pθ(x1:T)=∏Tt=1pθ(xt|x1:t−1)
- implicit distributions (e.g. GANs)
- x = gθ(z),z∼N(0,I)
- flow models (e.g. RealNVP)
- like above but gθ(z) invertible
- latent variable models (LVMs, e.g. VAE)
- pθ(x)=∫pθ(x,z)dz
Latent variable models
pθ(x)=∫pθ(x,z)dz
Latent variable models
pθ(x)=∫pθ(x|z)pθ(z)dz
Motivation 1
“it makes sense”
- describes data in terms of a generative process
- e.g. object properties, locations
- learnt z often interpretable
- causal reasoning often needs latent variables
Motivation 2
manifold assumption
- high-dimensional data
- doesn’t occupy all the space
- concentrated along low-dimensional manifold
- z≈ intrinsic coordinates within the manifold
Motivation 3
from simple to complicated
pθ(x)=∫pθ(x|z)pθ(z)dz
Motivation 3
from simple to complicated
pθ(x)⏟complicated=∫pθ(x|z)⏟simplepθ(z)⏟simpledz
Motivation 3
from simple to complicated
pθ(x)⏟complicated=∫N(x;μθ(z),diag(σθ(z)))⏟simpleN(z;0,I)⏟simpledz
Motivation 4
variational learning
- evaluating pθ(x) is hard
- learning is hard
- evaluating pθ(z|x) is hard
- inference is hard
- variational framework:
- approximate learning
- approximate inference

(Kingma and Welling, 2019) Variational Autoencoder
Variational autoencoder
- Decoder: pθ(x|z)=N(μθ(z),σnI)
- Encoder: qψ(z|x)=N(μψ(z),σψ(z))
- Prior: pθ(z)=N(0,I)
Variational encoder: interpretable z

Self-supervised learning
basic idea
- turn unsupervised problem into supervised one
- turn datapoints xi into input-output pairs
- called auxiliary or pretext task
- learn to solve auxiliary task
- transfer representation leaned to the downstream task
example: jigsaw puzzles

Data-efficiency in downstream task

Linearity in downstream task

Several self-supervised methods
- auto-encoding
- denoising auto-encoding
- pseudo-likelihood
- instance classification
- contrastive learning
- masked language models
Example: instance classification
- pick random data index i
- randomly transform image xi: T(xi)
- auxilliary task: guess data index i from transformed input T(xi)
- difficulty: N-way classification
Example: contrastive learning
- pick random y
- if y=1 pick two random images x1, x2
- if y=0 use same image twice x1=x2
- aux task: predict y from fθ(T1(x1)),fθ(T2(x2))
Example: Masked Language Models

image credit: (Lample and Conneau, 2019)
BERT

Why should any of this work?
Predicting What you Already Know Helps: Provable Self-Supervised Learning
Provable Self-Supervised Learning
Assumptions:
- observable X decomposes into X1,X2
- pretext: only given (X1,X2) pairs
- downstream: we will want to predict Y
- X1⊥⊥X2|Y,Z
- (+1 additional strong assumption)
Provable Self-Supervised Learning

X1⊥⊥X2|Y,Z
Provable Self-Supervised Learning

Provable Self-Supervised Learning

X1⊥⊥X2|Y,Z
Provable Self-Supervised Learning

👀⊥⊥👄|age,gender,ethnicity
Provable Self-Supervised Learning
If X1⊥⊥X2|Y, then
E[X2|X1]=∑kE[X2|Y=k]P[Y=k|X1=x1]
Provable Self-Supervised Learning
E[X2|X1=x1]=[E[X2|Y=1],…,E[X2|Y=k]][P[Y=1|X1=x1]⋮P[Y=k|X1=x1]]
Provable Self-Supervised Learning
E[X2|X1=x1]=[E[X2|Y=1],…,E[X2|Y=k]]⏟A[P[Y=1|X1=x1]⋮P[Y=k|X1=x1]]
Provable Self-Supervised Learning
E[X2|X1=x1]=A[P[Y=1|X1=x1]⋮P[Y=k|X1=x1]]
Provable Self-Supervised Learning
A†E[X2|X1=x1]=[P[Y=1|X1=x1]⋮P[Y=k|X1=x1]]
Provable Self-Supervised Learning
A†E[X2|X1=x1]⏟pretext task=[P[Y=1|X1=x1]⋮P[Y=k|X1=x1]]⏟downstream task
Provable self-supervised learning summary
- under assumptions of conditional independence
- (and that matrix A is full rank)
- P[Y|x1] is in linear span of E[X2|x1]
- All we need is linear model on top of E[X2|x1]
- note: P[Y|x1,x2] would be really optimal
Recap
Variational learning
θML=argmaxθ∑xi∈Dlogpθ(xi)
Variational learning
L(θ,ψ)=∑xi∈Dlogpθ(xi)−KL[qψ(z|xi)‖pθ(z|xi)]
Variational learning
L(θ,ψ)=∑xi∈Dlogpθ(xi)+Ez∼qψlogpθ(z|xi)qψ(z|xi)
Variational learning
L(θ,ψ)=∑xi∈DEz∼qψlogpθ(z|xi)pθ(xi)qψ(z|xi)
Variational learning
L(θ,ψ)=∑xi∈DEz∼qψlogpθ(z,xi)qψ(z|xi)
Variational learning
L(θ,ψ)=∑xi∈DEz∼qψ(z|xi)logp(xi|z)−KL[qψ(z|xi)|pθ(z)]
Variational learning
L(θ,ψ)=∑xi∈DEz∼qψ(z|xi)logp(xi|z)⏟reconstruction−KL[qψ(z|xi)|pθ(z)]
Discussion of max likelihood
- trained so that pθ(x) matches data
- evaluated by how useful pθ(z|x) is
- there is a mismatch
Representation learning vs max likelihood

Representation learning vs max likelihood

Representation learning vs max likelihood

Representation learning vs max likelihood

Representation learning vs max likelihood

Representation learning vs max likelihood

Discussion of max likelihood
- max likelihood may not produce good representations
- Why do variational methods find good representations?
- Are there alternative principles?