Inference in a GP has the following demands:
Complexity: | O(n3) |
Storage: | O(n2) |
Inference in a low rank GP has the following demands:
Complexity: | O(nm2) |
Storage: | O(nm) |
where m is a user chosen parameter.
Snelson and Ghahramani (n.d.),Quiñonero Candela and Rasmussen (2005),Lawrence (n.d.),Titsias (n.d.),Bui et al. (2017)
We’ve seen how we go from parametric to non-parametric.
The limit implies infinite dimensional w.
Gaussian processes are generally non-parametric: combine data with covariance function to get model.
This representation cannot be summarized by a parameter vector of a fixed size.
Parametric models have a representation that does not respond to increasing training set size.
Bayesian posterior distributions over parameters contain the information about the training data.
Use Bayes’ rule from training data, p(w|y,X),
Make predictions on test data p(y∗|X∗,y,X)=∫p(y∗|w,X∗)p(w|y,X)dw).
w becomes a bottleneck for information about the training set to pass to the test set.
Solution: increase m so that the bottleneck is so large that it no longer presents a problem.
How big is big enough for m? Non-parametrics says m→∞.
Kff≈Qff=KfuK−1uuKuf
X,y |
![]() |
X,y f(x)∼GP |
![]() |
X,y f(x)∼GPp(f)=N(0,Kff) |
![]() |
X,y f(x)∼GP p(f)=N(0,Kff) p(f|y,X) |
![]() |
Take an extra m points on the function, u=f(Z). p(y,f,u)=p(y|f)p(f|u)p(u)
Take and extra M points on the function, u=f(Z). p(y,f,u)=p(y|f)p(f|u)p(u) p(y|f)=N(y|f,σ2I)p(f|u)=N(f|KfuK−1uuu,˜K)p(u)=N(u|0,Kuu)
X,y f(x)∼GP p(f)=N(0,Kff) p(f|y,X) |
|
Z,up(u)=N(0,Kuu)
X,y f(x)∼GP p(f)=N(0,Kff) p(f|y,X) p(u)=N(0,Kuu) ˜p(u|y,X) |
![]() |
Instead of doing p(f|y,X)=p(y|f)p(f|X)∫p(y|f)p(f|X)df We’ll do p(u|y,Z)=p(y|u)p(u|Z)∫p(y|u)p(u|Z)du
logp(y|u)=log∫p(y|f)p(f|u)df=∫q(f)logp(y|f)p(f|u)q(f)df+KL(q(f)‖p(f|y,u)).
Maximizing lower bound minimizes the KL divergence (information gain): KL(p(f|u)‖p(f|y,u))=∫p(f|u)logp(f|u)p(f|y,u)du
This is minimized when the information stored about y is stored already in u.
The bound seeks an optimal compression from the information gain perspective.
If u=f bound is exact (f d-separates y from u).
[fu]∼N(0,K) with K=[KffKfuKufKuu]
If the likelihood, p(y|f), factorizes
<8-> Then the bound factorizes.
<10-> Now need a choice of distributions for f and y|f …
Introduce variable set which is finite dimensional. p(y∗|y)≈∫p(y∗|u)q(u|y)du
But dimensionality of u can be changed to improve approximation.
p(y) |
p(y)=∫p(y|f)p(f)df |
p(y)=∫p(y|f)p(u|f)p(f)dfdu |
p(y)=∫∫p(y|f)p(f|u)dfp(u)du |
p(y)=∫∫p(y|f)p(f|u)dfp(u)du |
p(y|u)=∫p(y|f)p(f|u)df |
p(y|u) |
p(y|θ) |
f,u∼N(0,[KffKfuKufKuu]) y|f=∏iN(f,σ2)
For Gaussian likelihoods:
Define: qi,i=varp(fi|u)(fi)=⟨f2i⟩p(fi|u)−⟨fi⟩2p(fi|u) We can write: ci=exp(−qi,i2σ2) If joint distribution of p(f,u) is Gaussian then: qi,i=ki,i−k⊤i,uK−1u,uki,u
ci is not a function of u but is a function of Xu.
The sum of qi,i is the total conditional variance.
If conditional density p(f|u) is Gaussian then it has covariance Q=Kff−KfuK−1uuKuf
tr(Q)=∑iqi,i is known as total variance.
Because it is on conditional distribution we call it total conditional variance.
Measure the ’capacity of a density’.
Determinant of covariance represents ’volume’ of density.
log determinant is entropy: sum of log eigenvalues of covariance.
trace of covariance is total variance: sum of eigenvalues of covariance.
λ>logλ then total conditional variance upper bounds entropy.
Exponentiated total variance bounds determinant. detQ<exptr(Q) Because k∏i=1λi<k∏i=1exp(λi) where {λi}ki=1 are the positive eigenvalues of Q This in turn implies detQ<k∏i=1exp(qi,i)
Conditional density p(f|u) can be seen as a communication channel.
Normally we have: Transmitteru→p(f|u)Channelf→Receiver and we control p(u) (the source density).
Here we can also control the transmission channel p(f|u).
Substitute variational bound into marginal likelihood: p(y)≥n∏i=1ci∫N(y|⟨f⟩,σ2I)p(u)du Note that: ⟨f⟩p(f|u)=Kf,uK−1u,uu is linearly dependent on u.
Making the marginalization of u straightforward. In the Gaussian case: p(u)=N(u|0,Ku,u)
logp(y|u)=log∫p(y|f)p(f|u,X)df
logp(y|u)=logEp(f|u,X)[p(y|f)] logp(y|u)≥Ep(f|u,X)[logp(y|f)]≜log˜p(y|u)
No inversion of Kff required
p(y|u)=p(y|f)p(f|u)p(f|y,u) logp(y|u)=logp(y|f)+logp(f|u)p(f|y,u) logp(y|u)=\bbEp(f|u)[logp(y|f)]+\bbEp(f|u)[logp(f|u)p(f|y,u)] logp(y|u)=˜p(y|u)+\textscKL[p(f|u)||p(f|y,u)]
No inversion of Kff required
˜p(y|u)=n∏i=1˜p(yi|u) ˜p(y|u)=N(y|kfuK−1uuu,σ2)exp{−12σ2(kff−kfuK−1uukuf)}
A straightforward likelihood approximation, and a penalty term
˜p(u|y,Z)=˜p(y|u)p(u|Z)∫˜p(y|u)p(u|Z)du
Computing the posterior costs O(nm2)
We also get a lower bound of the marginal likelihood
n∑i=1−12σ2(kff−kfuK−1uukuf)
n∑i=1−12σ2(kff−kfuK−1uukuf)
It’s easy to show that as Z→X:
u→f (and the posterior is exact)
The penalty term is zero.
The cost returns to O(n3)
So far we:
introduced Z,u
approximated the intergral over f variationally
captured the information in ˜p(u|y)
obtained a lower bound on the marginal likeihood
saw the effect of the penalty term
prediction for new points
Omitted details:
optimization of the covariance parameters using the bound
optimization of Z (simultaneously)
the form of ˜p(u|y)
historical approximations
Random or systematic
Set Z to subset of X
Set u to subset of f
Approximation to p(y|u):
$ p(_i) = p(_i_i) i$
$ p(_i) = 1 i$
{Deterministic Training Conditional (DTC)}
Approximation to p(y|u):
As our variational formulation, but without penalty
Optimization of Z is difficult
Approximation to p(y|u):
$ p() = _i p(_i) $
Optimization of Z is still difficult, and there are some weird heteroscedatic effects
Bayesian GP-LVM
|
|
Augment each layer with inducing variables ui.
Apply variational compression, p(y,{hi}ℓ−1i=1|{ui}ℓi=1,X)≥˜p(y|uℓ,hℓ−1)ℓ−1∏i=2˜p(hi|ui,hi−1)˜p(h1|ui,X)×exp(ℓ∑i=1−12σ2itr(Σi)) where ˜p(hi|ui,hi−1)=N(hi|KhiuiK−1uiuiui,σ2iI).
By sustaining explicity distributions over inducing variables James Hensman has developed a nested variant of variational compression.
Exciting thing: it mathematically looks like a deep neural network, but with inducing variables in the place of basis functions.
Additional complexity control term in the objective function.
logp(y|X)≥−1σ21tr(Σ1)−ℓ∑i=212σ2i(ψi−tr(ΦiK−1uiui))−ℓ∑i=1KL(q(ui)‖p(ui))−ℓ∑i=212σ2itr((Φi−Ψ⊤iΨi)K−1uiui⟨uiu⊤i⟩q(ui)K−1uiui)+\only<2>logN(y|ΨℓK−1uℓuℓmℓ,σ2ℓI)
\only<1>logN(y|\only<2−>ΨℓK−1uℓuℓmℓ,σ2ℓI) where
For Gaussian likelihoods:
Define: qi,i=varp(fi|u)(fi)=⟨f2i⟩p(fi|u)−⟨fi⟩2p(fi|u) We can write: ci=exp(−qi,i2σ2) If joint distribution of p(f,u) is Gaussian then: qi,i=ki,i−k⊤i,uK−1u,uki,u
ci is not a function of u but is a function of Xu.
Substitute variational bound into marginal likelihood: p(y)≥n∏i=1ci∫N(y|⟨f⟩,σ2I)p(u)du Note that: ⟨f⟩p(f|u)=Kf,uK−1u,uu is linearly dependent on u.
Making the marginalization of u straightforward. In the Gaussian case: p(u)=N(u|0,Ku,u)