With two unknowns and two observations: \[
\begin{aligned}
y_1 = & mx_1 + c\\
y_2 = & mx_2 + c
\end{aligned}
\]
Additional observation leads to overdetermined system.
\[
y_3 = mx_3 + c
\]
This problem is solved through a noise model \(\epsilon\sim
\mathcal{N}\left(0,\sigma^2\right)\)\[
\begin{aligned}
y_1 = mx_1 + c + \epsilon_1\\
y_2 = mx_2 + c + \epsilon_2\\
y_3 = mx_3 + c + \epsilon_3
\end{aligned}
\]
Underdetermined System
Underdetermined System
What about two unknowns and one observation? \[y_1 = mx_1 + c\]
Can compute \(m\) given \(c\). \[m =
\frac{y_1 - c}{x}\]
Underdetermined System
The Bayesian Controversy: Philosophical Underpinnings
A segment from the lecture in 2012 on philsophical underpinnings.
Noise Models
We aren’t modeling entire system.
Noise model gives mismatch between model and data.
Gaussian model justified by appeal to central limit theorem.
Other models also possible (Student-\(t\) for heavy tails).
Maximum likelihood with Gaussian noise leads to least
squares.
Different Types of Uncertainty
The first type of uncertainty we are assuming is aleatoric
uncertainty.
The second type of uncertainty we are assuming is epistemic
uncertainty.
Aleatoric Uncertainty
This is uncertainty we couldn’t know even if we wanted to. e.g. the
result of a football match before it’s played.
Where a sheet of paper might land on the floor.
Epistemic Uncertainty
This is uncertainty we could in principal know the answer too. We
just haven’t observed enough yet, e.g. the result of a football match
after it’s played.
What colour socks your lecturer is wearing.
Further Reading
Section 1.2.3 (pg 21–24) of Bishop (2006)
Sections 3.1-3.4 (pg 95-117) of Rogers and Girolami (2011)
Section 1.2.3 (pg 21–24) of Bishop (2006)
Section 1.2.6 (start from just past eq 1.64 pg 30-32) of Bishop
(2006)
Sum of Squares and Probability
… It is clear, that for the product \(\Omega = h^\mu \pi^{-\frac{1}{2}\mu} e^{-hh(vv +
v^\prime v^\prime + v^{\prime\prime} v^{\prime\prime} + \dots)}\)
to be maximised the sum \(vv + v ^\prime
v^\prime + v^{\prime\prime} v^{\prime\prime} + \text{etc}.\)
ought to be minimized. Therefore, the most probable values of the
unknown quantities \(p , q, r , s
\text{etc}.\), should be that in which the sum of the squares of
the differences between the functions \(V,
V^\prime, V^{\prime\prime} \text{etc}\), and the observed values
is minimized, for all observations of the same degree of precision
is presumed.
Prior Distribution
Bayesian inference requires a prior on the parameters.
The prior represents your belief before you see the data
of the likely value of the parameters.
For linear regression, consider a Gaussian prior on the
intercept:
\[c \sim
\mathcal{N}\left(0,\alpha_1\right)\]
Posterior Distribution
Posterior distribution is found by combining the prior with the
likelihood.
Posterior distribution is your belief after you see the
data of the likely value of the parameters.
The posterior is found through Bayes’ Rule\[
p(c|y) = \frac{p(y|c)p(c)}{p(y)}
\]
complete the square of the quadratic form to obtain \[\log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2)
= -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const},\] where \(\tau^2 = \left(n\sigma^{-2}
+\alpha_1^{-1}\right)^{-1}\) and \(\mu
= \frac{\tau^2}{\sigma^2} \sum_{i=1}^n(y_i-mx_i)\).
Main Trick
\[
p(c) = \frac{1}{\sqrt{2\pi\alpha_1}}
\exp\left(-\frac{1}{2\alpha_1}c^2\right)
\]\[
p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2) =
\frac{1}{\left(2\pi\sigma^2\right)^{\frac{n}{2}}}
\exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - mx_i - c)^2\right)
\]\[
p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{
y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{p(\mathbf{ y}|\mathbf{ x}, m,
\sigma^2)}
\]\[
p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{
y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{\int p(\mathbf{ y}|\mathbf{ x}, c,
m, \sigma^2)p(c) \text{d} c}
\]\[
p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) \propto p(\mathbf{
y}|\mathbf{ x}, c, m, \sigma^2)p(c)
\]\[
\begin{aligned}
\log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2)
=&-\frac{1}{2\sigma^2} \sum_{i=1}^n(y_i-c -
mx_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\
= &-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-mx_i)^2
-\left(\frac{n}{2\sigma^2} + \frac{1}{2\alpha_1}\right)c^2\\
& + c\frac{\sum_{i=1}^n(y_i-mx_i)}{\sigma^2},
\end{aligned}
\] complete the square of the quadratic form to obtain \[
\log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = -\frac{1}{2\tau^2}(c
- \mu)^2 +\text{const},
\] where \(\tau^2 = \left(n\sigma^{-2}
+\alpha_1^{-1}\right)^{-1}\) and \(\mu
= \frac{\tau^2}{\sigma^2} \sum_{i=1}^n(y_i-mx_i)\).
The Joint Density
Really want to know the joint posterior density over the
parameters \(c\)and\(m\).
Could now integrate out over \(m\),
but it’s easier to consider the multivariate case.
Two Dimensional Gaussian
Consider height, \(h/m\) and
weight, \(w/kg\).
Could sample height from a distribution: \[
p(h) \sim \mathcal{N}\left(1.7,0.0225\right).
\]
And similarly weight: \[
p(w) \sim \mathcal{N}\left(75,36\right).
\]
Height and Weight Models
Independence Assumption
We assume height and weight are independent.
\[
p(w, h) = p(w)p(h).
\]
Sampling Two Dimensional Variables
Body Mass Index
In reality they are dependent (body mass index) \(= \frac{w}{h^2}\).
To deal with this dependence we introduce correlated
multivariate Gaussians.
Let’s assume that the prior density is given by a zero mean Gaussian,
which is independent across each of the parameters, \[
\mathbf{ w}\sim \mathcal{N}\left(\mathbf{0},\alpha \mathbf{I}\right)
\] In other words, we are assuming, for the prior, that each
element of the parameters vector, \(w_i\), was drawn from a Gaussian density as
follows \[
w_i \sim \mathcal{N}\left(0,\alpha\right)
\] Let’s start by assigning the parameter of the prior
distribution, which is the variance of the prior distribution, \(\alpha\).
Further Reading
Multivariate Gaussians: Section 2.3 up to top of pg 85 of Bishop
(2006)
Section 3.3 up to 159 (pg 152–159) of Bishop (2006)
Revisit Olympics Data
Use Bayesian approach on olympics data with polynomials.
Choose a prior \(\mathbf{ w}\sim
\mathcal{N}\left(\mathbf{0},\alpha \mathbf{I}\right)\) with \(\alpha = 1\).
Choose noise variance \(\sigma^2 =
0.01\)
Sampling the Prior
Always useful to perform a ‘sanity check’ and sample from the prior
before observing the data.
Since \(\mathbf{ y}= \boldsymbol{
\Phi}\mathbf{ w}+ \boldsymbol{ \epsilon}\) just need to sample
\[
\mathbf{ w}\sim \mathcal{N}\left(0,\alpha\mathbf{I}\right)
\]\[
\boldsymbol{ \epsilon}\sim \mathcal{N}\left(\mathbf{0},\sigma^2\right)
\] with \(\alpha=1\) and \(\sigma^2 = 0.01\).
Two are equivalent when \(\alpha
\rightarrow \infty\).
Equivalent to a prior for \(\mathbf{
w}\) with infinite variance.
In other cases \(\alpha
\mathbf{I}\)regularizes the system (keeps parameters
smaller).
Sampling from the Posterior
Now check samples by extracting \(\mathbf{
w}\) from the posterior.
Now for \(\mathbf{ y}= \boldsymbol{
\Phi}\mathbf{ w}+ \boldsymbol{ \epsilon}\) need \[
\mathbf{ w}\sim \mathcal{N}\left(\boldsymbol{ \mu}_w,\mathbf{C}_w\right)
\] with \(\mathbf{C}_w =
\left[\sigma^{-2}\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}+
\alpha^{-1}\mathbf{I}\right]^{-1}\) and \(\boldsymbol{ \mu}_w =\mathbf{C}_w \sigma^{-2}
\boldsymbol{ \Phi}^\top \mathbf{ y}\)\[
\boldsymbol{ \epsilon}\sim
\mathcal{N}\left(\mathbf{0},\sigma^2\mathbf{I}\right)
\] with \(\alpha=1\) and \(\sigma^2 = 0.01\).
Marginal Likelihood
The marginal likelihood can also be computed, it has the form:
\[
p(\mathbf{ y}|\mathbf{X}, \sigma^2, \alpha) =
\frac{1}{(2\pi)^\frac{n}{2}\left|\mathbf{K}\right|^\frac{1}{2}}
\exp\left(-\frac{1}{2} \mathbf{ y}^\top \mathbf{K}^{-1} \mathbf{
y}\right)
\] where \(\mathbf{K}= \alpha
\boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2
\mathbf{I}\).
So it is a zero mean \(n\)-dimensional Gaussian with covariance
matrix \(\mathbf{K}\).
Computing the Expected Output
Given the posterior for the parameters, how can we compute the
expected output at a given location?
Output of model at location \(\mathbf{
x}_i\) is given by \[
f(\mathbf{ x}_i; \mathbf{ w}) = \boldsymbol{ \phi}_i^\top \mathbf{ w}
\]
We want the expected output under the posterior density, \(p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2,
\alpha)\).
Mean of mapping function will be given by \[
\begin{aligned} \left\langle f(\mathbf{ x}_i; \mathbf{
w})\right\rangle_{p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2,
\alpha)} &= \boldsymbol{ \phi}_i^\top \left\langle\mathbf{
w}\right\rangle_{p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2,
\alpha)} \\ & = \boldsymbol{ \phi}_i^\top \boldsymbol{ \mu}_w
\end{aligned}
\]
Variance of Expected Output
Variance of model at location \(\mathbf{
x}_i\) is given by \[
\begin{aligned}\text{var}(f(\mathbf{ x}_i; \mathbf{ w})) &=
\left\langle(f(\mathbf{ x}_i; \mathbf{ w}))^2\right\rangle -
\left\langle f(\mathbf{ x}_i; \mathbf{ w})\right\rangle^2 \\&=
\boldsymbol{ \phi}_i^\top\left\langle\mathbf{ w}\mathbf{
w}^\top\right\rangle \boldsymbol{ \phi}_i - \boldsymbol{ \phi}_i^\top
\left\langle\mathbf{ w}\right\rangle\left\langle\mathbf{
w}\right\rangle^\top \boldsymbol{ \phi}_i \\&= \boldsymbol{
\phi}_i^\top \mathbf{C}_i\boldsymbol{ \phi}_i
\end{aligned}
\] where all these expectations are taken under the posterior
density, \(p(\mathbf{ w}|\mathbf{ y},
\mathbf{X}, \sigma^2, \alpha)\).
Further Reading
Section 3.7–3.8 (pg 122–133) of Rogers and Girolami (2011)