Bayesian Regression

Overdetermined System

With two unknowns and two observations: \[ \begin{aligned} y_1 = & mx_1 + c\\ y_2 = & mx_2 + c \end{aligned} \]
Additional observation leads to overdetermined system. \[ y_3 = mx_3 + c \]
This problem is solved through a noise model \(\epsilon\sim \mathcal{N}\left(0,\sigma^2\right)\) \[ \begin{aligned} y_1 = mx_1 + c + \epsilon_1\\ y_2 = mx_2 + c + \epsilon_2\\ y_3 = mx_3 + c + \epsilon_3 \end{aligned} \]

Underdetermined System

What about two unknowns and one observation? \[y_1 = mx_1 + c\]

Can compute \(m\) given \(c\). \[m = \frac{y_1 - c}{x}\]

Underdetermined System

The Bayesian Controversy: Philosophical Underpinnings

A segment from the lecture in 2012 on philsophical underpinnings.

Noise Models

We aren’t modeling entire system.
Noise model gives mismatch between model and data.
Gaussian model justified by appeal to central limit theorem.
Other models also possible (Student-\(t\) for heavy tails).
Maximum likelihood with Gaussian noise leads to least squares.

Different Types of Uncertainty

The first type of uncertainty we are assuming is aleatoric uncertainty.
The second type of uncertainty we are assuming is epistemic uncertainty.

Aleatoric Uncertainty

This is uncertainty we couldn’t know even if we wanted to. e.g. the result of a football match before it’s played.
Where a sheet of paper might land on the floor.

Epistemic Uncertainty

This is uncertainty we could in principal know the answer too. We just haven’t observed enough yet, e.g. the result of a football match after it’s played.
What colour socks your lecturer is wearing.

Sum of Squares and Probability

… It is clear, that for the product \(\Omega = h^\mu \pi^{-\frac{1}{2}\mu} e^{-hh(vv + v^\prime v^\prime + v^{\prime\prime} v^{\prime\prime} + \dots)}\) to be maximised the sum \(vv + v ^\prime v^\prime + v^{\prime\prime} v^{\prime\prime} + \text{etc}.\) ought to be minimized. Therefore, the most probable values of the unknown quantities \(p , q, r , s \text{etc}.\), should be that in which the sum of the squares of the differences between the functions \(V, V^\prime, V^{\prime\prime} \text{etc}\), and the observed values is minimized, for all observations of the same degree of precision is presumed.

Prior Distribution

Bayesian inference requires a prior on the parameters.
The prior represents your belief before you see the data of the likely value of the parameters.
For linear regression, consider a Gaussian prior on the intercept:

\[c \sim \mathcal{N}\left(0,\alpha_1\right)\]

Posterior Distribution

Posterior distribution is found by combining the prior with the likelihood.
Posterior distribution is your belief after you see the data of the likely value of the parameters.
The posterior is found through Bayes’ Rule \[ p(c|y) = \frac{p(y|c)p(c)}{p(y)} \]

\[ \text{posterior} = \frac{\text{likelihood}\times \text{prior}}{\text{marginal likelihood}}. \]

Bayes Update

Stages to Derivation of the Posterior

Multiply likelihood by prior
- they are “exponentiated quadratics”, the answer is always also an exponentiated quadratic because \(\exp(a^2)\exp(b^2) = \exp(a^2 + b^2)\).
Complete the square to get the resulting density in the form of a Gaussian.
Recognise the mean and (co)variance of the Gaussian. This is the estimate of the posterior.

Main Trick

\[p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right)\] \[p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2) = \frac{1}{\left(2\pi\sigma^2\right)^{\frac{n}{2}}} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - mx_i - c)^2\right)\]

\[p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{p(\mathbf{ y}|\mathbf{ x}, m, \sigma^2)}\]

\[p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{\int p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c) \text{d} c}\]

\[p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) \propto p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)\]

\[\begin{aligned} \log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) =&-\frac{1}{2\sigma^2} \sum_{i=1}^n(y_i-c - mx_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\ = &-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-mx_i)^2 -\left(\frac{n}{2\sigma^2} + \frac{1}{2\alpha_1}\right)c^2\\ & + c\frac{\sum_{i=1}^n(y_i-mx_i)}{\sigma^2}, \end{aligned}\]

complete the square of the quadratic form to obtain \[\log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const},\] where \(\tau^2 = \left(n\sigma^{-2} +\alpha_1^{-1}\right)^{-1}\) and \(\mu = \frac{\tau^2}{\sigma^2} \sum_{i=1}^n(y_i-mx_i)\).

Main Trick

\[ p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right) \] \[ p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2) = \frac{1}{\left(2\pi\sigma^2\right)^{\frac{n}{2}}} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - mx_i - c)^2\right) \] \[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{p(\mathbf{ y}|\mathbf{ x}, m, \sigma^2)} \] \[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{\int p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c) \text{d} c} \] \[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) \propto p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c) \] \[ \begin{aligned} \log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) =&-\frac{1}{2\sigma^2} \sum_{i=1}^n(y_i-c - mx_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\ = &-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-mx_i)^2 -\left(\frac{n}{2\sigma^2} + \frac{1}{2\alpha_1}\right)c^2\\ & + c\frac{\sum_{i=1}^n(y_i-mx_i)}{\sigma^2}, \end{aligned} \] complete the square of the quadratic form to obtain \[ \log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const}, \] where \(\tau^2 = \left(n\sigma^{-2} +\alpha_1^{-1}\right)^{-1}\) and \(\mu = \frac{\tau^2}{\sigma^2} \sum_{i=1}^n(y_i-mx_i)\).

The Joint Density

Really want to know the joint posterior density over the parameters \(c\) and \(m\).
Could now integrate out over \(m\), but it’s easier to consider the multivariate case.

Two Dimensional Gaussian

Consider height, \(h/m\) and weight, \(w/kg\).
Could sample height from a distribution: \[ p(h) \sim \mathcal{N}\left(1.7,0.0225\right). \]
And similarly weight: \[ p(w) \sim \mathcal{N}\left(75,36\right). \]

Height and Weight Models

Independence Assumption

We assume height and weight are independent.

\[ p(w, h) = p(w)p(h). \]

Sampling Two Dimensional Variables

Body Mass Index

In reality they are dependent (body mass index) \(= \frac{w}{h^2}\).
To deal with this dependence we introduce correlated multivariate Gaussians.

Sampling Two Dimensional Variables

Independent Gaussians

\[ p(w, h) = p(w)p(h) \]

Independent Gaussians

\[ p(w, h) = \frac{1}{\sqrt{2\pi \sigma_1^2}\sqrt{2\pi\sigma_2^2}} \exp\left(-\frac{1}{2}\left(\frac{(w-\mu_1)^2}{\sigma_1^2} + \frac{(h-\mu_2)^2}{\sigma_2^2}\right)\right) \]

Independent Gaussians

\[ p(w, h) = \frac{1}{\sqrt{2\pi\sigma_1^22\pi\sigma_2^2}} \exp\left(-\frac{1}{2}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}\right)^\top\begin{bmatrix}\sigma_1^2& 0\\0&\sigma_2^2\end{bmatrix}^{-1}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}\right)\right) \]

Independent Gaussians

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi \mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{D}^{-1}(\mathbf{ y}- \boldsymbol{ \mu})\right) \]

Correlated Gaussian

Form correlated from original by rotating the data space using matrix \(\mathbf{R}\).

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{D}^{-1}(\mathbf{ y}- \boldsymbol{ \mu})\right) \]

Correlated Gaussian

Form correlated from original by rotating the data space using matrix \(\mathbf{R}\).

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{R}^\top\mathbf{ y}- \mathbf{R}^\top\boldsymbol{ \mu})^\top\mathbf{D}^{-1}(\mathbf{R}^\top\mathbf{ y}- \mathbf{R}^\top\boldsymbol{ \mu})\right) \]

Correlated Gaussian

Form correlated from original by rotating the data space using matrix \(\mathbf{R}\).

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{R}\mathbf{D}^{-1}\mathbf{R}^\top(\mathbf{ y}- \boldsymbol{ \mu})\right) \] this gives a covariance matrix: \[ \mathbf{C}^{-1} = \mathbf{R}\mathbf{D}^{-1} \mathbf{R}^\top \]

Correlated Gaussian

Form correlated from original by rotating the data space using matrix \(\mathbf{R}\).

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{C}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{C}^{-1} (\mathbf{ y}- \boldsymbol{ \mu})\right) \] this gives a covariance matrix: \[ \mathbf{C}= \mathbf{R}\mathbf{D} \mathbf{R}^\top \]

The Prior Density

Let’s assume that the prior density is given by a zero mean Gaussian, which is independent across each of the parameters, \[ \mathbf{ w}\sim \mathcal{N}\left(\mathbf{0},\alpha \mathbf{I}\right) \] In other words, we are assuming, for the prior, that each element of the parameters vector, \(w_i\), was drawn from a Gaussian density as follows \[ w_i \sim \mathcal{N}\left(0,\alpha\right) \] Let’s start by assigning the parameter of the prior distribution, which is the variance of the prior distribution, \(\alpha\).

Revisit Olympics Data

Use Bayesian approach on olympics data with polynomials.
Choose a prior \(\mathbf{ w}\sim \mathcal{N}\left(\mathbf{0},\alpha \mathbf{I}\right)\) with \(\alpha = 1\).
Choose noise variance \(\sigma^2 = 0.01\)

Sampling the Prior

Always useful to perform a ‘sanity check’ and sample from the prior before observing the data.
Since \(\mathbf{ y}= \boldsymbol{ \Phi}\mathbf{ w}+ \boldsymbol{ \epsilon}\) just need to sample \[ \mathbf{ w}\sim \mathcal{N}\left(0,\alpha\mathbf{I}\right) \] \[ \boldsymbol{ \epsilon}\sim \mathcal{N}\left(\mathbf{0},\sigma^2\right) \] with \(\alpha=1\) and \(\sigma^2 = 0.01\).

Computing the Posterior

\[ p(\mathbf{ w}| \mathbf{ y}, \mathbf{ x}, \sigma^2) = \mathcal{N}\left(\mathbf{ w}|\boldsymbol{ \mu}_w,\mathbf{C}_w\right) \] with \[ \mathbf{C}_w= \left(\sigma^{-2}\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}+ \alpha^{-1}\mathbf{I}\right)^{-1} \] and \[ \boldsymbol{ \mu}_w= \mathbf{C}_w\sigma^{-2}\boldsymbol{ \Phi}^\top \mathbf{ y} \]

Olympic Marathon Data

Gold medal times for Olympic Marathon since 1896.
Marathons before 1924 didn’t have a standardized distance.
Present results using pace per km.
In 1904 Marathon was badly organized leading to very slow times.

Image from Wikimedia Commons http://bit.ly/16kMKHQ

Olympic Marathon Data

Olympic Data with Bayesian Polynomials

Hold Out Validation

5-fold Cross Validation

Model Fit

Marginal likelihood doesn’t always increase as model order increases.
Bayesian model always has 2 parameters, regardless of how many basis functions (and here we didn’t even fit them).
Maximum likelihood model over fits through increasing number of parameters.
Revisit maximum likelihood solution with validation set.

Regularized Mean

Validation fit here based on mean solution for \(\mathbf{ w}\) only.
For Bayesian solution \[ \boldsymbol{ \mu}_w = \left[\sigma^{-2}\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}+ \alpha^{-1}\mathbf{I}\right]^{-1} \sigma^{-2} \boldsymbol{ \Phi}^\top \mathbf{ y} \] instead of \[ \mathbf{ w}^* = \left[\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\right]^{-1} \boldsymbol{ \Phi}^\top \mathbf{ y} \]
Two are equivalent when \(\alpha \rightarrow \infty\).
Equivalent to a prior for \(\mathbf{ w}\) with infinite variance.
In other cases \(\alpha \mathbf{I}\) regularizes the system (keeps parameters smaller).

Sampling from the Posterior

Now check samples by extracting \(\mathbf{ w}\) from the posterior.
Now for \(\mathbf{ y}= \boldsymbol{ \Phi}\mathbf{ w}+ \boldsymbol{ \epsilon}\) need \[ \mathbf{ w}\sim \mathcal{N}\left(\boldsymbol{ \mu}_w,\mathbf{C}_w\right) \] with \(\mathbf{C}_w = \left[\sigma^{-2}\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}+ \alpha^{-1}\mathbf{I}\right]^{-1}\) and \(\boldsymbol{ \mu}_w =\mathbf{C}_w \sigma^{-2} \boldsymbol{ \Phi}^\top \mathbf{ y}\) \[ \boldsymbol{ \epsilon}\sim \mathcal{N}\left(\mathbf{0},\sigma^2\mathbf{I}\right) \] with \(\alpha=1\) and \(\sigma^2 = 0.01\).

Marginal Likelihood

The marginal likelihood can also be computed, it has the form: \[ p(\mathbf{ y}|\mathbf{X}, \sigma^2, \alpha) = \frac{1}{(2\pi)^\frac{n}{2}\left|\mathbf{K}\right|^\frac{1}{2}} \exp\left(-\frac{1}{2} \mathbf{ y}^\top \mathbf{K}^{-1} \mathbf{ y}\right) \] where \(\mathbf{K}= \alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2 \mathbf{I}\).
So it is a zero mean \(n\)-dimensional Gaussian with covariance matrix \(\mathbf{K}\).

Computing the Expected Output

Given the posterior for the parameters, how can we compute the expected output at a given location?
Output of model at location \(\mathbf{ x}_i\) is given by \[ f(\mathbf{ x}_i; \mathbf{ w}) = \boldsymbol{ \phi}_i^\top \mathbf{ w} \]
We want the expected output under the posterior density, \(p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)\).
Mean of mapping function will be given by \[ \begin{aligned} \left\langle f(\mathbf{ x}_i; \mathbf{ w})\right\rangle_{p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)} &= \boldsymbol{ \phi}_i^\top \left\langle\mathbf{ w}\right\rangle_{p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)} \\ & = \boldsymbol{ \phi}_i^\top \boldsymbol{ \mu}_w \end{aligned} \]

Variance of Expected Output

Variance of model at location \(\mathbf{ x}_i\) is given by \[ \begin{aligned}\text{var}(f(\mathbf{ x}_i; \mathbf{ w})) &= \left\langle(f(\mathbf{ x}_i; \mathbf{ w}))^2\right\rangle - \left\langle f(\mathbf{ x}_i; \mathbf{ w})\right\rangle^2 \\&= \boldsymbol{ \phi}_i^\top\left\langle\mathbf{ w}\mathbf{ w}^\top\right\rangle \boldsymbol{ \phi}_i - \boldsymbol{ \phi}_i^\top \left\langle\mathbf{ w}\right\rangle\left\langle\mathbf{ w}\right\rangle^\top \boldsymbol{ \phi}_i \\&= \boldsymbol{ \phi}_i^\top \mathbf{C}_i\boldsymbol{ \phi}_i \end{aligned} \] where all these expectations are taken under the posterior density, \(p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)\).

Thanks!

twitter: @lawrennd
podcast: The Talking Machines
newspaper: Guardian Profile Page
blog posts:

Open Data Science

References

Bishop, C.M., 2006. Pattern recognition and machine learning. springer.

Gauss, C.F., 1809. Theoria motus corporum coelestium in sectionibus conicis solem ambientium. F. Perthes und I. H. Besser, Hamburg.

Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.

Bayesian Regression

Overdetermined System

Underdetermined System

Underdetermined System

Underdetermined System

The Bayesian Controversy: Philosophical Underpinnings

Noise Models

Different Types of Uncertainty

Aleatoric Uncertainty

Epistemic Uncertainty

Further Reading

Sum of Squares and Probability

Prior Distribution

Posterior Distribution

Bayes Update

Stages to Derivation of the Posterior

Main Trick

Main Trick

The Joint Density

Two Dimensional Gaussian

Height and Weight Models

Independence Assumption

Sampling Two Dimensional Variables

Body Mass Index

Sampling Two Dimensional Variables

Independent Gaussians

Independent Gaussians

Independent Gaussians

Independent Gaussians

Correlated Gaussian

Correlated Gaussian

Correlated Gaussian

Correlated Gaussian

The Prior Density

Further Reading

Revisit Olympics Data

Sampling the Prior

Computing the Posterior

Olympic Marathon Data

Olympic Marathon Data

Olympic Data with Bayesian Polynomials

Hold Out Validation

5-fold Cross Validation

Model Fit

Regularized Mean

Sampling from the Posterior

Marginal Likelihood

Computing the Expected Output

Variance of Expected Output

Further Reading

Thanks!

References