Bayesian Regression

Overdetermined System

  • With two unknowns and two observations: \[ \begin{aligned} y_1 = & mx_1 + c\\ y_2 = & mx_2 + c \end{aligned} \]
  • Additional observation leads to overdetermined system. \[ y_3 = mx_3 + c \]
  • This problem is solved through a noise model \(\epsilon\sim \mathcal{N}\left(0,\sigma^2\right)\) \[ \begin{aligned} y_1 = mx_1 + c + \epsilon_1\\ y_2 = mx_2 + c + \epsilon_2\\ y_3 = mx_3 + c + \epsilon_3 \end{aligned} \]

Underdetermined System

Underdetermined System

  • What about two unknowns and one observation? \[y_1 = mx_1 + c\]

Can compute \(m\) given \(c\). \[m = \frac{y_1 - c}{x}\]

Underdetermined System

The Bayesian Controversy: Philosophical Underpinnings

A segment from the lecture in 2012 on philsophical underpinnings.

Noise Models

  • We aren’t modeling entire system.
  • Noise model gives mismatch between model and data.
  • Gaussian model justified by appeal to central limit theorem.
  • Other models also possible (Student-\(t\) for heavy tails).
  • Maximum likelihood with Gaussian noise leads to least squares.

Different Types of Uncertainty

  • The first type of uncertainty we are assuming is aleatoric uncertainty.
  • The second type of uncertainty we are assuming is epistemic uncertainty.

Aleatoric Uncertainty

  • This is uncertainty we couldn’t know even if we wanted to. e.g. the result of a football match before it’s played.
  • Where a sheet of paper might land on the floor.

Epistemic Uncertainty

  • This is uncertainty we could in principal know the answer too. We just haven’t observed enough yet, e.g. the result of a football match after it’s played.
  • What colour socks your lecturer is wearing.

Further Reading

  • Section 1.2.3 (pg 21–24) of Bishop (2006)

  • Sections 3.1-3.4 (pg 95-117) of Rogers and Girolami (2011)

  • Section 1.2.3 (pg 21–24) of Bishop (2006)

  • Section 1.2.6 (start from just past eq 1.64 pg 30-32) of Bishop (2006)

Sum of Squares and Probability

… It is clear, that for the product \(\Omega = h^\mu \pi^{-\frac{1}{2}\mu} e^{-hh(vv + v^\prime v^\prime + v^{\prime\prime} v^{\prime\prime} + \dots)}\) to be maximised the sum \(vv + v ^\prime v^\prime + v^{\prime\prime} v^{\prime\prime} + \text{etc}.\) ought to be minimized. Therefore, the most probable values of the unknown quantities \(p , q, r , s \text{etc}.\), should be that in which the sum of the squares of the differences between the functions \(V, V^\prime, V^{\prime\prime} \text{etc}\), and the observed values is minimized, for all observations of the same degree of precision is presumed.

Prior Distribution

  • Bayesian inference requires a prior on the parameters.

  • The prior represents your belief before you see the data of the likely value of the parameters.

  • For linear regression, consider a Gaussian prior on the intercept:

    \[c \sim \mathcal{N}\left(0,\alpha_1\right)\]

Posterior Distribution

  • Posterior distribution is found by combining the prior with the likelihood.

  • Posterior distribution is your belief after you see the data of the likely value of the parameters.

  • The posterior is found through Bayes’ Rule \[ p(c|y) = \frac{p(y|c)p(c)}{p(y)} \]

    \[ \text{posterior} = \frac{\text{likelihood}\times \text{prior}}{\text{marginal likelihood}}. \]

Bayes Update

Stages to Derivation of the Posterior

  • Multiply likelihood by prior
    • they are “exponentiated quadratics”, the answer is always also an exponentiated quadratic because \(\exp(a^2)\exp(b^2) = \exp(a^2 + b^2)\).
  • Complete the square to get the resulting density in the form of a Gaussian.
  • Recognise the mean and (co)variance of the Gaussian. This is the estimate of the posterior.

Main Trick

\[p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right)\] \[p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2) = \frac{1}{\left(2\pi\sigma^2\right)^{\frac{n}{2}}} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - mx_i - c)^2\right)\]

\[p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{p(\mathbf{ y}|\mathbf{ x}, m, \sigma^2)}\]

\[p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{\int p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c) \text{d} c}\]

\[p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) \propto p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)\]

\[\begin{aligned} \log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) =&-\frac{1}{2\sigma^2} \sum_{i=1}^n(y_i-c - mx_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\ = &-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-mx_i)^2 -\left(\frac{n}{2\sigma^2} + \frac{1}{2\alpha_1}\right)c^2\\ & + c\frac{\sum_{i=1}^n(y_i-mx_i)}{\sigma^2}, \end{aligned}\]

complete the square of the quadratic form to obtain \[\log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const},\] where \(\tau^2 = \left(n\sigma^{-2} +\alpha_1^{-1}\right)^{-1}\) and \(\mu = \frac{\tau^2}{\sigma^2} \sum_{i=1}^n(y_i-mx_i)\).

Main Trick

\[ p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right) \] \[ p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2) = \frac{1}{\left(2\pi\sigma^2\right)^{\frac{n}{2}}} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - mx_i - c)^2\right) \] \[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{p(\mathbf{ y}|\mathbf{ x}, m, \sigma^2)} \] \[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{\int p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c) \text{d} c} \] \[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) \propto p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c) \] \[ \begin{aligned} \log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) =&-\frac{1}{2\sigma^2} \sum_{i=1}^n(y_i-c - mx_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\ = &-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-mx_i)^2 -\left(\frac{n}{2\sigma^2} + \frac{1}{2\alpha_1}\right)c^2\\ & + c\frac{\sum_{i=1}^n(y_i-mx_i)}{\sigma^2}, \end{aligned} \] complete the square of the quadratic form to obtain \[ \log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const}, \] where \(\tau^2 = \left(n\sigma^{-2} +\alpha_1^{-1}\right)^{-1}\) and \(\mu = \frac{\tau^2}{\sigma^2} \sum_{i=1}^n(y_i-mx_i)\).

The Joint Density

  • Really want to know the joint posterior density over the parameters \(c\) and \(m\).
  • Could now integrate out over \(m\), but it’s easier to consider the multivariate case.

Two Dimensional Gaussian

  • Consider height, \(h/m\) and weight, \(w/kg\).
  • Could sample height from a distribution: \[ p(h) \sim \mathcal{N}\left(1.7,0.0225\right). \]
  • And similarly weight: \[ p(w) \sim \mathcal{N}\left(75,36\right). \]

Height and Weight Models

Independence Assumption

  • We assume height and weight are independent.

    \[ p(w, h) = p(w)p(h). \]

Sampling Two Dimensional Variables

Body Mass Index

  • In reality they are dependent (body mass index) \(= \frac{w}{h^2}\).
  • To deal with this dependence we introduce correlated multivariate Gaussians.

Sampling Two Dimensional Variables

Independent Gaussians

\[ p(w, h) = p(w)p(h) \]

Independent Gaussians

\[ p(w, h) = \frac{1}{\sqrt{2\pi \sigma_1^2}\sqrt{2\pi\sigma_2^2}} \exp\left(-\frac{1}{2}\left(\frac{(w-\mu_1)^2}{\sigma_1^2} + \frac{(h-\mu_2)^2}{\sigma_2^2}\right)\right) \]

Independent Gaussians

\[ p(w, h) = \frac{1}{\sqrt{2\pi\sigma_1^22\pi\sigma_2^2}} \exp\left(-\frac{1}{2}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}\right)^\top\begin{bmatrix}\sigma_1^2& 0\\0&\sigma_2^2\end{bmatrix}^{-1}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}\right)\right) \]

Independent Gaussians

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi \mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{D}^{-1}(\mathbf{ y}- \boldsymbol{ \mu})\right) \]

Correlated Gaussian

Form correlated from original by rotating the data space using matrix \(\mathbf{R}\).

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{D}^{-1}(\mathbf{ y}- \boldsymbol{ \mu})\right) \]

Correlated Gaussian

Form correlated from original by rotating the data space using matrix \(\mathbf{R}\).

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{R}^\top\mathbf{ y}- \mathbf{R}^\top\boldsymbol{ \mu})^\top\mathbf{D}^{-1}(\mathbf{R}^\top\mathbf{ y}- \mathbf{R}^\top\boldsymbol{ \mu})\right) \]

Correlated Gaussian

Form correlated from original by rotating the data space using matrix \(\mathbf{R}\).

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{R}\mathbf{D}^{-1}\mathbf{R}^\top(\mathbf{ y}- \boldsymbol{ \mu})\right) \] this gives a covariance matrix: \[ \mathbf{C}^{-1} = \mathbf{R}\mathbf{D}^{-1} \mathbf{R}^\top \]

Correlated Gaussian

Form correlated from original by rotating the data space using matrix \(\mathbf{R}\).

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{C}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{C}^{-1} (\mathbf{ y}- \boldsymbol{ \mu})\right) \] this gives a covariance matrix: \[ \mathbf{C}= \mathbf{R}\mathbf{D} \mathbf{R}^\top \]

The Prior Density

Let’s assume that the prior density is given by a zero mean Gaussian, which is independent across each of the parameters, \[ \mathbf{ w}\sim \mathcal{N}\left(\mathbf{0},\alpha \mathbf{I}\right) \] In other words, we are assuming, for the prior, that each element of the parameters vector, \(w_i\), was drawn from a Gaussian density as follows \[ w_i \sim \mathcal{N}\left(0,\alpha\right) \] Let’s start by assigning the parameter of the prior distribution, which is the variance of the prior distribution, \(\alpha\).

Further Reading

  • Multivariate Gaussians: Section 2.3 up to top of pg 85 of Bishop (2006)

  • Section 3.3 up to 159 (pg 152–159) of Bishop (2006)

Revisit Olympics Data

  • Use Bayesian approach on olympics data with polynomials.

  • Choose a prior \(\mathbf{ w}\sim \mathcal{N}\left(\mathbf{0},\alpha \mathbf{I}\right)\) with \(\alpha = 1\).

  • Choose noise variance \(\sigma^2 = 0.01\)

Sampling the Prior

  • Always useful to perform a ‘sanity check’ and sample from the prior before observing the data.
  • Since \(\mathbf{ y}= \boldsymbol{ \Phi}\mathbf{ w}+ \boldsymbol{ \epsilon}\) just need to sample \[ \mathbf{ w}\sim \mathcal{N}\left(0,\alpha\mathbf{I}\right) \] \[ \boldsymbol{ \epsilon}\sim \mathcal{N}\left(\mathbf{0},\sigma^2\right) \] with \(\alpha=1\) and \(\sigma^2 = 0.01\).

Computing the Posterior

\[ p(\mathbf{ w}| \mathbf{ y}, \mathbf{ x}, \sigma^2) = \mathcal{N}\left(\mathbf{ w}|\boldsymbol{ \mu}_w,\mathbf{C}_w\right) \] with \[ \mathbf{C}_w= \left(\sigma^{-2}\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}+ \alpha^{-1}\mathbf{I}\right)^{-1} \] and \[ \boldsymbol{ \mu}_w= \mathbf{C}_w\sigma^{-2}\boldsymbol{ \Phi}^\top \mathbf{ y} \]

Olympic Marathon Data

  • Gold medal times for Olympic Marathon since 1896.
  • Marathons before 1924 didn’t have a standardized distance.
  • Present results using pace per km.
  • In 1904 Marathon was badly organized leading to very slow times.
Image from Wikimedia Commons http://bit.ly/16kMKHQ

Olympic Marathon Data

Olympic Data with Bayesian Polynomials

Hold Out Validation

5-fold Cross Validation

Model Fit

  • Marginal likelihood doesn’t always increase as model order increases.
  • Bayesian model always has 2 parameters, regardless of how many basis functions (and here we didn’t even fit them).
  • Maximum likelihood model over fits through increasing number of parameters.
  • Revisit maximum likelihood solution with validation set.

Regularized Mean

  • Validation fit here based on mean solution for \(\mathbf{ w}\) only.
  • For Bayesian solution \[ \boldsymbol{ \mu}_w = \left[\sigma^{-2}\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}+ \alpha^{-1}\mathbf{I}\right]^{-1} \sigma^{-2} \boldsymbol{ \Phi}^\top \mathbf{ y} \] instead of \[ \mathbf{ w}^* = \left[\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\right]^{-1} \boldsymbol{ \Phi}^\top \mathbf{ y} \]
  • Two are equivalent when \(\alpha \rightarrow \infty\).
  • Equivalent to a prior for \(\mathbf{ w}\) with infinite variance.
  • In other cases \(\alpha \mathbf{I}\) regularizes the system (keeps parameters smaller).

Sampling from the Posterior

  • Now check samples by extracting \(\mathbf{ w}\) from the posterior.
  • Now for \(\mathbf{ y}= \boldsymbol{ \Phi}\mathbf{ w}+ \boldsymbol{ \epsilon}\) need \[ \mathbf{ w}\sim \mathcal{N}\left(\boldsymbol{ \mu}_w,\mathbf{C}_w\right) \] with \(\mathbf{C}_w = \left[\sigma^{-2}\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}+ \alpha^{-1}\mathbf{I}\right]^{-1}\) and \(\boldsymbol{ \mu}_w =\mathbf{C}_w \sigma^{-2} \boldsymbol{ \Phi}^\top \mathbf{ y}\) \[ \boldsymbol{ \epsilon}\sim \mathcal{N}\left(\mathbf{0},\sigma^2\mathbf{I}\right) \] with \(\alpha=1\) and \(\sigma^2 = 0.01\).

Marginal Likelihood

  • The marginal likelihood can also be computed, it has the form: \[ p(\mathbf{ y}|\mathbf{X}, \sigma^2, \alpha) = \frac{1}{(2\pi)^\frac{n}{2}\left|\mathbf{K}\right|^\frac{1}{2}} \exp\left(-\frac{1}{2} \mathbf{ y}^\top \mathbf{K}^{-1} \mathbf{ y}\right) \] where \(\mathbf{K}= \alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2 \mathbf{I}\).

  • So it is a zero mean \(n\)-dimensional Gaussian with covariance matrix \(\mathbf{K}\).

Computing the Expected Output

  • Given the posterior for the parameters, how can we compute the expected output at a given location?
  • Output of model at location \(\mathbf{ x}_i\) is given by \[ f(\mathbf{ x}_i; \mathbf{ w}) = \boldsymbol{ \phi}_i^\top \mathbf{ w} \]
  • We want the expected output under the posterior density, \(p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)\).
  • Mean of mapping function will be given by \[ \begin{aligned} \left\langle f(\mathbf{ x}_i; \mathbf{ w})\right\rangle_{p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)} &= \boldsymbol{ \phi}_i^\top \left\langle\mathbf{ w}\right\rangle_{p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)} \\ & = \boldsymbol{ \phi}_i^\top \boldsymbol{ \mu}_w \end{aligned} \]

Variance of Expected Output

  • Variance of model at location \(\mathbf{ x}_i\) is given by \[ \begin{aligned}\text{var}(f(\mathbf{ x}_i; \mathbf{ w})) &= \left\langle(f(\mathbf{ x}_i; \mathbf{ w}))^2\right\rangle - \left\langle f(\mathbf{ x}_i; \mathbf{ w})\right\rangle^2 \\&= \boldsymbol{ \phi}_i^\top\left\langle\mathbf{ w}\mathbf{ w}^\top\right\rangle \boldsymbol{ \phi}_i - \boldsymbol{ \phi}_i^\top \left\langle\mathbf{ w}\right\rangle\left\langle\mathbf{ w}\right\rangle^\top \boldsymbol{ \phi}_i \\&= \boldsymbol{ \phi}_i^\top \mathbf{C}_i\boldsymbol{ \phi}_i \end{aligned} \] where all these expectations are taken under the posterior density, \(p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)\).

Further Reading

  • Section 3.7–3.8 (pg 122–133) of Rogers and Girolami (2011)

  • Section 3.4 (pg 161–165) of Bishop (2006)

Thanks!

References

Bishop, C.M., 2006. Pattern recognition and machine learning. springer.
Gauss, C.F., 1809. Theoria motus corporum coelestium in sectionibus conicis solem ambientium. F. Perthes und I. H. Besser, Hamburg.
Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.