Bayesian Regression

Neil D. Lawrence

Dedan Kimathi University, Nyeri, Kenya

Overdetermined System

  • With two unknowns and two observations: \[ \begin{aligned} y_1 = & mx_1 + c\\ y_2 = & mx_2 + c \end{aligned} \]
  • Additional observation leads to overdetermined system. \[ y_3 = mx_3 + c \]
  • This problem is solved through a noise model \(\epsilon\sim \mathscr{N}\left(0,\sigma^2\right)\) \[ \begin{aligned} y_1 = mx_1 + c + \epsilon_1\\ y_2 = mx_2 + c + \epsilon_2\\ y_3 = mx_3 + c + \epsilon_3 \end{aligned} \]

Underdetermined System

Underdetermined System

  • What about two unknowns and one observation? \[y_1 = mx_1 + c\]

Can compute \(m\) given \(c\). \[m = \frac{y_1 - c}{x}\]

Underdetermined System

Solution

  • Place a probability distribution over \(c\).

A Philosophical Dispute: Probabilistic Treatment of Parameters?

  • Bayesian is not because of Bayes’ rule
  • It is because parameters (e.g. \(m\) and \(c\) are treated with distributions
  • Bayesian vs Frequentist controversy
  • My view is its a false dichotomy, these are complementary approaches

Noise Models

  • We aren’t modeling entire system.
  • Noise model gives mismatch between model and data.
  • Gaussian model justified by appeal to central limit theorem.
    • Other models also possible (Student-\(t\) for heavy tails).
  • Maximum likelihood with Gaussian noise leads to least squares.

Different Types of Uncertainty

  • The first type of uncertainty we are assuming is aleatoric uncertainty.
  • The second type of uncertainty we are assuming is epistemic uncertainty.

Aleatoric Uncertainty

  • This is uncertainty we couldn’t know even if we wanted to. e.g. the result of a football match before it’s played.
  • Where a sheet of paper might land on the floor.

Epistemic Uncertainty

  • This is uncertainty we could in principal know the answer too. We just haven’t observed enough yet, e.g. the result of a football match after it’s played.
  • What colour socks your lecturer is wearing.

Prior Distribution

  • Bayesian inference requires a prior on the parameters.

  • The prior represents your belief before you see the data of the likely value of the parameters.

  • For linear regression, consider a Gaussian prior on the intercept:

    \[c \sim \mathscr{N}\left(0,\alpha_1\right)\]

Posterior Distribution

  • Posterior distribution is found by combining the prior with the likelihood.
  • Posterior distribution is your belief after you see the data of the likely value of the parameters.

Posterior Distribution

  • The posterior is found through Bayes’ Rule \[ p(c|y) = \frac{p(y|c)p(c)}{p(y)} \] \[ \text{posterior} = \frac{\text{likelihood}\times \text{prior}}{\text{marginal likelihood}} \]

Bayes Update

Stages to Derivation of the Posterior

  • Multiply likelihood by prior
    • they are “exponentiated quadratics,” the answer is always also an exponentiated quadratic because \(\exp(a^2)\exp(b^2) = \exp(a^2 + b^2)\).
  • Complete the square to get the resulting density in the form of a Gaussian.
  • Recognise the mean and (co)variance of the Gaussian. This is the estimate of the posterior.

Main Trick

\[ p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right) \]

Main Trick

\[\begin{aligned} p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)& = \\ \frac{1}{\left(2\pi\sigma^2\right)^{\frac{n}{2}}} & \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - mx_i - c)^2\right) \end{aligned}\]

\[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{p(\mathbf{ y}|\mathbf{ x}, m, \sigma^2)} \]

\[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{\int p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c) \text{d} c} \]

\[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) \propto p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c) \]

\[ \log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) =-\frac{1}{2\sigma^2} \sum_{i=1}^n(y_i-c \\ & - mx_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\]

\[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) \propto p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c) \] \[\begin{aligned} \log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) =& -\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-mx_i)^2 \\ & -\left(\frac{n}{2\sigma^2} + \frac{1}{2\alpha_1}\right)c^2\\ & + c\frac{\sum_{i=1}^n(y_i-mx_i)}{\sigma^2} +\text{const} \end{aligned}\]

Complete the Square

\[ \log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const} \] \[ \tau^2 = \left(n\sigma^{-2} +\alpha_1^{-1}\right)^{-1} \] \[\mu = \frac{\tau^2}{\sigma^2} \sum_{i=1}^n(y_i-mx_i)\]

The Joint Density

  • Really want to know the joint posterior density over the parameters \(c\) and \(m\).
  • Could now consider \(m\), but it’s easier to consider the multivariate case.

Two Dimensional Gaussian

  • Consider height, \(h/m\) and weight, \(w/kg\).
  • Could sample height from a distribution: \[ p(h) \sim \mathscr{N}\left(1.7,0.0225\right). \]
  • And similarly weight: \[ p(w) \sim \mathscr{N}\left(75,36\right). \]

Height and Weight Models

Independence Assumption

  • We assume height and weight are independent.

    \[ p(w, h) = p(w)p(h). \]

Sampling Two Dimensional Variables

Body Mass Index

  • In reality they are dependent (body mass index) \(= \frac{w}{h^2}\).
  • To deal with this dependence we introduce correlated multivariate Gaussians.

Sampling Two Dimensional Variables

Independent Gaussians

\[ p(w, h) = p(w)p(h) \]

Independent Gaussians

\[\begin{aligned} p(w, h) = \frac{1}{\sqrt{2\pi \sigma_1^2}\sqrt{2\pi\sigma_2^2}} \exp\left(-&\frac{1}{2}\left(\frac{(w-\mu_1)^2}{\sigma_1^2} \right. \\ \left. +& \frac{(h-\mu_2)^2}{\sigma_2^2}\right)\right) \end{aligned}\]

Independent Gaussians

\[ p(w, h) = \frac{1}{\sqrt{2\pi\sigma_1^22\pi\sigma_2^2}} \exp\left(-\frac{1}{2}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}\right)^\top\begin{bmatrix}\sigma_1^2& 0\\0&\sigma_2^2\end{bmatrix}^{-1}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}\right)\right) \]

Independent Gaussians

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi \mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{D}^{-1}(\mathbf{ y}- \boldsymbol{ \mu})\right) \]

Correlated Gaussian

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{D}^{-1}(\mathbf{ y}- \boldsymbol{ \mu})\right) \]

Correlated Gaussian

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{R}^\top\mathbf{ y}- \mathbf{R}^\top\boldsymbol{ \mu})^\top\mathbf{D}^{-1}(\mathbf{R}^\top\mathbf{ y}- \mathbf{R}^\top\boldsymbol{ \mu})\right) \]

Correlated Gaussian

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{R}\mathbf{D}^{-1}\mathbf{R}^\top(\mathbf{ y}- \boldsymbol{ \mu})\right) \]

\[ \mathbf{C}^{-1} = \mathbf{R}\mathbf{D}^{-1} \mathbf{R}^\top \]

Correlated Gaussian

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{C}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{C}^{-1} (\mathbf{ y}- \boldsymbol{ \mu})\right) \] \[ \mathbf{C}= \mathbf{R}\mathbf{D} \mathbf{R}^\top \]

The Prior Density

\[ \mathbf{ w}\sim \mathscr{N}\left(\mathbf{0},\alpha \mathbf{I}\right) \]

\[ w_i \sim \mathscr{N}\left(0,\alpha\right) \]

Revisit Olympics Data

  • Use Bayesian approach on olympics data with polynomials.

  • Choose a prior \(\mathbf{ w}\sim \mathscr{N}\left(\mathbf{0},\alpha \mathbf{I}\right)\) with \(\alpha = 1\).

  • Choose noise variance \(\sigma^2 = 0.01\)

Generating from the Model

\[ \mathbf{ y}= \boldsymbol{ \Phi}\mathbf{ w}+ \boldsymbol{ \epsilon} \]

  • Sample \[ \mathbf{ w}\sim \mathscr{N}\left(0,\alpha\mathbf{I}\right) \] \[ \boldsymbol{ \epsilon}\sim \mathscr{N}\left(\mathbf{0},\sigma^2\right) \] with \(\alpha=1\) and \(\sigma^2 = 0.01\).

Sampling from the Prior

Scaling Gaussian-distributed Variables

\[ f= \phi w \]

\[ w\sim \mathscr{N}\left(\mu_w,c_w\right) \]

\[ \phi w\sim \mathscr{N}\left(\phi\mu_w,\phi^2c_w\right) \]

Sampling from the Prior

\[ \mathbf{ f}= \boldsymbol{ \Phi}\mathbf{ w}. \]

Computing the Posterior

\[ p(\mathbf{ w}| \mathbf{ y}, \mathbf{ x}, \sigma^2) = \mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu}_w,\mathbf{C}_w\right) \] with \[ \mathbf{C}_w= \left(\sigma^{-2}\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}+ \alpha^{-1}\mathbf{I}\right)^{-1} \] and \[ \boldsymbol{ \mu}_w= \mathbf{C}_w\sigma^{-2}\boldsymbol{ \Phi}^\top \mathbf{ y} \]

Multivariate Bayesian Inference

Sampling from the Posterior

  • Now check samples by extracting \(\mathbf{ w}\) from the posterior.

Sampling from the Posterior

  • Now for \(\mathbf{ y}= \boldsymbol{ \Phi}\mathbf{ w}+ \boldsymbol{ \epsilon}\) need \[ \mathbf{ w}\sim \mathscr{N}\left(\boldsymbol{ \mu}_w,\mathbf{C}_w\right) \] with \(\mathbf{C}_w = \left[\sigma^{-2}\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}+ \alpha^{-1}\mathbf{I}\right]^{-1}\) and \(\boldsymbol{ \mu}_w =\mathbf{C}_w \sigma^{-2} \boldsymbol{ \Phi}^\top \mathbf{ y}\) \[ \boldsymbol{ \epsilon}\sim \mathscr{N}\left(\mathbf{0},\sigma^2\mathbf{I}\right) \] with \(\alpha=1\) and \(\sigma^2 = 0.01\).

Polynomial Function Samples

Properties of Gaussian Variables

Sum of Gaussian-distributed Variables

Sampling from Gaussians and Summing Up

Matrix Multiplication of Gaussian Variables

\[ f_i = \boldsymbol{ \phi}_i^\top \mathbf{ w} \]

\[ f_i = \sum_{j=1}^K w_j \phi_{i, j} \]

Expectation under Gaussian

\[ \left\langle\mathbf{ f}\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \int \mathbf{ f} \mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right) \text{d}\mathbf{ w}= \int \boldsymbol{ \Phi}\mathbf{ w} \mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right) \text{d}\mathbf{ w}= \boldsymbol{ \Phi}\int \mathbf{ w} \mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right) \text{d}\mathbf{ w}= \boldsymbol{ \Phi}\boldsymbol{ \mu} \]

\[ \left\langle\mathbf{ f}\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\mathbf{0},\alpha\mathbf{I}\right)} = \mathbf{0} \]

Expectation under Gaussian

\[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \left\langle\mathbf{ f}\mathbf{ f}^\top\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} - \left\langle\mathbf{ f}\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)}\left\langle\mathbf{ f}\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)}^\top \]

\[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \left\langle\mathbf{ f}\mathbf{ f}^\top\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} - \boldsymbol{ \Phi}\boldsymbol{ \mu}\boldsymbol{ \mu}^\top \boldsymbol{ \Phi}^\top \]

\[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \left\langle\boldsymbol{ \Phi}\mathbf{ w}\mathbf{ w}^\top \boldsymbol{ \Phi}^\top\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} - \boldsymbol{ \Phi}\boldsymbol{ \mu}\boldsymbol{ \mu}^\top \boldsymbol{ \Phi}^\top \] \[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \boldsymbol{ \Phi}\left\langle\mathbf{ w}\mathbf{ w}^\top\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} \boldsymbol{ \Phi}^\top - \boldsymbol{ \Phi}\boldsymbol{ \mu}\boldsymbol{ \mu}^\top\boldsymbol{ \Phi}^\top \] Which is dependent on the second moment of the Gaussian, \[ \left\langle\mathbf{ w}\mathbf{ w}^\top\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \mathbf{C}+ \boldsymbol{ \mu}\boldsymbol{ \mu}^\top \]

\[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \boldsymbol{ \Phi}\mathbf{C}\boldsymbol{ \Phi}^\top \] so in the case of the prior distribution, where we have \(\mathbf{C}= \alpha \mathbf{I}\) we can write \[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\mathbf{0},\alpha \mathbf{I}\right)} = \alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top \]

\[ \mathbf{ y}= \mathbf{ f}+ \boldsymbol{ \epsilon} \]

\[ \mathbf{ f}\sim \mathscr{N}\left(\mathbf{0},\alpha\boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top\right) \]

\[ p(\mathbf{ y}) = \mathscr{N}\left(\mathbf{ y}|\mathbf{0},\alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2\mathbf{I}\right) \]

Marginal Likelihood

\[ p(\mathbf{ y}|\mathbf{X}, \sigma^2, \alpha) = \frac{1}{(2\pi)^\frac{n}{2}\left|\mathbf{K}\right|^\frac{1}{2}} \exp\left(-\frac{1}{2} \mathbf{ y}^\top \mathbf{K}^{-1} \mathbf{ y}\right) \] \[ \mathbf{K}= \alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2 \mathbf{I} \]

Computing the Mean and Error Bars of the Functions

\[ \mathbf{ f}= \boldsymbol{ \Phi}\mathbf{ w} \]

Computing the Expected Output

  • Given the posterior for the parameters, how can we compute the expected output at a given location?
  • Output of model at location \(\mathbf{ x}_i\) is given by \[ f(\mathbf{ x}_i; \mathbf{ w}) = \boldsymbol{ \phi}_i^\top \mathbf{ w} \]

Computing the Expected Output

  • We want the expected output under the posterior density, \(p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)\).
  • Mean of mapping function will be given by \[ \begin{aligned} \left\langle f(\mathbf{ x}_i; \mathbf{ w})\right\rangle_{p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)} &= \boldsymbol{ \phi}_i^\top \left\langle\mathbf{ w}\right\rangle_{p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)} \\ & = \boldsymbol{ \phi}_i^\top \boldsymbol{ \mu}_w \end{aligned} \]

Computing Error Bars

\[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu}_w,\mathbf{C}_w\right)} = \boldsymbol{ \Phi}\mathbf{C}_w \boldsymbol{ \Phi}^\top \]

Variance of Expected Output

  • Variance of model at location \(\mathbf{ x}_i\) is given by \[\begin{aligned}\text{var}(f(\mathbf{ x}_i; \mathbf{ w})) &= \left\langle(f(\mathbf{ x}_i; \mathbf{ w}))^2\right\rangle - \left\langle f(\mathbf{ x}_i; \mathbf{ w})\right\rangle^2 \\&= \boldsymbol{ \phi}_i^\top\left\langle\mathbf{ w}\mathbf{ w}^\top\right\rangle \boldsymbol{ \phi}_i - \boldsymbol{ \phi}_i^\top \left\langle\mathbf{ w}\right\rangle\left\langle\mathbf{ w}\right\rangle^\top \boldsymbol{ \phi}_i \\&= \boldsymbol{ \phi}_i^\top \mathbf{C}_i\boldsymbol{ \phi}_i \end{aligned}\] where all these expectations are taken under the posterior density, \(p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)\).

Olympic Marathon Data

  • Gold medal times for Olympic Marathon since 1896.
  • Marathons before 1924 didn’t have a standardized distance.
  • Present results using pace per km.
  • In 1904 Marathon was badly organized leading to very slow times.
Image from Wikimedia Commons

Olympic Marathon Data

Olympic Data with Bayesian Polynomials

Hold Out Validation

5-fold Cross Validation

Model Fit

  • Marginal likelihood doesn’t always increase as model order increases.
  • Bayesian model always has 2 parameters, regardless of how many basis functions (and here we didn’t even fit them).
  • Maximum likelihood model over fits through increasing number of parameters.
  • Revisit maximum likelihood solution with validation set.

Bayesian Interpretation of Regularisation

Bootstrap Predication and Bayesian Misspecified Models: https://www.jstor.org/stable/3318894#metadata_info_tab_contents

Edwin Fong and Chris Holmes: On the Marginal Likelihood and Cross Validation https://arxiv.org/abs/1905.08737

Regularised Mean

  • Validation fit here based on mean solution for \(\mathbf{ w}\) only.
  • For Bayesian solution \[ \boldsymbol{ \mu}_w = \left[\sigma^{-2}\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}+ \alpha^{-1}\mathbf{I}\right]^{-1} \sigma^{-2} \boldsymbol{ \Phi}^\top \mathbf{ y} \] instead of \[ \mathbf{ w}^* = \left[\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\right]^{-1} \boldsymbol{ \Phi}^\top \mathbf{ y} \]

Regularised Mean

  • Two are equivalent when \(\alpha \rightarrow \infty\).
  • Equivalent to a prior for \(\mathbf{ w}\) with infinite variance.
  • In other cases \(\alpha \mathbf{I}\) regularizes the system (keeps parameters smaller).

Further Reading

  • Section 1.2.3 (pg 21–24) of Bishop (2006)

  • Sections 3.1-3.4 (pg 95-117) of Rogers and Girolami (2011)

  • Section 1.2.3 (pg 21–24) of Bishop (2006)

  • Section 1.2.6 (start from just past eq 1.64 pg 30-32) of Bishop (2006)

  • Multivariate Gaussians: Section 2.3 up to top of pg 85 of Bishop (2006)

  • Section 3.3 up to 159 (pg 152–159) of Bishop (2006)

  • Section 3.7–3.8 (pg 122–133) of Rogers and Girolami (2011)

  • Section 3.4 (pg 161–165) of Bishop (2006)

Thanks!

References

Bishop, C.M., 2006. Pattern recognition and machine learning. springer.
Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.