Bayesian Regression

Overdetermined System

With two unknowns and two observations: \[ \begin{aligned} y_1 = & mx_1 + c\\ y_2 = & mx_2 + c \end{aligned} \]
Additional observation leads to overdetermined system. \[ y_3 = mx_3 + c \]
This problem is solved through a noise model \(\epsilon\sim \mathscr{N}\left(0,\sigma^2\right)\) \[ \begin{aligned} y_1 = mx_1 + c + \epsilon_1\\ y_2 = mx_2 + c + \epsilon_2\\ y_3 = mx_3 + c + \epsilon_3 \end{aligned} \]

Underdetermined System

What about two unknowns and one observation? \[y_1 = mx_1 + c\]

Can compute \(m\) given \(c\). \[m = \frac{y_1 - c}{x}\]

Underdetermined System

Solution

Place a probability distribution over \(c\).

A Philosophical Dispute: Probabilistic Treatment of Parameters?

Bayesian is not because of Bayes’ rule
It is because parameters (e.g. \(m\) and \(c\) are treated with distributions
Bayesian vs Frequentist controversy
My view is its a false dichotomy, these are complementary approaches

Noise Models

We aren’t modeling entire system.
Noise model gives mismatch between model and data.
Gaussian model justified by appeal to central limit theorem.
- Other models also possible (Student-\(t\) for heavy tails).
Maximum likelihood with Gaussian noise leads to least squares.

Different Types of Uncertainty

The first type of uncertainty we are assuming is aleatoric uncertainty.
The second type of uncertainty we are assuming is epistemic uncertainty.

Aleatoric Uncertainty

This is uncertainty we couldn’t know even if we wanted to. e.g. the result of a football match before it’s played.
Where a sheet of paper might land on the floor.

Epistemic Uncertainty

This is uncertainty we could in principal know the answer too. We just haven’t observed enough yet, e.g. the result of a football match after it’s played.
What colour socks your lecturer is wearing.

Prior Distribution

Bayesian inference requires a prior on the parameters.
The prior represents your belief before you see the data of the likely value of the parameters.
For linear regression, consider a Gaussian prior on the intercept:

\[c \sim \mathscr{N}\left(0,\alpha_1\right)\]

Posterior Distribution

Posterior distribution is found by combining the prior with the likelihood.
Posterior distribution is your belief after you see the data of the likely value of the parameters.

Posterior Distribution

The posterior is found through Bayes’ Rule \[ p(c|y) = \frac{p(y|c)p(c)}{p(y)} \] \[ \text{posterior} = \frac{\text{likelihood}\times \text{prior}}{\text{marginal likelihood}} \]

Bayes Update

Stages to Derivation of the Posterior

Multiply likelihood by prior
- they are “exponentiated quadratics,” the answer is always also an exponentiated quadratic because \(\exp(a^2)\exp(b^2) = \exp(a^2 + b^2)\).
Complete the square to get the resulting density in the form of a Gaussian.
Recognise the mean and (co)variance of the Gaussian. This is the estimate of the posterior.

Main Trick

\[ p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right) \]

Main Trick

\[\begin{aligned} p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)& = \\ \frac{1}{\left(2\pi\sigma^2\right)^{\frac{n}{2}}} & \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - mx_i - c)^2\right) \end{aligned}\]

\[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{p(\mathbf{ y}|\mathbf{ x}, m, \sigma^2)} \]

\[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = \frac{p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c)}{\int p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c) \text{d} c} \]

\[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) \propto p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c) \]

\[ \log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) =-\frac{1}{2\sigma^2} \sum_{i=1}^n(y_i-c \\ & - mx_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\]

\[ p(c| \mathbf{ y}, \mathbf{ x}, m, \sigma^2) \propto p(\mathbf{ y}|\mathbf{ x}, c, m, \sigma^2)p(c) \] \[\begin{aligned} \log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) =& -\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-mx_i)^2 \\ & -\left(\frac{n}{2\sigma^2} + \frac{1}{2\alpha_1}\right)c^2\\ & + c\frac{\sum_{i=1}^n(y_i-mx_i)}{\sigma^2} +\text{const} \end{aligned}\]

Complete the Square

\[ \log p(c | \mathbf{ y}, \mathbf{ x}, m, \sigma^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const} \] \[ \tau^2 = \left(n\sigma^{-2} +\alpha_1^{-1}\right)^{-1} \] \[\mu = \frac{\tau^2}{\sigma^2} \sum_{i=1}^n(y_i-mx_i)\]

The Joint Density

Really want to know the joint posterior density over the parameters \(c\) and \(m\).
Could now consider \(m\), but it’s easier to consider the multivariate case.

Two Dimensional Gaussian

Consider height, \(h/m\) and weight, \(w/kg\).
Could sample height from a distribution: \[ p(h) \sim \mathscr{N}\left(1.7,0.0225\right). \]
And similarly weight: \[ p(w) \sim \mathscr{N}\left(75,36\right). \]

Height and Weight Models

Independence Assumption

We assume height and weight are independent.

\[ p(w, h) = p(w)p(h). \]

Sampling Two Dimensional Variables

Body Mass Index

In reality they are dependent (body mass index) \(= \frac{w}{h^2}\).
To deal with this dependence we introduce correlated multivariate Gaussians.

Sampling Two Dimensional Variables

Independent Gaussians

\[ p(w, h) = p(w)p(h) \]

Independent Gaussians

\[\begin{aligned} p(w, h) = \frac{1}{\sqrt{2\pi \sigma_1^2}\sqrt{2\pi\sigma_2^2}} \exp\left(-&\frac{1}{2}\left(\frac{(w-\mu_1)^2}{\sigma_1^2} \right. \\ \left. +& \frac{(h-\mu_2)^2}{\sigma_2^2}\right)\right) \end{aligned}\]

Independent Gaussians

\[ p(w, h) = \frac{1}{\sqrt{2\pi\sigma_1^22\pi\sigma_2^2}} \exp\left(-\frac{1}{2}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}\right)^\top\begin{bmatrix}\sigma_1^2& 0\\0&\sigma_2^2\end{bmatrix}^{-1}\left(\begin{bmatrix}w \\ h\end{bmatrix} - \begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}\right)\right) \]

Independent Gaussians

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi \mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{D}^{-1}(\mathbf{ y}- \boldsymbol{ \mu})\right) \]

Correlated Gaussian

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{D}^{-1}(\mathbf{ y}- \boldsymbol{ \mu})\right) \]

Correlated Gaussian

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{R}^\top\mathbf{ y}- \mathbf{R}^\top\boldsymbol{ \mu})^\top\mathbf{D}^{-1}(\mathbf{R}^\top\mathbf{ y}- \mathbf{R}^\top\boldsymbol{ \mu})\right) \]

Correlated Gaussian

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{D}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{R}\mathbf{D}^{-1}\mathbf{R}^\top(\mathbf{ y}- \boldsymbol{ \mu})\right) \]

\[ \mathbf{C}^{-1} = \mathbf{R}\mathbf{D}^{-1} \mathbf{R}^\top \]

Correlated Gaussian

\[ p(\mathbf{ y}) = \frac{1}{\det{2\pi\mathbf{C}}^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(\mathbf{ y}- \boldsymbol{ \mu})^\top\mathbf{C}^{-1} (\mathbf{ y}- \boldsymbol{ \mu})\right) \] \[ \mathbf{C}= \mathbf{R}\mathbf{D} \mathbf{R}^\top \]

The Prior Density

\[ \mathbf{ w}\sim \mathscr{N}\left(\mathbf{0},\alpha \mathbf{I}\right) \]

\[ w_i \sim \mathscr{N}\left(0,\alpha\right) \]

Revisit Olympics Data

Use Bayesian approach on olympics data with polynomials.
Choose a prior \(\mathbf{ w}\sim \mathscr{N}\left(\mathbf{0},\alpha \mathbf{I}\right)\) with \(\alpha = 1\).
Choose noise variance \(\sigma^2 = 0.01\)

Sampling from a Gaussian

Scaling Gaussian-distributed Variables

\[ f= \phi w \]

\[ w\sim \mathscr{N}\left(\mu_w,c_w\right) \]

\[ \phi w\sim \mathscr{N}\left(\phi\mu_w,\phi^2c_w\right) \]

Generating from the Model

\[ \mathbf{ y}= \boldsymbol{ \Phi}\mathbf{ w}+ \boldsymbol{ \epsilon} \]

Sample \[ \mathbf{ w}\sim \mathscr{N}\left(0,\alpha\mathbf{I}\right) \] \[ \boldsymbol{ \epsilon}\sim \mathscr{N}\left(\mathbf{0},\sigma^2\right) \] with \(\alpha=1\) and \(\sigma^2 = 0.01\).

Olympic Marathon Polynomial Sample

Sampling from the Prior

K = degree + 1
z_vec = np.random.normal(size=K)
w_sample = z_vec*np.sqrt(alpha)

f_sample = Phi_pred@w_sample

Mean Output

Each sample \(\mathbf{ w}_s\) has a prediction \(\mathbf{ f}_s\)

\[ \mathbf{ f}_s = \boldsymbol{ \Phi}\mathbf{ w}_s. \]

for i in range(num_samples):
    z_vec = np.random.normal(size=K)
    w_sample = z_vec*np.sqrt(alpha)
    f_sample = np.dot(Phi_pred,w_sample)
    _ = ax.plot(x_pred.flatten(), f_sample.flatten(), linewidth=2)

Computing the Posterior

\[ p(\mathbf{ w}| \mathbf{ y}, \mathbf{ x}, \sigma^2) = \mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu}_w,\mathbf{C}_w\right) \] with \[ \mathbf{C}_w= \left(\sigma^{-2}\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}+ \alpha^{-1}\mathbf{I}\right)^{-1} \] and \[ \boldsymbol{ \mu}_w= \mathbf{C}_w\sigma^{-2}\boldsymbol{ \Phi}^\top \mathbf{ y} \]

Multivariate Bayesian Inference

Sampling from the Posterior

Now check samples by extracting \(\mathbf{ w}\) from the posterior.

Sampling from the Posterior

Now for \(\mathbf{ y}= \boldsymbol{ \Phi}\mathbf{ w}+ \boldsymbol{ \epsilon}\) need \[ \mathbf{ w}\sim \mathscr{N}\left(\boldsymbol{ \mu}_w,\mathbf{C}_w\right) \] with \(\mathbf{C}_w = \left[\sigma^{-2}\boldsymbol{ \Phi}^\top \boldsymbol{ \Phi}+ \alpha^{-1}\mathbf{I}\right]^{-1}\) and \(\boldsymbol{ \mu}_w =\mathbf{C}_w \sigma^{-2} \boldsymbol{ \Phi}^\top \mathbf{ y}\) \[ \boldsymbol{ \epsilon}\sim \mathscr{N}\left(\mathbf{0},\sigma^2\mathbf{I}\right) \] with \(\alpha=1\) and \(\sigma^2 = 0.01\).

Polynomial Function Samples

Properties of Gaussian Variables

Sum of Gaussian-distributed Variables

Sampling from Gaussians and Summing Up

Matrix Multiplication of Gaussian Variables

\[ f_i = \boldsymbol{ \phi}_i^\top \mathbf{ w} \]

\[ f_i = \sum_{j=1}^K w_j \phi_{i, j} \]

Expectation under Gaussian

\[ \left\langle\mathbf{ f}\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \int \mathbf{ f} \mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right) \text{d}\mathbf{ w}= \int \boldsymbol{ \Phi}\mathbf{ w} \mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right) \text{d}\mathbf{ w}= \boldsymbol{ \Phi}\int \mathbf{ w} \mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right) \text{d}\mathbf{ w}= \boldsymbol{ \Phi}\boldsymbol{ \mu} \]

\[ \left\langle\mathbf{ f}\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\mathbf{0},\alpha\mathbf{I}\right)} = \mathbf{0} \]

Expectation under Gaussian

\[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \left\langle\mathbf{ f}\mathbf{ f}^\top\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} - \left\langle\mathbf{ f}\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)}\left\langle\mathbf{ f}\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)}^\top \]

\[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \left\langle\mathbf{ f}\mathbf{ f}^\top\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} - \boldsymbol{ \Phi}\boldsymbol{ \mu}\boldsymbol{ \mu}^\top \boldsymbol{ \Phi}^\top \]

\[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \left\langle\boldsymbol{ \Phi}\mathbf{ w}\mathbf{ w}^\top \boldsymbol{ \Phi}^\top\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} - \boldsymbol{ \Phi}\boldsymbol{ \mu}\boldsymbol{ \mu}^\top \boldsymbol{ \Phi}^\top \] \[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \boldsymbol{ \Phi}\left\langle\mathbf{ w}\mathbf{ w}^\top\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} \boldsymbol{ \Phi}^\top - \boldsymbol{ \Phi}\boldsymbol{ \mu}\boldsymbol{ \mu}^\top\boldsymbol{ \Phi}^\top \] Which is dependent on the second moment of the Gaussian, \[ \left\langle\mathbf{ w}\mathbf{ w}^\top\right\rangle_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \mathbf{C}+ \boldsymbol{ \mu}\boldsymbol{ \mu}^\top \]

\[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu},\mathbf{C}\right)} = \boldsymbol{ \Phi}\mathbf{C}\boldsymbol{ \Phi}^\top \] so in the case of the prior distribution, where we have \(\mathbf{C}= \alpha \mathbf{I}\) we can write \[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\mathbf{0},\alpha \mathbf{I}\right)} = \alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top \]

\[ \mathbf{ y}= \mathbf{ f}+ \boldsymbol{ \epsilon} \]

\[ \mathbf{ f}\sim \mathscr{N}\left(\mathbf{0},\alpha\boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top\right) \]

\[ p(\mathbf{ y}) = \mathscr{N}\left(\mathbf{ y}|\mathbf{0},\alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2\mathbf{I}\right) \]

Marginal Likelihood

\[ p(\mathbf{ y}|\mathbf{X}, \sigma^2, \alpha) = \frac{1}{(2\pi)^\frac{n}{2}\left|\mathbf{K}\right|^\frac{1}{2}} \exp\left(-\frac{1}{2} \mathbf{ y}^\top \mathbf{K}^{-1} \mathbf{ y}\right) \] \[ \mathbf{K}= \alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2 \mathbf{I} \]

Computing the Mean and Error Bars of the Functions

\[ \mathbf{ f}= \boldsymbol{ \Phi}\mathbf{ w} \]

Computing the Expected Output

Given the posterior for the parameters, how can we compute the expected output at a given location?
Output of model at location \(\mathbf{ x}_i\) is given by \[ f(\mathbf{ x}_i; \mathbf{ w}) = \boldsymbol{ \phi}_i^\top \mathbf{ w} \]

Computing the Expected Output

We want the expected output under the posterior density, \(p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)\).
Mean of mapping function will be given by \[ \begin{aligned} \left\langle f(\mathbf{ x}_i; \mathbf{ w})\right\rangle_{p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)} &= \boldsymbol{ \phi}_i^\top \left\langle\mathbf{ w}\right\rangle_{p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)} \\ & = \boldsymbol{ \phi}_i^\top \boldsymbol{ \mu}_w \end{aligned} \]

Computing Error Bars

\[ \text{cov}\left(\mathbf{ f}\right)_{\mathscr{N}\left(\mathbf{ w}|\boldsymbol{ \mu}_w,\mathbf{C}_w\right)} = \boldsymbol{ \Phi}\mathbf{C}_w \boldsymbol{ \Phi}^\top \]

Variance of Expected Output

Variance of model at location \(\mathbf{ x}_i\) is given by \[\begin{aligned}\text{var}(f(\mathbf{ x}_i; \mathbf{ w})) &= \left\langle(f(\mathbf{ x}_i; \mathbf{ w}))^2\right\rangle - \left\langle f(\mathbf{ x}_i; \mathbf{ w})\right\rangle^2 \\&= \boldsymbol{ \phi}_i^\top\left\langle\mathbf{ w}\mathbf{ w}^\top\right\rangle \boldsymbol{ \phi}_i - \boldsymbol{ \phi}_i^\top \left\langle\mathbf{ w}\right\rangle\left\langle\mathbf{ w}\right\rangle^\top \boldsymbol{ \phi}_i \\&= \boldsymbol{ \phi}_i^\top \mathbf{C}_i\boldsymbol{ \phi}_i \end{aligned}\] where all these expectations are taken under the posterior density, \(p(\mathbf{ w}|\mathbf{ y}, \mathbf{X}, \sigma^2, \alpha)\).

Olympic Marathon Data

Gold medal times for Olympic Marathon since 1896.
Marathons before 1924 didn’t have a standardized distance.
Present results using pace per km.
In 1904 Marathon was badly organized leading to very slow times.

Image from Wikimedia Commons

Olympic Marathon Data

Olympic Data with Bayesian Polynomials

Hold Out Validation

5-fold Cross Validation

Model Fit

Marginal likelihood doesn’t always increase as model order increases.
Bayesian model always has 2 parameters, regardless of how many basis functions (and here we didn’t even fit them).
Maximum likelihood model over fits through increasing number of parameters.
Revisit maximum likelihood solution with validation set.

Bayesian Interpretation of Regularisation

Bootstrap Predication and Bayesian Misspecified Models: https://www.jstor.org/stable/3318894#metadata_info_tab_contents

Edwin Fong and Chris Holmes: On the Marginal Likelihood and Cross Validation https://arxiv.org/abs/1905.08737

Regularised Mean

Validation fit here based on mean solution for \(\mathbf{ w}\) only.
For Bayesian solution \[ \boldsymbol{ \mu}_w = \left[\sigma^{-2}\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}+ \alpha^{-1}\mathbf{I}\right]^{-1} \sigma^{-2} \boldsymbol{ \Phi}^\top \mathbf{ y} \] instead of \[ \mathbf{ w}^* = \left[\boldsymbol{ \Phi}^\top\boldsymbol{ \Phi}\right]^{-1} \boldsymbol{ \Phi}^\top \mathbf{ y} \]

Regularised Mean

Two are equivalent when \(\alpha \rightarrow \infty\).
Equivalent to a prior for \(\mathbf{ w}\) with infinite variance.
In other cases \(\alpha \mathbf{I}\) regularizes the system (keeps parameters smaller).

Thanks!

References

Bishop, C.M., 2006. Pattern recognition and machine learning. springer.

Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.

Bayesian Regression

Overdetermined System

Underdetermined System

Underdetermined System

Underdetermined System

Solution

A Philosophical Dispute: Probabilistic Treatment of Parameters?

Noise Models

Different Types of Uncertainty

Aleatoric Uncertainty

Epistemic Uncertainty

Prior Distribution

Posterior Distribution

Posterior Distribution

Bayes Update

Stages to Derivation of the Posterior

Main Trick

Main Trick

Complete the Square

The Joint Density

Two Dimensional Gaussian

Height and Weight Models

Independence Assumption

Sampling Two Dimensional Variables

Body Mass Index

Sampling Two Dimensional Variables

Independent Gaussians

Independent Gaussians

Independent Gaussians

Independent Gaussians

Correlated Gaussian

Correlated Gaussian

Correlated Gaussian

Correlated Gaussian

The Prior Density

Revisit Olympics Data

Sampling from a Gaussian

Scaling Gaussian-distributed Variables

Generating from the Model

Olympic Marathon Polynomial Sample

Sampling from the Prior

Mean Output

Computing the Posterior

Multivariate Bayesian Inference

Sampling from the Posterior

Sampling from the Posterior

Polynomial Function Samples

Properties of Gaussian Variables

Sum of Gaussian-distributed Variables

Sampling from Gaussians and Summing Up

Matrix Multiplication of Gaussian Variables

Expectation under Gaussian

Expectation under Gaussian

Marginal Likelihood

Computing the Mean and Error Bars of the Functions

Computing the Expected Output

Computing the Expected Output

Computing Error Bars

Variance of Expected Output

Olympic Marathon Data

Olympic Marathon Data

Olympic Data with Bayesian Polynomials

Hold Out Validation

5-fold Cross Validation

Model Fit

Bayesian Interpretation of Regularisation

Regularised Mean

Regularised Mean

Further Reading

Thanks!

References