\[ \text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}\]
\[\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}\]
point 1: \(x= 1\), \(y=3\) \[ 3 = m + c \]
point 2: \(x= 3\), \(y=1\) \[ 1 = 3m + c \]
point 3: \(x= 2\), \(y=2.5\) \[ 2.5 = 2m + c \]
point 1: \(x= 1\), \(y=3\) \[ 3 = m + c + \epsilon_1 \]
point 2: \(x= 3\), \(y=1\) \[ 1 = 3m + c + \epsilon_2 \]
point 3: \(x= 2\), \(y=2.5\) \[ 2.5 = 2m + c + \epsilon_3 \]
Set the mean of Gaussian to be a function. \[ p\left(y_i|x_i\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp \left(-\frac{\left(y_i-f\left(x_i\right)\right)^{2}}{2\sigma^2}\right). \]
This gives us a ‘noisy function’.
This is known as a stochastic process.
Terminology | Mathematical notation | Description |
---|---|---|
joint | \(P(X=x, Y=y)\) | prob. that X=x and Y=y |
marginal | \(P(X=x)\) | prob. that X=x regardless of Y |
conditional | \(P(X=x\vert Y=y)\) | prob. that X=x given that Y=y |
Terminology | Definition | Probability Notation |
---|---|---|
Joint Probability | \(\lim_{N\rightarrow\infty}\frac{n_{X=3,Y=4}}{N}\) | \(P\left(X=3,Y=4\right)\) |
Marginal Probability | \(\lim_{N\rightarrow\infty}\frac{n_{X=5}}{N}\) | \(P\left(X=5\right)\) |
Conditional Probability | \(\lim_{N\rightarrow\infty}\frac{n_{X=3,Y=4}}{n_{Y=4}}\) | \(P\left(X=3\vert Y=4\right)\) |
Typically we should write out \(P\left(X=x,Y=y\right)\).
In practice, we often use \(P\left(x,y\right)\).
This looks very much like we might write a multivariate function, e.g. \(f\left(x,y\right)=\frac{x}{y}\).
We now quickly review the ‘rules of probability’.
All distributions are normalized. This is clear from the fact that \(\sum_{x}n_{x}=N\), which gives \[\sum_{x}P\left(x\right)={\lim_{N\rightarrow\infty}}\frac{\sum_{x}n_{x}}{N}={\lim_{N\rightarrow\infty}}\frac{N}{N}=1.\] A similar result can be derived for the marginal and conditional distributions.
Ignoring the limit in our definitions:
The marginal probability \(P\left(y\right)\) is \({\lim_{N\rightarrow\infty}}\frac{n_{y}}{N}\) .
The joint distribution \(P\left(x,y\right)\) is \({\lim_{N\rightarrow\infty}}\frac{n_{x,y}}{N}\).
\(n_{y}=\sum_{x}n_{x,y}\) so \[ {\lim_{N\rightarrow\infty}}\frac{n_{y}}{N}={\lim_{N\rightarrow\infty}}\sum_{x}\frac{n_{x,y}}{N}, \] in other words \[ P\left(y\right)=\sum_{x}P\left(x,y\right). \] This is known as the sum rule of probability.
\(y\) | 1 | 2 | 3 | 4 |
---|---|---|---|---|
\(P\left(y\right)\) | 0.3 | 0.2 | 0.1 | 0.4 |
\(y\) | 1 | 2 | 3 | 4 |
---|---|---|---|---|
\(P\left(y\right)\) | 0.3 | 0.2 | 0.1 | 0.4 |
\(y^2\) | 1 | 4 | 9 | 16 |
\(-\log(P(y))\) | 1.204 | 1.609 | 2.302 | 0.916 |
\(y\) | 1 | 2 | 3 | 4 |
---|---|---|---|---|
\(P\left(y\right)\) | 0.3 | 0.2 | 0.1 | 0.4 |
\(y^2\) | 1 | 4 | 9 | 16 |
\(-\log(P(y))\) | 1.204 | 1.609 | 2.302 | 0.916 |
You are given the following values samples of heights of students,
\(i\) | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
\(y_i\) | 1.76 | 1.73 | 1.79 | 1.81 | 1.85 | 1.80 |
What is the sample mean?
What is the sample variance?
Can you compute sample approximation expected value of \(-\log P(y)\)?
\(i\) | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
\(y_i\) | 1.76 | 1.73 | 1.79 | 1.81 | 1.85 | 1.80 |
\(y^2_i\) | 3.0976 | 2.9929 | 3.2041 | 3.2761 | 3.4225 | 3.2400 |
You are given the following values samples of heights of students,
\(i\) | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
\(y_i\) | 1.76 | 1.73 | 1.79 | 1.81 | 1.85 | 1.80 |
Actually these “data” were sampled from a Gaussian with mean 1.7 and standard deviation 0.15. Are your estimates close to the real values? If not why not?
See probability review at end of slides for reminders.
For other material in Bishop read:
If you are unfamiliar with probabilities you should complete the following exercises:
book: The Atomic Human
twitter: @lawrennd
podcast: The Talking Machines
newspaper: Guardian Profile Page
blog posts:
Section 2.2 (pg 41–53) of Rogers and Girolami (2011)
Section 2.4 (pg 55–58) of Rogers and Girolami (2011)
Section 2.5.1 (pg 58–60) of Rogers and Girolami (2011)
Section 2.5.3 (pg 61–62) of Rogers and Girolami (2011)
Probability densities: Section 1.2.1 (Pages 17–19) of Bishop (2006)
Expectations and Covariances: Section 1.2.2 (Pages 19–20) of Bishop (2006)
The Gaussian density: Section 1.2.4 (Pages 24–28) (don’t worry about material on bias) of Bishop (2006)
For material on information theory and KL divergence try Section 1.6 & 1.6.1 (pg 48 onwards) of Bishop (2006)
Exercise 1.7 of Bishop (2006)
Exercise 1.8 of Bishop (2006)
Exercise 1.9 of Bishop (2006)