Wake word classification (Global Pulse Project).
Breakthrough in 2012 with ImageNet result of Alex Krizhevsky, Ilya Sutskever and Geoff Hinton
We are given a data set containing ‘inputs’, \(\mathbf{X}\) and ‘targets’, \(\mathbf{ y}\).
Each data point consists of an input vector \(\mathbf{ x}_i\) and a class label, \(y_i\).
For binary classification assume \(y_i\) should be either \(1\) (yes) or \(-1\) (no).
Input vector can be thought of as features.
\(y\) | 0 | 1 |
---|---|---|
\(P(y)\) | \((1-\pi)\) | \(\pi\) |
This is the Bernoulli distribution.
The Bernoulli distribution \[ P(y) = \pi^y(1-\pi)^{(1-y)} \]
Is a clever trick for switching probabilities, as code it would be
Stationary point: set derivative to zero \[0 = -\frac{\sum_{i=1}^{n} y_i}{\pi} + \frac{\sum_{i=1}^{n} (1-y_i)}{1-\pi},\]
Rearrange to form \[(1-\pi)\sum_{i=1}^{n} y_i = \pi\sum_{i=1}^{n} (1-y_i),\]
Giving \[\sum_{i=1}^{n} y_i = \pi\left(\sum_{i=1}^{n} (1-y_i) + \sum_{i=1}^{n} y_i\right),\]
Recognise that \(\sum_{i=1}^{n} (1-y_i) + \sum_{i=1}^{n} y_i = n\) so we have \[\pi = \frac{\sum_{i=1}^{n} y_i}{n}\]
Estimate the probability associated with the Bernoulli by setting it to the number of observed positives, divided by the total length of \(y\).
Makes intiutive sense.
What’s your best guess of probability for coin toss is heads when you get 47 heads from 100 tosses?
\[ \text{posterior} = \frac{\text{likelihood}\times\text{prior}}{\text{marginal likelihood}} \]
Four components:
Probabilistic Machine Learning: place probability distributions (or densities) over all the variables of interest.
In naive Bayes this is exactly what we do.
Form a classification algorithm by modelling the joint density of our observations.
Need to make assumption about joint density.
Given model parameters \(\boldsymbol{ \theta}\) we assume that all data points in the model are independent. \[ p(y^*, \mathbf{ x}^*, \mathbf{ y}, \mathbf{X}|\boldsymbol{ \theta}) = p(y^*, \mathbf{ x}^*|\boldsymbol{ \theta})\prod_{i=1}^{n} p(y_i, \mathbf{ x}_i | \boldsymbol{ \theta}). \]
This is a conditional independence assumption.
We also make similar assumptions for regression (where \(\boldsymbol{ \theta}= \left\{\mathbf{ w},\sigma^2\right\}\)).
Here we assume joint density of \(\mathbf{ y}\) and \(\mathbf{X}\) is independent across the data given the parameters.
Computing posterior distribution in this case becomes easier, this is known as the ‘Bayes classifier’.
To specify the joint distribution we also need the marginal for \(p(y_i)\) \[p(x_{i,j},y_i| \boldsymbol{ \theta}) = p(x_{i,j}|y_i, \boldsymbol{ \theta})p(y_i).\]
Because \(y_i\) is binary the Bernoulli density makes a suitable choice for our prior over \(y_i\), \[p(y_i|\pi) = \pi^{y_i} (1-\pi)^{1-y_i}\] where \(\pi\) now has the interpretation as being the prior probability that the classification should be positive.
\[\begin{align*} E(\boldsymbol{ \theta}, \pi)& = -\log p(\mathbf{ y}, \mathbf{X}|\boldsymbol{ \theta}, \pi) \\ &= -\sum_{i=1}^{n} \sum_{j=1}^{p} \log p(x_{i, j}|y_i, \boldsymbol{ \theta}) - \sum_{i=1}^{n} \log p(y_i|\pi), \end{align*}\]
The distributions show the parameters of the independent
class conditional probabilities for no maternity services. It is a
Bernoulli distribution with the parameter, \(\pi\), given by (theta_0
) for
the facilities without maternity services and theta_1
for
the facilities with maternity services. The parameters whow that,
facilities with maternity services also are more likely to have other
services such as grid electricity, emergency transport, immunization
programs etc.
The naive Bayes assumption says that the joint probability for these services is given by the product of each of these Bernoulli distributions.
We have modelled the numbers in our table with a Gaussian density.
Since several of these numbers are counts, a more appropriate
distribution might be the Poisson distribution. But here we can see that
the average number of nurses, healthworkers and doctors is
higher in the facilities with maternal services
(mu_1
) than those without maternal services
(mu_0
). There is also a small difference between the mean
latitude and longitudes. However, the standard deviation which
would be given by the square root of the variance parameters
(sigma_0
and sigma_1
) is large, implying that
a difference in latitude and longitude may be due to sampling error. To
be sure more analysis would be required.
\[ \pi = \frac{\sum_{i=1}^{n} y_i + 1}{n+ 2} \]
twitter: @lawrennd
podcast: The Talking Machines
newspaper: Guardian Profile Page
blog posts: