Neil D. Lawrence
Dedan Kimathi University, Nyeri, Kenya
Prediction Function
From last time \[P(y_i|\mathbf{ w}, \mathbf{ x}) = \pi_i^{y_i} (1-\pi_i)^{1-y_i}\]
Trick for switching betwen probabilities
Multiplying everything out leads to \[ \begin{aligned} \frac{\text{d}E(\mathbf{ w})}{\text{d}\mathbf{ w}} = & -\sum_{i=1}^n y_i- h\left(\mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x})\boldsymbol{ \phi}(\mathbf{ x}_i). \]
\[ \mathbf{ w}_\text{new} \leftarrow \mathbf{ w}_\text{old} - \eta(H(\boldsymbol{ \phi}_i^\top \mathbf{ w}) (1-y_i) \boldsymbol{ \phi}_i - (1-H(\boldsymbol{ \phi}_i^\top \mathbf{ w})) y_i \boldsymbol{ \phi}_i \]
The gradient of the negative log-likelihood of logistic regression is \[ \frac{\text{d}E(\mathbf{ w})}{\text{d}\mathbf{ w}} = -\sum_{i=1}^n y_i\left(1-h\left(\mathbf{ w}^\top \boldsymbol{ \phi}_i\right)\right) \boldsymbol{ \phi}_i + \sum_{i=1}^n (1-y_i)\left(h\left(\mathbf{ w}^\top \boldsymbol{ \phi}_i\right)\right) \boldsymbol{ \phi}_i. \] so the gradient with respect to one point is \[ \frac{\text{d}E_i(\mathbf{ w})}{\text{d}\mathbf{ w}}=y_i\left(1-h\left(\mathbf{ w}^\top \boldsymbol{ \phi}_i\right)\right) \boldsymbol{ \phi}_i + (1-y_i)\left(h\left(\mathbf{ w}^\top \boldsymbol{ \phi}_i\right)\right) \boldsymbol{ \phi}_i. \] and the stochastic gradient update for logistic regression (with mini-batch size set to 1) is \[ \mathbf{ w}_\text{new} \leftarrow \mathbf{ w}_\text{old} - \eta ((1-y_i)\left(h\left(\mathbf{ w}^\top \boldsymbol{ \phi}_i\right)\right) - y_i\left(1-h\left(\mathbf{ w}^\top \boldsymbol{ \phi}_i\right)\right) \boldsymbol{ \phi}_i) \] The difference between the two is that for the perceptron we are using the Heaviside function, \(H(\cdot)\), whereas for logistic regression we’re using the sigmoid, \(h(\cdot)\), which is like a soft version of the Heaviside function.
Linear regression models continuous response \(y_i\) vs inputs \(\mathbf{ x}_i\): \[y_i = f(\mathbf{ x}_i) + \epsilon_i\] where \(f(\mathbf{ x}_i) = \mathbf{ w}^\top\mathbf{ x}_i\)
Probabilistic model: \[p(y_i|\mathbf{ x}_i) = \mathscr{N}\left(y_i|\mathbf{ w}^\top\mathbf{ x}_i,\sigma^2\right)\]
Matrix form: \[\mathbf{ y}= \mathbf{X}\mathbf{ w}+ \boldsymbol{ \epsilon}\]
Expected prediction: \[\mathbb{E}[y_i|\mathbf{ x}_i] = \mathbf{ w}^\top\mathbf{ x}_i\]
Prediction Function
Objective Function
{For this form the gradient of the log likelihood with respect to the model’s parameters is given by \[ \nabla_\mathbf{W}L(\mathbf{W}) = \left(T- \left\langle T\right\rangle_{p(\mathbf{ y}|\boldsymbol{ \theta})})\boldsymbol{ \Phi}. \]
{The natural parameters of a univariate Gaussian with mean, \(\mu\) and variance \(\sigma^2\), \[ y \sim \mathscr{N}\left((,\mu\right), \sigma^2) \] are \(\theta_1 = \frac{\mu}{\sigma^2}\) and \(\theta_2 = -\tfrac{1}{2\sigma^2}\). This allows us to write \[ p(y | \boldsymbol{ \theta}) = \exp(\nu_1 y + \nu_2 y^2 - A(\boldsymbol{ \theta}) \] where the log partition function is \[ A(\theta_1, \theta_2) = \frac{\theta_1^2}{4\theta_2} - \frac{1}{2}\log\det{-2\theta_2} - \frac{1}{2}\log 2\pi. \]
{
Diagnostic checks are essential for building confidence about the model reliability. Create residual plots against fitted values and each predictor. Look for systematic patterns, a U-shaped residual plot suggests missing quadratic terms. For logistic regression, plot predicted probabilities against actual outcomes in bins to check calibration. Calculate influence measures like Cook’s distance to identify outliers. In a house price model, a mansion might have outsized influence on coefficients. Check variance inflation factors (VIF) for multicollinearity. High VIF (>5-10) suggests problematic correlation between predictors.
company: Trent AI
book: The Atomic Human
twitter: @lawrennd
newspaper: Guardian Profile Page
blog posts: