Regression: covariates \(\mathbf{Z}\) are observed
Factor analysis: \(\mathbf{Z}\) are latent/unknown
Solution: treat as probability distributions
\[
x_{i,j} \sim
\mathscr{N}\left(0,1\right),
\] and we can write the density governing the latent variable associated with a single point as, \[
\mathbf{ z}_{i, :} \sim \mathscr{N}\left(\mathbf{0},\mathbf{I}\right).
\]
Probabilistic PCA Max. Likelihood Soln (Tipping and Bishop (1999)) \[
p\left(\mathbf{Y}|\mathbf{W}\right)=\prod_{i=1}^{n}\mathscr{N}\left(\mathbf{ y}_{i,:}|\mathbf{0},\mathbf{C}\right),\quad \mathbf{C}=\mathbf{W}\mathbf{W}^{\top}+\sigma^{2}\mathbf{I}
\]\[
\log p\left(\mathbf{Y}|\mathbf{W}\right)=-\frac{n}{2}\log\left|\mathbf{C}\right|-\frac{1}{2}\text{tr}\left(\mathbf{C}^{-1}\mathbf{Y}^{\top}\mathbf{Y}\right)+\text{const.}
\] If \(\mathbf{U}_{q}\) are first \(q\) principal eigenvectors of \(n^{-1}\mathbf{Y}^{\top}\mathbf{Y}\) and the corresponding eigenvalues are \(\boldsymbol{\Lambda}_{q}\), \[
\mathbf{W}=\mathbf{U}_{q}\mathbf{L}\mathbf{R}^{\top},\quad\mathbf{L}=\left(\boldsymbol{\Lambda}_{q}-\sigma^{2}\mathbf{I}\right)^{\frac{1}{2}}
\] where \(\mathbf{R}\) is an arbitrary rotation matrix.
Probabilistic PCA
Represent data, \(\mathbf{Y}\), with a lower dimensional set of latent variables \(\mathbf{Z}\).
Assume a linear relationship of the form \[
\mathbf{ y}_{i,:}=\mathbf{W}\mathbf{ z}_{i,:}+\boldsymbol{ \epsilon}_{i,:},
\] where \[
\boldsymbol{ \epsilon}_{i,:} \sim \mathscr{N}\left(\mathbf{0},\sigma^2\mathbf{I}\right)
\]
Probabilistic PCA
PPCA defines a probabilistic model where:
Data is generated from latent variables through linear transformation
Gaussian noise is added to the transformed variables
Find directions in the data, \(\mathbf{ z}= \mathbf{U}\mathbf{ y}\), for which variance is maximized.
Lagrangian
Solution is found via constrained optimisation (which uses Lagrange multipliers): \[
L\left(\mathbf{u}_{1},\lambda_{1}\right)=\mathbf{u}_{1}^{\top}\mathbf{S}\mathbf{u}_{1}+\lambda_{1}\left(1-\mathbf{u}_{1}^{\top}\mathbf{u}_{1}\right)
\]
Gradient with respect to \(\mathbf{u}_{1}\)\[\frac{\text{d}L\left(\mathbf{u}_{1},\lambda_{1}\right)}{\text{d}\mathbf{u}_{1}}=2\mathbf{S}\mathbf{u}_{1}-2\lambda_{1}\mathbf{u}_{1}\]
Lagrangian
Rearrange to form \[\mathbf{S}\mathbf{u}_{1}=\lambda_{1}\mathbf{u}_{1}.\] Which is known as an eigenvalue problem.
Further directions that are orthogonal to the first can also be shown to be eigenvectors of the covariance.
PCA directions
Maximum variance directions are eigenvectors of the covariance matrix
Relationship to Matrix Factorization
PCA is closely related to matrix factorisation.
Instead of \(\mathbf{Z}\), \(\mathbf{W}\)
Define Users \(\mathbf{U}\) and items \(\mathbf{V}\)
Element-wise operations faster than matrix multiplication
Overall complexity: \(O(nd+ q^3)\) vs \(O(d^3)\)
Scales well for high-dimensional data with low latent dimension
Reconstruction of the Data
Given any posterior projection of a data point, we can replot the original data as a function of the input space.
We will now try to reconstruct the motion capture figure form some different places in the latent plot.
Other Data Sets to Explore
Below there are a few other data sets from pods you might want to explore with PCA. Both of them have \(p\)>\(n\) so you need to consider how to do the larger eigenvalue probleme efficiently without large demands on computer memory.
The data is actually quite high dimensional, and solving the eigenvalue problem in the high dimensional space can take some time. At this point we turn to a neat trick, you don’t have to solve the full eigenvalue problem in the \(d\times d\) covariance, you can choose instead to solve the related eigenvalue problem in the \(n\times n\) space, and in this case \(n=200\) which is much smaller than \(d\).
The original eigenvalue problem has the form \[
\mathbf{Y}^\top\mathbf{Y}\mathbf{U} = \mathbf{U}\boldsymbol{\Lambda}
\] But if we premultiply by \(\mathbf{Y}\) then we can solve, \[
\mathbf{Y}\mathbf{Y}^\top\mathbf{Y}\mathbf{U} = \mathbf{Y}\mathbf{U}\boldsymbol{\Lambda}
\] but it turns out that we can write \[
\mathbf{U}^\prime = \mathbf{Y}\mathbf{U} \Lambda^{\frac{1}{2}}
\] where \(\mathbf{U}^\prime\) is an orthorormal matrix because \[
\left.\mathbf{U}^\prime\right.^\top\mathbf{U}^\prime = \Lambda^{-\frac{1}{2}}\mathbf{U}\mathbf{Y}^\top\mathbf{Y}\mathbf{U} \Lambda^{-\frac{1}{2}}
\] and since \(\mathbf{U}\) diagonalises \(\mathbf{Y}^\top\mathbf{Y}\), \[
\mathbf{U}\mathbf{Y}^\top\mathbf{Y}\mathbf{U} = \Lambda
\] then \[
\left.\mathbf{U}^\prime\right.^\top\mathbf{U}^\prime = \mathbf{I}
\]
Olivetti Faces
im = np.reshape(Y[1, :].flatten(), (64, 64)).T}
Visualizing the Eigenvectors
Reconstruction
Gene Expression
When Dimensionality Reduction Fails
Dimensionality reduction can fail in several key scenarios:
Truly high dimensional data with no simpler structure
Highly nonlinear relationships between dimensions
Varying intrinsic dimensionality across data
The Swiss Roll Example
Classic example of nonlinear structure
2D manifold embedded in 3D
Linear methods like PCA fail
Requires nonlinear methods (t-SNE, UMAP)
Common Failure Modes
Key failure scenarios:
Linear methods on nonlinear manifolds
Assuming global structure when only local exists
Not accounting for noise
Visualization and Human Perception
Human visual system is our highest bandwidth connection to the world
Optic tract: ~8.75 million bits/second
Verbal communication: only ~2,000 bits/minute
Active sensing through rapid eye movements (saccades)
The Atomic Human pages bandwidth, communication 10-12,16,21,29,31,34,38,41,44,65-67,76,81,90-91,104,115,149,196,214,216,235,237-238,302,334 , MacKay, Donald 227-228,230-237,267-270, optic nerve/tract 205,235, O’Regan, Kevin 236-240,250,259,262-263,297,299, saccade 236,238,259-260,297,301, visual system/visual cortex 204-206,209,235-239,249-250,255,259,260,268-270,281,294,297,301,324,330.
Bishop, C.M., James, G.D., 1993. Analysis of multiphase flows using dual-energy gamma densitometry and neural networks. Nuclear Instruments and Methods in Physics Research A327, 580–593. https://doi.org/10.1016/0168-9002(93)90728-Z
Hotelling, H., 1933. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24, 417–441.
MacKay, D.M., 1991. Behind the eye. Basil Blackwell.
Tipping, M.E., Bishop, C.M., 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society, B 6, 611–622. https://doi.org/doi:10.1111/1467-9868.00196