Clustering and High Dimensions

Neil D. Lawrence

Dedan Kimathi University, Nyeri, Kenya

Review

Clustering

  • Common approach for grouping data points
  • Assigns data points to discrete groups
  • Examples include:
    • Animal classification
    • Political affiliation grouping

Clustering vs Vector Quantisation

  • Clustering expects gaps between groups in data density
  • Vector quantization may not require density gaps
  • For practical purposes, both involve:
    • Allocating points to groups
    • Determining optimal number of groups

Task

  • Task: associate data points with different labels.
  • Labels are not provided by humans.
  • Process is intuitive for humans - we do it naturally.

Platonic Ideals

  • Greek philosopher Plato considered the concept of ideals
  • The Platonic ideal bird is the most bird-like bird
  • In clustering, we find these ideals as cluster centers
  • Data points are allocated to their nearest center

\(k\)-means Clustering

Mathematical Formulation

  • Represent objects as data vectors \(\mathbf{ x}_i\)
  • Represent cluster centres as vectors \(\boldsymbol{ \mu}_j\)
  • Define similarity/distance between objects and centres
  • Distance function: \(d_{ij} = f(\mathbf{ x}_i, \boldsymbol{ \mu}_j)\)

Squared Distance

  • Common choice: squared distance \[ d_{ij} = (\mathbf{ x}_i - \boldsymbol{ \mu}_j)^2 \]
  • Goal: find centres close to many data points

Objective Function

  • Given similarity measure, need number of cluster centres, \(K\).
  • Find their location by allocating each center to a sub-set of the points and minimizing the sum of the squared errors, \[ E(\mathbf{M}) = \sum_{i \in \mathbf{i}_j} (\mathbf{ x}_i - \boldsymbol{ \mu}_j)^2 \] here \(\mathbf{i}_j\) is all indices of data points allocated to the \(j\)th center.

\(k\)-Means Clustering

  • \(k\)-means clustering is simple and quick to implement.
  • Very initialisation sensitive.

Initialisation

  • Initialisation is the process of selecting a starting set of parameters.
  • Optimisation result can depend on the starting point.
  • For \(k\)-means clustering you need to choose an initial set of centres.
  • Optimisation surface has many local optima, algorithm gets stuck in ones near initialisation.

\(k\)-means Algorithm

  • Simple iterative clustering algorithm
  • Key steps:
    1. Initialize with random centres
    2. Assign points to nearest center
    3. Update centres as cluster means
    4. Repeat until stable

Objective Function

  • Minimizes sum of squared distances: \[ E=\sum_{j=1}^K \sum_{i\ \text{allocated to}\ j} \left(\mathbf{ y}_{i, :} - \boldsymbol{ \mu}_{j, :}\right)^\top\left(\mathbf{ y}_{i, :} - \boldsymbol{ \mu}_{j, :}\right) \]
  • Solution not guaranteed to be global or unique
  • Represents a non-convex optimization problem

\(k\)-Means Clustering

\(k\)-means

Hierarchical Clustering

Mathematical Formulation

  • Start with each data point as its own cluster
  • Define distance between clusters (linkage criterion)
  • At each step, merge the two closest clusters
  • Continue until all points are in one cluster
  • Result: A tree structure (dendrogram) showing merge history

Linkage Criteria

  • Single linkage: Distance between closest points in clusters
  • Complete linkage: Distance between farthest points in clusters
  • Average linkage: Average distance between all point pairs
  • Ward linkage: Minimises within-cluster variance increase

Ward’s Criterion

  • Ward’s method minimizes the increase in within-cluster variance
  • Set every data point as cluster size 1.
  • Select as \[ \Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2 \]
  • Where \(\boldsymbol{ \mu}_i, \boldsymbol{ \mu}_j\) are cluster centroids and \(n_i, n_j\) are cluster sizes

Ward’s Criterion

  • Creates compact, roughly spherical clusters

  • Effective for data with clear cluster structure

  • Why Ward’s works well:

    • Minimises information loss during merging
    • Creates clusters with minimal internal variance
    • Produces balanced cluster sizes
    • Mathematically optimal for spherical clusters

Mathematical Derivation

  • Within-cluster sum of squares (\(E(\mathbf{M})\)):
    • \(E(\mathbf{M}) = \sum_{i=1}^k \sum_{\mathbf{ x}\in C_i} \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2\)
  • Ward’s criterion: Minimise increase in \(E(\mathbf{M})\) when merging
  • This is the same criterion as used for objective in \(k\)-means
  • Key insight: For spherical clusters, this is equivalent to minimising centroid distance weighted by cluster sizes

Mathematical Derivation of Ward Distance

  • Start with two clusters \(C_i\) and \(C_j\) with centroids \(\boldsymbol{ \mu}_i, \boldsymbol{ \mu}_j\)
  • After merging: new centroid \[ \boldsymbol{ \mu}_{ij} = \frac{n_i \boldsymbol{ \mu}_i + n_j \boldsymbol{ \mu}_j}{n_i + n_j} \]
  • Increase in \(E(\mathbf{M})\): \[ \Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2 \]

Mathematical Derivation - II

  • Step 1: Original \(E(\mathbf{M})\) for separate clusters \[ E(\mathbf{M})_{\text{original}} = \sum_{\mathbf{ x}\in C_i} \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2 + \sum_{\mathbf{ x}\in C_j} \|\mathbf{ x}- \boldsymbol{ \mu}_j\|^2 \]
  • Step 2: New \(E(\mathbf{M})\) after merging \[ E(\mathbf{M})_{\text{new}} = \sum_{\mathbf{ x}\in C_i \cup C_j} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2 \]
  • Step 3: Increase in \(E(\mathbf{M})\) \[ \Delta_{i,j} = E(\mathbf{M})_{\text{new}} - E(\mathbf{M})_{\text{original}} \]

Mathematical Derivation - III

  • Expanding the new \(E(\mathbf{M})\): \[ E(\mathbf{M})_{\text{new}} = \sum_{\mathbf{ x}\in C_i} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2 + \sum_{\mathbf{ x}\in C_j} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2 \]
  • Key identity \[\begin{aligned} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2 = & \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2 \\ & + \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij}\|^2 \\ & + 2(\mathbf{ x}- \boldsymbol{ \mu}_i)^\top(\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij}) \end{aligned}\]

Mathematical Derivation - IV

  • After simplification \[\begin{aligned} E(\mathbf{M})_{\text{new}} = & E(\mathbf{M})_{\text{original}} \\ & + n_i \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij}\|^2 \\ & + n_j \|\boldsymbol{ \mu}_j - \boldsymbol{ \mu}_{ij}\|^2 \end{aligned}\]

Mathematica Derivation - V

  • Substituting the centroid formula \[ \boldsymbol{ \mu}_{ij} = \frac{n_i \boldsymbol{ \mu}_i + n_j \boldsymbol{ \mu}_j}{n_i + n_j} \]
  • Final result: \[ \Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2 \]

Local Optimality

  • Spherical clusters have minimal \(E(\mathbf{M})\) for given number of points
  • Ward’s method makes locally optimal choices at each merge step
  • The weighting \(\frac{n_i n_j}{n_i + n_j}\) prevents premature merging of large clusters

Algorithm

  • Step 1: Start with \(n\) clusters (one per point)
  • Step 2: Compute distance matrix between all clusters
  • Step 3: Find and merge the two closest clusters
  • Step 4: Update distance matrix
  • Step 5: Repeat until one cluster remains

Ward’s Criterion on Artificial Data

Oil Flow Data

Hierarchical Clustering of Oil Flow Data

Phylogenetic Trees

  • Hierarchical clustering of genetic sequence data
  • Creates evolutionary trees showing species relationships
  • Estimates common ancestors and mutation timelines
  • Critical for tracking viral evolution and outbreaks

Product Clustering

  • Hierarchical clustering for e-commerce products
  • Creates product taxonomy trees
  • Splits into nested categories (e.g. Electronics → Phones → Smartphones)

Hierarchical Clustering Challenge

  • Many products belong in multiple clusters (e.g. running shoes are both ‘sporting goods’ and ‘clothing’)
  • Tree structures are too rigid for natural categorization
  • Human concept learning is more flexible:
    • Forms overlapping categories
    • Learns abstract rules
    • Builds causal theories

Thinking in High Dimensions

Mixtures of Gaussians

Mixtures of Gaussians

High Dimensional Data

Thinking in High Dimensions

  • Two dimensional plots of Gaussians can be misleading.
  • Our low dimensional intuitions can fail dramatically.
  • Two major issues:
    1. In high dimensions all the data moves to a ‘shell.’ There is nothing near the mean!
    2. Distances between points become constant.
    3. These affects apply to many densities.
  • Let’s consider a Gaussian “egg.”

Distance from a Mean

Dimensionality Greater than Three

The Gaussian Egg

1D Egg

Here the yolk has 65.8%, the green has 4.8% and the white has 29.4%

Here the yolk has 59.4%, the green has 7.4% and the white has 33.2%

Here the yolk has 56.1%, the green has 9.2% and the white has 34.7%

The Gamma Density

Gamma Density Formula

\[\mathscr{G}\left(x|a,b\right)=\frac{b^{a}}{\Gamma\left(a\right)}x^{a-1}e^{-bx}\]
  • Shape parameter: \(a\) (controls form)
  • Rate parameter: \(b\) (controls scale)
  • Gamma function: \(\Gamma\left(a\right)=\int_{0}^{\infty}x^{a-1}e^{-x}\text{d}x\)

Gamma Properties

  • Mean: \(\frac{a}{b}\),
  • Variance: \(\frac{a}{b^{2}}\)
  • Support: \(x > 0\) (positive numbers only)

Special Cases

  • Exponential \(a=1\) \(\rightarrow\) \(\left\langle x\right\rangle_{b}\)
  • \(\chi^2\) (chi-squared, 1 df) \(a=\frac{1}{2}, b=\frac{1}{2}\) \(\rightarrow\) \(\chi_{1}^{2}\left(x\right)\)

Other Important Applications

  • Conjugate prior for Gaussian precision (inverse variance)
  • Modeling waiting times and lifetimes

Additive Property

  • If \[ x_k \sim \text{Gamma}(a_k, b) \] for \(k=1,\ldots,d\)
  • Then \[ \sum_{k=1}^dx_k \sim \text{Gamma}(\sum_{k=1}^d a_k, b) \]

Warning: Parameterisation Confusion

  • Rate parameter: \(b\) (used here)
  • Scale parameter: \(\beta = b^{-1}\) (alternative)
  • Watch out for different software conventions.

Mathematics

What is the density of probability mass?

For a \(d\)-dimensional Gaussian distribution:

Individual components: \[ y_{i,k} \sim \mathscr{N}\left(0,\sigma^2\right) \]

Mathematics

What is the density of probability mass?

For a \(d\)-dimensional Gaussian distribution:

Squared components: \[ y_{i,k}^2 \sim \sigma^2 \chi_1^2 \] (scaled chi-squared)

Mathematics

What is the density of probability mass?

For a \(d\)-dimensional Gaussian distribution:

Gamma distribution: \[ y_{i,k}^2 \sim \mathscr{G}\left(\frac{1}{2},\frac{1}{2\sigma^2}\right) \]

Mathematics

What is the density of probability mass?

For a \(d\)-dimensional Gaussian distribution:

Sum of squares: \[ \sum_{k=1}^dy_{i,k}^2 \sim \mathscr{G}\left(\frac{d}{2},\frac{1}{2\sigma^2}\right) \]

Mathematics

What is the density of probability mass?

For a \(d\)-dimensional Gaussian distribution:

Expected value:

\[ \left\langle\sum_{k=1}^dy_{i,k}^2\right\rangle = d\sigma^2 \]

Mathematics

What is the density of probability mass?

Normalized sum: \[ \frac{1}{d}\sum_{k=1}^dy_{i,k}^2 \sim \mathscr{G}\left(\frac{d}{2},\frac{d}{2\sigma^2}\right) \]

Mathematics

What is the density of probability mass?

Expected normalised: \[ \frac{1}{d}\left\langle\sum_{k=1}^dy_{i,k}^2\right\rangle = \sigma^2 \]

Distribution of Mass against Dimensionality

chi-squared Distributions

\(z^2\)

\(z_1^2 + z_2^2\)

\(\sum_{i=1}^5 z^2_i\)

Square of Sample

  • Square of sample from Gaussian is scaled chi-squared density
  • \(\chi^2\) density is a variant of the gamma \(a=\frac{1}{2}\), \(b=\frac{1}{2\sigma^{2}}\), \[ \mathscr{G}\left(x|a,b\right)=\frac{b^{a}}{\Gamma\left(a\right)}x^{a-1}e^{-bx} \]

Distance Distributions

  • Addition of gamma random variables with the same rate is gamma with sum of shape parameters (\(y_{i,k}\)s are independent)
  • Scaling of gamma density scales the rate parameter

Where is the Mass?

Proportions of volumes between yolk, green and white as \(d\rightarrow 1024\) (log scale)

Looking at Gaussian Samples

High Dimensional Gaussians and Interpoint Distances

Interpoint Distances

  • The other effect in high dimensions is all points become equidistant.
  • Can show this for Gaussians with a similar proof to the above.

Interpoint Distance Analysis

For two points \(i\) and \(j\) in d-dimensional space:

  1. Individual components: \(y_{i,k} \sim \mathcal{N}(0, \sigma_k^2)\) and \(y_{j,k} \sim \mathcal{N}(0, \sigma_k^2)\)
  2. Difference: \(y_{i,k} - y_{j,k} \sim \mathcal{N}(0, 2\sigma_k^2)\)
  3. Squared difference: \((y_{i,k} - y_{j,k})^2 \sim \mathscr{G}\left(\frac{1}{2},\frac{1}{4\sigma_k^2}\right)\)

Interpoint Distance Analysis

For spherical Gaussian where \(\sigma_k^2 = \sigma^2\):

Sum of squared differences: \[ \sum_{k=1}^d(y_{i,k} - y_{j,k})^2 \sim \mathscr{G}\left(\frac{d}{2},\frac{1}{4\sigma^2}\right) \]

Interpoint Distance Analysis

Normalised distance: \[ \frac{1}{d}\sum_{k=1}^d(y_{i,k} - y_{j,k})^2 \sim \mathscr{G}\left(\frac{d}{2},\frac{d}{4\sigma^2}\right) \]

Key Results

  • Mean distance: \(2\sigma^2\)
  • Variance: \(\frac{8\sigma^2}{d}\) (decreases with dimension!)
  • All points become equidistant as \(d\to \infty\)

Central Limit Theorem and Non-Gaussian Case

  • Analytic for spherical, independent Gaussian data.
  • For independent data, the central limit theorem applies.
  • The mean squared distance in high D is the mean of the variances.
  • The variance about the mean scales as \(d^{-1}\).

Summary

  • In high dimensions if individual dimensions are independent the distributions behave counter intuitively.
  • All data sits at one standard deviation from the mean.
  • The densities of squared distances can be analytically calculated for the Gaussian case.

Summary

  • For non-Gaussian systems we can invoke the central limit theorem.
  • Next we will consider example data sets and see how their interpoint distances are distributed.

Example Data Sets

Sanity Check

Data sampled from independent Gaussian distribution

  • If dimensions are independent, we expect low variance, Gaussian behavior for the distribution of squared distances.

Sanity Check

Distance distribution for a Gaussian with \(d=1000\), \(n=1000\) with theoretical curve

Sanity Check

Same data generation, but fewer data points.

  • If dimensions are independent, we expect low variance, Gaussian behaviour for the distribution of squared distances.

Sanity Check

Distance distribution for a Gaussian with \(d=1000\) and \(n=100\) with theoretical curve
Simulation of oil flow

Histogram for Simulation of oil flow

OSU Motion Capture Data: Run 1

Ohio State University motion capture

Histogram for Ohio State University motion capture

Robot Wireless Data

Robot WiFi Data

Robot wireless navigation

Histogram for Robot wireless navigation

Where does practice depart from our theory?

  • The situation for real data does not reflect what we expect.
  • Real data exhibits greater variances on interpoint distances.
    • Somehow the real data seems to have a smaller effective dimension.
  • Let’s look at another \(d=1000\).

1000-D Gaussian

Distance distribution for a different Gaussian with \(d=1000\)

  1. Gaussian has a specific low rank covariance matrix \(\mathbf{C}=\mathbf{W}\mathbf{W}^{\top}+\sigma^{2}\mathbf{I}\).

  2. Take \(\sigma^{2}=1e-2\) and sample \(\mathbf{W}\in\Re^{1000\times2}\) from \(\mathscr{N}\left(0,1\right)\).

  3. Theoretical curve taken assuming dimensionality of 2.

Summary and Key Points

Thanks!

References

Bishop, C.M., James, G.D., 1993. Analysis of multiphase flows using dual-energy gamma densitometry and neural networks. Nuclear Instruments and Methods in Physics Research A327, 580–593. https://doi.org/10.1016/0168-9002(93)90728-Z