Clustering and High Dimensions

Review

Clustering

Common approach for grouping data points
Assigns data points to discrete groups
Examples include:
- Animal classification
- Political affiliation grouping

Clustering vs Vector Quantisation

Clustering expects gaps between groups in data density
Vector quantization may not require density gaps
For practical purposes, both involve:
- Allocating points to groups
- Determining optimal number of groups

Task

Task: associate data points with different labels.
Labels are not provided by humans.
Process is intuitive for humans - we do it naturally.

Platonic Ideals

Greek philosopher Plato considered the concept of ideals
The Platonic ideal bird is the most bird-like bird
In clustering, we find these ideals as cluster centers
Data points are allocated to their nearest center

\(k\)-means Clustering

Mathematical Formulation

Represent objects as data vectors \(\mathbf{ x}_i\)
Represent cluster centres as vectors \(\boldsymbol{ \mu}_j\)
Define similarity/distance between objects and centres
Distance function: \(d_{ij} = f(\mathbf{ x}_i, \boldsymbol{ \mu}_j)\)

Squared Distance

Common choice: squared distance \[ d_{ij} = (\mathbf{ x}_i - \boldsymbol{ \mu}_j)^2 \]
Goal: find centres close to many data points

Objective Function

Given similarity measure, need number of cluster centres, \(K\).
Find their location by allocating each center to a sub-set of the points and minimizing the sum of the squared errors, \[ E(\mathbf{M}) = \sum_{i \in \mathbf{i}_j} (\mathbf{ x}_i - \boldsymbol{ \mu}_j)^2 \] here \(\mathbf{i}_j\) is all indices of data points allocated to the \(j\)th center.

\(k\)-Means Clustering

\(k\)-means clustering is simple and quick to implement.
Very initialisation sensitive.

Initialisation

Initialisation is the process of selecting a starting set of parameters.
Optimisation result can depend on the starting point.
For \(k\)-means clustering you need to choose an initial set of centres.
Optimisation surface has many local optima, algorithm gets stuck in ones near initialisation.

\(k\)-means Algorithm

Simple iterative clustering algorithm
Key steps:
1. Initialize with random centres
2. Assign points to nearest center
3. Update centres as cluster means
4. Repeat until stable

Objective Function

Minimizes sum of squared distances: \[ E=\sum_{j=1}^K \sum_{i\ \text{allocated to}\ j} \left(\mathbf{ y}_{i, :} - \boldsymbol{ \mu}_{j, :}\right)^\top\left(\mathbf{ y}_{i, :} - \boldsymbol{ \mu}_{j, :}\right) \]
Solution not guaranteed to be global or unique
Represents a non-convex optimization problem

\(k\)-Means Clustering

\(k\)-means

Hierarchical Clustering

Mathematical Formulation

Start with each data point as its own cluster
Define distance between clusters (linkage criterion)
At each step, merge the two closest clusters
Continue until all points are in one cluster
Result: A tree structure (dendrogram) showing merge history

Linkage Criteria

Single linkage: Distance between closest points in clusters
Complete linkage: Distance between farthest points in clusters
Average linkage: Average distance between all point pairs
Ward linkage: Minimises within-cluster variance increase

Ward’s Criterion

Ward’s method minimizes the increase in within-cluster variance
Set every data point as cluster size 1.
Select as \[ \Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2 \]
Where \(\boldsymbol{ \mu}_i, \boldsymbol{ \mu}_j\) are cluster centroids and \(n_i, n_j\) are cluster sizes

Ward’s Criterion

Creates compact, roughly spherical clusters
Effective for data with clear cluster structure
Why Ward’s works well:
- Minimises information loss during merging
- Creates clusters with minimal internal variance
- Produces balanced cluster sizes
- Mathematically optimal for spherical clusters

Mathematical Derivation

Within-cluster sum of squares (\(E(\mathbf{M})\)):
- \(E(\mathbf{M}) = \sum_{i=1}^k \sum_{\mathbf{ x}\in C_i} \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2\)
Ward’s criterion: Minimise increase in \(E(\mathbf{M})\) when merging
This is the same criterion as used for objective in \(k\)-means
Key insight: For spherical clusters, this is equivalent to minimising centroid distance weighted by cluster sizes

Mathematical Derivation of Ward Distance

Start with two clusters \(C_i\) and \(C_j\) with centroids \(\boldsymbol{ \mu}_i, \boldsymbol{ \mu}_j\)
After merging: new centroid \[ \boldsymbol{ \mu}_{ij} = \frac{n_i \boldsymbol{ \mu}_i + n_j \boldsymbol{ \mu}_j}{n_i + n_j} \]
Increase in \(E(\mathbf{M})\): \[ \Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2 \]

Mathematical Derivation - II

Step 1: Original \(E(\mathbf{M})\) for separate clusters \[ E(\mathbf{M})_{\text{original}} = \sum_{\mathbf{ x}\in C_i} \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2 + \sum_{\mathbf{ x}\in C_j} \|\mathbf{ x}- \boldsymbol{ \mu}_j\|^2 \]
Step 2: New \(E(\mathbf{M})\) after merging \[ E(\mathbf{M})_{\text{new}} = \sum_{\mathbf{ x}\in C_i \cup C_j} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2 \]
Step 3: Increase in \(E(\mathbf{M})\) \[ \Delta_{i,j} = E(\mathbf{M})_{\text{new}} - E(\mathbf{M})_{\text{original}} \]

Mathematical Derivation - III

Expanding the new \(E(\mathbf{M})\): \[ E(\mathbf{M})_{\text{new}} = \sum_{\mathbf{ x}\in C_i} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2 + \sum_{\mathbf{ x}\in C_j} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2 \]
Key identity \[\begin{aligned} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2 = & \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2 \\ & + \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij}\|^2 \\ & + 2(\mathbf{ x}- \boldsymbol{ \mu}_i)^\top(\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij}) \end{aligned}\]

Mathematical Derivation - IV

After simplification \[\begin{aligned} E(\mathbf{M})_{\text{new}} = & E(\mathbf{M})_{\text{original}} \\ & + n_i \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij}\|^2 \\ & + n_j \|\boldsymbol{ \mu}_j - \boldsymbol{ \mu}_{ij}\|^2 \end{aligned}\]

Mathematica Derivation - V

Substituting the centroid formula \[ \boldsymbol{ \mu}_{ij} = \frac{n_i \boldsymbol{ \mu}_i + n_j \boldsymbol{ \mu}_j}{n_i + n_j} \]
Final result: \[ \Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2 \]

Local Optimality

Spherical clusters have minimal \(E(\mathbf{M})\) for given number of points
Ward’s method makes locally optimal choices at each merge step
The weighting \(\frac{n_i n_j}{n_i + n_j}\) prevents premature merging of large clusters

Algorithm

Step 1: Start with \(n\) clusters (one per point)
Step 2: Compute distance matrix between all clusters
Step 3: Find and merge the two closest clusters
Step 4: Update distance matrix
Step 5: Repeat until one cluster remains

Ward’s Criterion on Artificial Data

Oil Flow Data

Hierarchical Clustering of Oil Flow Data

Phylogenetic Trees

Hierarchical clustering of genetic sequence data
Creates evolutionary trees showing species relationships
Estimates common ancestors and mutation timelines
Critical for tracking viral evolution and outbreaks

Product Clustering

Hierarchical clustering for e-commerce products
Creates product taxonomy trees
Splits into nested categories (e.g. Electronics → Phones → Smartphones)

Hierarchical Clustering Challenge

Many products belong in multiple clusters (e.g. running shoes are both ‘sporting goods’ and ‘clothing’)
Tree structures are too rigid for natural categorization
Human concept learning is more flexible:
- Forms overlapping categories
- Learns abstract rules
- Builds causal theories

Thinking in High Dimensions

Mixtures of Gaussians

High Dimensional Data

Thinking in High Dimensions

Two dimensional plots of Gaussians can be misleading.
Our low dimensional intuitions can fail dramatically.
Two major issues:
1. In high dimensions all the data moves to a ‘shell.’ There is nothing near the mean!
2. Distances between points become constant.
3. These affects apply to many densities.
Let’s consider a Gaussian “egg.”

Distance from a Mean

Dimensionality Greater than Three

The Gaussian Egg

1D Egg

Here the yolk has 65.8%, the green has 4.8% and the white has 29.4%

Here the yolk has 59.4%, the green has 7.4% and the white has 33.2%

Here the yolk has 56.1%, the green has 9.2% and the white has 34.7%

The Gamma Density

Gamma Density Formula

\[\mathscr{G}\left(x|a,b\right)=\frac{b^{a}}{\Gamma\left(a\right)}x^{a-1}e^{-bx}\]

Shape parameter: \(a\) (controls form)
Rate parameter: \(b\) (controls scale)
Gamma function: \(\Gamma\left(a\right)=\int_{0}^{\infty}x^{a-1}e^{-x}\text{d}x\)

Gamma Properties

Mean: \(\frac{a}{b}\),
Variance: \(\frac{a}{b^{2}}\)
Support: \(x > 0\) (positive numbers only)

Special Cases

Exponential \(a=1\) \(\rightarrow\) \(\left\langle x\right\rangle_{b}\)
\(\chi^2\) (chi-squared, 1 df) \(a=\frac{1}{2}, b=\frac{1}{2}\) \(\rightarrow\) \(\chi_{1}^{2}\left(x\right)\)

Other Important Applications

Conjugate prior for Gaussian precision (inverse variance)
Modeling waiting times and lifetimes

Additive Property

If \[ x_k \sim \text{Gamma}(a_k, b) \] for \(k=1,\ldots,d\)
Then \[ \sum_{k=1}^dx_k \sim \text{Gamma}(\sum_{k=1}^d a_k, b) \]