EM Algorithm for Gaussian Mixture Models

In an era dominated by multi-billion parameter neural networks, it is easy to dismiss classical statistical models as relics of the past. However, during my recent assignment implementing a Gaussian Mixture Model (GMM) from scratch, I was reminded of the sheer mathematical elegance that powers these traditional techniques — and reintroduced to the Expectation-Maximization (EM) algorithm.

1. What is a GMM?

A GMM assumes all data points are generated from a mixture of Gaussian distributions. The probability density function of a sample \(\boldsymbol{x}_i \in \mathbb{R}^d\) is:

\[p(\boldsymbol{x}_i \mid \boldsymbol{\theta}) = \sum_{j=1}^{k} \pi_j \mathcal{N}(\boldsymbol{x}_i \;; \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)\]

Where \(\pi_j\) is the mixing coefficient of the \(j\)-th cluster, and \(\boldsymbol{\mu}_j\), \(\boldsymbol{\Sigma}_j\) are the mean and covariance of each Gaussian component.

GMMs are ideal for low-to-medium dimensional continuous data. Unlike hard clustering (k-means), GMMs perform soft clustering — assigning probabilities rather than rigid labels. In credit risk modeling, for example, a GMM might say “this profile has a 75% chance of defaulting” rather than forcing a binary decision.

2. The Core Magic: EM Algorithm

Fitting a GMM via Maximum Likelihood requires maximizing the log-likelihood:

\[\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \log \left[ \sum_{j=1}^{k} \pi_j \mathcal{N}(\boldsymbol{x}_i \;; \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j) \right]\]

The problem: a logarithm of a sum that cannot be solved analytically. The EM algorithm bypasses this by introducing a responsibility matrix \(F_{ij}\) and using Jensen’s Inequality to construct a tractable lower bound:

\[\ell(\boldsymbol{\theta}) \ge \sum_{i=1}^{n} \sum_{j=1}^{k} F_{ij} \log \left[ \frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}_i, Z_i=j)}{F_{ij}} \right] = Q(\mathbf{F}, \boldsymbol{\theta})\]

This transforms a messy \(\log \sum\) into a differentiable \(\sum \log\). The algorithm then alternates between two steps:

E-Step: Fix \(\boldsymbol{\theta}\), compute responsibilities \(F_{ij}\) — the probability that cluster \(j\) generated sample \(i\).

M-Step: Fix \(F_{ij}\), update parameters in closed form:

\[\boldsymbol{\mu}_j^{\text{new}} = \frac{\sum_{i=1}^{n} F_{ij} \boldsymbol{x}_i}{\sum_{i=1}^{n} F_{ij}} \quad , \quad \boldsymbol{\Sigma}_j^{\text{new}} = \frac{\sum_{i=1}^{n} F_{ij} (\boldsymbol{x}_i - \boldsymbol{\mu}_j)(\boldsymbol{x}_i - \boldsymbol{\mu}_j)^T}{\sum_{i=1}^{n} F_{ij}}\]

Every iteration strictly increases the log-likelihood until convergence.

3. Implementation Challenges

Two real-world pitfalls worth knowing:

Local Minima — The log-likelihood landscape is non-convex. Poor initialization leads to subpar solutions. Fix: run multiple random initializations and keep the best result.

Covariance Collapse — In high-dimensional data (e.g. flattened MNIST images), a cluster can collapse onto a single outlier, driving variance to zero and crashing the likelihood to infinity. Fix: clip the variance floor with np.maximum(sigma_sq, 1e-2).

Final Thoughts

Implementing EM from scratch offered a perspective that backpropagation never could — taking a mathematically intractable problem, approximating it with an elegant lower bound, and optimizing it through rigorous probabilistic reasoning.

Sometimes, looking back at classical machine learning is exactly what you need to move forward as a better engineer.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Supervised Recurrent Networks and the GRU to Vanishing Gradients