Gaussian Mixture Models
Returning to the Mystery
Remember Chibany’s original puzzle from Chapter 1? They had mystery bentos with two peaks in their weight distribution, but the average fell in a valley where no individual bento existed.
We now have all the tools to solve this completely:
- Chapter 1: Expected value paradox in mixtures
- Chapter 2: Continuous probability (PDFs, CDFs)
- Chapter 3: Gaussian distributions
- Chapter 4: Bayesian learning for parameters
Now we combine them: What if we have multiple Gaussian distributions mixed together, and we need to figure out both which component each observation belongs to AND the parameters of each component?
This is a Gaussian Mixture Model (GMM).
📚 Prerequisite: Understanding Categorization
Before tackling the full GMM learning problem, make sure you understand categorization in mixture models with known parameters.
⚠️ Recommended Preparation
If you haven’t already, work through the Gaussian Clusters assignment from Chapter 4:
📝 Assignment: Open in Colab: solution_2_gaussian_clusters.ipynb
📓 Interactive exploration: Open in Colab: gaussian_bayesian_interactive_exploration.ipynb (Part 2)
Why this matters:
- Chapter 4 Problem 2 teaches you how to compute P(category | observation) when parameters are known
- This chapter (5) extends that to learning parameters when they are unknown
- Understanding categorization with known parameters is essential before attempting to learn them!
What you’ll practice:
- Using Bayes’ rule: P(c|x) = p(x|c)P(c) / p(x)
- Computing marginal distributions: p(x) = Σ_c p(x|c)P(c)
- Understanding decision boundaries and how priors/variances affect them
- Visualizing bimodal vs. unimodal mixture distributions
The Bridge: Known Parameters → Unknown Parameters
In Chapter 4 Problem 2, you learned:
- Given: μ₁, μ₂, σ₁², σ₂², θ (all known)
- Infer: Which category for each observation?
- Formula: P(c=1|x) = θ·N(x;μ₁,σ₁²) / [θ·N(x;μ₁,σ₁²) + (1-θ)·N(x;μ₂,σ₂²)]
In this chapter, we tackle the harder problem:
- Given: Only observations x₁, x₂, …, xₙ
- Infer: Categories AND parameters μ₁, μ₂, σ₁², σ₂², θ
- Method: Expectation-Maximization (EM) algorithm
Think of it as:
- First (Chapter 4 Problem 2): “I know the recipe for tonkatsu (μ₁, σ₁²) and hamburger (μ₂, σ₂²). Given a weight, which is it?”
- Now (Chapter 5): “I don’t know the recipes! Can I figure them out from weights alone?”
The Complete Problem
Chibany receives 20 mystery bentos. They measure their weights:
[498, 352, 501, 349, 497, 503, 351, 500, 348, 502,
499, 350, 498, 353, 501, 347, 499, 502, 352, 500]Looking at the histogram, they see two clear clusters around 350g and 500g.
The questions:
- How many types of bentos are there? (We’ll assume 2 for now)
- Which type is each bento? (Classification problem)
- What are the parameters for each type? (Learning problem)
Gaussian Mixture Model: The Math
A GMM says each observation comes from one of K Gaussian components:
$$p(x) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(x | \mu_k, \sigma_k^2)$$
Where:
- π_k: Mixing proportion (probability of component k)
- μ_k: Mean of component k
- σ_k²: Variance of component k
Constraint: $\sum_{k=1}^{K} \pi_k = 1$ (probabilities must sum to 1)
The Generative Story
- Choose a component: Sample k ~ Categorical(π₁, π₂, …, πₖ)
- Generate observation: Sample x ~ N(μₖ, σₖ²)
This is exactly what GenJAX is built for!
📘 Foundation Concept: Discrete + Continuous Together
Notice the beautiful combination here!
Step 1 is discrete (like Tutorial 1):
- Choose which component: k ~ Categorical(π₁, π₂, …, πₖ)
- This is just like choosing between {hamburger, tonkatsu}
- We’re counting discrete outcomes (component 1, component 2, …)
- From Tutorial 1: Random variables map outcomes to values
Step 2 is continuous (like Tutorial 3):
- Generate the actual weight: x ~ N(μₖ, σₖ²)
- This uses probability density we learned in Chapter 2
- We’re measuring continuous values (350g, 500g, …)
Why this matters:
- Real problems often combine both!
- Discrete choices (which category?) + Continuous measurements (what value?)
- Tutorial 1’s logic (discrete counting) works alongside Tutorial 3’s tools (continuous density)
- GenJAX handles both seamlessly in the same model
The power: Mixture models show that discrete and continuous probability aren’t separate worlds—they work together to model rich, real-world phenomena.
Two-Component Bento Model
For Chibany’s bentos with K=2 (tonkatsu and hamburger):
Component 1 (Tonkatsu):
- π₁ = 0.7 (70% of bentos)
- μ₁ = 500g
- σ₁² = 4 (std dev = 2g)
Component 2 (Hamburger):
- π₂ = 0.3 (30% of bentos)
- μ₂ = 350g
- σ₂² = 4 (std dev = 2g)
| |
Output:
Generated 14 tonkatsu and 6 hamburger bentos
Weights: [501.2 349.8 499.5 351.3 498.7 502.1 350.5 ...]The Inference Problem
Forward (Generative): Given parameters (π, μ, σ²), generate observations ✅ Backward (Inference): Given observations, infer parameters (π, μ, σ²) and assignments ❓
This is harder! We need to solve:
- Which component did each observation come from?
- What are the parameters (μ₁, μ₂, σ₁², σ₂²)?
- What are the mixing proportions (π₁, π₂)?
These problems are interdependent:
- If we knew the assignments, we could easily estimate parameters (just compute means/variances per component)
- If we knew the parameters, we could compute assignment probabilities (which Gaussian is each point closer to?)
Classic chicken-and-egg problem!
Understanding the Inference Challenge
If we knew which type each bento was, learning would be straightforward - just compute the mean and variance for each group. Conversely, if we knew the true parameters, we could compute which component each observation likely came from.
This chicken-and-egg problem is exactly what probabilistic inference is designed to solve. Instead of point estimates, we’ll use GenJAX to reason about the full posterior distribution over both parameters and assignments.
Bayesian GMM with GenJAX
Now let’s implement a fully Bayesian version using GenJAX, where we treat component assignments as latent variables to infer:
| |
Note: The above shows the conceptual structure. In practice, GMM inference with GenJAX requires careful implementation of inference algorithms. We’ll explore more sophisticated inference techniques including MCMC and variational methods in later chapters. The Bayesian approach becomes particularly powerful for more complex models like DPMM (Chapter 6), where we want to reason about uncertainty in the number of components.
Model Selection: How Many Components?
How do we know K=2? What if there are 3 types of bentos, or 5?
In traditional approaches, you would fit multiple models with different K values and use criteria like BIC (Bayesian Information Criterion) to select the best one.
However, in fully Bayesian inference (which we’ll explore more in Chapter 6), we can treat K itself as a random variable and let the data inform us about the likely number of components through the posterior distribution.
Real-World Applications
GMMs aren’t just for bentos. They appear everywhere:
Image Segmentation
- Each pixel belongs to one of K clusters (e.g., foreground vs. background)
- Learn cluster parameters from pixel intensities
Speaker Identification
- Audio features from different speakers cluster differently
- GMM models the distribution of vocal characteristics
Anomaly Detection
- Normal data fits a mixture of typical patterns
- Outliers have low probability under all components
Customer Segmentation
- Customers cluster by behavior (high spenders, occasional buyers, etc.)
- Each segment modeled as a Gaussian in feature space
Practice Problems
Problem 1: Three Coffee Blends
A café serves three coffee blends. You measure 30 caffeine levels (mg/cup):
[82, 118, 155, 80, 120, 158, 79, 115, 160, 83, 121, 157,
81, 119, 156, 84, 117, 159, 78, 122, 154, 82, 116, 158,
80, 120, 155, 81, 118, 157]a) Extend the Bayesian GMM code to K=3 components.
b) What prior distributions would be appropriate for the means if you know caffeine levels range from 50-200mg?
c) How would you interpret the posterior distribution over component assignments?
Show Solution
| |
Problem 2: Understanding Uncertainty
Using the Bayesian GMM for the bento data:
a) How would you quantify uncertainty about which component a particular observation belongs to?
b) How is this different from a point estimate of the assignment?
Show Solution
a) In the Bayesian approach, we get a full posterior distribution over component assignments. For each observation i, we can compute:
- P(z_i = 0 | data) - probability it’s component 0
- P(z_i = 1 | data) - probability it’s component 1
An observation near the decision boundary might have P(z_i = 0) ≈ 0.5, showing high uncertainty.
b) A point estimate would simply assign each observation to its most likely component, discarding information about confidence. The Bayesian approach preserves this uncertainty, which is crucial for:
- Identifying ambiguous cases
- Propagating uncertainty to downstream tasks
- Making better decisions under uncertainty
For example, a bento weighing 425g (right between the two clusters) would have high assignment uncertainty that we shouldn’t ignore.
What’s Next?
We now understand:
- Gaussian Mixture Models combine multiple Gaussians
- GMMs elegantly combine discrete choices (component assignments) with continuous observations
- GenJAX naturally expresses the generative process as a probabilistic program
- Bayesian inference preserves uncertainty over both parameters and assignments
But we had to specify K (number of components) in advance. What if we don’t know how many clusters exist?
In Chapter 6, we’ll learn about Dirichlet Process Mixture Models (DPMM): a Bayesian approach that learns the number of components automatically from the data!
Key Takeaways
- GMM: Mixture of K Gaussians with mixing proportions π
- Generative process: First choose component (discrete), then generate observation (continuous)
- Bayesian inference: Reason about full posterior over parameters and assignments
- GenJAX: Express GMMs declaratively as probabilistic programs
- Uncertainty: Preserve and quantify uncertainty about component membership
- Applications: Clustering, segmentation, anomaly detection
Next Chapter: Dirichlet Process Mixture Models →