Hierarchical Bayes :: Probability & Probabilistic Computing Tutorial

Hierarchical Bayes :: Probability & Probabilistic Computing Tutorialhttps://josephausterweil.github.io/probintro/intro2/12_hierarchical_bayes/index.htmlTwo extremes that both feel wrong Chibany has been keeping a journal: for each student who brings them a bento, they record whether it was tonkatsu or hamburger. After a while the journal looks like this — each student with a tonkatsu count $k_i$ out of their total bento count $n_i$: Student Tonkatsu $k_i$ Total $n_i$ Raw fraction $k_i / n_i$ Alyssa 70 100 0.70 Ben 28 40 0.70 Carmen 6 10 0.60 Diego 3 5 0.60 Emi 2 2 1.00 Farid 0 1 0.00 Chibany wants, for each student, a believable estimate of $\theta_i$ — that student’s underlying probability of bringing tonkatsu. Two obvious strategies both fail:Hugoen-usMon, 01 Jun 2026 00:00:00 +0000The Beta Distributionhttps://josephausterweil.github.io/probintro/intro2/12_hierarchical_bayes/the-beta-distribution/index.htmlMon, 01 Jun 2026 00:00:00 +0000https://josephausterweil.github.io/probintro/intro2/12_hierarchical_bayes/the-beta-distribution/index.htmlThe Beta distribution (a prior for a probability) To pool partially we need a prior over a rate $\theta \in [0, 1]$ — a probability distribution whose outcomes are themselves probabilities. The natural choice is the Beta distribution, written $\text{Beta}(a, b)$, and it is the one new piece of notation in this chapter. Define-before-use: the Beta distribution $\text{Beta}(a, b)$ is a probability distribution over a single number $\theta$ between 0 and 1. It has two shape parameters $a > 0$ and $b > 0$, and the most useful way to read them is as a soft count:Partial Pooling & Shrinkagehttps://josephausterweil.github.io/probintro/intro2/12_hierarchical_bayes/partial-pooling-and-shrinkage/index.htmlMon, 01 Jun 2026 00:00:00 +0000https://josephausterweil.github.io/probintro/intro2/12_hierarchical_bayes/partial-pooling-and-shrinkage/index.htmlPartial pooling and shrinkage Now suppose the six students share a common population prior $\text{Beta}(a, b)$ — say $\text{Beta}(6, 4)$, encoding “the typical student is about 60% tonkatsu ($\tfrac{6}{6+4} = 0.6$), with a prior strength of $a + b = 10$ bentos.” Each student’s estimate becomes their own Beta-Binomial posterior mean, $(a + k_i) / (a + b + n_i)$: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import jax.numpy as jnp names = ["Alyssa", "Ben", "Carmen", "Diego", "Emi", "Farid"] k = jnp.array([70, 28, 6, 3, 2, 0]) n = jnp.array([100, 40, 10, 5, 2, 1]) a, b = 6.0, 4.0 # shared population prior: mean 0.6, strength 10 population_mean = a / (a + b) raw = k / n posterior_mean = (a + k) / (a + b + n) # Beta-Binomial posterior mean per student print(f"population mean = {population_mean:.2f}\n") for name, r, pm in zip(names, raw, posterior_mean): print(f" {name:7s} raw {float(r):.2f} -> pooled {float(pm):.3f} (shift {float(pm - r):+.3f})") Output:Learning the Priorhttps://josephausterweil.github.io/probintro/intro2/12_hierarchical_bayes/learning-the-prior/index.htmlMon, 01 Jun 2026 00:00:00 +0000https://josephausterweil.github.io/probintro/intro2/12_hierarchical_bayes/learning-the-prior/index.htmlWhere does the population prior come from? So far we fixed $(a, b) = (6, 4)$ by hand. But the whole promise of this chapter was to learn the prior. The hierarchical model already contains the answer: $(a, b)$ is itself a latent variable with its own distribution, so we can infer it from the students’ data — the same way we’ve inferred every other unknown in this tutorial. We put a broad, weakly-informative hyperprior on $(a, b)$ — a prior on the population prior, just “some plausible range of population shapes, nothing committed” (below, a uniform box over $0.5 \le a, b \le 20$; widen it and the estimate barely moves until the bounds get extreme) — observe all the students’ counts, and weight candidate $(a, b)$ values by how well they explain the data. This is plain importance sampling — the exact tool from Chapter 5 and the GenJAX tutorial, now aimed one level up at the hyperparameters.Connections & Summaryhttps://josephausterweil.github.io/probintro/intro2/12_hierarchical_bayes/connections-and-summary/index.htmlMon, 01 Jun 2026 00:00:00 +0000https://josephausterweil.github.io/probintro/intro2/12_hierarchical_bayes/connections-and-summary/index.htmlThe connection to No Free Lunch Step back to where Chapter 7 left us. The No Free Lunch (NFL) theorem proved that a learner must bring a prior — inductive bias is not optional, because a learner that entertains every hypothesis equally can’t generalize at all. That sounds like a life sentence: someone has to hand the learner its bias. Hierarchical Bayes is the escape hatch. The prior is still required — NFL is not repealed — but the learner can acquire it from data about related problems instead of being born with it. Each student is a small learning problem; the population level is where the learner discovers “students tend to be around 60% tonkatsu,” and that discovered bias is exactly what lets it make a sane guess for a brand-new student it has barely any data on. The hierarchy is where inductive bias comes from when you don’t want to hand-pick it.