Joint, conditional, marginal

Probability from zero

Learning objectives

Define a joint distribution for two random variables and read its joint PMF / PDF
Compute marginal distributions by summing or integrating out one variable
Compute conditional distributions via Bayes' rule on bivariate data
State independence formally and identify when it fails
Recognise correlation vs independence — and the cases where they diverge

§0.2 handled one random variable at a time. Now: two or more. The JOINT DISTRIBUTION carries everything — marginals and conditionals are derivatives of it. Most real statistical questions are about RELATIONSHIPS between variables (Y given X, P(disease | symptom), regression, classification), so the joint-conditional-marginal triple is the foundation.

The joint distribution

For two random variables $X, Y$ :

Discrete: joint PMF $p_{X,Y}(x, y) = P(X = x, Y = y)$ with $\sum_{x, y} p_{X,Y}(x, y) = 1$ .
Continuous: joint PDF $f_{X,Y}(x, y)$ with $\iint f_{X,Y} = 1$ and $P((X, Y) \in A) = \iint_A f_{X,Y}$ .

The joint captures both marginals AND the dependence structure. You can't recover the joint from the marginals alone — different joint distributions can have identical marginals (this is Fréchet bounds territory).

Marginals: integrate out the other variable

To get $f_X$ from $f_{X,Y}$ :

f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y)\,dy.

Discrete analogue: $p_X(x) = \sum_y p_{X,Y}(x, y)$ . Marginals are projections — they discard the information about the OTHER variable.

Conditionals: divide by the marginal

The conditional PDF of $Y$ given $X = x$ :

$f_{Y | X}(y | x) = \frac{f_{X,Y}(x, y)}{f_X(x)}$ for $f_X(x) > 0$ .

Discrete: $P(Y = y | X = x) = P(X=x, Y=y) / P(X=x)$ . This is BAYES' RULE in its simplest form. The conditional answers: "given we know X = x, what is the distribution of Y?".

Independence: the joint factors

$X$ and $Y$ are INDEPENDENT (written $X \perp Y$ ) if and only if:

f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y) \text{ for all } x, y.

Equivalently: $f_{Y|X}(y|x) = f_Y(y)$ — knowing X tells you nothing new about Y. Independence is a MUCH stronger condition than zero correlation: ρ_{X,Y} = 0 does not imply independence (the canonical counter-example: X ~ N(0, 1), Y = X² are dependent with ρ = 0).

Correlation vs independence

INDEPENDENT ⇒ uncorrelated (ρ = 0).
UNCORRELATED ⇏ independent in general; only true for Gaussian joint distributions.

For Bivariate Normal, ρ = 0 ⇔ X ⊥ Y. For any non-Gaussian joint, you can have ρ = 0 with strong dependence. This is why "uncorrelated" is a weak property.

Try it

Default scenario "Correlated Normal" with ρ = 0.6. Slide the "Conditional X" slider from -2 to +2. Watch the orange conditional histogram SHIFT — E[Y | X = x] tracks ρ·x. The conditional MEAN changes with X (dependence!).
Set ρ = 0 (Normal scenario). Conditional histogram stays approximately centred at 0 regardless of where you slide X. This is independence (for Gaussian).
Switch to "Independent Uniforms". The marginal of Y is uniform on [0, 1]; the conditional Y | X is ALSO uniform on [0, 1] regardless of X. Independence: no information.
Switch to "Smoker × Cancer" (toy). P(cancer | smoker) ≈ 0.20 vs P(cancer | non-smoker) ≈ 0.05. The marginal is ≈ 0.125. The conditional changes the answer by a factor of 4 — strong dependence. This is Bayes' rule in action.
Cancel: set ρ = 0.95 in Normal mode. Conditional becomes very tight (small slice). Cross-check: the sample correlation ρ̂ reading should match the chosen ρ within ±0.05 at n = 600.

For Smoker × Cancer, suppose the prevalence of cancer in the GENERAL population is 12.5% (the marginal). A patient screens positive on a non-specific symptom; you know they are a smoker. What changes? Why does §0.3's conditional-marginal framework make the calculation routine rather than ad-hoc?

What you now know

The joint distribution carries everything; marginals and conditionals are derivatives. Independence requires the joint to factor. Correlation is weaker than independence outside Gaussian-land. Conditionals (the Bayes'-rule construction) are how we update beliefs as new information arrives. §0.4 will compute expectations and moments of these distributions. §6 and §7 build on conditional distributions to develop causal inference and Bayesian inference respectively.

References

Wasserman, L. (2004). All of Statistics. Springer. (Chapter 3 — joint and conditional distributions.)
Casella, G., Berger, R.L. (2002). Statistical Inference, 2nd ed. Duxbury. (Sections 4.1-4.4.)
Pearl, J. (2009). Causality, 2nd ed. Cambridge University Press. (Conditional distributions as the language of causal inference.)
Ross, S.M. (2014). Introduction to Probability Models, 11th ed. Academic Press. (Chapter 3.)
Mackay, D.J.C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge. (Free online; clean treatment of joint-conditional-marginal.)