Conditional Probability and Independence

Part 15, Chapter 15: Combinatorics and Probability

Learning objectives

  • Define conditional probability P(AB)P(A \mid B) and interpret it geometrically
  • Define independence and verify it via the multiplication rule P(AB)=P(A)P(B)P(A \cap B) = P(A) P(B)
  • Apply the law of total probability to partition-based computations
  • Use Bayes' theorem to invert conditional probabilities
  • Distinguish pairwise independence from mutual independence

Conditional probability and independence are the two engines of Bayesian reasoning, the framework behind spam filters, medical-diagnosis tools, recommender systems, and modern machine learning. Conditional probability tells us how to update a belief in light of evidence; independence tells us when evidence is informationally irrelevant. Bayes' theorem stitches them together by inverting cause-and-effect: from "given the disease, the test is positive 99% of the time" to "given a positive test, how likely is the disease?" These look like the same statement to a beginner, but they can differ by orders of magnitude.

Conditional probability

The conditional probability of AA given BB is defined for P(B)>0P(B) > 0 by

P(AmidB)=fracP(AcapB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}.

Geometrically, conditioning on BB restricts the sample space to BB and re-normalises so the new sample space has measure 1. The probability of any event AA in this restricted world is the fraction of BB that also belongs to AA.

Rearranging gives the multiplication rule: P(AcapB)=P(AmidB)P(B)=P(BmidA)P(A)P(A \cap B) = P(A \mid B) P(B) = P(B \mid A) P(A). This chains naturally: P(AcapBcapC)=P(A)P(BmidA)P(CmidAcapB)P(A \cap B \cap C) = P(A) P(B \mid A) P(C \mid A \cap B), and so on for any number of events.

Independence

Two events are independent when conditioning on one does not change the probability of the other: P(AmidB)=P(A)P(A \mid B) = P(A). Equivalently, and more symmetrically, P(AcapB)=P(A)P(B)P(A \cap B) = P(A) P(B). This is the standard definition because it handles zero-probability events gracefully and treats both events on equal footing.

For more than two events we need mutual independence: every finite sub-collection Ai1,ldots,AikA_{i_1}, \ldots, A_{i_k}ik satisfies P(Ai1capcdotscapAik)=P(Ai1)cdotsP(Aik)P(A_{i_1} \cap \cdots \cap A_{i_k}) = P(A_{i_1}) \cdots P(A_{i_k})ik)=P(Ai1)cdotsP(Aik). This is strictly stronger than pairwise independence, where only every pair factors.

Total probability and Bayes' theorem

If B1,B2,ldots\{B_1, B_2, \ldots\} is a partition of Omega\Omega (disjoint events whose union is everything), then for any event AA:

P(A)=sumiP(AmidBi)P(Bi)P(A) = \sum_i P(A \mid B_i) P(B_i)iP(AmidBi)P(Bi) (the law of total probability).

Combining this with the multiplication rule gives Bayes' theorem:

P(BjmidA)=fracP(AmidBj)P(Bj)sumiP(AmidBi)P(Bi)P(B_j \mid A) = \frac{P(A \mid B_j) P(B_j)}{\sum_i P(A \mid B_i) P(B_i)}j)P(Bj)sumiP(AmidBi)P(Bi).

The numerator is "joint probability of BjB_jj and the evidence"; the denominator is "total probability of the evidence." The result tells you how to update your prior beliefs P(Bj)P(B_j)j) into posterior beliefs P(BjmidA)P(B_j \mid A)jmidA) once AA has been observed.

The medical-test parable

A disease has prevalence P(D)=0.001P(D) = 0.001. A test has sensitivity P(+midD)=0.99P(+ \mid D) = 0.99 and specificity P(midDc)=0.99P(- \mid D^{c}) = 0.99. Your test comes back positive. What is P(Dmid+)P(D \mid +)?

P(+)=P(+midD)P(D)+P(+midDc)P(Dc)=0.99cdot0.001+0.01cdot0.999=0.00099+0.00999=0.01098P(+) = P(+ \mid D) P(D) + P(+ \mid D^{c}) P(D^{c}) = 0.99 \cdot 0.001 + 0.01 \cdot 0.999 = 0.00099 + 0.00999 = 0.01098.

P(Dmid+)=frac0.000990.01098approx0.090P(D \mid +) = \frac{0.00099}{0.01098} \approx 0.090.

Even after a 99%-sensitive, 99%-specific positive test, the chance of disease is only about 9%, because the disease is rare and the false-positive rate, multiplied by the much larger healthy population, swamps the true-positive count. This is the base-rate fallacy: intuitive reasoning ignores P(D)P(D) and concludes that a 99% test gives 99% confidence, which is dead wrong.

Where this shows up
  • Spam filters: Naive Bayes classifies emails by computing P(textspammidtextwords)P(\text{spam} \mid \text{words}) via Bayes' theorem with conditional-independence assumptions across word tokens. Despite the simplifying assumption, it works astoundingly well.
  • Medical diagnosis: Every "positive predictive value" reported for a screening test is exactly a Bayesian posterior. Mammography, prostate-specific-antigen tests, and COVID-19 rapid antigen kits all face the base-rate problem when used on low-prevalence populations.
  • PageRank as Markov chains: Google's PageRank assumes a random surfer follows links with conditional probabilities derived from the link structure; the stationary distribution is the long-run probability of being on any given page, a Bayesian fixed point on a vast conditional-probability table.
  • Forensic DNA evidence: The "match probability" is P(textevidencemidtextinnocent)P(\text{evidence} \mid \text{innocent}), NOT P(textinnocentmidtextevidence)P(\text{innocent} \mid \text{evidence}). Confusing the two (the prosecutor's fallacy) has led to wrongful convictions when the base-rate of "people who could match" is large.
  • Bayesian neural networks: Modern deep learning increasingly tracks the posterior distribution over network weights given training data, rather than a single point estimate. Decision-making under uncertainty, in autonomous vehicles, medical imaging, pushes hard toward fully Bayesian models.

Pause and think: If AA and BB are independent, are AA and BcB^{c} also independent? Prove it from the definition. (Hint: P(A)=P(AcapB)+P(AcapBc)P(A) = P(A \cap B) + P(A \cap B^{c}).)

Try it

  • Two cards are drawn from a standard deck without replacement. Are the events "first is an ace" and "second is an ace" independent? Compute both P(A)P(B)P(A) P(B) and P(AcapB)P(A \cap B) to check.
  • A jar has 3 red and 7 blue marbles. Draw two with replacement. Are the draws independent? Now repeat without replacement. Why does the answer flip?
  • Rare-disease redux: with prevalence P(D)=0.01P(D) = 0.01 (1%) and the same 99% sensitivity / 99% specificity, recompute P(Dmid+)P(D \mid +). Why is the answer dramatically different from the 0.1%-prevalence case?
  • An urn contains 2 fair coins and 1 two-headed coin (3 coins total, note the quiz uses a 4-coin variant of this same Bayesian setup; the answers differ because the priors differ, but the technique is identical). You pick a coin at random and flip it; it comes up heads. What is the probability you picked the two-headed coin?
  • Show that three events A,B,CA, B, C can be pairwise independent yet NOT mutually independent. Construct an explicit example with three flips of a fair coin.

A trap to watch for

The prosecutor's fallacy, confusing P(AmidB)P(A \mid B) with P(BmidA)P(B \mid A), appears constantly in news reporting, legal arguments, and even peer-reviewed science. "A 1-in-a-million DNA match!" sounds damning until you remember that in a city of 10 million, there are ~10 people who would also match. The correct quantity is P(textguiltymidtextmatch)P(\text{guilty} \mid \text{match}), which depends on the prior probability of guilt before the DNA evidence. Always check: does the cited probability condition on the cause or on the effect?

What you now know

You can compute conditional probabilities by restricting the sample space, check independence via the factorisation P(AcapB)=P(A)P(B)P(A \cap B) = P(A) P(B), invert direction with Bayes' theorem, and avoid the base-rate and prosecutor's fallacies. The next section moves from EVENT probabilities to RANDOM VARIABLES, numerical summaries of random outcomes, and introduces expectation and variance, the two summary statistics that drive almost every applied probability calculation.

Mark section complete →

References

  • Garrity, T. (2002). All the Mathematics You Missed. Cambridge University Press, ch. 15.
  • Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson, ch. 3.
  • Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1 (3rd ed.). Wiley, ch. 5.
  • Grimmett, G., Stirzaker, D. (2001). Probability and Random Processes (3rd ed.). Oxford University Press, ch. 1.
  • Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann (Bayesian network foundations).

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.