Probability: Sample Spaces and Events

Part 15, Chapter 15: Combinatorics and Probability

Learning objectives

  • Define a probability space and state Kolmogorov's three axioms
  • Compute probabilities of compound events using equally likely outcomes
  • Apply the complement rule and inclusion-exclusion for events
  • Use the union bound for upper-bounding probabilities of unions
  • Distinguish disjoint events from independent events (almost opposite)

Probability is the calculus of uncertainty. When the world refuses to be deterministic, the next dice roll, the next click, the next packet drop, we still want quantitative predictions. Probability supplies them by reducing "uncertainty" to a small number of consistent rules: outcomes, events, and a measure on those events that obeys three axioms. Once those axioms are in place, every formula in the rest of the chapter (conditional probability, expected value, the CLT) is a logical consequence.

The probability space (Omega,mathcalF,P)(\Omega, \mathcal{F}, P)

A sample space Omega\Omega is the set of all possible outcomes of an experiment. An event is any subset of Omega\Omega; the collection of allowed events is a sigma\sigma-algebra mathcalF\mathcal{F}. A probability measure is a function P: \mathcal{F} \to [0, 1] satisfying Kolmogorov's axioms:

  • (K1) P(A)geq0P(A) \geq 0 for every event AA.
  • (K2) P(Omega)=1P(\Omega) = 1, something certain happens.
  • (K3) For any countable sequence of disjoint events A1,A2,ldotsA_1, A_2, \ldots, P!left(bigcupiAiright)=sumiP(Ai)P\!\left(\bigcup_i A_i\right) = \sum_i P(A_i)right)=sumiP(Ai). (Countable additivity.)

From these alone you can derive every elementary identity. For example, since Omega=AcupAc\Omega = A \cup A^{c} with AA and AcA^{c} disjoint, K2 plus K3 give the complement rule: P(Ac)=1P(A)P(A^{c}) = 1 - P(A). Similarly P(emptyset)=0P(\emptyset) = 0.

Equally likely outcomes

When Omega\Omega is finite and every outcome is equally likely (fair coins, dice, random card draws), the probability of an event reduces to a counting ratio: P(A)=fracAOmegaP(A) = \frac{|A|}{|\Omega|}. This is why §15.1 came first, classical probability is just counting under a uniform distribution. Most introductory problems live here.

Inclusion-exclusion for probabilities

For overlapping events, the counting identity transfers directly: P(AcupB)=P(A)+P(B)P(AcapB)P(A \cup B) = P(A) + P(B) - P(A \cap B). The corrective P(AcapB)P(A \cap B) term removes the double-count. For three events the alternating sum extends: P(AcupBcupC)=P(A)+P(B)+P(C)P(AcapB)P(AcapC)P(BcapC)+P(AcapBcapC)P(A \cup B \cup C) = P(A) + P(B) + P(C) - P(A \cap B) - P(A \cap C) - P(B \cap C) + P(A \cap B \cap C).

The union bound drops the corrective terms and gives the inequality P(AcupB)leqP(A)+P(B)P(A \cup B) \leq P(A) + P(B). This is loose but indispensable: it converts hard intersection probabilities into easy sums and powers analysis of every randomised algorithm, every concentration inequality, and (e.g.) the entire theory of statistical learning generalisation bounds.

The Venn-diagram widget above lets you set up two overlapping events and read off P(A)P(A), P(B)P(B), P(AcapB)P(A \cap B), and P(AcupB)P(A \cup B). Adjust the overlap and watch the inclusion-exclusion identity stay invariant.

Where this shows up
  • Reliability engineering: The probability that a system with nn components fails is at most the sum of the individual failure probabilities, literally the union bound. Aircraft, nuclear-plant, and spacecraft reliability are designed around this inequality with redundant subsystems engineered so each one's individual failure probability is 10710^{-7} per hour or less.
  • Machine learning generalisation: A hypothesis class with HH functions has training-vs-test gap bounded by sqrtlog(H)/n\sqrt{\log(H)/n} via the union bound across hypotheses. This is the foundational PAC-learning result.
  • Networking, packet loss bounds: The probability that ANY of kk packets is lost is at most kcdotpk \cdot p where pp is the per-packet loss probability. Used to provision congestion windows in TCP and to size redundancy in QUIC and erasure-coded storage.
  • Insurance and actuarial science: Premium calculations begin from the probability that a covered event occurs in the policy term; reinsurance treaties use inclusion-exclusion to price overlapping risks across multiple insured parties.
  • A/B testing: Multiple-comparison corrections (Bonferroni) explicitly use the union bound: if you run kk simultaneous tests at significance alpha/k\alpha/k, the family-wise error rate stays below alpha\alpha. Without the correction, false positives multiply.

Pause and think: Suppose P(A)=0.6P(A) = 0.6 and P(B)=0.7P(B) = 0.7. What is the LARGEST possible value of P(AcupB)P(A \cup B)? The SMALLEST? (Hint: think about the most and least possible overlap. Note P(AcapB)geqP(A)+P(B)1P(A \cap B) \geq P(A) + P(B) - 1 since P(AcupB)leq1P(A \cup B) \leq 1.)

Try it

  • Predict first: when rolling two fair dice, what is P(textsum=7)P(\text{sum} = 7)? P(textsum=12)P(\text{sum} = 12)? Why is one much larger?
  • A bag has 4 red and 6 blue balls. Draw two without replacement. Compute P(textbothred)P(\text{both red}) using the multiplication principle on conditional probabilities.
  • Use the complement to compute P(textatleastoneheadin5fairtosses)P(\text{at least one head in 5 fair tosses}). Why is direct enumeration of "at least one" almost always inefficient?
  • If P(A)=0.3P(A) = 0.3 and P(B)=0.5P(B) = 0.5 and AcapB=emptysetA \cap B = \emptyset, what is P(AcupB)P(A \cup B)? Why can disjoint events have positive individual probabilities yet be maximally dependent?
  • Show that P(A)+P(B)1leqP(AcapB)leqmin(P(A),P(B))P(A) + P(B) - 1 \leq P(A \cap B) \leq \min(P(A), P(B)). These are the Fréchet bounds on the intersection.

A trap to watch for

Beginners regularly confuse "disjoint" with "independent" because both involve two events related in some way. They are almost opposite. Disjoint means AcapB=emptysetA \cap B = \emptyset, so P(AcapB)=0P(A \cap B) = 0, if AA happens, BB cannot. Independent (formal definition in §15.3) means P(AcapB)=P(A)P(B)P(A \cap B) = P(A) P(B), the two events have no informational link. For events both with positive probability, disjointness is the MAXIMALLY DEPENDENT case (knowing one rules the other out), while independence is the no-information case. They coincide only in the trivial case where one of the events has probability zero.

What you now know

You can build a probability space from a sample space and a measure, compute event probabilities by counting under uniformity, manipulate compound events with inclusion-exclusion, and bound probabilities of unions with the union bound. The next section deepens this with conditional probability and independence, the tools that let you update beliefs as evidence arrives and that power every Bayesian inference in machine learning.

Mark section complete →

References

  • Garrity, T. (2002). All the Mathematics You Missed. Cambridge University Press, ch. 15.
  • Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson, ch. 2.
  • Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1 (3rd ed.). Wiley, ch. 1.
  • Grimmett, G., Stirzaker, D. (2001). Probability and Random Processes (3rd ed.). Oxford University Press, ch. 1.
  • Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press, ch. 1.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.