Probability: Sample Spaces and Events
Learning objectives
- Define a probability space and state Kolmogorov's three axioms
- Compute probabilities of compound events using equally likely outcomes
- Apply the complement rule and inclusion-exclusion for events
- Use the union bound for upper-bounding probabilities of unions
- Distinguish disjoint events from independent events (almost opposite)
Probability is the calculus of uncertainty. When the world refuses to be deterministic, the next dice roll, the next click, the next packet drop, we still want quantitative predictions. Probability supplies them by reducing "uncertainty" to a small number of consistent rules: outcomes, events, and a measure on those events that obeys three axioms. Once those axioms are in place, every formula in the rest of the chapter (conditional probability, expected value, the CLT) is a logical consequence.
The probability space
A sample space is the set of all possible outcomes of an experiment. An event is any subset of ; the collection of allowed events is a -algebra . A probability measure is a function P: \mathcal{F} \to [0, 1] satisfying Kolmogorov's axioms:
- (K1) for every event .
- (K2) , something certain happens.
- (K3) For any countable sequence of disjoint events , . (Countable additivity.)
From these alone you can derive every elementary identity. For example, since with and disjoint, K2 plus K3 give the complement rule: . Similarly .
Equally likely outcomes
When is finite and every outcome is equally likely (fair coins, dice, random card draws), the probability of an event reduces to a counting ratio: . This is why §15.1 came first, classical probability is just counting under a uniform distribution. Most introductory problems live here.
Inclusion-exclusion for probabilities
For overlapping events, the counting identity transfers directly: . The corrective term removes the double-count. For three events the alternating sum extends: .
The union bound drops the corrective terms and gives the inequality . This is loose but indispensable: it converts hard intersection probabilities into easy sums and powers analysis of every randomised algorithm, every concentration inequality, and (e.g.) the entire theory of statistical learning generalisation bounds.
The Venn-diagram widget above lets you set up two overlapping events and read off , , , and . Adjust the overlap and watch the inclusion-exclusion identity stay invariant.
- Reliability engineering: The probability that a system with components fails is at most the sum of the individual failure probabilities, literally the union bound. Aircraft, nuclear-plant, and spacecraft reliability are designed around this inequality with redundant subsystems engineered so each one's individual failure probability is per hour or less.
- Machine learning generalisation: A hypothesis class with functions has training-vs-test gap bounded by via the union bound across hypotheses. This is the foundational PAC-learning result.
- Networking, packet loss bounds: The probability that ANY of packets is lost is at most where is the per-packet loss probability. Used to provision congestion windows in TCP and to size redundancy in QUIC and erasure-coded storage.
- Insurance and actuarial science: Premium calculations begin from the probability that a covered event occurs in the policy term; reinsurance treaties use inclusion-exclusion to price overlapping risks across multiple insured parties.
- A/B testing: Multiple-comparison corrections (Bonferroni) explicitly use the union bound: if you run simultaneous tests at significance , the family-wise error rate stays below . Without the correction, false positives multiply.
Pause and think: Suppose and . What is the LARGEST possible value of ? The SMALLEST? (Hint: think about the most and least possible overlap. Note since .)
Try it
- Predict first: when rolling two fair dice, what is ? ? Why is one much larger?
- A bag has 4 red and 6 blue balls. Draw two without replacement. Compute using the multiplication principle on conditional probabilities.
- Use the complement to compute . Why is direct enumeration of "at least one" almost always inefficient?
- If and and , what is ? Why can disjoint events have positive individual probabilities yet be maximally dependent?
- Show that . These are the Fréchet bounds on the intersection.
A trap to watch for
Beginners regularly confuse "disjoint" with "independent" because both involve two events related in some way. They are almost opposite. Disjoint means , so , if happens, cannot. Independent (formal definition in §15.3) means , the two events have no informational link. For events both with positive probability, disjointness is the MAXIMALLY DEPENDENT case (knowing one rules the other out), while independence is the no-information case. They coincide only in the trivial case where one of the events has probability zero.
What you now know
You can build a probability space from a sample space and a measure, compute event probabilities by counting under uniformity, manipulate compound events with inclusion-exclusion, and bound probabilities of unions with the union bound. The next section deepens this with conditional probability and independence, the tools that let you update beliefs as evidence arrives and that power every Bayesian inference in machine learning.
Mark section complete →
References
- Garrity, T. (2002). All the Mathematics You Missed. Cambridge University Press, ch. 15.
- Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson, ch. 2.
- Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1 (3rd ed.). Wiley, ch. 1.
- Grimmett, G., Stirzaker, D. (2001). Probability and Random Processes (3rd ed.). Oxford University Press, ch. 1.
- Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press, ch. 1.