Why probability? Axioms and Kolmogorov

Probability from zero

Learning objectives

State Kolmogorov's three axioms (non-negativity, normalisation, countable additivity)
Read a probability assignment and decide whether it is valid
Derive P(Aᶜ) = 1 − P(A) and inclusion–exclusion from the axioms
Recognise that additivity is the keystone — overlap requires the inclusion–exclusion correction
Connect the axioms to the empirical-frequency picture you will formalise as the LLN in §0.6

Probability theory is the formal language of uncertainty. It is the language every later chapter of this textbook speaks — every estimator, every confidence interval, every Bayesian posterior, every causal-inference identification strategy. So before any of that arrives, you need to be fluent in the rules of the language. This section is those rules: three short axioms, due to Kolmogorov in 1933, that turn out to be enough to build the entire edifice.

If you have only seen probability informally — "the probability of heads is one half because there are two equally likely outcomes" — three things are about to change. First, you will see why an axiomatic formulation is necessary and not just pedantic. Second, you will see exactly which calculations are licensed by which axiom. Third, you will see, through three live widgets, what would go wrong if the axioms were relaxed.

The intuition: why bother with axioms?

Imagine three reasonable-sounding rules of thumb a beginner might use:

"If event A is more likely than event B, then P(A) > P(B)." Sensible — but vague. How much more likely?
"If A and B have nothing in common, then P(A or B) = P(A) + P(B)." Also sensible — but only when "nothing in common" is precise.
"Some outcome must happen." True. But how does that turn into arithmetic?

The three Kolmogorov axioms turn these three intuitions into rules a machine could check. They are the smallest set of constraints that lets the calculus of probability work — and that excludes assignments that produce paradoxes. Drop any one of them and entire chapters of the rest of this textbook collapse.

The setup: sample space, events, σ-algebra

An experiment has a sample space $\Omega$ , which is the set of every outcome we are willing to consider. For a coin flip $\Omega = {H, T}$ . For one roll of a six-sided die $\Omega = {1, 2, 3, 4, 5, 6}$ . For a real-valued measurement $\Omega = \mathbb{R}$ . For two flips of a coin $\Omega = {HH, HT, TH, TT}$ .

An event is a subset of $\Omega$ . "The die comes up even" is the event $A = {2, 4, 6} \subseteq \Omega$ . "Heads exactly once in two flips" is $A = {HT, TH}$ . The empty set $\emptyset$ ("nothing happens") and the full space $\Omega$ ("something happens") are both events too.

For finite $\Omega$ every subset is an event. For infinite $\Omega$ (e.g. $\mathbb{R}$ ) we have to be slightly careful — the collection of events forms a σ-algebra, meaning it is closed under complement, countable union, and countable intersection. The technical bookkeeping is necessary to avoid paradoxes like the Banach–Tarski one but it is not where the probability intuition lives. For now, when you read "event" think "a subset of outcomes we are willing to assign a probability to".

Kolmogorov's three axioms

A probability on $\Omega$ is a function $P$ that assigns a number to every event, subject to three rules:

Non-negativity. $P(A) \geq 0$ for every event $A$ .
Normalisation. $P(\Omega) = 1$ .
Countable additivity. For any countable sequence of pairwise disjoint events $A_1, A_2, A_3, \ldots$ (meaning $A_i \cap A_j = \emptyset$ whenever $i \neq j$ ), $P!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i).$

That is the whole formalism. Three lines. Everything else in this section — and most of the next 200 pages — is a consequence.

Read the third one slowly. The qualifier "pairwise disjoint" does the heavy lifting. Additivity is a license to add probabilities, but only when the events do not overlap. The instant $A_1 \cap A_2 \neq \emptyset$ , additivity is silent — you cannot use it. Section §0.1's most common beginner error is forgetting that qualifier and writing $P(A \cup B) = P(A) + P(B)$ when $A$ and $B$ overlap; that produces wrong answers for everything from medical diagnostics to insurance pricing.

What if we dropped one? — the axiom violator

Reading axioms is one thing; feeling why each one is necessary is another. Below are four mock probability assignments. One is a valid Kolmogorov probability; the other three each break a single axiom. For each, pick the axiom you think is violated, click Reveal, and read the consequence panel.

The widget makes a single point per card: if non-negativity fails, probabilities cease to be areas or frequencies; if normalisation fails, your expectations stop being well-defined; if additivity fails, you cannot compute the probability of a compound outcome by summing its disjoint pieces. Those are not stylistic preferences — those are the calculations the rest of this textbook hinges on.

Consequences you can use

The three axioms are deliberately minimal, but they are enough to derive every working rule you will use day-to-day.

Complement rule. Since $A$ and $A^c$ are disjoint and $A \cup A^c = \Omega$ , additivity gives $P(A) + P(A^c) = P(\Omega) = 1$ , so $P(A^c) = 1 - P(A)$ . Useful constantly: the probability of "at least one head in ten flips" is easier to write as $1 - P(\textrm{no heads})$ than as a sum of ten cases.

Monotonicity. If $A \subseteq B$ then $P(A) \leq P(B)$ . (Decompose $B = A \cup (B \setminus A)$ where the two pieces are disjoint, apply additivity, and use non-negativity on $P(B \setminus A)$ .) "Drawing a king" is a subset of "drawing a face card", so it cannot have larger probability.

Probabilities are in [0, 1]. From monotonicity and $A \subseteq \Omega$ : $P(A) \leq P(\Omega) = 1$ . Combined with non-negativity, every probability is in $[0, 1]$ . This is a consequence, not a fourth axiom.

Inclusion–exclusion for two events. When $A$ and $B$ are not necessarily disjoint:

P(A \cup B) = P(A) + P(B) - P(A \cap B).

The intuition is double-counting: when you add $P(A) + P(B)$ , every outcome in $A \cap B$ is counted in both, so you subtract it once. This is the formula that restores additivity for overlapping events. The next widget makes it visible.

The geometry: drag two events

The Venn explorer below shows two events $A$ and $B$ as circles over a unit-square sample space $\Omega = [0, 1]^2$ . The probability of any event equals its area (a uniform distribution on the square). Drag the circles, resize them with the sliders, and watch the inclusion–exclusion check at the right.

Three configurations to try in particular:

Disjoint (use the preset): the intersection is empty, so $P(A \cap B) = 0$ and $P(A \cup B) = P(A) + P(B)$ exactly. The inclusion–exclusion correction term vanishes.
Overlapping: the intersection has positive probability. Now $P(A \cup B) < P(A) + P(B)$ ; the difference is exactly $P(A \cap B)$ .
A contains B: $A \cap B = B$ , so $P(A \cap B) = P(B)$ . The union is just $A$ , and inclusion–exclusion collapses to $P(A \cup B) = P(A)$ .

The Monte Carlo readout has a tiny residual (on the order of a hundredth) because the areas are estimated by sampling 20 000 points; the true residual would be zero. That residual is the noise you will learn to quantify in §0.6 when you meet the Law of Large Numbers.

Where do these numbers come from? — frequency convergence

One last widget closes the loop. The Kolmogorov axioms tell you the rules of the probability calculus, but they say nothing about where the numbers come from. Two answers compete in practice: frequentist ("probability is the long-run relative frequency of an event") and Bayesian ("probability is a degree of belief"). For now, set Bayesian aside; you will meet it in Part 7.

The frequentist position has a sharp empirical prediction. If you assign theoretical probabilities to the faces of a die and then roll it many times, the empirical frequency of each face should approach the theoretical probability as the number of rolls grows. The widget below makes that statement testable. Pick a probability distribution with the sliders (note: the sliders auto-renormalise so the total is always 1 — axiom 2 in action), then click Roll 100, Roll 1000, Roll 10000, and watch what happens.

The bar chart on the left compares empirical (blue bars) to theoretical (orange markers); the line chart on the right tracks the empirical frequency of face 1 against $n$ . As $n$ grows the blue line should hug the orange reference more and more tightly. That is the Law of Large Numbers, which §0.6 will state precisely and §0.7's Central Limit Theorem will sharpen by quantifying the rate of convergence.

Try it

In the Venn explorer, find a configuration with $P(A) = P(B) \approx 0.3$ and $P(A \cap B) \approx 0.05$ . Compute $P(A \cup B)$ by hand using inclusion–exclusion and check it against the widget's reported number.
Set the Venn explorer to "A contains B". What is $P(A \setminus B)$ ? Write it in terms of $P(A)$ and $P(B)$ .
In the axiom violator, before clicking Reveal on the weather card, ask yourself: how could a forecaster end up writing $P(\textrm{rain}) + P(\textrm{no rain}) = 1.2$ ? What real-world process could produce this kind of error?
In the frequency convergence widget, set the sliders to a wildly skewed distribution (say P(face 1) ≈ 0.7, everything else small) and roll 100 times. Do the bars match the markers? Now roll 10 000 more. What does that suggest about how many samples you need before "the empirical frequency is the probability" becomes a safe claim?

Pause and reflect: the three axioms make no mention of where probabilities come from — only the rules they must obey. Why is that separation useful? Where in the rest of a real research workflow does each side of that separation (the rules of the calculus, the sources of the numbers) actually appear?

What you now know

You have the formal definition of a probability — a sample space, a σ-algebra of events, and a function $P$ obeying three axioms (non-negativity, normalisation, countable additivity). You can derive the complement rule, monotonicity, and inclusion–exclusion from those axioms in a few lines. You have seen, through the axiom-violator and the Venn explorer, exactly what goes wrong if any axiom is dropped, and through the frequency-convergence simulator you have seen the bridge from theoretical probabilities to the empirical frequencies the rest of the book will lean on. Section §0.2 picks up here and reformulates everything in the language of random variables, which is the move that turns events into numbers you can take expectations of.

References

Kolmogorov, A.N. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer. (The original axiomatic paper; an English translation, Foundations of the Theory of Probability, was published by Chelsea in 1956.)
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (Chapter 1, "Probability".)
Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 1, "Probability Theory".)
Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Volume 1 (3rd ed.). Wiley.
Hájek, A. (2019). Interpretations of Probability. Stanford Encyclopedia of Philosophy. (For the philosophical grounding behind frequentist vs Bayesian readings of the same axioms.)