Expectations and moments
Learning objectives
- Compute the expectation E[X] as a probability-weighted average
- Compute variance Var(X) and standard deviation SD(X) and interpret them as spread
- Recognise skewness as asymmetry and kurtosis as tail-heaviness
- State the linearity property of expectation and use it for sums of random variables
- Diagnose distributions where mean differs from median (and what that means)
The DISTRIBUTION of a random variable contains all its information — but to summarise it we use MOMENTS: a sequence of numbers (mean, variance, skewness, kurtosis, …) that capture progressively finer features. The first four moments together describe location, spread, asymmetry, and tail behaviour.
Expectation: the probability-weighted average
For discrete X: .
For continuous X: .
E[X] is the BALANCE POINT of the distribution if you imagine the PMF/PDF as a mass distribution on the real line. For Uniform[0,1] it's 0.5; for Exp(λ) it's 1/λ; for Normal(μ, σ²) it's μ.
Expectation is also called the FIRST MOMENT. By analogy with physics, it's the centre of mass.
Linearity of expectation — its superpower
For ANY random variables X, Y (independent or not) and constants a, b:
This is the most useful single property in probability. It does NOT require independence. Example: for n i.i.d. samples, regardless of the distribution shape. This is the foundation of unbiasedness arguments.
Variance and standard deviation
The SECOND CENTRAL MOMENT measures spread around the mean:
Always non-negative. Zero iff X is constant. Standard deviation is variance on the original scale.
Linearity-style result: . Variance is NOT linear (the constant b drops out, the scalar a squares). For sums ; only zero covariance lets variance distribute.
Higher moments: skewness and kurtosis
The THIRD STANDARDISED MOMENT, SKEWNESS:
Symmetric distributions have skewness 0. Right-skewed distributions (long right tail) have positive skewness; left-skewed have negative. Mean > median for right-skewed; mean < median for left-skewed.
The FOURTH MOMENT, KURTOSIS — and the more useful EXCESS KURTOSIS:
Normal has excess kurtosis 0 (subtracting the Normal's 3 makes Normal the reference). Heavy-tailed distributions (t with low df, Cauchy if it had finite moments) have positive excess kurtosis. Light-tailed / bimodal distributions can have negative excess kurtosis. Kurtosis is about tail heaviness, NOT "peakedness" (a common misreading).
Mean ≠ median: when does it matter?
For symmetric distributions: mean = median. For skewed distributions: they diverge. Use median (not mean) when the distribution is heavily skewed or has outliers (e.g., income, claim sizes, runtime). Mean is sensitive to single extreme observations; median is robust.
Try it
- Default Normal(0, 1). Mean = median = 0. Variance = 1. Skewness = 0. Excess kurtosis = 0. The shaded ±1σ region covers about 68% of the area.
- Switch to Exponential(1) (λ = 1). Mean = 1, median = ln(2) ≈ 0.693. The median sits LEFT of the mean — right-skewed. Skewness = 2; this large positive value reflects the long right tail.
- Switch to Beta(2, 5). Asymmetric, bounded on [0, 1]. Mode is at (α-1)/(α+β-2) = 0.2. Mean = α/(α+β) = 2/7 ≈ 0.286. Median ≈ 0.265. Right-skewed.
- Set Beta α = β = 5. Symmetric — mean = median = 0.5. Roughly Normal-shaped.
- Switch to the Bimodal Mixture. Mean = 0 (by symmetry), median = 0 (also by symmetry). But the "typical" observation is NOT near 0; it's near ±1.5. Mean and median are MISLEADING for bimodal distributions — you need to look at the shape, not the moments alone.
For an Exponential(λ = 0.5), the mean is 2 and the median is ln(2)/0.5 ≈ 1.386. If an insurance claim follows this distribution, which summary should you report to a regulator — and which to a customer? Justify in one sentence each.
What you now know
Moments compress a distribution into a sequence of numbers. Mean and variance are usually enough; skewness flags asymmetry; kurtosis flags tail behaviour. Linearity of expectation works across any joint structure. §0.5 is the distribution catalog — each named distribution has a mean and variance formula you should commit to memory. §0.9 returns to moments via the moment-generating function, which lets you compute all moments at once.
References
- Wasserman, L. (2004). All of Statistics. Springer. (Chapter 3 — expectation.)
- Casella, G., Berger, R.L. (2002). Statistical Inference, 2nd ed. (Section 2.2.)
- Westfall, P.H. (2014). "Kurtosis as peakedness, 1905-2014: R.I.P." The American Statistician 68(3). (Clarifying that kurtosis is about tails, not peakedness.)
- Joanes, D.N., Gill, C.A. (1998). "Comparing measures of sample skewness and kurtosis." The Statistician 47, 183-189.
- Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1. Wiley. (Chapter 9 — moments and generating functions.)