Expectations and moments

Probability from zero

Learning objectives

Compute the expectation E[X] as a probability-weighted average
Compute variance Var(X) and standard deviation SD(X) and interpret them as spread
Recognise skewness as asymmetry and kurtosis as tail-heaviness
State the linearity property of expectation and use it for sums of random variables
Diagnose distributions where mean differs from median (and what that means)

The DISTRIBUTION of a random variable contains all its information — but to summarise it we use MOMENTS: a sequence of numbers (mean, variance, skewness, kurtosis, …) that capture progressively finer features. The first four moments together describe location, spread, asymmetry, and tail behaviour.

Expectation: the probability-weighted average

For discrete X: $E[X] = \sum_x x \cdot p_X(x)$ .

For continuous X: $E[X] = \int_{-\infty}^{\infty} x \cdot f_X(x),dx$ .

E[X] is the BALANCE POINT of the distribution if you imagine the PMF/PDF as a mass distribution on the real line. For Uniform[0,1] it's 0.5; for Exp(λ) it's 1/λ; for Normal(μ, σ²) it's μ.

Expectation is also called the FIRST MOMENT. By analogy with physics, it's the centre of mass.

Linearity of expectation — its superpower

For ANY random variables X, Y (independent or not) and constants a, b:

E[aX + bY] = a\,E[X] + b\,E[Y].

This is the most useful single property in probability. It does NOT require independence. Example: for n i.i.d. samples, $E[\bar X] = E[X]$ regardless of the distribution shape. This is the foundation of unbiasedness arguments.

Variance and standard deviation

The SECOND CENTRAL MOMENT measures spread around the mean:

\mathrm{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2.

Always non-negative. Zero iff X is constant. Standard deviation $\sigma = \sqrt{\mathrm{Var}(X)}$ is variance on the original scale.

Linearity-style result: $\mathrm{Var}(aX + b) = a^2 \mathrm{Var}(X)$ . Variance is NOT linear (the constant b drops out, the scalar a squares). For sums $\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) + 2,\mathrm{Cov}(X, Y)$ ; only zero covariance lets variance distribute.

Higher moments: skewness and kurtosis

The THIRD STANDARDISED MOMENT, SKEWNESS:

\mathrm{Skew}(X) = E\!\left[\left(\frac{X - \mu}{\sigma}\right)^3\right].

Symmetric distributions have skewness 0. Right-skewed distributions (long right tail) have positive skewness; left-skewed have negative. Mean > median for right-skewed; mean < median for left-skewed.

The FOURTH MOMENT, KURTOSIS — and the more useful EXCESS KURTOSIS:

\mathrm{ExcessKurt}(X) = E\!\left[\left(\frac{X - \mu}{\sigma}\right)^4\right] - 3.

Normal has excess kurtosis 0 (subtracting the Normal's 3 makes Normal the reference). Heavy-tailed distributions (t with low df, Cauchy if it had finite moments) have positive excess kurtosis. Light-tailed / bimodal distributions can have negative excess kurtosis. Kurtosis is about tail heaviness, NOT "peakedness" (a common misreading).

Mean ≠ median: when does it matter?

For symmetric distributions: mean = median. For skewed distributions: they diverge. Use median (not mean) when the distribution is heavily skewed or has outliers (e.g., income, claim sizes, runtime). Mean is sensitive to single extreme observations; median is robust.

Try it

Default Normal(0, 1). Mean = median = 0. Variance = 1. Skewness = 0. Excess kurtosis = 0. The shaded ±1σ region covers about 68% of the area.
Switch to Exponential(1) (λ = 1). Mean = 1, median = ln(2) ≈ 0.693. The median sits LEFT of the mean — right-skewed. Skewness = 2; this large positive value reflects the long right tail.
Switch to Beta(2, 5). Asymmetric, bounded on [0, 1]. Mode is at (α-1)/(α+β-2) = 0.2. Mean = α/(α+β) = 2/7 ≈ 0.286. Median ≈ 0.265. Right-skewed.
Set Beta α = β = 5. Symmetric — mean = median = 0.5. Roughly Normal-shaped.
Switch to the Bimodal Mixture. Mean = 0 (by symmetry), median = 0 (also by symmetry). But the "typical" observation is NOT near 0; it's near ±1.5. Mean and median are MISLEADING for bimodal distributions — you need to look at the shape, not the moments alone.

For an Exponential(λ = 0.5), the mean is 2 and the median is ln(2)/0.5 ≈ 1.386. If an insurance claim follows this distribution, which summary should you report to a regulator — and which to a customer? Justify in one sentence each.

What you now know

Moments compress a distribution into a sequence of numbers. Mean and variance are usually enough; skewness flags asymmetry; kurtosis flags tail behaviour. Linearity of expectation works across any joint structure. §0.5 is the distribution catalog — each named distribution has a mean and variance formula you should commit to memory. §0.9 returns to moments via the moment-generating function, which lets you compute all moments at once.

References

Wasserman, L. (2004). All of Statistics. Springer. (Chapter 3 — expectation.)
Casella, G., Berger, R.L. (2002). Statistical Inference, 2nd ed. (Section 2.2.)
Westfall, P.H. (2014). "Kurtosis as peakedness, 1905-2014: R.I.P." The American Statistician 68(3). (Clarifying that kurtosis is about tails, not peakedness.)
Joanes, D.N., Gill, C.A. (1998). "Comparing measures of sample skewness and kurtosis." The Statistician 47, 183-189.
Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1. Wiley. (Chapter 9 — moments and generating functions.)