Expectations and moments

Probability from zero

Learning objectives

  • Compute the expectation E[X] as a probability-weighted average
  • Compute variance Var(X) and standard deviation SD(X) and interpret them as spread
  • Recognise skewness as asymmetry and kurtosis as tail-heaviness
  • State the linearity property of expectation and use it for sums of random variables
  • Diagnose distributions where mean differs from median (and what that means)

The DISTRIBUTION of a random variable contains all its information — but to summarise it we use MOMENTS: a sequence of numbers (mean, variance, skewness, kurtosis, …) that capture progressively finer features. The first four moments together describe location, spread, asymmetry, and tail behaviour.

Expectation: the probability-weighted average

For discrete X: E[X]=xxpX(x)E[X] = \sum_x x \cdot p_X(x).

For continuous X: E[X]=xfX(x)dxE[X] = \int_{-\infty}^{\infty} x \cdot f_X(x),dx.

E[X] is the BALANCE POINT of the distribution if you imagine the PMF/PDF as a mass distribution on the real line. For Uniform[0,1] it's 0.5; for Exp(λ) it's 1/λ; for Normal(μ, σ²) it's μ.

Expectation is also called the FIRST MOMENT. By analogy with physics, it's the centre of mass.

Linearity of expectation — its superpower

For ANY random variables X, Y (independent or not) and constants a, b:

E[aX+bY]=aE[X]+bE[Y].E[aX + bY] = a\,E[X] + b\,E[Y].

This is the most useful single property in probability. It does NOT require independence. Example: for n i.i.d. samples, E[Xˉ]=E[X]E[\bar X] = E[X] regardless of the distribution shape. This is the foundation of unbiasedness arguments.

Variance and standard deviation

The SECOND CENTRAL MOMENT measures spread around the mean:

Var(X)=E[(Xμ)2]=E[X2](E[X])2.\mathrm{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2.

Always non-negative. Zero iff X is constant. Standard deviation σ=Var(X)\sigma = \sqrt{\mathrm{Var}(X)} is variance on the original scale.

Linearity-style result: Var(aX+b)=a2Var(X)\mathrm{Var}(aX + b) = a^2 \mathrm{Var}(X). Variance is NOT linear (the constant b drops out, the scalar a squares). For sums Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) + 2,\mathrm{Cov}(X, Y); only zero covariance lets variance distribute.

Higher moments: skewness and kurtosis

The THIRD STANDARDISED MOMENT, SKEWNESS:

Skew(X)=E ⁣[(Xμσ)3].\mathrm{Skew}(X) = E\!\left[\left(\frac{X - \mu}{\sigma}\right)^3\right].

Symmetric distributions have skewness 0. Right-skewed distributions (long right tail) have positive skewness; left-skewed have negative. Mean > median for right-skewed; mean < median for left-skewed.

The FOURTH MOMENT, KURTOSIS — and the more useful EXCESS KURTOSIS:

ExcessKurt(X)=E ⁣[(Xμσ)4]3.\mathrm{ExcessKurt}(X) = E\!\left[\left(\frac{X - \mu}{\sigma}\right)^4\right] - 3.

Normal has excess kurtosis 0 (subtracting the Normal's 3 makes Normal the reference). Heavy-tailed distributions (t with low df, Cauchy if it had finite moments) have positive excess kurtosis. Light-tailed / bimodal distributions can have negative excess kurtosis. Kurtosis is about tail heaviness, NOT "peakedness" (a common misreading).

Mean ≠ median: when does it matter?

For symmetric distributions: mean = median. For skewed distributions: they diverge. Use median (not mean) when the distribution is heavily skewed or has outliers (e.g., income, claim sizes, runtime). Mean is sensitive to single extreme observations; median is robust.

Moments VisualizerInteractive figure — enable JavaScript to interact.

Try it

  • Default Normal(0, 1). Mean = median = 0. Variance = 1. Skewness = 0. Excess kurtosis = 0. The shaded ±1σ region covers about 68% of the area.
  • Switch to Exponential(1) (λ = 1). Mean = 1, median = ln(2) ≈ 0.693. The median sits LEFT of the mean — right-skewed. Skewness = 2; this large positive value reflects the long right tail.
  • Switch to Beta(2, 5). Asymmetric, bounded on [0, 1]. Mode is at (α-1)/(α+β-2) = 0.2. Mean = α/(α+β) = 2/7 ≈ 0.286. Median ≈ 0.265. Right-skewed.
  • Set Beta α = β = 5. Symmetric — mean = median = 0.5. Roughly Normal-shaped.
  • Switch to the Bimodal Mixture. Mean = 0 (by symmetry), median = 0 (also by symmetry). But the "typical" observation is NOT near 0; it's near ±1.5. Mean and median are MISLEADING for bimodal distributions — you need to look at the shape, not the moments alone.

For an Exponential(λ = 0.5), the mean is 2 and the median is ln(2)/0.5 ≈ 1.386. If an insurance claim follows this distribution, which summary should you report to a regulator — and which to a customer? Justify in one sentence each.

What you now know

Moments compress a distribution into a sequence of numbers. Mean and variance are usually enough; skewness flags asymmetry; kurtosis flags tail behaviour. Linearity of expectation works across any joint structure. §0.5 is the distribution catalog — each named distribution has a mean and variance formula you should commit to memory. §0.9 returns to moments via the moment-generating function, which lets you compute all moments at once.

References

  • Wasserman, L. (2004). All of Statistics. Springer. (Chapter 3 — expectation.)
  • Casella, G., Berger, R.L. (2002). Statistical Inference, 2nd ed. (Section 2.2.)
  • Westfall, P.H. (2014). "Kurtosis as peakedness, 1905-2014: R.I.P." The American Statistician 68(3). (Clarifying that kurtosis is about tails, not peakedness.)
  • Joanes, D.N., Gill, C.A. (1998). "Comparing measures of sample skewness and kurtosis." The Statistician 47, 183-189.
  • Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1. Wiley. (Chapter 9 — moments and generating functions.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.