Random Variables, Expectation, and Variance

Part 15, Chapter 15: Combinatorics and Probability

Learning objectives

  • Define a discrete random variable and compute its expected value E[X]=xipiE[X] = \sum x_i p_i
  • Apply linearity of expectation, even when summands are dependent
  • Compute variance via Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2 and interpret the standard deviation
  • Use the mean and variance formulas for Bernoulli, binomial, and geometric distributions
  • Recognise that variance adds for INDEPENDENT random variables and explain why

Expected value is the centre of gravity of a random variable; variance is the spread around that centre. Together these two numbers summarise almost everything you ever need to know about a distribution, they are the first and second moments, and a huge fraction of applied probability and statistics is "use the mean and variance, and lean on the CLT for the rest." This section introduces both quantities, develops the linearity-of-expectation trick that is one of the most powerful tools in the discipline, and exhibits the variance formulas for the canonical distributions.

Random variables and expectation

A random variable XX is a function X:OmegatomathbbRX: \Omega \to \mathbb{R} that assigns a real number to each outcome of an experiment. For a discrete random variable taking values x1,x2,ldotsx_1, x_2, \ldots with probabilities pi=P(X=xi)p_i = P(X = x_i)i), the expected value is

E[X] = \sum_i x_i \, p_i.

It is the long-run average of XX over many independent trials, by the law of large numbers. For a fair die roll, E[X] = (1 + 2 + \cdots + 6)/6 = 3.5, not an attainable outcome, but the long-run average per roll.

Linearity of expectation

For ANY random variables XX and YY, even when they are dependent, and any constants a,b,ca, b, c:

E[aX + bY + c] = a E[X] + b E[Y] + c.

This deceptively simple identity is one of the most-used tricks in combinatorics and probability. It lets you decompose a complicated random variable into a sum of indicator variables, compute the mean of each indicator (easy, it equals the probability of the event it indicates), and sum. No independence is required.

Classic application. What is the expected number of fixed points of a random permutation of 1,ldots,n\{1, \ldots, n\}? Let Xi=1X_i = 1i=1 if pi(i)=i\pi(i) = i, else 00. Then E[X_i] = 1/n (any of the nn positions is equally likely for element ii). By linearity, the expected number of fixed points is ncdot(1/n)=1n \cdot (1/n) = 1, surprisingly independent of nn.

Variance and standard deviation

The variance measures spread around the mean:

\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2.

The second form is the workhorse for computation: it sidesteps subtracting the mean from every value. The standard deviation sigmaX=sqrttextVar(X)\sigma_X = \sqrt{\text{Var}(X)}X=sqrttextVar(X) has the same units as XX and is the natural scale on which to quote spread.

Variance is NOT linear. For constants a,ca, c: textVar(aX+c)=a2textVar(X)\text{Var}(aX + c) = a^2 \text{Var}(X) (squared scaling, ignored shifts). For INDEPENDENT random variables X,YX, Y: textVar(X+Y)=textVar(X)+textVar(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y). Without independence the cross-term 2,textCov(X,Y)2\,\text{Cov}(X, Y) appears.

Canonical distributions

  • Bernoulli(p)(p): P(X=1)=pP(X = 1) = p, P(X=0)=1pP(X = 0) = 1 - p. Then E[X] = p, textVar(X)=p(1p)\text{Var}(X) = p(1 - p). Variance is maximised at p=1/2p = 1/2 (most uncertainty), and zero when pin0,1p \in \{0, 1\} (deterministic).
  • Binomial(n,p)(n, p): sum of nn independent Bernoulli(p)(p). So by linearity E[X] = np; by additivity of variance under independence, textVar(X)=np(1p)\text{Var}(X) = np(1 - p).
  • Geometric(p)(p): number of trials up to and including the first success. E[X] = 1/p, textVar(X)=(1p)/p2\text{Var}(X) = (1 - p)/p^{2}. (A fair coin needs on average 2 flips to see the first head.)
  • Poisson(lambda)(\lambda): E[X] = \text{Var}(X) = \lambda. The equality of mean and variance is the defining structural feature of Poisson processes.

Use the grapher to sketch the binomial PMF f(k)=binomnkpk(1p)nkf(k) = \binom{n}{k}p^{k}(1-p)^{n-k} for fixed small nn (e.g., n=10n = 10) as a function of kk. Notice the mass concentrates around npnp; the width scales as sqrtnp(1p)\sqrt{np(1-p)}. The Central Limit Theorem (next section) makes this scaling rigorous.

Where this shows up
  • Finance, risk-adjusted return: The Sharpe ratio (E[R] - r_f)/\sigma_R divides excess expected return by standard deviation. Portfolio optimisation in the Markowitz framework is literally minimising variance subject to a fixed expected return.
  • Insurance pricing: The pure premium for a policy is E[L], the expected loss. Capital reserves are sized by sqrtnsigmaL\sqrt{n} \sigma_LL via the CLT (next section). Variance drives the cost of providing risk-pooling.
  • A/B testing: Sample-size formulas all come from sigma2/n\sigma^{2}/n. To detect a 1% relative effect with 80% power you need npropto1/delta2n \propto 1/\delta^{2} samples, where delta\delta is the relative effect, pure variance arithmetic.
  • Machine learning, bias-variance tradeoff: Predictor error decomposes as E[(\hat{y} - y)^{2}] = (\text{bias})^{2} + \text{variance} + \text{irreducible noise}. Tuning model complexity is a balance between these two competing quantities.
  • Quality control: Six-Sigma manufacturing targets a defect rate below P(Xmu>6sigma)P(|X - \mu| > 6\sigma) (under normality, about 1 in 500 million). The whole programme is an applied-variance discipline.
  • Pause and think: Why does the formula \text{Var}(X) = E[X^{2}] - (E[X])^{2} require E[X^{2}] \geq (E[X])^{2}? (Hint: variance is the expectation of a non-negative quantity.) This is a special case of the Cauchy-Schwarz inequality and the foundation of Jensen's inequality.

    Try it

    • Compute E[X] for a fair die. Then compute textVar(X)\text{Var}(X) using E[X^{2}] - (E[X])^{2}.
    • The expected number of heads in nn flips of a biased coin with P(H)=pP(H) = p is npnp. Re-derive this from linearity of expectation by writing the total as X1+X2+cdots+X_nX_1 + X_2 + \cdots + X_nn.
    • You roll a fair die until you see a 6. What is the expected number of rolls? (Hint: geometric distribution with p=1/6p = 1/6, so E[X] = 6.)
    • Show that for any non-negative random variable XX taking integer values, E[X] = \sum_{k=1}^{\infty} P(X \geq k). Use this to give a 1-line proof that geometric(p)(p) has mean 1/p1/p.
    • If XX and YY are independent with variances sigmaX2\sigma_X^{2}X2 and sigmaY2\sigma_Y^{2}Y2, find textVar(XY)\text{Var}(X - Y). Why does the answer use a plus, not a minus?

      A trap to watch for

      Beginners often try to extend linearity to variance: "textVar(X+Y)=textVar(X)+textVar(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)." This is only true under independence. In general, textVar(X+Y)=textVar(X)+textVar(Y)+2,textCov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\,\text{Cov}(X, Y). If XX and YY are positively correlated, their sum has higher variance than the sum of their variances; if negatively correlated (a portfolio hedge!), the sum has LOWER variance, the foundation of diversification.

      What you now know

      You can compute means and variances of standard discrete distributions, apply linearity of expectation to decompose complicated random variables into sums of indicators, and recognise when independence lets variance add. The next section (the Central Limit Theorem) is the crown jewel of probability: it explains WHY the mean and variance summarise so much, for large sums of independent random variables, mean and variance literally determine the entire distribution.

      Mark section complete →

      References

      • Garrity, T. (2002). All the Mathematics You Missed. Cambridge University Press, ch. 15.
      • Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson, ch. 4–5.
      • Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1 (3rd ed.). Wiley, ch. 9.
      • Grimmett, G., Stirzaker, D. (2001). Probability and Random Processes (3rd ed.). Oxford University Press, ch. 3.
      • Mitzenmacher, M., Upfal, E. (2017). Probability and Computing (2nd ed.). Cambridge University Press, ch. 2–3 (linearity-of-expectation toolbox).

      This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.