Random Variables, Expectation, and Variance
Learning objectives
- Define a discrete random variable and compute its expected value
- Apply linearity of expectation, even when summands are dependent
- Compute variance via and interpret the standard deviation
- Use the mean and variance formulas for Bernoulli, binomial, and geometric distributions
- Recognise that variance adds for INDEPENDENT random variables and explain why
Expected value is the centre of gravity of a random variable; variance is the spread around that centre. Together these two numbers summarise almost everything you ever need to know about a distribution, they are the first and second moments, and a huge fraction of applied probability and statistics is "use the mean and variance, and lean on the CLT for the rest." This section introduces both quantities, develops the linearity-of-expectation trick that is one of the most powerful tools in the discipline, and exhibits the variance formulas for the canonical distributions.
Random variables and expectation
A random variable is a function that assigns a real number to each outcome of an experiment. For a discrete random variable taking values with probabilities , the expected value is
E[X] = \sum_i x_i \, p_i.
It is the long-run average of over many independent trials, by the law of large numbers. For a fair die roll, E[X] = (1 + 2 + \cdots + 6)/6 = 3.5, not an attainable outcome, but the long-run average per roll.
Linearity of expectation
For ANY random variables and , even when they are dependent, and any constants :
E[aX + bY + c] = a E[X] + b E[Y] + c.
This deceptively simple identity is one of the most-used tricks in combinatorics and probability. It lets you decompose a complicated random variable into a sum of indicator variables, compute the mean of each indicator (easy, it equals the probability of the event it indicates), and sum. No independence is required.
Classic application. What is the expected number of fixed points of a random permutation of ? Let if , else . Then E[X_i] = 1/n (any of the positions is equally likely for element ). By linearity, the expected number of fixed points is , surprisingly independent of .
Variance and standard deviation
The variance measures spread around the mean:
\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2.
The second form is the workhorse for computation: it sidesteps subtracting the mean from every value. The standard deviation has the same units as and is the natural scale on which to quote spread.
Variance is NOT linear. For constants : (squared scaling, ignored shifts). For INDEPENDENT random variables : . Without independence the cross-term appears.
Canonical distributions
- Bernoulli: , . Then E[X] = p, . Variance is maximised at (most uncertainty), and zero when (deterministic).
- Binomial: sum of independent Bernoulli. So by linearity E[X] = np; by additivity of variance under independence, .
- Geometric: number of trials up to and including the first success. E[X] = 1/p, . (A fair coin needs on average 2 flips to see the first head.)
- Poisson: E[X] = \text{Var}(X) = \lambda. The equality of mean and variance is the defining structural feature of Poisson processes.
Use the grapher to sketch the binomial PMF for fixed small (e.g., ) as a function of . Notice the mass concentrates around ; the width scales as . The Central Limit Theorem (next section) makes this scaling rigorous.
- Finance, risk-adjusted return: The Sharpe ratio (E[R] - r_f)/\sigma_R divides excess expected return by standard deviation. Portfolio optimisation in the Markowitz framework is literally minimising variance subject to a fixed expected return.
- Insurance pricing: The pure premium for a policy is E[L], the expected loss. Capital reserves are sized by via the CLT (next section). Variance drives the cost of providing risk-pooling.
- A/B testing: Sample-size formulas all come from . To detect a 1% relative effect with 80% power you need samples, where is the relative effect, pure variance arithmetic.
- Machine learning, bias-variance tradeoff: Predictor error decomposes as E[(\hat{y} - y)^{2}] = (\text{bias})^{2} + \text{variance} + \text{irreducible noise}. Tuning model complexity is a balance between these two competing quantities.
- Quality control: Six-Sigma manufacturing targets a defect rate below (under normality, about 1 in 500 million). The whole programme is an applied-variance discipline.
Pause and think: Why does the formula \text{Var}(X) = E[X^{2}] - (E[X])^{2} require E[X^{2}] \geq (E[X])^{2}? (Hint: variance is the expectation of a non-negative quantity.) This is a special case of the Cauchy-Schwarz inequality and the foundation of Jensen's inequality.
Try it
- Compute E[X] for a fair die. Then compute using E[X^{2}] - (E[X])^{2}.
- The expected number of heads in flips of a biased coin with is . Re-derive this from linearity of expectation by writing the total as .
- You roll a fair die until you see a 6. What is the expected number of rolls? (Hint: geometric distribution with , so E[X] = 6.)
- Show that for any non-negative random variable taking integer values, E[X] = \sum_{k=1}^{\infty} P(X \geq k). Use this to give a 1-line proof that geometric has mean .
- If and are independent with variances and , find . Why does the answer use a plus, not a minus?
A trap to watch for
Beginners often try to extend linearity to variance: "." This is only true under independence. In general, . If and are positively correlated, their sum has higher variance than the sum of their variances; if negatively correlated (a portfolio hedge!), the sum has LOWER variance, the foundation of diversification.
What you now know
You can compute means and variances of standard discrete distributions, apply linearity of expectation to decompose complicated random variables into sums of indicators, and recognise when independence lets variance add. The next section (the Central Limit Theorem) is the crown jewel of probability: it explains WHY the mean and variance summarise so much, for large sums of independent random variables, mean and variance literally determine the entire distribution.
Mark section complete →
References
- Garrity, T. (2002). All the Mathematics You Missed. Cambridge University Press, ch. 15.
- Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson, ch. 4–5.
- Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1 (3rd ed.). Wiley, ch. 9.
- Grimmett, G., Stirzaker, D. (2001). Probability and Random Processes (3rd ed.). Oxford University Press, ch. 3.
- Mitzenmacher, M., Upfal, E. (2017). Probability and Computing (2nd ed.). Cambridge University Press, ch. 2–3 (linearity-of-expectation toolbox).