Communicating uncertainty without lying

Part 3 — Confidence intervals and uncertainty

Learning objectives

Recognise that numbers do not speak for themselves: the SAME study can be reported in ways that mislead OR ways that inform. Communication is part of the analysis, not separable from it
Catalogue the six canonical mis-communication failure modes — (i) p-value misread as 'probability the effect is real'; (ii) 'statistically significant' conflated with 'important / large / meaningful'; (iii) point estimate quoted without its CI; (iv) CI quietly summarised as a point estimate; (v) one-sided vs two-sided CI left ambiguous; (vi) model assumptions hidden
State the defensible-reporting checklist: ALWAYS report effect size with CI, ALWAYS state the model assumptions, ALWAYS include n, prefer plain-language glosses over bare statistics. Prefer visual displays over text-only summaries when shape information matters
Compare six ways to visualise uncertainty about a scalar estimand: point only, point + 95% CI error bar, density, violin, fan chart (Britton-Fisher-Whitley 1998), and hypothetical-outcome plot (HOP, Hullman et al. 2015). State which information each representation preserves and which it loses
Describe the '70% chance of rain' problem (Gigerenzer et al. 2005, Risk Analysis): the public commonly misreads a probability as a spatial or temporal fraction. Frequency framing ('7 out of every 10 days like this one') and icon arrays reduce the misinterpretation rate
State that a BAYESIAN CREDIBLE interval admits a direct probabilistic interpretation ('there is a 95% probability that θ ∈ [a, b]', conditional on prior and data) — making it easier for non-statisticians to read than a frequentist CI. Note this comes at the cost of being prior-dependent; the trade-off is unavoidable
Read the catalogue of seminal references: Spiegelhalter-Pearson-Short (2011) for visualising uncertainty broadly; Hullman et al. (2015) for HOPs specifically; Greenland et al. (2016) and Wasserstein-Lazar (2016) for p-value misinterpretations; Cumming (2008) for CIs as the recommended replacement; Tukey (1977) and Wilkinson (2005) as the visualisation foundations
Articulate the philosophy: uncertainty NOT communicated honestly is worse than no uncertainty at all — it gives consumers of research the illusion of certainty without the substance. Defensible communication is verbose; brevity often misleads. The mature researcher accepts this trade-off

The five sections that opened Part 3 built CONFIDENCE-INTERVAL machinery from the ground up — exact vs asymptotic CIs (§3.1), bootstrap CIs (§3.2), profile-likelihood CIs (§3.3), prediction intervals (§3.4), and calibration as the empirical-testability backbone (§3.5). Every one of those constructions produces a number — or rather, a pair of numbers — that summarises uncertainty about some estimand. §3.6 confronts the next question: how do you REPORT that uncertainty so the audience reads it correctly?

The answer is not obvious. The same study, with the same data, can be communicated in ways that mislead OR inform. A CI of [0.05, 0.55] for a treatment effect can be reported as "the effect is 0.30" (suppressing the entire uncertainty story), as "the effect is statistically significant (p = 0.04)" (which says nothing about magnitude), or as "we are 95% confident the true effect lies between a tiny 0.05 and a substantial 0.55, with effect-size point estimate 0.30 (n = 60, assuming approximately Normal errors)" — three reports of the SAME analysis with wildly different consequences for downstream readers. This section is the field guide to picking the right framing.

The §3.6 arc has eight stops. First, the core problem: numbers do not speak for themselves. Second, the catalogue of six common mis-communication failure modes. Third, the defensible-reporting checklist. Fourth, six ways to visualise uncertainty about a scalar estimand — and the first widget that lets you toggle between them. Fifth, probabilistic forecasts and the "70% chance of rain" problem. Sixth, Bayesian credible intervals as a communication tool. Seventh, side-by-side misleading-vs-honest scenarios — the second widget. Eighth, where this lands in Parts 4, 7, 9.7, and 10. Together they close Part 3 with the communication side of the uncertainty story.

The core communication problem: numbers do not speak for themselves

A statistic — a p-value, a confidence interval, an odds ratio, a forecast probability — arrives in a research report as a string of characters. The reader supplies the INTERPRETATION. If the writer is careful, the report constrains the reader to the intended reading. If the writer is careless or evasive, the report leaves room for the reader to construct a different reading — usually one that exaggerates the certainty of the finding.

Consider a single example. A clinical trial of Drug A reports: "Drug A reduces stroke risk, p = 0.03." Eleven words. They are not false. But they invite at least three misreadings: (i) "There is a 97% chance the drug works," conflating $P(\text{data} \mid H_0)$ with $P(H_1 \mid \text{data})$ ; (ii) "The effect is statistically significant — it must be clinically meaningful," conflating technical significance with practical importance; (iii) "Drug A reduces stroke risk by some unspecified amount that the reader can ignore," suppressing the effect-size story. The same data, honestly reported, look very different: "Drug A reduces stroke risk by 4 in 1000 (95% CI 1 to 7 in 1000), p = 0.03 (n = 8000, logistic-regression-based covariate adjustment). Below the clinically-relevant 6-in-1000 threshold for routine treatment recommendation." Sixty-five words. Verbose. Defensible.

The communication problem is therefore a TRADE-OFF: brevity invites misreading; honest reporting requires verbosity. The mature researcher accepts the trade-off — and learns the small number of communication patterns that make defensible reports feel natural rather than burdensome. The §3.6 catalogue and the §3.6 visualisations are those patterns.

The catalogue of six mis-communication failure modes

The empirical-statistics literature has named the recurring misreadings. Greenland, Senn, Rothman, Carlin, Poole, Goodman, and Altman (2016, European Journal of Epidemiology 31(4), 337–350, "Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations") provide the canonical 25-misinterpretation list. Wasserstein and Lazar (2016, American Statistician 70(2), 129–133, "The ASA's statement on p-values") issued a six-principle communication directive in response. Boiled down, six failure modes recur:

The p-value as "probability the effect is real." A p-value reports $P(\text{data as extreme as observed} \mid H_0)$ ; it does NOT report $P(H_1 \mid \text{data})$ . The two are connected by Bayes' rule via the prior, which a p-value does not specify. A small p-value does not say the effect is real; it says the data are surprising IF the null were true. The next sentence — what the data say about the alternative — is a separate question.
"Significant" colloquially vs statistically. In everyday English "significant" means "important, large, meaningful". In statistics it means " $p < \alpha$ ", and with large enough $n$ even trivially small effects are "significant" in the technical sense. Cohen (1990, American Psychologist 45(12), 1304–1312, "Things I have learned, so far") warned against this conflation for decades; the misuse persists.
Point estimate quoted without a CI. "OR = 1.5: drug A increases recovery odds 50%" suppresses the uncertainty story. The same OR could come from a tight 95% CI [1.4, 1.6] or a loose [0.7, 3.2]. The single number is fiction. Wasserstein, Schirm, and Lazar (2019, American Statistician 73(s1), 1–19, "Moving to a world beyond p < 0.05") and the Greenland et al. (2016) list both rank this as a top failure mode.
CI summarised as point estimate. The inverse failure: a CI of [0.05, 0.55] gets reduced to "the effect is 0.30" by selective quotation. The mid-point is reported; the width is dropped. The downstream reader cannot tell whether [0.05, 0.55] or [0.29, 0.31] produced "0.30".
One-sided vs two-sided ambiguity. A 95% CI reported as [−0.02, 0.50] may be the lower bound of a one-sided 95% CI (upper bound +∞, implicit) or one half of a two-sided 95% CI (other half on the other tail). The two answer different questions and the reporting must specify which. CONSORT 2010 (Schulz, Altman, Moher 2010, BMJ) requires this disclosure for trial reports.
Hidden model assumptions. A t-CI assumes Normality; a Wald-binomial CI assumes the asymptotic regime is reached; a bootstrap CI assumes smooth functionals; an LRT-CI assumes regularity. When the assumption fails, §3.5 calibration shows the nominal coverage is not the actual coverage. Defensible reports state the procedure, the assumption, and — when assumptions are dubious — provide an assumption-light alternative for comparison.

Each failure mode is preventable. The cost is a few extra sentences of careful prose. The §3.6 misleading-vs-honest widget below makes the contrast vivid.

The defensible-reporting checklist

Boiled down to a procedural checklist for every quantitative result a researcher reports:

Report the EFFECT SIZE with its UNITS. Not "the result was significant" but "0.05 ng/mL".
Report the 95% CONFIDENCE INTERVAL (or BAYESIAN CREDIBLE INTERVAL). The single number is meaningless without the spread.
State whether the CI is ONE-sided or TWO-sided; if two-sided, the level on each tail.
State the SAMPLE SIZE $n$ (and effective sample size if relevant — Part 8 covers cluster effects and Part 6 covers IPTW).
State the MODEL ASSUMPTIONS — Normal-errors, proportional hazards, independent observations, IID, exchangeability, whatever applies. When assumptions are dubious, run a §3.5-style calibration check or report a robust alternative (bootstrap, conformal).
Add a PLAIN-LANGUAGE GLOSS: "we are 95% confident the true odds ratio lies between a small 5% increase and a more-than-doubling." Translate technical jargon ("OR", "RR", "HR") into everyday English at first use.
Prefer VISUAL DISPLAYS for shape-dependent claims. A density / violin / fan chart conveys skew and bimodality that a bare CI cannot. Use the §3.6 visualisations widget below as a guide.
For probabilistic forecasts (rain, election outcomes, disease incidence), prefer FREQUENCY FRAMING ("7 out of 10 days like this one") over bare percentages ("70% chance"). Add icon arrays where space allows.
If using BAYESIAN methods, state the PRIOR and run a sensitivity analysis with at least one alternative prior. Part 7 develops this in depth.

The checklist is verbose. It is also defensible — every item directly counters one of the failure modes from the catalogue. Adopting it is the difference between research that gets misinterpreted and research that gets understood.

Visualising uncertainty about a scalar estimand

Wilkinson (2005, The Grammar of Graphics, 2nd ed., Springer) argued that visual displays are interpretive INSTRUMENTS — they show patterns that text statistics cannot. For an uncertainty distribution about a single scalar $\theta$ (a treatment effect, a regression coefficient, a forecast probability), six representations are now standard:

Point estimate only. A single dot at $\hat\theta$ . Conveys NOTHING about uncertainty. The default for headlines that suppress the uncertainty story.
Point + CI error bar. A dot + a horizontal line + caps. Conveys central tendency and a 95% extreme range. Loses shape information — skew, bimodality, and tail mass all flatten into symmetric endpoints. The standard scientific default since the mid-20th century.
Density curve. The full posterior / sampling / bootstrap density. Conveys location, scale, skew, modality, and tail heaviness. Adds visual complexity. Tukey (1977, Exploratory Data Analysis, Addison-Wesley) and Wilkinson (2005) advocate this when shape matters.
Violin plot. Mirrored density with median tick and IQR box at the centre. Same information as a density curve, but the mirroring makes the visual centre-of-mass intuitive. Preferred default in many modern biomedical and psychology journals.
Fan chart. Nested credible bands (50%, 80%, 95%, 99%), darkest in the centre and fading outward. Britton, Fisher, and Whitley (1998, Bank of England Quarterly Bulletin Q1) introduced this for inflation forecasts; it is now standard in macroeconomics and epidemiology. Conveys multiple quantile levels in a single graphic.
Hypothetical-outcome plot (HOP). Multiple INDEPENDENT draws from the same uncertainty distribution displayed as small-multiples (often animated). Hullman, Resnick, and Adar (2015, IEEE Transactions on Visualization and Computer Graphics 21(12), 2089–2098, "Hypothetical outcome plots help untrained observers judge trends in ambiguous data") showed empirically that untrained observers judge uncertainty more accurately from HOPs than from bare error bars. The variability is shown by SEEING IT PLAY OUT — the frequentist interpretation made visible.

The first §3.6 widget lets you toggle between all six on the SAME underlying 4000-draw posterior. The teaching message is direct: choice of representation changes what the audience perceives. A symmetric Normal sampling distribution looks essentially the same in all six. A right-skewed (lognormal) or bimodal-mixture distribution looks RADICALLY different — the bare error bar lies about the shape, the density / violin tells the truth, the fan chart spreads the quantile story across visible bands, and the HOP animates the variability.

Things to verify in the widget:

Default settings: symmetric (Normal) shape, $n = 40$ , view = "point + 95% CI". The mean is around 0.30 (the true value), the CI is roughly symmetric around it. All six representations agree because the distribution IS symmetric.
Switch shape to "right-skew (lognormal)". The bare error bar (view 2) still looks symmetric — but the density (view 3) reveals a right tail. The mean exceeds the median; the 97.5 percentile is well to the right of the mean. A reader who only saw view 2 would not know the distribution is skewed.
With "right-skew" still selected, switch to view 4 (violin). The mirrored density makes the skew visible immediately. The IQR box is shifted to the left of the centre-of-mass. Switch to view 5 (fan chart) — the nested 50 / 80 / 95% bands show that the upper bound stretches much further than the lower.
Switch shape to "bimodal (mixture)". Now the symmetric ± error bar is actively misleading: it suggests a single mode at the centre when in fact the density has TWO peaks. Switch to view 3 (density) — both peaks are visible. View 4 (violin) shows the two modes mirrored. View 1 (point only) is the worst-case communication — it implies a single answer where the distribution has two.
Switch to view 6 (hypothetical outcomes). 20 independent draws are displayed as small-multiples on a single axis. The variability is felt directly: the reader counts how often a draw falls above 0.40, below 0.20, etc. Hullman et al. (2015) showed this improves trend judgements among non-statisticians.
Increase $n$ from 40 to 320. All six representations narrow proportionally to $1/\sqrt{n}$ — the CI shrinks, the density sharpens, the fan bands contract. The intuition behind the §1.6 standard-error scaling becomes visible across the entire toolkit.
Click "Re-roll sample". The mean and CI shift slightly (Monte-Carlo noise on the simulated draws); the shape persists because the underlying generator is fixed. With "Re-roll HOP draws", only the 20 displayed lines change — the underlying distribution is fixed but the small-multiples re-sample.

Probabilistic forecasts: the "70% chance of rain" problem

Probabilistic forecasts deserve a dedicated subsection because the framing problem is particularly acute. Gigerenzer, Hertwig, van den Broek, Fasolo, and Katsikopoulos (2005, Risk Analysis 25(3), 623–629) surveyed lay readers of "70% chance of rain tomorrow". The intended meaning — "of all days with conditions like tomorrow's, rain falls on 70% of them" — was selected by FEWER THAN HALF the respondents. The dominant misreadings: "rain over 70% of the area tomorrow" and "rain for 70% of the day tomorrow." Both are incorrect by definition. Neither corresponds to the meteorological process the forecast actually predicts.

Gigerenzer (2002, Reckoning with Risk, Penguin) and Spiegelhalter, Pearson, and Short (2011, Science 333(6048), 1393–1400, "Visualizing uncertainty about the future") consolidated the recommendation: frequency framings beat bare percentages. The honest report is not "70% chance of rain" but "of every 10 days with conditions like tomorrow's, we expect rain on 7 of them." The abstract probability is replaced by a CONCRETE COUNT of imagined cases — a frequentist statement the audience can interpret without translating.

Adding an ICON ARRAY (10 raindrops + 3 suns, or whatever the ratio implies) reduces misinterpretation further. Studies in medical risk communication (Spiegelhalter et al. 2011; Galesic, Garcia-Retamero, Gigerenzer 2009, Health Psychology 28(2), 210–216) show that the icon-array format is interpreted accurately by both patients and physicians. The second §3.6 widget shows this exact framing for the rain forecast.

Bayesian credible intervals as a communication tool

The frequentist confidence-interval story has a deep communication awkwardness: "We are 95% confident that $\theta \in [a, b]$ " is NOT a probability statement about $\theta$ — it is a procedural statement about the long-run frequency with which the interval-construction procedure covers the true value (§3.5 calibration). Non-statisticians find this distinction unintuitive; the §3.1 warning against reading the 95% as the probability that the realised interval covers the true value is widely ignored in practice.

A BAYESIAN CREDIBLE INTERVAL [a, b] for a parameter $\theta$ admits the direct probabilistic reading: "There is a 95% posterior probability that $\theta \in [a, b]$ , given the data and the prior." This is exactly the reading the non-statistician audience supplies anyway. The credible-interval framing is therefore EASIER TO COMMUNICATE — at the cost of being prior-dependent (different priors produce different intervals on the same data). The trade-off is unavoidable; the choice depends on whether the writer can defend the prior or prefers the procedural-only stance.

Gelman and Hill (2007, Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press, chapter on communication) argue for the Bayesian credible-interval framing as the default in applied research, with the prior disclosed and a sensitivity analysis run. Part 7 of this textbook develops the machinery (Metropolis, Gibbs, HMC, posterior predictive checks) and revisits the communication question in §7.6. For §3.6 the key point is that the credible-interval framing is a legitimate communication tool — not a Bayesian-vs-frequentist methodological war, but a pragmatic choice about how to make uncertainty intelligible.

Misleading vs honest: side-by-side scenarios

The second §3.6 widget presents six side-by-side scenarios. Each scenario shows the SAME study finding, communicated in a misleading way (left card) and an honest way (right card, hidden until you click Reveal). The diagnosis panel names the failure mode and points to the canonical reference. The six scenarios mirror the six failure modes from the catalogue above.

Things to verify in the widget:

Start with scenario 1 ("p-value misread as probability the effect is real"). The misleading card quotes p = 0.03 and concludes "97% chance the drug works". Click Reveal: the honest card adds the absolute risk reduction (4 in 1000 with 95% CI 1 to 7 in 1000) and points out that p is $P(\text{data} \mid H_0)$ , not $P(H_1 \mid \text{data})$ . The diagnosis cites Greenland et al. (2016) and the ASA p-value statement.
Scenario 2 ("statistically significant"). The misleading card reports "statistically significant (p < 0.001)" with no effect size. The honest card discloses the effect (0.05 ng/mL with CI 0.03 to 0.07) AND the clinical relevance threshold (0.20 ng/mL) — showing the effect is statistically significant but clinically negligible. The diagnosis cites Cohen (1990) on the conflation of statistical with practical significance.
Scenario 3 ("point estimate without CI"). The misleading card quotes "OR = 1.5: 50% higher recovery odds". The honest card adds 95% CI 1.05 to 2.14 and n = 80, plus a plain-language gloss ("anywhere from a tiny 5% boost to a more-than-doubling"). The contrast — same point estimate, drastically different uncertainty story — makes the case for always pairing effects with CIs.
Scenario 4 ("one-sided vs two-sided ambiguity"). The misleading card reports "95% CI: [−0.02, 0.50]" without specifying. The honest card reports BOTH the two-sided CI [−0.12, 0.60] and the one-sided lower bound [−0.02, +∞), and discloses which question the study is answering.
Scenario 5 ("hidden assumptions"). The misleading card reports a t-test CI [−0.05, 1.20]. The honest card discloses that Shapiro-Wilk rejects Normality (p < 0.001) and reports the bootstrap CI [−0.30, 1.95] for comparison — substantially wider. The diagnosis cites §3.5 calibration: assumption failure means nominal coverage ≠ actual coverage.
Scenario 6 ("70% chance of rain"). The misleading card is the bare probability. The honest card reframes as "7 out of every 10 days like this one" with a 10-icon array (7 raindrops, 3 suns). The diagnosis cites Gigerenzer (2002) and Spiegelhalter et al. (2011) on frequency framings beating percentages.
Click through ALL SIX scenarios using the "Next scenario →" button. Each scenario embodies a distinct failure mode from the §3.6 catalogue; together they cover the §3.6 communication landscape.

Where this lands in the rest of the textbook

The §3.6 communication framework — effect size + CI + assumptions + n + plain-language gloss + visual display — recurs throughout the rest of the textbook. Specific connections:

Part 4 (linear regression). §4.7 (model selection) and §4.8 (causal warnings) develop the regression-specific communication rules: report the coefficient estimates with CIs AND robust standard errors when heteroscedasticity is suspected; report the $R^2$ AND the residual standard error; distinguish association from causation in the prose. The diagnostic plots (§4.3) are themselves uncertainty visualisations applied to residuals.
Part 7 (Bayesian methods). §7.6 (posterior-predictive checks) develops the Bayesian-credible-interval communication convention. The whole part replaces the frequentist procedural framing with direct probability statements that are easier to communicate but harder to defend without prior justification.
Part 9 (ML for researchers). §9.4 (calibration) and §9.7 ("Reporting an ML result so the reader can trust it") develop the ML-specific reporting checklist: discrimination metric + calibration diagram + ECE + Brier score + fairness audit. The §3.6 communication framework is the foundation; §9.7 is the ML-specific deployment.
Part 10 (real-research capstones). Every one of the six capstones (§§10.1–10.6) closes with a communication section that applies the §3.6 checklist to the specific finding. The capstones are the worked examples of defensible reporting end-to-end.

Try it

In the uncertainty-visualizations widget, set shape = symmetric (Normal), $n = 40$ . Cycle through all six views. Confirm that for a symmetric distribution, the bare error bar conveys essentially the same information as the density and the violin. The fan chart adds quantile detail; the HOP adds replication intuition.
Same widget. Switch shape to "right-skew (lognormal)". Cycle through all six views again. Note that the bare error bar (view 2) still looks symmetric while the density / violin / fan chart all reveal the right tail. State which views convey the skew and which hide it.
Same widget. Switch shape to "bimodal (mixture)". Cycle through views 1, 2, 3, 4, 6. Note that views 1 (point only) and 2 (point + CI) are misleading — they imply a single answer where the distribution has two modes. The density (view 3) and violin (view 4) make the bimodality visible. The HOP (view 6) draws roughly half its 20 samples near each mode, also revealing the bimodality.
Same widget. Set $n = 10$ (the smallest), then increase to $n = 320$ . Watch the bands shrink across all six representations. State the scaling: width $\propto 1/\sqrt{n}$ (the §1.6 standard-error rule). Confirm visually that increasing $n$ by a factor of 32 narrows the CI by a factor of $\sqrt{32} \approx 5.66$ .
In the misleading-vs-honest widget, work through scenarios 1, 2, and 3. For each, identify (before clicking Reveal) what is missing from the misleading card. After clicking Reveal, check whether the honest card supplied exactly what you identified.
Same widget. Scenarios 4 and 5 (one-vs-two-sided, hidden assumptions). These are subtler — the misleading card looks defensible at first glance, but the honest card discloses additional information that changes the interpretation. State which information the misleading card omitted and what its omission implied.
Same widget. Scenario 6 (70% chance of rain). State two ways a member of the lay public might misread the bare percentage. Then verify the icon-array honest version corresponds to the frequency-framing recommendation (Gigerenzer 2002, Spiegelhalter et al. 2011).
Pen-and-paper. Apply the defensible-reporting checklist to a study finding from your own field (or invent one). Write the misleading short version FIRST (one sentence). Then write the honest long version (five to eight sentences). Compare the two for length, defensibility, and how each handles each item on the checklist.
Pen-and-paper. Suppose a clinical trial reports OR = 1.30 (95% CI 1.05 to 1.61) for Drug B vs placebo, with n = 600, assuming proportional hazards. Write a defensible one-paragraph summary including effect size, CI, sample size, assumption, plain-language gloss, and one sentence about what the result does — and does not — let you conclude.
Pen-and-paper. State why a Bayesian credible interval admits a direct probabilistic interpretation while a frequentist CI does not. State the cost of the Bayesian framing (prior dependence). Cite Gelman & Hill (2007) on when each framing is preferred in applied research.

Pause and reflect: §3.6 has cast COMMUNICATION as a substantive part of the analysis, not a separable post-hoc step. The same study, with the same data and the same calibrated CI, can be reported in ways that mislead OR inform. The catalogue of six failure modes — p-value-as-effect-probability, "significant" colloquially, point-without-CI, CI-as-point, one-vs-two-sided, hidden-assumptions — names the recurring pitfalls. The defensible-reporting checklist supplies the counter-pattern: effect size + CI + assumptions + n + plain-language gloss + visual display. The §3.6 visualisations (point, CI, density, violin, fan, HOP) and the frequency-framing convention for probabilistic forecasts give the visual and verbal vocabulary. Uncertainty that is not communicated honestly is worse than no uncertainty at all — it gives the audience the illusion of certainty without the substance. With §3.6 Part 3 closes; Parts 4 through 10 will all REUSE this communication framework.

What you now know

You can articulate the core communication problem: numbers do not speak for themselves; the SAME study can be reported in ways that mislead OR inform; communication is part of the analysis, not separable from it. You know the example: a clinical-trial result reported as "p = 0.03" (eleven words, three misreadings) versus the same result reported with effect size + CI + assumptions + n + plain-language gloss + relevance-threshold comparison (verbose, defensible).

You can name the SIX canonical failure modes from the Greenland et al. (2016) and Wasserstein-Lazar (2016) catalogues: (i) p-value misread as probability the effect is real; (ii) "statistically significant" conflated with "important / meaningful / large"; (iii) point estimate quoted without its CI; (iv) CI silently summarised as a point estimate; (v) one-sided vs two-sided CI left ambiguous; (vi) model assumptions hidden. Each failure mode has a canonical reference and a defensible counter-pattern.

You can state the DEFENSIBLE-REPORTING CHECKLIST: effect size with units, 95% CI (or credible interval), one- vs two-sided spec, sample size $n$ , model assumptions, plain-language gloss, visual display, frequency framing for probabilistic forecasts, prior + sensitivity analysis if Bayesian. The checklist is verbose; the verbosity is the cost of defensibility.

You can compare SIX visualisations of uncertainty about a scalar estimand: (1) point only — no uncertainty signal; (2) point + 95% CI error bar — central tendency and extreme range, loses shape; (3) density curve — full shape; (4) violin plot — mirrored density with median + IQR; (5) fan chart (Britton et al. 1998) — nested credible bands at 50 / 80 / 95 / 99%; (6) hypothetical-outcome plot (Hullman et al. 2015) — independent draws displayed as small-multiples, animating the variability. You know which representations preserve shape information (3, 4, 5, 6) and which flatten it (1, 2).

You can describe the "70% chance of rain" problem (Gigerenzer et al. 2005): the lay public commonly misreads probabilities as spatial or temporal fractions. The fix is FREQUENCY FRAMING ("7 out of every 10 days like this one") plus icon arrays (Spiegelhalter et al. 2011). The same recommendation applies across medical risk, election forecasts, and macroeconomic projections.

You can articulate the BAYESIAN-CREDIBLE-INTERVAL communication advantage: "there is a 95% probability that $\theta \in [a, b]$ " admits a direct probabilistic reading, easier for non-statisticians to interpret. The cost is prior dependence; the trade-off is unavoidable. Gelman & Hill (2007) advocate this framing for applied research with prior disclosure and sensitivity analysis (Part 7 develops the machinery).

You can articulate the philosophy: uncertainty NOT communicated honestly is worse than no uncertainty at all. Defensible communication is verbose; brevity often misleads. The mature researcher accepts this trade-off and learns the small number of communication patterns that make defensible reports feel natural. The §3.6 checklist + visualisations + scenarios provide the patterns; Parts 4 through 10 deploy them across the rest of the textbook.

Where this lands in Parts 4-10. Part 4 (linear regression): §§4.3, 4.7, 4.8 apply the §3.6 communication framework to regression coefficients, model-selection criteria, and causal-inference warnings. Part 7 (Bayesian methods): §7.6 (posterior-predictive checks) develops the Bayesian-credible-interval communication convention. Part 9 (ML for researchers): §9.7 ("Reporting an ML result so the reader can trust it") is the ML-specific deployment of the §3.6 checklist, building on §9.4 calibration. Part 10 (real-research capstones): all six capstones close with §3.6-compliant communication sections, demonstrating defensible reporting end-to-end. §3.6 is the framework; the rest of the textbook is the application.

Part 3 is now complete: six sections covering exact vs asymptotic CIs (§3.1), bootstrap CIs (§3.2), profile-likelihood and LRT CIs (§3.3), prediction intervals (§3.4), calibration as the empirical-testability backbone (§3.5), and communicating uncertainty without lying (§3.6). The reader has the construction toolkit, the empirical-testability framework, AND the communication framework. Part 4 begins next — linear regression done seriously.

References

Spiegelhalter, D., Pearson, M., Short, I. (2011). "Visualizing uncertainty about the future." Science 333(6048), 1393–1400. (Definitive review of uncertainty visualisation across medical risk, weather, economic forecasting. Discusses error bars, density plots, fan charts, icon arrays, and the empirical evidence on which formats reduce misinterpretation. The §3.6 foundation reference for the visualisation side.)
Hullman, J., Resnick, P., Adar, E. (2015). "Hypothetical outcome plots help untrained observers judge trends in ambiguous data." IEEE Transactions on Visualization and Computer Graphics 21(12), 2089–2098. (Introduces and empirically validates the hypothetical-outcome plot (HOP). Untrained observers judge uncertainty more accurately from animated HOPs than from bare error bars. The cited evidence for view 6 of the §3.6 visualisations widget.)
Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.B., Poole, C., Goodman, S.N., Altman, D.G. (2016). "Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations." European Journal of Epidemiology 31(4), 337–350. (The canonical 25-misinterpretation list. Items 1, 5, 6, 9, 19 are the direct sources for §3.6 failure modes 1–6. Mandatory reading for any researcher reporting quantitative findings.)
Wasserstein, R.L., Lazar, N.A. (2016). "The ASA's statement on p-values: context, process, and purpose." American Statistician 70(2), 129–133. (The American Statistical Association's official six-principle statement, issued in response to widespread p-value misuse. Principle 2 — "P-values do not measure the probability that the studied hypothesis is true" — is the direct counter to §3.6 failure mode 1.)
Wasserstein, R.L., Schirm, A.L., Lazar, N.A. (2019). "Moving to a world beyond 'p < 0.05'." American Statistician 73(s1), 1–19. (The ASA follow-up. Advocates for retiring statistical-significance language and reporting effect sizes with intervals. Directly supports §3.6 failure modes 2 and 3.)
Cumming, G. (2008). "Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better." Perspectives on Psychological Science 3(4), 286–300. (Empirical and theoretical argument that CIs predict replication outcomes substantially better than p-values. A key reference for the §3.6 case "always report the CI, not just the p-value".)
Gelman, A., Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. (The communication chapter — and the worked examples throughout — develop the Bayesian-credible-interval framing as the default for applied research. Foundation for §3.6 Bayesian-credible-interval subsection and a recurring reference in Parts 4, 5, 7.)
Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley. (The founding text of exploratory data analysis. Argues for visual representations over text statistics whenever shape information matters — the philosophical underpinning of §3.6 visualisations and Part 4 regression diagnostics.)
Wilkinson, L. (2005). The Grammar of Graphics, 2nd ed. Springer. (The systematic theory of statistical visualisation. Develops density, violin, and faceted small-multiples as principled visual instruments. Foundational reference for the §3.6 visualisations widget and the ggplot2 / Vega-Lite ecosystems built on it.)
Gigerenzer, G. (2002). Reckoning with Risk: Learning to Live with Uncertainty. Penguin Books. (Argues for frequency framings over probability framings in risk communication. Documents the systematic misreading of bare percentages by both lay readers and trained physicians. The §3.6 case for frequency framing in the "70% chance of rain" subsection.)
Britton, E., Fisher, P.G., Whitley, J. (1998). "The Inflation Report projections: understanding the fan chart." Bank of England Quarterly Bulletin, Q1, 30–37. (The original Bank of England fan chart introduction. Defines the nested-credible-band display for forecast uncertainty. Now standard across macroeconomic forecasting, epidemiology, and climate science.)
Schulz, K.F., Altman, D.G., Moher, D. (2010). "CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials." BMJ 340, c332. (The international reporting-standards document for randomised controlled trials. Requires effect sizes with CIs, one- vs two-sided specification, and assumption disclosure — the §3.6 checklist made mandatory for the most regulated form of clinical research.)