Decision Trees

Chapter 7: Tree-Based Models I — Decision Trees

Learning objectives

Describe the structure of a decision tree: nodes, branches, and leaves
Compute Gini impurity and entropy for a node
Calculate information gain to determine the best split
Understand pruning strategies to prevent overfitting
Apply decision trees to lithology classification

Decision Trees: Interpretable ML

A decision tree is a flowchart-like model that makes predictions by recursively splitting the data based on feature values. It is one of the most interpretable machine learning models: you can follow the path from root to leaf and understand exactly why a prediction was made.

1. Tree Structure

Components

Root node: the top of the tree; contains the entire dataset.
Internal nodes: each performs a test on a feature (e.g., "Is gamma ray > 75 API?").
Branches: the outcomes of the test (e.g., "Yes" and "No").
Leaf nodes: terminal nodes that provide the prediction (class label for classification, value for regression).

Example for lithology classification:

[Gamma Ray > 75?] /
Yes No /
[Density > 2.5?] [Calcite > 50%?] / \ /
Yes No Yes No | | | | SHALE SILTSTONE LIMESTONE SANDSTONE

2. Splitting Criteria

Gini Impurity

Gini impurity measures how "mixed" a node is. For a node with $C$ classes and probability $p_k$ of class $k$ :

G = 1 - \sum_{k=1}^{C} p_k^2

$G = 0$ means the node is pure (all samples belong to one class). The maximum Gini for a binary problem is $0.5$ (equal mix of both classes).

Example: A node has 30 sandstone and 10 shale samples (total 40). $p_\text{sand} = 30/40 = 0.75$ , $p_\text{shale} = 10/40 = 0.25$ . $G = 1 - (0.75^2 + 0.25^2) = 1 - (0.5625 + 0.0625) = 0.375$ .

Entropy

An alternative measure from information theory:

H = -\sum_{k=1}^{C} p_k \log_2 p_k

$H = 0$ for a pure node, $H = 1$ for a binary node with equal class distribution.

Example: Same node (30 sand, 10 shale): $H = -(0.75 \log_2 0.75 + 0.25 \log_2 0.25) = -(0.75 \times (-0.415) + 0.25 \times (-2)) = 0.311 + 0.5 = 0.811$ .

Information Gain

Information gain measures the reduction in impurity achieved by splitting on a feature:

\text{IG} = H(\text{parent}) - \sum_{j} \frac{n_j}{n} H(\text{child}_j)

The feature with the highest information gain (or largest Gini reduction) is chosen for splitting at each node.

Example: Parent entropy $H = 0.811$ . After splitting on "gamma ray > 75": left child (25 sand, 2 shale) and right child (5 sand, 8 shale). Weighted child entropy: $\frac{27}{40} H_L + \frac{13}{40} H_R$ . If this is lower than 0.811, the split is beneficial.

3. Building the Tree

Recursive Algorithm

At the current node, evaluate all possible splits on all features.
Select the split with the highest information gain (or largest Gini reduction).
Partition the data into child nodes.
Recursively repeat for each child node.
Stop when a stopping criterion is met.

4. Pruning

Why Prune?

An unpruned decision tree will keep splitting until every leaf is pure or no more features are available. This leads to overfitting: the tree memorises the training data, including its noise, and generalises poorly to new data.

Pre-Pruning (Early Stopping)

Max depth: limit how deep the tree can grow.
Min samples per leaf: require a minimum number of samples in each leaf.
Min samples per split: require a minimum number of samples to justify a new split.
Max leaf nodes: limit the total number of leaves.

Post-Pruning (Cost-Complexity Pruning)

Grow the full tree, then remove branches that add little predictive power. Uses a parameter $\alpha$ (ccp_alpha in scikit-learn) that penalises tree complexity.

5. Pros and Cons

Advantages

Interpretable: easy to visualise and explain to non-technical stakeholders.
No feature scaling needed: trees are invariant to monotonic transformations.
Handles both numerical and categorical features.
Fast prediction: just follow a path from root to leaf.

Disadvantages

Prone to overfitting (mitigated by pruning).
Unstable: small changes in data can produce a very different tree.
Biased toward features with many levels.
Cannot extrapolate: predictions are always within the range of training data.

Geoscience Application

Decision trees are excellent for lithology classification from well logs. A geologist can inspect the tree and verify that the splits make geological sense (e.g., "gamma ray > 75 → probably shale" is consistent with shale having higher radioactivity). Feature importance from the tree reveals which logs are most diagnostic for each rock type.

6. CART Algorithm Detail

How Splits Are Chosen

The Classification and Regression Trees (CART) algorithm is the most widely used tree-building method (it is what scikit-learn implements). CART builds binary trees: every internal node has exactly two children.

At each node, CART considers every possible split on every feature:

For each feature $x_j$ , sort the training data by that feature.
Consider every midpoint between consecutive distinct values as a candidate threshold $t$ .
For each candidate split $x_j \leq t$ , compute the impurity reduction: $\Delta G = G(\text{parent}) - \frac{n_L}{n} G(\text{left}) - \frac{n_R}{n} G(\text{right})$
Select the feature and threshold that maximise $\Delta G$ .

For a dataset with $m$ samples and $n$ features, each split evaluation is $O(mn \log m)$ (due to sorting). The total tree construction is $O(mn^2 \log m)$ in the worst case.

7. Regression Trees

Using Variance Reduction

For regression (predicting a continuous value), decision trees replace class impurity with variance reduction. The prediction at each leaf is the mean of the target values in that leaf.

The splitting criterion becomes:

\Delta \text{Var} = \text{Var}(\text{parent}) - \frac{n_L}{n}\text{Var}(\text{left}) - \frac{n_R}{n}\text{Var}(\text{right})

where $\text{Var}(S) = \frac{1}{|S|}\sum_{i \in S}(y_i - \bar{y}_S)^2$ .

Example: A regression tree predicting porosity from well-log features might split on "RHOB > 2.4 g/cc" because samples with higher density have systematically lower porosity, reducing variance in both child nodes.

Multi-Output Trees

Decision trees can predict multiple targets simultaneously. For multi-output regression, the splitting criterion sums variance reduction across all targets:

\Delta = \sum_{t=1}^{T} \Delta \text{Var}_t

In scikit-learn, pass a multi-column $\mathbf{y}$ to DecisionTreeRegressor. This is useful in geoscience when predicting several petrophysical properties (porosity, permeability, water saturation) from the same set of well-log features in a single model.

8. Feature Importance

Mean Decrease in Impurity (MDI)

A natural measure of feature importance falls out of the tree-building process. The importance of feature $j$ is the total impurity reduction achieved by splits on that feature, averaged over all trees (in an ensemble) or across the single tree:

\text{Imp}(j) = \sum_{\text{nodes } t \text{ splitting on } j} \frac{n_t}{n} \cdot \Delta G_t

Importances are normalised to sum to 1. In scikit-learn, access via tree.feature_importances_.

Caution: MDI importance is biased toward high-cardinality features (features with many distinct values get more split opportunities). For a more reliable estimate, use permutation importance: randomly shuffle one feature at a time and measure how much test accuracy drops.

9. Decision Trees vs. Linear Models

When Trees Beat Linear Models

Decision trees and linear models have complementary strengths:

Trees excel when the true decision boundary is non-linear and axis-aligned (e.g., "if GR > 75 AND RHOB < 2.4, predict sandstone"). They capture threshold effects and feature interactions naturally.
Linear models excel when the true relationship is approximately linear or when the boundary is diagonal in feature space (e.g., porosity decreasing linearly with depth).
Trees handle mixed feature types (numerical + categorical) without preprocessing. Linear models require encoding and scaling.
Linear models extrapolate beyond the training range; trees cannot (they predict the nearest leaf value).

In practice, tree ensembles (Random Forest, Gradient Boosting) combine many trees and often outperform both single trees and linear models on tabular geoscience data.

Geoscience: Well Log Interpretation Trees

Decision trees are particularly powerful for automated well-log interpretation:

Electrofacies classification: A tree trained on GR, RHOB, NPHI, RT, and PE can automatically classify depth intervals into electrofacies. The tree rules (e.g., "GR < 60 and PE > 3.5 $\to$ limestone") can be validated against core data and geological knowledge.
Pay zone identification: Regression trees predict net-to-gross or effective porosity, helping identify productive intervals. Feature importances reveal which logs are most diagnostic.
Missing log prediction: When a log (e.g., density) was not run in a well, a regression tree trained on other wells can predict it from available logs.

[Refs: Breiman et al., Classification and Regression Trees; Mitchell, Machine Learning]

References

Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 9 (additive models, trees, related methods). Springer.
James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 8 (tree-based methods). Springer.
Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 18 (trees, forests). MIT Press.
Olden, J.D., Lawler, J.J., Poff, N.L. (2008). Machine learning methods without tears: A primer for ecologists. Q. Rev. Biol. 83(2), 171–193.