Nearest Neighbor and KNN

Chapter 5: Distance-Based Learning — K-Nearest Neighbors

Learning objectives

Compute distances between points using Euclidean, Manhattan, Minkowski, and Cosine metrics
Classify a new point using 1-NN and K-NN with majority voting
Explain the curse of dimensionality and its effect on KNN
Understand the importance of feature scaling for distance-based methods
Apply KNN for both classification and regression in geoscience problems
Choose an appropriate K using cross-validation and the elbow method
Describe the computational complexity of KNN and acceleration strategies

Learning from Neighbors

K-Nearest Neighbors (KNN) is one of the simplest and most intuitive machine learning algorithms. The core idea: to classify a new data point, find the $K$ closest training examples and let them "vote" on the class. KNN is a lazy learner — it stores the entire training set and does all the work at prediction time, rather than building an explicit model during training. This makes training instantaneous but prediction potentially slow for large datasets.

In geoscience, KNN is widely used for mineral classification from X-ray fluorescence (XRF) data, well-log facies identification, soil type mapping from geochemical surveys, and weather analogue forecasting. Its simplicity makes it an excellent baseline before trying more complex models.

1. Distance Metrics

The foundation of KNN is the concept of "closeness." Different distance metrics capture different notions of similarity. The choice of metric can significantly affect classification results, especially in geoscience where features may have different physical meanings.

Euclidean Distance (L2)

The straight-line distance between two points $\mathbf{x} = (x_1, \ldots, x_n)$ and $\mathbf{y} = (y_1, \ldots, y_n)$ :

d_{\text{Euclidean}}(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}

This is the most commonly used distance metric. For 2D points $(x_1, x_2)$ and $(y_1, y_2)$ : $d = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2}$ .

Worked numeric example: Consider two rock samples described by quartz content (%) and feldspar content (%): $A = (65, 20)$ and $B = (50, 35)$ .

d(A, B) = \sqrt{(65-50)^2 + (20-35)^2} = \sqrt{225 + 225} = \sqrt{450} \approx 21.21

Euclidean distance works well when all features are on similar scales and when the "straight-line" concept of closeness is appropriate.

Manhattan Distance (L1)

Sum of absolute differences along each dimension — like navigating a grid of city blocks:

d_{\text{Manhattan}}(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n}|x_i - y_i|

Worked numeric example: Using the same rock samples $A = (65, 20)$ and $B = (50, 35)$ :

d(A, B) = |65-50| + |20-35| = 15 + 15 = 30

Manhattan distance is more robust to outliers than Euclidean distance because it does not square the differences. In geoscience, it can be useful when features represent independent physical quantities (e.g., different mineral percentages that do not interact geometrically).

Minkowski Distance (General Form)

A generalisation that includes both Euclidean and Manhattan as special cases:

d_p(\mathbf{x}, \mathbf{y}) = \left(\sum_{i=1}^{n}|x_i - y_i|^p\right)^{1/p}

$p = 1$ : Manhattan distance. $p = 2$ : Euclidean distance. As $p \to \infty$ , this becomes the Chebyshev distance (maximum absolute difference along any single dimension): $d_\infty = \max_i |x_i - y_i|$ .

Higher values of $p$ give more weight to the single largest feature difference. Lower $p$ distributes weight more evenly across all dimensions.

Cosine Similarity

Cosine similarity measures the angle between two vectors, ignoring their magnitude:

\text{cosine\_sim}(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \, \|\mathbf{y}\|}

Cosine distance is defined as $1 - \text{cosine_sim}$ . This metric is useful when the direction of the feature vector matters more than its magnitude. In geoscience, cosine similarity is valuable for comparing geochemical compositions where two samples may have the same mineral ratios but different total concentrations (e.g., comparing XRF spectra where sample preparation affects total counts).

Example: Samples $A = (30, 60, 10)$ and $B = (15, 30, 5)$ have identical proportions (same direction) so cosine similarity = 1.0, but Euclidean distance is large.

2. The 1-NN Algorithm

Simplest Case

Given a new query point $\mathbf{q}$ :

Compute the distance from $\mathbf{q}$ to every training point.
Find the training point with the smallest distance.
Assign the class of that nearest neighbor to $\mathbf{q}$ .

1-NN is simple but sensitive to noise: a single mislabelled training point can cause misclassification. The decision boundary of 1-NN forms a Voronoi tessellation of the feature space — each training point "owns" the region of space closest to it.

An important theoretical result: as the training set size $n \to \infty$ , the 1-NN error rate is at most twice the Bayes optimal error rate. This means 1-NN, despite its simplicity, has strong theoretical guarantees given enough data.

3. The K-NN Algorithm

Voting Mechanism

Instead of relying on just 1 neighbor, K-NN uses the $K$ closest neighbors and applies majority voting:

Compute the distance from the query point to all training points.
Select the $K$ training points with the smallest distances.
Count the votes: the class that appears most often among the $K$ neighbors wins.
In case of a tie, options include: random selection, choosing the class of the closest neighbor among the tied classes, or using distance-weighted voting.

Distance-weighted voting: Instead of giving each neighbor an equal vote, weight the vote by the inverse of the distance: $w_i = \frac{1}{d_i}$ . Closer neighbors have more influence. This often improves performance, especially when there are noisy or borderline samples among the K neighbors.

Choosing K

The choice of $K$ is crucial and represents a bias-variance trade-off:

Small K (e.g., 1 or 3): captures local patterns, but sensitive to noise and outliers. Low bias, high variance.
Large K (e.g., 20 or 50): smoother decision boundary, more robust to noise, but may miss fine-grained patterns. High bias, low variance.
Rule of thumb: try $K = \sqrt{n}$ where $n$ is the number of training samples, then tune using cross-validation.
Use odd K for binary classification to avoid ties.

The Elbow Method for K Selection

Plot test accuracy (or cross-validation accuracy) against K for a range of values (e.g., 1 to 25). Look for the "elbow" — the point where increasing K no longer significantly improves accuracy. Beyond this point, increasing K adds bias without reducing variance.

Cross-Validation for K Selection

For each candidate K, perform k-fold cross-validation (e.g., 5-fold). Split the training data into 5 parts, train on 4 and validate on 1, rotate, and average the accuracy. The K with the highest average CV accuracy is selected. This is more reliable than using a single train/test split because it uses all data for both training and validation.

4. Feature Scaling

Why It Is Essential

KNN relies on distances. If one feature has a much larger range than others, it will dominate the distance calculation. This is a critical issue in geoscience where features span vastly different scales.

Concrete Example of Scaling Failure

Consider classifying rock samples using depth (m) and porosity (%). Sample A: (2000 m, 25%). Sample B: (2005 m, 5%). Sample C: (3000 m, 24%).

Without scaling, $d(A, B) = \sqrt{5^2 + 20^2} = \sqrt{425} \approx 20.6$ . And $d(A, C) = \sqrt{1000^2 + 1^2} = \sqrt{1000001} \approx 1000$ .

The depth difference completely swamps the porosity difference. KNN would say A is closer to B, ignoring the huge porosity difference (25% vs 5%). But a geologist knows that porosity is far more diagnostic of rock type than a 5-metre depth difference.

StandardScaler (Z-Score Normalisation)

x' = \frac{x - \mu}{\sigma}

Transforms each feature to zero mean and unit variance. Best when features are approximately normally distributed. This is the most commonly used scaler for KNN in practice.

MinMaxScaler (Min-Max Normalisation)

x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

Scales each feature to the range $[0, 1]$ . Sensitive to outliers (a single extreme value compresses the rest of the data). Best when the data does not have extreme outliers and you want a bounded range.

When to Use Which

Use StandardScaler as a default for KNN. Use MinMaxScaler when you need bounded features (e.g., for algorithms that require inputs in [0,1]). Always fit the scaler on the training set only, then transform both training and test sets with the same parameters to avoid data leakage.

5. KNN for Regression

Averaging Instead of Voting

KNN can also predict continuous values. Instead of majority voting, average the target values of the $K$ nearest neighbors:

\hat{y} = \frac{1}{K}\sum_{i=1}^{K}y_{\text{neighbor}_i}

Weighted KNN regression uses inverse-distance weights:

\hat{y} = \frac{\sum_{i=1}^{K} w_i \, y_i}{\sum_{i=1}^{K} w_i}, \quad w_i = \frac{1}{d_i}

In geoscience, KNN regression can be used to predict porosity from well-log measurements, estimate permeability from core data, or interpolate geochemical concentrations across a survey grid.

Example: Predict the porosity at a new well location. The 3 nearest wells have porosities of 18%, 22%, and 20% at distances of 1 km, 2 km, and 4 km. Uniform average: $\hat{y} = (18+22+20)/3 = 20%$ . Weighted average: $\hat{y} = \frac{18/1 + 22/2 + 20/4}{1/1 + 1/2 + 1/4} = \frac{18 + 11 + 5}{1.75} = \frac{34}{1.75} \approx 19.4%$ .

6. The Curse of Dimensionality

High Dimensions Break Distance

As the number of features grows, data points become increasingly spread out in the high-dimensional space. This phenomenon has profound consequences for KNN:

Distances converge: In high dimensions, the ratio of the nearest to the farthest neighbor distance approaches 1. If all distances are nearly equal, the concept of "nearest" loses meaning.
Volume grows exponentially: The volume of the unit hypercube in $d$ dimensions is $1^d = 1$ , but the volume of the inscribed hypersphere shrinks rapidly. Most of the volume is in the "corners" of the space where no data lives.
Data becomes sparse: To maintain the same density of data points, you need exponentially more samples as dimensions increase. With 10 features, you might need $10^{10}$ samples for the same coverage that $10^2 = 100$ samples provide in 2D.

Intuitive Example

Imagine you have 100 well-log readings with 2 features (gamma ray and density). The data fills a 2D plane nicely and KNN works well. Now add 50 more features (every possible log curve, elemental concentrations, etc.). The same 100 points are now scattered in a 52-dimensional space. Each point is roughly equally far from every other point. KNN cannot distinguish meaningful neighbors from random ones.

Solutions

Dimensionality reduction with PCA: project the data onto the top principal components that capture most variance.
Feature selection: choose only the most relevant features using domain knowledge or statistical tests. A geologist might select gamma ray, density, and neutron porosity as the three most diagnostic logs.
Collect more data: if you must use many features, you need a correspondingly large training set.

7. Computational Complexity

Cost of Prediction

For each new query point, brute-force KNN must compute the distance to all $n$ training points in $d$ dimensions. This costs $O(nd)$ per prediction. For a test set of $m$ queries: $O(mnd)$ .

With large training sets (e.g., millions of seismic trace samples), this becomes prohibitively slow.

Acceleration with KD-Trees and Ball Trees

KD-trees partition the feature space into axis-aligned regions. For low-dimensional data ( $d < 20$ ), KD-trees reduce the average query time to $O(d \log n)$ . However, in high dimensions, KD-trees degrade to brute-force performance.

Ball trees use nested hyperspheres and perform better in moderate dimensions (up to ~40). scikit-learn automatically selects the best algorithm based on the data shape.

In practice, for geoscience datasets with 3–10 features and a few thousand samples, brute-force KNN is fast enough. For larger datasets, consider using the algorithm="kd_tree" or algorithm="ball_tree" options in scikit-learn.

8. Geoscience Applications

Mineral Classification from XRF Data

Given the elemental concentrations (Si, Al, Fe, Ca, Mg, K, Na) from X-ray fluorescence analysis of a rock sample, classify it as granite, basalt, sandstone, or limestone by finding the K most similar samples in a reference database. Each training sample is a known rock type with measured XRF concentrations. KNN compares the new sample's geochemical fingerprint against the database.

Well-Log Facies Classification

Using multiple well-log curves (gamma ray, resistivity, bulk density, neutron porosity, photoelectric factor) as features, KNN can classify each depth interval into a lithofacies (shale, siltstone, clean sandstone, cemented sandstone, limestone, dolomite). This is one of the most common applications of KNN in petroleum geoscience.

Weather Analogue Forecasting

Find the K most similar historical weather patterns (based on pressure, temperature, humidity fields) and use their outcomes to forecast tomorrow's weather. This "analogue" method is a direct application of KNN to atmospheric science.

[Refs: Bishop, Pattern Recognition and ML; Hastie et al., Elements of Statistical Learning; Cover & Hart, Nearest Neighbor Pattern Classification, 1967]

References

Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 13.3 (k-nearest neighbor classifiers). Springer.
James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 3 & 4 (KNN regression and classification). Springer.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 2.5 (nonparametric methods). Springer.
Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 16.1 (KNN). MIT Press.