Introduction
Cross-validation is a vital step in ensuring that machine learning models generalise well to unseen data. By systematically splitting the dataset into training and testing subsets, cross-validation helps to assess model performance, minimise overfitting, and guide model selection.
This page covers the most common cross-validation techniques with examples to help you implement them effectively.
Why is Cross-Validation Important?
- Improves Generalization: Ensures the model performs well on unseen data.
- Reduces Overfitting: Provides a robust estimate of performance by using multiple splits of the dataset.
- Aids in Model Selection: Helps choose the best algorithm or hyperparameter settings.
1. K-Fold Cross-Validation
- Concept:
Split the dataset intok
equally sized folds. Usek-1
folds for training and the remaining fold for testing. Repeat the processk
times, ensuring each fold serves as the test set once. - Advantages:
- Reliable estimates of model performance.
- Efficient for small to medium-sized datasets.
- Illustrative Example:
Suppose we have a dataset with 1000 samples and setk=5
.- Step 1: Split the data into 5 folds (200 samples each).
- Step 2: Train on folds 1–4 and test on fold 5.
- Step 3: Rotate the test fold and repeat until all folds are used.
- Result: Average the performance metrics from the 5 iterations.
Code Example:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
# Example dataset
X = [[i] for i in range(1, 11)]
y = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = RandomForestClassifier()
for train_index, test_index in kf.split(X):
X_train, X_test = [X[i] for i in train_index], [X[i] for i in test_index]
y_train, y_test = [y[i] for i in train_index], [y[i] for i in test_index]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("Fold Accuracy:", accuracy_score(y_test, predictions))
2. Stratified K-Fold Cross-Validation
- Concept:
Ensures each fold maintains the same proportion of classes as the original dataset. - Applications:
- Ideal for imbalanced classification problems (e.g., fraud detection, rare disease diagnosis).
- Illustrative Example:
If the dataset has 80% class A and 20% class B, each fold will preserve this ratio.
Code Example:
from sklearn.model_selection import StratifiedKFold
X = [[i] for i in range(1, 21)]
y = [0] * 15 + [1] * 5 # Imbalanced dataset
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_index, test_index in skf.split(X, y):
print("Train:", train_index, "Test:", test_index)
3. Leave-One-Out Cross-Validation (LOOCV)
- Concept:
Each sample in the dataset is used as a test set once, while the remaining samples form the training set. - Advantages:
- Maximises training data for each iteration.
- Disadvantages:
- Computationally expensive for large datasets.
Code Example:
from sklearn.model_selection import LeaveOneOut
X = [[i] for i in range(1, 6)]
y = [0, 1, 0, 1, 0]
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
print("Train:", train_index, "Test:", test_index)
4. Hold-Out Validation
- Concept:
Split the data into separate training, validation, and test sets (e.g., 70%-15%-15%). - Applications:
- Useful for quick model prototyping.
- Illustrative Example:
- Train: 70% of the dataset.
- Validation: 15% to tune hyperparameters.
- Test: 15% for final evaluation.
Code Example:
from sklearn.model_selection import train_test_split
X = [[i] for i in range(1, 11)]
y = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
print("Training set:", X_train)
print("Validation set:", X_val)
print("Test set:", X_test)
5. Time-Series Cross-Validation
- Concept:
Maintains the chronological order of the data. Uses older data for training and newer data for validation. - Applications:
- Time-series forecasting (e.g., stock prices, weather prediction).
- Illustrative Example:
Suppose we have data from January to December. Train the model on January–June and validate on July–August. Repeat with increasing time windows.
Code Example:
from sklearn.model_selection import TimeSeriesSplit
X = [[i] for i in range(1, 13)] # Monthly data
y = [i for i in range(1, 13)]
tscv = TimeSeriesSplit(n_splits=3)
for train_index, test_index in tscv.split(X):
print("Train:", train_index, "Test:", test_index)
Best Practices for Cross-Validation
- Choose the Right Technique: Match the cross-validation method to your data and problem type (e.g., stratified for imbalanced datasets, time series for sequential data).
- Watch for Data Leakage: Ensure the test set remains unseen during training or preprocessing to avoid inflated performance metrics.
- Monitor Computational Cost: Use simpler techniques like K-Fold for large datasets to reduce computational overhead.
Conclusion
Cross-validation is an essential step in building reliable machine learning models. By choosing the appropriate technique, you can ensure that your models are both accurate and generalizable.
Next Topic: Hyperparameter Tuning ➡️