Introduction
Data preprocessing is a critical step in preparing raw data for machine learning models. Properly preprocessed data improves model accuracy, ensures compatibility with algorithms, and reduces potential biases. Below are the essential preprocessing techniques explained with examples and code snippets.
1. Data Encoding
Machine learning models require numerical data, so categorical variables must be encoded into numerical formats.
Techniques:
- Label Encoding:
Assigns a unique integer to each category. Suitable for ordinal data.
Example: Converting [‘Low’, ‘Medium’, ‘High’] to [0, 1, 2].
python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['encoded'] = le.fit_transform(df['category'])
- One-Hot Encoding:
Creates binary columns for each category, ideal for non-ordinal data.
Example: Converting [‘Cat’, ‘Dog’, ‘Bird’] to:
Cat | Dog | Bird
1 | 0 | 0
0 | 1 | 0
0 | 0 | 1
python
df = pd.get_dummies(df, columns=['category'])
2. Scaling and Normalization
Scaling ensures that features contribute equally to the model by standardizing their magnitude.
Techniques:
- Min-Max Scaling:
Scales data to a fixed range (usually 0 to 1) while preserving the shape of the distribution.
python
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['scaled'] = scaler.fit_transform(df[['feature']])
- Standardization (Z-Score Normalization):
Centers data around a mean of 0 with a standard deviation of 1, useful for features with varying magnitudes.
python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['standardized'] = scaler.fit_transform(df[['feature']])
3. Feature Transformation
Feature transformation modifies the data format to highlight patterns or improve model performance.
Examples:
- Log Transformation:
Reduces skewness in data distributions, particularly effective for data with outliers or long tails.
python
df['log_transformed'] = np.log(df['feature'] + 1)
- Polynomial Features:
Captures complex, non-linear relationships by creating higher-order feature combinations.
python
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform(df[['feature1', 'feature2']])
- Discretisation :
Converts continuous variables into discrete bins or categories, useful for creating ordinal data.
python
from sklearn.preprocessing import KBinsDiscretizer
kbins = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform')
df['binned'] = kbins.fit_transform(df[['feature']])
Real-World Applications
- Data Encoding:
- Label Encoding: Ranking customer satisfaction as ‘Good,’ ‘Average,’ ‘Poor.’
- One-Hot Encoding: Encoding product categories for an e-commerce recommendation system.
- Scaling and Normalisation:
- Normalising financial transaction amounts for fraud detection models.
- Standardising medical test results for healthcare analytics.
- Feature Transformation:
- Log transforming revenue data for predictive analytics.
- Using polynomial features for better predictions in housing price models.
Why is Preprocessing Essential?
- Improves model performance by creating clean and structured datasets.
- Reduces the risk of biased outcomes from skewed or unbalanced data.
- Ensures compatibility with specific machine learning algorithms that require normalized or encoded inputs.
Next Steps
After preprocessing, the next step is Feature Engineering and Optimization, where you’ll refine the dataset by identifying and creating features that have the greatest impact on model accuracy.
With your data preprocessed, you’re one step closer to building robust AI models.