Data Preprocessing Techniques

Introduction

Data preprocessing is a critical step in preparing raw data for machine learning models. Properly preprocessed data improves model accuracy, ensures compatibility with algorithms, and reduces potential biases. Below are the essential preprocessing techniques explained with examples and code snippets.


1. Data Encoding

Machine learning models require numerical data, so categorical variables must be encoded into numerical formats.

Techniques:

  • Label Encoding:
    Assigns a unique integer to each category. Suitable for ordinal data.
    Example: Converting [‘Low’, ‘Medium’, ‘High’] to [0, 1, 2].

python

  from sklearn.preprocessing import LabelEncoder
  le = LabelEncoder()
  df['encoded'] = le.fit_transform(df['category'])
  • One-Hot Encoding:
    Creates binary columns for each category, ideal for non-ordinal data.
    Example: Converting [‘Cat’, ‘Dog’, ‘Bird’] to:
  Cat | Dog | Bird
   1  |  0  |  0  
   0  |  1  |  0  
   0  |  0  |  1  

python

  df = pd.get_dummies(df, columns=['category'])

2. Scaling and Normalization

Scaling ensures that features contribute equally to the model by standardizing their magnitude.

Techniques:

  • Min-Max Scaling:
    Scales data to a fixed range (usually 0 to 1) while preserving the shape of the distribution.

python

  from sklearn.preprocessing import MinMaxScaler
  scaler = MinMaxScaler()
  df['scaled'] = scaler.fit_transform(df[['feature']])
  • Standardization (Z-Score Normalization):
    Centers data around a mean of 0 with a standard deviation of 1, useful for features with varying magnitudes.

python

  from sklearn.preprocessing import StandardScaler
  scaler = StandardScaler()
  df['standardized'] = scaler.fit_transform(df[['feature']])

3. Feature Transformation

Feature transformation modifies the data format to highlight patterns or improve model performance.

Examples:

  • Log Transformation:
    Reduces skewness in data distributions, particularly effective for data with outliers or long tails.

python

  df['log_transformed'] = np.log(df['feature'] + 1)
  • Polynomial Features:
    Captures complex, non-linear relationships by creating higher-order feature combinations.

python

  from sklearn.preprocessing import PolynomialFeatures
  poly = PolynomialFeatures(degree=2)
  poly_features = poly.fit_transform(df[['feature1', 'feature2']])
  • Discretisation :
    Converts continuous variables into discrete bins or categories, useful for creating ordinal data.

python

  from sklearn.preprocessing import KBinsDiscretizer
  kbins = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform')
  df['binned'] = kbins.fit_transform(df[['feature']])

Real-World Applications

  • Data Encoding:
    • Label Encoding: Ranking customer satisfaction as ‘Good,’ ‘Average,’ ‘Poor.’
    • One-Hot Encoding: Encoding product categories for an e-commerce recommendation system.
  • Scaling and Normalisation:
    • Normalising financial transaction amounts for fraud detection models.
    • Standardising medical test results for healthcare analytics.
  • Feature Transformation:
    • Log transforming revenue data for predictive analytics.
    • Using polynomial features for better predictions in housing price models.

Why is Preprocessing Essential?

  • Improves model performance by creating clean and structured datasets.
  • Reduces the risk of biased outcomes from skewed or unbalanced data.
  • Ensures compatibility with specific machine learning algorithms that require normalized or encoded inputs.

Next Steps

After preprocessing, the next step is Feature Engineering and Optimization, where you’ll refine the dataset by identifying and creating features that have the greatest impact on model accuracy.

With your data preprocessed, you’re one step closer to building robust AI models.