Why Use One-Hot Encoding Instead of Label Encoding?
When preparing data for machine learning, encoding categorical variables is a crucial step. Two common approaches are Label Encoding and One-Hot Encoding. Choosing the right one can impact model performance.
1. What is Label Encoding?
Label encoding assigns each category a numeric value:
Red → 0
Blue → 1
Green → 2
Pros:
-
Simple and memory efficient.
-
Works well for ordinal data (where categories have a natural order).
Cons:
-
Implies an order/priority between categories that may not exist.
-
Models might misinterpret numerical values as having mathematical meaning.
2. What is One-Hot Encoding?
One-hot encoding creates a binary column for each category:
Color_Red → 1 or 0
Color_Blue → 1 or 0
Color_Green → 1 or 0
Example:
| Color | Red | Blue | Green |
|---|---|---|---|
| Red | 1 | 0 | 0 |
| Green | 0 | 0 | 1 |
| Blue | 0 | 1 | 0 |
Pros:
-
No unintended ordering.
-
Safer for nominal (unordered) categories.
Cons:
-
Can create many columns for high-cardinality features.
-
Slightly more memory usage.
3. Why Choose One-Hot Over Label Encoding?
Choose One-Hot when:
-
Categories are nominal (no natural order) — e.g., "Red", "Blue", "Green".
-
You want to avoid models thinking
Green (2)is greater thanRed (0).
Choose Label Encoding when:
-
Categories are ordinal (have an inherent order) — e.g., "Small" < "Medium" < "Large".
-
Tree-based models (like Random Forest, XGBoost) can sometimes handle label encoding without misinterpreting the numbers.
4. Real-Life Example
Predicting house prices:
-
Feature:
Neighborhood(nominal) -
Using label encoding:
A → 0, B → 1, C → 2
Model might think neighborhood C is twice as expensive as A, just because 2 > 0.
Using one-hot encoding avoids this assumption:
Neighborhood_A | Neighborhood_B | Neighborhood_C
1 | 0 | 0
0 | 1 | 0
0 | 0 | 1
5. Python Example
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# Sample data
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red']})
# Label Encoding
le = LabelEncoder()
df['Color_Label'] = le.fit_transform(df['Color'])
# One-Hot Encoding
ohe = OneHotEncoder(sparse=False)
ohe_df = pd.DataFrame(ohe.fit_transform(df[['Color']]), columns=ohe.get_feature_names_out(['Color']))
# Combine results
final_df = pd.concat([df, ohe_df], axis=1)
print(final_df)
In short:
-
One-Hot Encoding = safer for nominal data.
-
Label Encoding = suitable for ordinal data or when using models that handle categorical values natively.
Choosing the right encoding method can prevent subtle errors and improve model accuracy.
Comments
Post a Comment