Understanding OneHotEncoder in Scikit-learn with a Multiple Choice Example
When working with machine learning, handling categorical data properly is essential. One widely used method is One-Hot Encoding, which converts categorical features into binary vectors. Let’s dive into a multiple-choice question example to see how this works.
The Code Example
from sklearn.preprocessing import OneHotEncoder
data = [['apple', 3], ['banana', 1], ['apple', 2], ['orange', 1], ['banana', 3]]
ohe = OneHotEncoder(sparse_output=False)
ohe.fit(data)
print(ohe.transform(data).shape[1])
Step-by-Step Explanation
-
Dataset Preparation
The dataset has two features:-
Fruit names:
apple,banana,orange -
Numbers:
1, 2, 3
Unique categories:
-
Fruits → 3 unique values
-
Numbers → 3 unique values
-
-
OneHotEncoder Initialization
ohe = OneHotEncoder(sparse_output=False)Here,
sparse_output=Falseensures the output will be a dense NumPy array instead of a sparse matrix. -
Fitting the Encoder
ohe.fit(data)The encoder learns the unique categories from both columns.
-
Transform and Shape
print(ohe.transform(data).shape[1])After one-hot encoding:
-
Fruits → 3 binary columns
-
Numbers → 3 binary columns
-
Total = 3 + 3 = 6 columns
-
Visual Representation
Before encoding:
["apple", 3]
["banana", 1]
["apple", 2]
["orange", 1]
["banana", 3]
After encoding (simplified example):
[1,0,0, 0,0,1] # apple + 3
[0,1,0, 1,0,0] # banana + 1
[1,0,0, 0,1,0] # apple + 2
[0,0,1, 1,0,0] # orange + 1
[0,1,0, 0,0,1] # banana + 3
Correct Answer
The output of the code will be:
6
Key Takeaways
-
OneHotEncoder expands categorical features into multiple binary features.
-
The number of columns after encoding equals the sum of unique categories across all features.
-
Even numbers are treated as categorical labels here.
-
For continuous numeric features, scaling methods (like StandardScaler) should be used instead.
✅ Final Answer: 6
Comments
Post a Comment