Understanding OneHotEncoder in Scikit-learn with a Multiple Choice Example

When working with machine learning, handling categorical data properly is essential. One widely used method is One-Hot Encoding, which converts categorical features into binary vectors. Let’s dive into a multiple-choice question example to see how this works.


The Code Example

from sklearn.preprocessing import OneHotEncoder

data = [['apple', 3], ['banana', 1], ['apple', 2], ['orange', 1], ['banana', 3]]

ohe = OneHotEncoder(sparse_output=False)
ohe.fit(data)
print(ohe.transform(data).shape[1])

Step-by-Step Explanation

  1. Dataset Preparation
    The dataset has two features:

    • Fruit names: apple, banana, orange

    • Numbers: 1, 2, 3

    Unique categories:

    • Fruits → 3 unique values

    • Numbers → 3 unique values

  2. OneHotEncoder Initialization

    ohe = OneHotEncoder(sparse_output=False)
    

    Here, sparse_output=False ensures the output will be a dense NumPy array instead of a sparse matrix.

  3. Fitting the Encoder

    ohe.fit(data)
    

    The encoder learns the unique categories from both columns.

  4. Transform and Shape

    print(ohe.transform(data).shape[1])
    

    After one-hot encoding:

    • Fruits → 3 binary columns

    • Numbers → 3 binary columns

    • Total = 3 + 3 = 6 columns


Visual Representation

Before encoding:

["apple", 3]
["banana", 1]
["apple", 2]
["orange", 1]
["banana", 3]

After encoding (simplified example):

[1,0,0, 0,0,1]   # apple + 3
[0,1,0, 1,0,0]   # banana + 1
[1,0,0, 0,1,0]   # apple + 2
[0,0,1, 1,0,0]   # orange + 1
[0,1,0, 0,0,1]   # banana + 3

Correct Answer

The output of the code will be:

6

Key Takeaways

  • OneHotEncoder expands categorical features into multiple binary features.

  • The number of columns after encoding equals the sum of unique categories across all features.

  • Even numbers are treated as categorical labels here.

  • For continuous numeric features, scaling methods (like StandardScaler) should be used instead.

Final Answer: 6

Comments

Popular posts from this blog

Understanding Data Leakage in Machine Learning: Causes, Examples, and Prevention

🌳 Understanding Maximum Leaf Nodes in Decision Trees (Scikit-Learn)

Linear Regression with and without Intercept: Explained Simply