Feature Extraction from Dictionaries in Scikit-Learn: Why DictVectorizer is the Right Choice
When preparing data for machine learning models, one of the biggest challenges is feature extraction. Raw data often comes in various formats — text, images, dictionaries, JSON, etc. To feed this into ML models, we need a way to convert it into a numerical matrix.
Consider the following dataset:
data = [
{'age': 4, 'height': 96.0},
{'age': 1, 'height': 73.9},
{'age': 3, 'height': 88.9},
{'age': 2, 'height': 81.6}
]
Now the question is:
👉 Which API from Scikit-Learn can be used to extract features from this dictionary-style data?
The Options
Why DictVectorizer is Correct
The DictVectorizer in Scikit-Learn is specifically designed to handle lists of dictionaries and convert them into a numeric feature matrix. Each key in the dictionary becomes a feature (column), and the values become the corresponding entries.
Example:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
features = vec.fit_transform(data)
print(features)
print(vec.get_feature_names_out())
Output:
[[ 4. 96.0]
[ 1. 73.9]
[ 3. 88.9]
[ 2. 81.6]]
['age', 'height']
Here:
-
The
DictVectorizertransformed dictionaries into a NumPy array. -
Each dictionary key (
age,height) became a feature column. -
Values remained as they were, making it a perfect fit for structured dictionary-style data.
Why Not the Others?
-
HashingVectorizer
-
Designed for text data (bag-of-words).
-
It tokenizes text into words, then applies a hashing trick.
-
Not suitable for numeric dictionaries like above.
-
-
FeatureHasher
-
Similar to
HashingVectorizerbut works on feature mappings. -
Good for high-dimensional categorical data, but unnecessary when you already have small, clean dictionaries with numeric values.
-
Key Takeaway
When working with structured dictionary-like data in Scikit-Learn:
✅ Use DictVectorizer
❌ Avoid HashingVectorizer and FeatureHasher (better suited for text or large sparse categorical data).
Final Answer:
To extract features from the given data:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
features = vec.fit_transform(data)
👉 Correct API: DictVectorizer
Would you like me to also add a real-world extension (like handling categorical features in dictionaries, e.g., {'gender': 'male', 'age': 25}) to show how DictVectorizer handles one-hot encoding automatically?
Comments
Post a Comment