Feature Extraction from Dictionaries in Scikit-Learn: Why DictVectorizer is the Right Choice

- August 30, 2025

When preparing data for machine learning models, one of the biggest challenges is feature extraction. Raw data often comes in various formats — text, images, dictionaries, JSON, etc. To feed this into ML models, we need a way to convert it into a numerical matrix.

Consider the following dataset:

data = [
    {'age': 4, 'height': 96.0},
    {'age': 1, 'height': 73.9},
    {'age': 3, 'height': 88.9},
    {'age': 2, 'height': 81.6}
]

Now the question is:
👉 Which API from Scikit-Learn can be used to extract features from this dictionary-style data?

The Options

Why DictVectorizer is Correct

The DictVectorizer in Scikit-Learn is specifically designed to handle lists of dictionaries and convert them into a numeric feature matrix. Each key in the dictionary becomes a feature (column), and the values become the corresponding entries.

Example:

from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)
features = vec.fit_transform(data)

print(features)
print(vec.get_feature_names_out())

Output:

[[ 4.  96.0]
 [ 1.  73.9]
 [ 3.  88.9]
 [ 2.  81.6]]
['age', 'height']

Here:

The DictVectorizer transformed dictionaries into a NumPy array.
Each dictionary key (age, height) became a feature column.
Values remained as they were, making it a perfect fit for structured dictionary-style data.

Why Not the Others?

HashingVectorizer
- Designed for text data (bag-of-words).
- It tokenizes text into words, then applies a hashing trick.
- Not suitable for numeric dictionaries like above.
FeatureHasher
- Similar to HashingVectorizer but works on feature mappings.
- Good for high-dimensional categorical data, but unnecessary when you already have small, clean dictionaries with numeric values.

Key Takeaway

When working with structured dictionary-like data in Scikit-Learn:

✅ Use DictVectorizer
❌ Avoid HashingVectorizer and FeatureHasher (better suited for text or large sparse categorical data).

Final Answer:

To extract features from the given data:

from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)
features = vec.fit_transform(data)

👉 Correct API: DictVectorizer

Would you like me to also add a real-world extension (like handling categorical features in dictionaries, e.g., {'gender': 'male', 'age': 25}) to show how DictVectorizer handles one-hot encoding automatically?

Search This Blog

Data Science