Baselines Matter: Understanding DummyClassifier and “Most Frequent” Strategy
# Baselines Matter: Understanding DummyClassifier and “Most Frequent” Strategy
When you’re starting a machine learning project, a strong baseline is your best friend. It tells you whether your fancy model is actually learning anything useful—or just adding complexity. In scikit-learn, DummyClassifier is the go-to tool for building such baselines.
This post walks through a concrete example, explains why the result looks the way it does, and shows how to use baselines responsibly.
---
## The Setup
- Feature matrix: X with shape (1000, 5)
- Labels: y with two classes {0, 1}
- Class distribution: 650 examples are class 1, 350 are class 0
- Code:
```python
from sklearn.dummy import DummyClassifier
base_clf = DummyClassifier(strategy='most_frequent')
base_clf.fit(X, y)
print(base_clf.score(X, y))
```
What happens here?
- `strategy='most_frequent'` always predicts the majority class seen during `fit`.
- Since class 1 appears 650/1000 times, the classifier will always predict 1.
Therefore, the training accuracy printed is:
- Accuracy = correct predictions / total = 650 / 1000 = 0.65
So, the output is 0.65.
---
## Why This Baseline Is Important
- It quantifies the minimum performance a model must beat. If your logistic regression or random forest scores around 0.66 on the same data, it’s barely better than predicting the majority class every time.
- It highlights class imbalance. A seemingly “good” accuracy (65%) is actually trivial to achieve here and may be useless for minority-class detection.
---
## Beyond Accuracy: Metrics That Matter
For imbalanced data, accuracy can be misleading. Prefer:
- Precision, Recall, F1 (especially for the minority class)
- ROC AUC and PR AUC (Precision-Recall AUC is often more informative with imbalance)
- Confusion matrix to see error types
```python
from sklearn.metrics import classification_report, confusion_matrix
y_pred = base_clf.predict(X)
print(confusion_matrix(y, y_pred))
print(classification_report(y, y_pred, digits=3))
```
With the most-frequent strategy (always predicting 1):
- True Positives: 650
- False Positives: 350
- True Negatives: 0
- False Negatives: 0
- Recall for class 1 = 1.00, but precision for class 1 = 650 / (650+350) = 0.65
- Class 0 is never detected (recall = 0.00)
---
## Better Baselines You Should Try
DummyClassifier supports multiple strategies:
- `most_frequent`: always predict the majority class
- `stratified`: predict according to class distribution
- `uniform`: predict uniformly at random
- `constant`: predict a user-specified label
```python
DummyClassifier(strategy='stratified', random_state=42)
DummyClassifier(strategy='uniform', random_state=42)
DummyClassifier(strategy='constant', constant=0)
```
These help you understand whether a “real” model beats random or distribution-aware guessing.
---
## Train/Test Protocol: Avoid Self-Congratulation
The code above scores on the training set. For meaningful evaluation:
```python
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
base = DummyClassifier(strategy='most_frequent')
base.fit(X_tr, y_tr)
print('Test accuracy:', base.score(X_te, y_te))
```
- Use `stratify=y` to preserve class ratios across splits.
- Compare every real model against this test-set baseline.
---
## Key Takeaways
- The “most frequent” baseline here yields 0.65 accuracy because 65% of labels are class 1.
- Always create and report a baseline—then ensure your model meaningfully exceeds it.
- Don’t rely on accuracy alone; use recall, precision, F1, and PR AUC for imbalanced data.
- Evaluate on a stratified train/test split to avoid misleading conclusions.
If you’d like, share your dataset goal (e.g., maximize recall for class 0), and I can suggest tailored metrics, thresholds, and next-step models.
Comments
Post a Comment