Baselines Matter: Understanding DummyClassifier and “Most Frequent” Strategy

- August 30, 2025

# Baselines Matter: Understanding DummyClassifier and “Most Frequent” Strategy

When you’re starting a machine learning project, a strong baseline is your best friend. It tells you whether your fancy model is actually learning anything useful—or just adding complexity. In scikit-learn, DummyClassifier is the go-to tool for building such baselines.

This post walks through a concrete example, explains why the result looks the way it does, and shows how to use baselines responsibly.

---

## The Setup

- Feature matrix: X with shape (1000, 5)

- Labels: y with two classes {0, 1}

- Class distribution: 650 examples are class 1, 350 are class 0

- Code:

```python

from sklearn.dummy import DummyClassifier

base_clf = DummyClassifier(strategy='most_frequent')

base_clf.fit(X, y)

print(base_clf.score(X, y))

```

What happens here?

- `strategy='most_frequent'` always predicts the majority class seen during `fit`.

- Since class 1 appears 650/1000 times, the classifier will always predict 1.

Therefore, the training accuracy printed is:

- Accuracy = correct predictions / total = 650 / 1000 = 0.65

So, the output is 0.65.

---

## Why This Baseline Is Important

- It quantifies the minimum performance a model must beat. If your logistic regression or random forest scores around 0.66 on the same data, it’s barely better than predicting the majority class every time.

- It highlights class imbalance. A seemingly “good” accuracy (65%) is actually trivial to achieve here and may be useless for minority-class detection.

---

## Beyond Accuracy: Metrics That Matter

For imbalanced data, accuracy can be misleading. Prefer:

- Precision, Recall, F1 (especially for the minority class)

- ROC AUC and PR AUC (Precision-Recall AUC is often more informative with imbalance)

- Confusion matrix to see error types

```python

from sklearn.metrics import classification_report, confusion_matrix

y_pred = base_clf.predict(X)

print(confusion_matrix(y, y_pred))

print(classification_report(y, y_pred, digits=3))

```

With the most-frequent strategy (always predicting 1):

- True Positives: 650

- False Positives: 350

- True Negatives: 0

- False Negatives: 0

- Recall for class 1 = 1.00, but precision for class 1 = 650 / (650+350) = 0.65

- Class 0 is never detected (recall = 0.00)

---

## Better Baselines You Should Try

DummyClassifier supports multiple strategies:

- `most_frequent`: always predict the majority class

- `stratified`: predict according to class distribution

- `uniform`: predict uniformly at random

- `constant`: predict a user-specified label

```python

DummyClassifier(strategy='stratified', random_state=42)

DummyClassifier(strategy='uniform', random_state=42)

DummyClassifier(strategy='constant', constant=0)

```

These help you understand whether a “real” model beats random or distribution-aware guessing.

---

## Train/Test Protocol: Avoid Self-Congratulation

The code above scores on the training set. For meaningful evaluation:

```python

from sklearn.model_selection import train_test_split

X_tr, X_te, y_tr, y_te = train_test_split(

X, y, test_size=0.2, stratify=y, random_state=42

)

base = DummyClassifier(strategy='most_frequent')

base.fit(X_tr, y_tr)

print('Test accuracy:', base.score(X_te, y_te))

```

- Use `stratify=y` to preserve class ratios across splits.

- Compare every real model against this test-set baseline.

---

## Key Takeaways

- The “most frequent” baseline here yields 0.65 accuracy because 65% of labels are class 1.

- Always create and report a baseline—then ensure your model meaningfully exceeds it.

- Don’t rely on accuracy alone; use recall, precision, F1, and PR AUC for imbalanced data.

- Evaluate on a stratified train/test split to avoid misleading conclusions.

If you’d like, share your dataset goal (e.g., maximize recall for class 0), and I can suggest tailored metrics, thresholds, and next-step models.

Search This Blog

Data Science

Baselines Matter: Understanding DummyClassifier and “Most Frequent” Strategy

Comments

Post a Comment

Popular posts from this blog

Understanding Data Leakage in Machine Learning: Causes, Examples, and Prevention

🌳 Understanding Maximum Leaf Nodes in Decision Trees (Scikit-Learn)

Linear Regression with and without Intercept: Explained Simply