Baselines Matter: Understanding DummyClassifier and “Most Frequent” Strategy

 # Baselines Matter: Understanding DummyClassifier and “Most Frequent” Strategy


When you’re starting a machine learning project, a strong baseline is your best friend. It tells you whether your fancy model is actually learning anything useful—or just adding complexity. In scikit-learn, DummyClassifier is the go-to tool for building such baselines.


This post walks through a concrete example, explains why the result looks the way it does, and shows how to use baselines responsibly.


---


## The Setup


- Feature matrix: X with shape (1000, 5)

- Labels: y with two classes {0, 1}

- Class distribution: 650 examples are class 1, 350 are class 0

- Code:


```python

from sklearn.dummy import DummyClassifier


base_clf = DummyClassifier(strategy='most_frequent')

base_clf.fit(X, y)

print(base_clf.score(X, y))

```


What happens here?


- `strategy='most_frequent'` always predicts the majority class seen during `fit`.

- Since class 1 appears 650/1000 times, the classifier will always predict 1.


Therefore, the training accuracy printed is:


- Accuracy = correct predictions / total = 650 / 1000 = 0.65


So, the output is 0.65.


---


## Why This Baseline Is Important


- It quantifies the minimum performance a model must beat. If your logistic regression or random forest scores around 0.66 on the same data, it’s barely better than predicting the majority class every time.

- It highlights class imbalance. A seemingly “good” accuracy (65%) is actually trivial to achieve here and may be useless for minority-class detection.


---


## Beyond Accuracy: Metrics That Matter


For imbalanced data, accuracy can be misleading. Prefer:


- Precision, Recall, F1 (especially for the minority class)

- ROC AUC and PR AUC (Precision-Recall AUC is often more informative with imbalance)

- Confusion matrix to see error types


```python

from sklearn.metrics import classification_report, confusion_matrix


y_pred = base_clf.predict(X)

print(confusion_matrix(y, y_pred))

print(classification_report(y, y_pred, digits=3))

```


With the most-frequent strategy (always predicting 1):

- True Positives: 650

- False Positives: 350

- True Negatives: 0

- False Negatives: 0

- Recall for class 1 = 1.00, but precision for class 1 = 650 / (650+350) = 0.65

- Class 0 is never detected (recall = 0.00)


---


## Better Baselines You Should Try


DummyClassifier supports multiple strategies:


- `most_frequent`: always predict the majority class

- `stratified`: predict according to class distribution

- `uniform`: predict uniformly at random

- `constant`: predict a user-specified label


```python

DummyClassifier(strategy='stratified', random_state=42)

DummyClassifier(strategy='uniform', random_state=42)

DummyClassifier(strategy='constant', constant=0)

```


These help you understand whether a “real” model beats random or distribution-aware guessing.


---


## Train/Test Protocol: Avoid Self-Congratulation


The code above scores on the training set. For meaningful evaluation:


```python

from sklearn.model_selection import train_test_split


X_tr, X_te, y_tr, y_te = train_test_split(

    X, y, test_size=0.2, stratify=y, random_state=42

)


base = DummyClassifier(strategy='most_frequent')

base.fit(X_tr, y_tr)

print('Test accuracy:', base.score(X_te, y_te))

```


- Use `stratify=y` to preserve class ratios across splits.

- Compare every real model against this test-set baseline.


---


## Key Takeaways


- The “most frequent” baseline here yields 0.65 accuracy because 65% of labels are class 1.

- Always create and report a baseline—then ensure your model meaningfully exceeds it.

- Don’t rely on accuracy alone; use recall, precision, F1, and PR AUC for imbalanced data.

- Evaluate on a stratified train/test split to avoid misleading conclusions.


If you’d like, share your dataset goal (e.g., maximize recall for class 0), and I can suggest tailored metrics, thresholds, and next-step models.

Comments

Popular posts from this blog

Understanding Data Leakage in Machine Learning: Causes, Examples, and Prevention

🌳 Understanding Maximum Leaf Nodes in Decision Trees (Scikit-Learn)

Linear Regression with and without Intercept: Explained Simply