Understanding max_samples in BaggingClassifier in Python

- August 30, 2025

When working with ensemble methods in machine learning, Bagging (Bootstrap Aggregating) is a popular technique to improve model performance by reducing variance. Scikit-learn provides a convenient implementation called BaggingClassifier for classification problems.

Let’s break down a sample code snippet:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    max_samples=0.5,
    n_estimators=10,
    random_state=42
)

model.fit(X_train, y_train)

What is `max_samples`?

The max_samples parameter controls the percentage or number of samples drawn from the training dataset to train each base estimator.
It can take either an integer or a float:
- Integer: The exact number of samples to draw.
- Float (between 0 and 1): The fraction of the training dataset to draw. For example, max_samples=0.5 means 50% of the training samples will be randomly selected for each base estimator.

Why use `max_samples`?

Diversity in the Ensemble: By training each base estimator on a different random subset of the data, the model reduces correlation between estimators, which improves generalization.
Efficiency: You don’t always need the full dataset to train each estimator. Smaller subsets can speed up training.

Example:

Assume your training dataset has 1000 samples:

With max_samples=0.5, each Decision Tree in the ensemble will be trained on 500 randomly sampled points.
With max_samples=1.0, each tree would use all 1000 samples (default behavior if not specified).

Other Parameters in BaggingClassifier

base_estimator: The model used for each estimator (e.g., DecisionTreeClassifier here).
n_estimators: Number of base estimators in the ensemble (10 in our example).
random_state: Ensures reproducibility by controlling random sampling.

Key Points to Remember

max_samples is NOT the number of estimators, nor the number of features. It only affects how many training samples each estimator sees.
It can help reduce overfitting if your base estimator tends to overfit on the full dataset.
Typical values: 0.5 to 1.0 (50%-100% of samples).

Conclusion

The max_samples parameter is a powerful way to control diversity and generalization in bagging ensembles. By selecting a fraction of training data for each estimator, you create multiple slightly different models whose predictions, when aggregated, lead to a more robust overall model.

✅ Answer to the Question: “The percentage of training samples used to train each base estimator.”

If you want, I can also create a blog post including multiple images explaining BaggingClassifier with max_samples, n_estimators, and base estimator, so it’s a complete learning guide for students.

Do you want me to do that?

Search This Blog

Data Science