⚖️ Logistic Regression in Sklearn: Handling Class Imbalance and Regularization
Introduction
Logistic Regression is one of the most widely used algorithms for binary classification. While it is simple, powerful, and interpretable, two important aspects play a huge role in its performance:
-
Class imbalance – when one class has far more samples than the other.
-
Regularization – controlling model complexity to avoid overfitting.
In this blog, we’ll understand how class_weight='balanced' and the parameter C work in LogisticRegression from scikit-learn.
🔹 The Example Code
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced', C=0.5)
model.fit(X, y)
This code trains a Logistic Regression model with:
-
class_weight='balanced' -
C=0.5
Now let’s break down what this means.
🔹 Class Weight and Imbalanced Datasets
When classes are imbalanced (e.g., 90% negatives, 10% positives), the model might be biased towards the majority class.
👉 Setting class_weight='balanced':
-
Automatically adjusts weights inversely proportional to class frequencies.
-
Rare classes get higher weight so the model pays more attention to them.
✅ Correct understanding:
“The
balancedmode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data.”
❌ Wrong assumption:
"Equal weights are given to all classes." → This is incorrect, because weights are proportional to imbalance.
🔹 Regularization with Parameter C
Logistic Regression in sklearn always applies regularization unless penalty='none' is explicitly set.
-
The parameter C is the inverse of regularization strength.
-
Lower C → Stronger regularization.
-
Higher C → Weaker regularization.
In the given code:
-
C = 0.5means the model applies moderate regularization.
✅ Correct understanding:
“The value of C indicates that the model will apply a regularization.”
❌ Wrong assumption:
“No regularization is applied because C is set.” → Incorrect.
🔹 Final Takeaways
-
class_weight='balanced'handles imbalanced datasets by adjusting weights inversely to class frequencies. -
The
Cparameter controls regularization strength (smaller C = stronger regularization). -
Logistic Regression in sklearn by default always applies regularization unless explicitly disabled.
👉 Pro Tip: Always check class balance in your dataset. If classes are highly skewed, use class_weight='balanced' or manually specify class weights to prevent bias.
Would you like me to also create a comparison table:
📊 “Logistic Regression with Default vs Balanced Class Weights” for better visualization in the blog?
Comments
Post a Comment