🌳 Decision Tree Splitting Rules Explained (with min_samples_split & min_samples_leaf)
Decision Trees are among the most intuitive machine learning algorithms. However, when we tune their hyperparameters, especially min_samples_split and min_samples_leaf, it’s important to understand how they influence the tree’s growth.
Let’s explore this with a concrete example from Scikit-Learn’s DecisionTreeClassifier.
📌 The Code
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Load dataset
X, y = load_breast_cancer(as_frame=True, return_X_y=True)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Decision Tree with constraints
clf = DecisionTreeClassifier(min_samples_split=7, min_samples_leaf=4, random_state=5)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
Here we set:
-
min_samples_split = 7→ A node must have at least 7 samples to even attempt splitting. -
min_samples_leaf = 4→ After splitting, each child node must have at least 4 samples.
⚙️ Rules for Splitting a Node
A split at node N will only be performed if:
-
Node Size Check:
samples_at_node ≥ min_samples_split
→ If fewer than 7 samples, no split. -
Child Size Check: After the split, both children must have ≥ 4 samples.
→ Otherwise, split is invalid.
📝 Scenarios to Test
We are given multiple hypothetical scenarios. Let’s check them one by one:
✅ Scenario 1: Node N = 15 (split → 10 left, 5 right)
-
Node size = 15 ≥ 7 → ✅
-
Left = 10 ≥ 4, Right = 5 ≥ 4 → ✅
👉 Valid Split
✅ Scenario 2: Node N = 8 (split → 4 left, 4 right)
-
Node size = 8 ≥ 7 → ✅
-
Left = 4 ≥ 4, Right = 4 ≥ 4 → ✅
👉 Valid Split
❌ Scenario 3: Node N = 9 (split → 2 left, 7 right)
-
Node size = 9 ≥ 7 → ✅
-
Left = 2 ❌ (violates
min_samples_leaf)
👉 Invalid Split
✅ Scenario 4: Node N = 14 (split → 4 left, 10 right)
-
Node size = 14 ≥ 7 → ✅
-
Left = 4 ≥ 4, Right = 10 ≥ 4 → ✅
👉 Valid Split
❌ Scenario 5: Node N = 6 (split → 3 left, 3 right)
-
Node size = 6 ❌ (violates
min_samples_split)
👉 Invalid Split
🎯 Final Correct Options
Only these scenarios lead to valid splits:
-
Node N = 15 → (10, 5)
-
Node N = 8 → (4, 4)
-
Node N = 14 → (4, 10)
🚀 Key Takeaways
-
min_samples_splitensures that nodes are not split if they are too small. -
min_samples_leafensures that leaf nodes are not too small, preventing overfitting. -
Always check both conditions before concluding if a split is valid.
By tuning these parameters, you can control tree depth, generalization ability, and robustness of your model.
👉 Do you want me to also add a diagram (decision tree splitting illustration) in the blog to make it even more visually clear?
Comments
Post a Comment