Feature Selection with VarianceThreshold in Machine Learning

Feature selection is an important preprocessing step in machine learning. It helps remove uninformative features so that the model can train faster and generalize better. One simple but powerful technique for feature selection is VarianceThreshold, which removes features that have very little variance (i.e., almost constant across samples).

In this blog, we’ll walk through an example where we apply VarianceThreshold to a dataset, compute variances manually, and confirm the result.


Problem Statement

We are given a dataset of shape 4 × 4 (4 samples, 4 features):

from sklearn.feature_selection import VarianceThreshold  

data = [
    [1, 2, 3, 4],
    [2, 1, 3, 2],
    [1, 3, 2, 4],
    [4, 2, 4, 2]
]

We apply a variance threshold of 0.1:

vf = VarianceThreshold(threshold=0.1)
selected_data = vf.fit_transform(data)
selected_data.shape[1]

The question is: How many features will remain after applying this threshold?


Step-by-Step Variance Calculation

Variance is calculated column-wise (feature-wise). VarianceThreshold uses population variance (divides by n, not n-1).


Feature 0 → [1, 2, 1, 4]

  • Mean μ=(1+2+1+4)/4=2\mu = (1+2+1+4)/4 = 2

  • Squared deviations = (12)2+(22)2+(12)2+(42)2=1+0+1+4=6(1-2)^2 + (2-2)^2 + (1-2)^2 + (4-2)^2 = 1+0+1+4 = 6

  • Variance = 6/4=1.56/4 = \mathbf{1.5}


Feature 1 → [2, 1, 3, 2]

  • Mean μ=(2+1+3+2)/4=2\mu = (2+1+3+2)/4 = 2

  • Squared deviations = 0+1+1+0=20+1+1+0 = 2

  • Variance = 2/4=0.52/4 = \mathbf{0.5}


Feature 2 → [3, 3, 2, 4]

  • Mean μ=(3+3+2+4)/4=3\mu = (3+3+2+4)/4 = 3

  • Squared deviations = 0+0+1+1=20+0+1+1 = 2

  • Variance = 2/4=0.52/4 = \mathbf{0.5}


Feature 3 → [4, 2, 4, 2]

  • Mean μ=(4+2+4+2)/4=3\mu = (4+2+4+2)/4 = 3

  • Squared deviations = 1+1+1+1=41+1+1+1 = 4

  • Variance = 4/4=1.04/4 = \mathbf{1.0}


Decision Based on Threshold

  • Computed variances: [1.5, 0.5, 0.5, 1.0]

  • Threshold = 0.1

Since all variances are greater than 0.1, no features are removed.


Final Output

  • Shape after transformation = (4, 4)

  • Number of selected features = 4

So, the output of the code is:

4

Key Takeaways

  • VarianceThreshold is a simple baseline method for feature selection.

  • Features with variance below the threshold are removed, as they provide little to no useful information.

  • In our case, all features had enough variance, so none were removed.

👉 Always remember: VarianceThreshold works best for filtering out constant or near-constant features before applying more sophisticated feature selection or dimensionality reduction methods.


✨ That’s it! You now understand how to compute variances manually and verify why VarianceThreshold kept all features in this dataset.


Would you like me to also visualize the variances with a simple bar chart (features vs variance) for an even clearer blog illustration?

Comments

Popular posts from this blog

Understanding Data Leakage in Machine Learning: Causes, Examples, and Prevention

🌳 Understanding Maximum Leaf Nodes in Decision Trees (Scikit-Learn)

Linear Regression with and without Intercept: Explained Simply