Demystifying Dimensionality Reduction & PCA
When working with machine learning, you might encounter datasets with hundreds or even thousands of features. While more features might seem better, they can actually make your model slower, harder to interpret, and more prone to overfitting. This is where dimensionality reduction comes in.
1. What is Dimensionality Reduction?
Dimensionality reduction is the process of reducing the number of features (columns) in your dataset while keeping the most important patterns in the data.
Why it matters:
-
Speeds up training and reduces computation.
-
Avoids the curse of dimensionality — too many features make it harder for the model to generalize.
-
Removes noise and redundancy from data.
-
Makes visualization easier — you can plot high-dimensional data in 2D or 3D.
2. Does It Remove Data Points?
No — dimensionality reduction does not delete rows (data points). Instead:
-
It transforms each point into a new coordinate system.
-
The number of features changes, but the number of samples stays the same.
Example:
Before: 100 samples × 50 features
After: 100 samples × 5 features
Still 100 samples, just represented in a smaller feature space.
However, since we compress the data, some information is lost, but the goal is to keep only the important patterns.
3. What is PCA? (Principal Component Analysis)
PCA is one of the most common dimensionality reduction techniques.
Intuition:
-
Imagine you have points scattered in 3D space.
-
PCA finds a new set of axes (directions) that capture the most variation in the data.
-
Then, it re-expresses the data using only the most important axes.
How PCA Works — Step by Step
-
Standardize the data
Make sure each feature has mean = 0 and variance = 1, so no feature dominates due to scale. -
Find the covariance matrix
Measures how features vary together. -
Compute eigenvectors & eigenvalues
-
Eigenvectors = new directions (principal components).
-
Eigenvalues = amount of variance each direction explains.
-
-
Sort by variance explained
Keep the top k components that explain the most variance. -
Transform the data
Project original data onto these components.
4. Example: Reducing 2D to 1D
Imagine you have 2 features: height and weight. The data points form a diagonal cloud.
-
PCA finds the line (1D axis) that best fits the spread.
-
Each point is projected onto this line.
-
Now instead of
(height, weight), each person is described by one number — their position along this new axis.
5. Real-Life Example in Python
Let's reduce a dataset of handwritten digits (MNIST) from 784 features to 2 features for visualization.
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Load dataset
digits = load_digits()
X = digits.data
y = digits.target
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Plot
plt.figure(figsize=(8,6))
scatter = plt.scatter(X_pca[:,0], X_pca[:,1], c=y, cmap='tab10', s=15)
plt.legend(*scatter.legend_elements(), title="Digits")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("MNIST Digits Projected into 2D via PCA")
plt.show()
What happens here?
-
Original data had 784 features (28×28 pixel images).
-
PCA compressed it to 2 features.
-
We can now visualize digits in a simple 2D plot.
6. Key Notes on PCA
-
Unsupervised: Doesn’t use labels.
-
Linear transformation: New features are combinations of original features.
-
Variance-focused: Keeps directions where data varies the most.
-
Trade-off: Fewer features = less detail, but also less noise.
7. When to Use PCA?
-
Before feeding data into machine learning to remove noise.
-
For visualizing high-dimensional data in 2D/3D.
-
When dealing with multicollinearity (features highly correlated).
-
For speeding up algorithms on large datasets.
In short: PCA doesn’t throw away your data points — it gives them a more compact, information-rich description. Think of it like compressing a photo: the picture is smaller, but still recognizable.
Comments
Post a Comment