The Ultimate Guide to Scikit-learn Models (with Key Hyperparameters)
Choosing the right model in scikit-learn can be tricky.
This guide gives you model categories, when to use them, real-world examples, and important hyperparameters to tune for better performance.
1. Classification Models — Predicting Categories
| Model | When to Use | Example | Key Hyperparameters |
|---|---|---|---|
| LogisticRegression | Binary/multi-class classification with a linear decision boundary. | Will it rain tomorrow? (Yes/No) | penalty (l1, l2, elasticnet), C (inverse regularization), solver |
| KNeighborsClassifier | Classification by similarity to nearest neighbors. | Classify plants by leaf shape. | n_neighbors, weights (uniform, distance), metric |
| DecisionTreeClassifier | Simple, interpretable rules for non-linear data. | Diagnose diabetes. | max_depth, min_samples_split, min_samples_leaf, criterion |
| RandomForestClassifier | Multiple trees for higher accuracy & robustness. | Credit card fraud detection. | n_estimators, max_depth, min_samples_split, max_features |
| GradientBoostingClassifier | Sequential trees that fix previous errors. | Predict customer churn. | n_estimators, learning_rate, max_depth, subsample |
| HistGradientBoostingClassifier | Faster gradient boosting for large data. | Classify product reviews. | max_iter, learning_rate, max_depth, l2_regularization |
| GaussianNB | Naive Bayes for continuous features. | Spam filtering. | var_smoothing |
| BernoulliNB | Naive Bayes for binary features. | Detect keyword presence in text. | alpha, binarize |
| MultinomialNB | Naive Bayes for count data. | Text classification. | alpha, fit_prior |
| SVC | Works well for small-to-medium datasets with clear margins. | Tumor classification. | kernel, C, gamma |
| LinearSVC | Fast linear SVM for large-scale classification. | Sentiment analysis. | C, penalty, loss |
| MLPClassifier | Neural network for non-linear decision boundaries. | Handwriting recognition. | hidden_layer_sizes, activation, solver, alpha, learning_rate |
2. Regression Models — Predicting Continuous Values
| Model | When to Use | Example | Key Hyperparameters |
|---|---|---|---|
| LinearRegression | Simple linear relationships. | Predict rainfall from humidity. | (No major hyperparameters) |
| Ridge | Linear regression with L2 regularization. | Predict house prices. | alpha, solver |
| Lasso | Linear regression with L1 regularization (feature selection). | Predict crop yield. | alpha, selection |
| ElasticNet | Mix of L1 and L2 penalties. | Predict rainfall with many correlated features. | alpha, l1_ratio |
| KNeighborsRegressor | Based on nearest neighbors’ average. | Estimate soil pH. | n_neighbors, weights, metric |
| DecisionTreeRegressor | Non-linear regression with rules. | Predict wind speed. | max_depth, min_samples_split, min_samples_leaf |
| RandomForestRegressor | Multiple trees for stable regression. | Predict electricity usage. | n_estimators, max_depth, min_samples_split |
| GradientBoostingRegressor | Boosted trees for high accuracy. | Predict rental prices. | n_estimators, learning_rate, max_depth, subsample |
| HistGradientBoostingRegressor | Fast gradient boosting on large data. | Predict crop yield. | max_iter, learning_rate, max_depth |
| SVR | SVM for regression. | Predict river water levels. | kernel, C, epsilon, gamma |
| MLPRegressor | Neural network for regression. | Predict solar power output. | hidden_layer_sizes, activation, solver, alpha |
| HuberRegressor | Robust to outliers. | Predict rainfall in noisy data. | epsilon, alpha |
| TheilSenRegressor | Robust, works well with small data. | Estimate median house prices. | max_subpopulation, n_jobs |
| RANSACRegressor | Ignores outliers when fitting. | Predict rainfall ignoring faulty readings. | min_samples, residual_threshold, max_trials |
3. Clustering Models — Grouping Data Without Labels
| Model | When to Use | Example | Key Hyperparameters |
|---|---|---|---|
| KMeans | Data with spherical clusters, known number of clusters. | Group weather stations by climate. | n_clusters, init, max_iter |
| MiniBatchKMeans | Faster KMeans for large datasets. | Cluster cities by rainfall. | n_clusters, batch_size, max_iter |
| AgglomerativeClustering | Hierarchical grouping. | Group countries by seasonal temperatures. | n_clusters, linkage |
| DBSCAN | Arbitrary-shaped clusters + noise detection. | Detect abnormal rainfall areas. | eps, min_samples |
| Birch | Large datasets with streaming data. | Cluster satellite images. | n_clusters, threshold, branching_factor |
| SpectralClustering | Graph-based clustering. | Group river basins by water flow. | n_clusters, affinity |
| GaussianMixture | Probabilistic clustering. | Classify cloud types. | n_components, covariance_type |
4. Anomaly Detection — Spotting Outliers
| Model | When to Use | Example | Key Hyperparameters |
|---|---|---|---|
| IsolationForest | High-dimensional anomaly detection. | Detect extreme rainfall. | n_estimators, max_samples, contamination |
| OneClassSVM | Complex anomaly boundaries. | Find abnormal wind patterns. | kernel, nu, gamma |
| LocalOutlierFactor | Local density-based outliers. | Detect faulty sensors. | n_neighbors, contamination, metric |
5. Time Series (Outside scikit-learn but Common)
Even though scikit-learn doesn’t specialize in time series forecasting, you’ll often use:
| Model | When to Use | Example | Key Hyperparameters |
|---|---|---|---|
| ARIMA / SARIMA | Trend & seasonality. | Monthly rainfall forecast. | p, d, q, P, D, Q, m |
| Prophet | Automated trend & seasonality handling. | Predict seasonal flooding. | changepoint_prior_scale, seasonality_mode |
| LSTM | Sequence learning. | Predict river levels. | units, dropout, epochs, batch_size |
✅ How to Choose a Model
-
Check your target
-
Category → Classification
-
Number → Regression
-
No target → Clustering
-
Rare events → Anomaly detection
-
Time component → Time series
-
-
Start simple, then tune
Begin with a basic model, then adjust key hyperparameters for performance.
If you want, I can now create a visual “Model Selection Flowchart” from this blog so you can see which model to pick step-by-step, along with hyperparameter cheat notes.
That would make it even easier to recall during work.
Do you want me to prepare that flowchart next?
Comments
Post a Comment