Understanding the Libraries Imported and Their Roles in Our Data Science Notebook
When working on a machine learning project, especially data preprocessing and modeling, we rely on several powerful Python libraries. Let's explore which libraries were imported in this notebook, why we needed them, and what role they played in the process.
https://www.kaggle.com/code/jating4you/21f2000735-notebook-t22025?scriptVersionId=255216108
1. pandas — The Backbone for Data Handling
import pandas as pd
-
Why imported?
pandasis the go-to library for data manipulation and analysis. It provides easy-to-use data structures like DataFrames which help in handling tabular data. -
How was it used?
-
Reading CSV files (
pd.read_csv) for loading train and test datasets. -
Handling missing values, dropping columns, removing duplicates, and performing feature engineering (creating new columns).
-
Converting date columns to datetime and extracting day, month, year features.
-
2. seaborn and matplotlib.pyplot — Data Visualization Tools
import seaborn as sns
import matplotlib.pyplot as plt
-
Why imported?
Visualization is crucial for understanding data distribution, spotting outliers, and relationships between features. -
How was it used?
-
Creating boxplots for identifying outliers in
pageViews. -
Plotting scatterplots and boxplots to analyze the relationship between features and the target variable (
purchaseValue). -
Plotting heatmaps to visualize correlation matrices and feature importances.
-
3. category_encoders — For Target Encoding of Categorical Features
from category_encoders import TargetEncoder
-
Why imported?
Categorical variables need to be converted into numerical values for most ML models. Target encoding is a powerful technique that replaces categories with a value derived from the target variable, preserving important information. -
How was it used?
-
Encoding categorical columns in train and test sets based on the
purchaseValuetarget before applying one-hot encoding.
-
4. sklearn.model_selection — Splitting Data for Training and Validation
from sklearn.model_selection import train_test_split
-
Why imported?
To evaluate how well a model generalizes, we split data into training, validation, and test sets. This prevents overfitting and helps in hyperparameter tuning. -
How was it used?
-
Splitting the dataset into training, validation, and test subsets.
-
5. ML Libraries and Metrics
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.impute import SimpleImputer
import numpy as np
import time
-
Why imported?
These are the core tools for building, training, and evaluating machine learning models. -
How was each used?
-
XGBoost, LightGBM, RandomForest: Different powerful regression algorithms trained to predict
purchaseValue. -
mean_absolute_error, r2_score: Evaluation metrics to measure prediction accuracy and model performance.
-
SimpleImputer: Imputes missing values (here, median strategy) for Random Forest inputs.
-
NumPy: Numeric operations like rounding and clipping predictions.
-
time: Measure training durations.
-
6. Key Code Operations and Their Purpose
Data Cleaning & Preprocessing
-
Dropping columns with only one unique value to reduce noise.
-
Removing
sessionIdanduserIdsince they likely don't help prediction. -
Removing duplicates for clean data.
-
Dropping columns with too many missing values (>90%).
-
Filling missing values and converting types for categorical and numerical columns (
totals.bounces,new_visits, etc.). -
Extracting date features (day, week, month, year) from a
datecolumn.
Feature Engineering
-
Creating engagement and temporal features like:
-
Hits per session, views per hit, activity scores.
-
Weekend indicator, quarter, month start/end, seasons.
-
Binning skewed variables using quantiles.
-
Interaction features (multiplying related columns to capture combined effects).
-
Outlier Handling
-
Visualizing
pageViewsdistribution via boxplots. -
Using IQR (Interquartile Range) method to cap outliers in
pageViews.
Exploratory Data Analysis (EDA)
-
Plotting scatter and boxplots for numerical and categorical features vs target.
-
Computing correlation matrix and plotting heatmap to understand feature relationships.
Encoding
-
Applying target encoding to categorical features.
-
Then using one-hot encoding to prepare features for models.
Modeling
-
Training and evaluating three regressors:
-
XGBoost (with early stopping).
-
LightGBM (with early stopping and logging).
-
Random Forest (with median imputation).
-
-
Comparing models using R² and MAE metrics.
-
Plotting feature importance for the best model.
-
Making final predictions on the test set with the best model.
Summary: Why These Libraries?
| Library/Module | Purpose | Why Essential |
|---|---|---|
pandas |
Data loading, manipulation, cleaning | Core data handling |
seaborn, matplotlib |
Data visualization | Understand data distribution & relations |
category_encoders |
Target encoding | Better categorical encoding |
sklearn.model_selection |
Data splitting | Proper training and validation splits |
xgboost, lightgbm, RandomForestRegressor |
Modeling | Powerful, fast predictive models |
sklearn.metrics |
Performance evaluation | Measure model quality |
sklearn.impute |
Missing data imputation | Handle missing values for Random Forest |
numpy |
Numeric operations | Array manipulation and calculations |
time |
Timing model training | Measure computational efficiency |
Final Thoughts
Each library and tool in your notebook has a specific role in the machine learning pipeline — from loading data, cleaning it, understanding it visually, encoding categorical data, splitting for validation, training models, and finally evaluating and predicting. Understanding this helps you build efficient, well-structured data science projects.
If you'd like, I can also help you write a more focused post on feature engineering or model comparison. Just let me know!
Comments
Post a Comment