Understanding the Libraries Imported and Their Roles in Our Data Science Notebook

When working on a machine learning project, especially data preprocessing and modeling, we rely on several powerful Python libraries. Let's explore which libraries were imported in this notebook, why we needed them, and what role they played in the process.

https://www.kaggle.com/code/jating4you/21f2000735-notebook-t22025?scriptVersionId=255216108

1. `pandas` — The Backbone for Data Handling

import pandas as pd

Why imported?
pandas is the go-to library for data manipulation and analysis. It provides easy-to-use data structures like DataFrames which help in handling tabular data.
How was it used?
- Reading CSV files (pd.read_csv) for loading train and test datasets.
- Handling missing values, dropping columns, removing duplicates, and performing feature engineering (creating new columns).
- Converting date columns to datetime and extracting day, month, year features.

2. `seaborn` and `matplotlib.pyplot` — Data Visualization Tools

import seaborn as sns
import matplotlib.pyplot as plt

Why imported?
Visualization is crucial for understanding data distribution, spotting outliers, and relationships between features.
How was it used?
- Creating boxplots for identifying outliers in pageViews.
- Plotting scatterplots and boxplots to analyze the relationship between features and the target variable (purchaseValue).
- Plotting heatmaps to visualize correlation matrices and feature importances.

3. `category_encoders` — For Target Encoding of Categorical Features

from category_encoders import TargetEncoder

Why imported?
Categorical variables need to be converted into numerical values for most ML models. Target encoding is a powerful technique that replaces categories with a value derived from the target variable, preserving important information.
How was it used?
- Encoding categorical columns in train and test sets based on the purchaseValue target before applying one-hot encoding.

4. `sklearn.model_selection` — Splitting Data for Training and Validation

from sklearn.model_selection import train_test_split

Why imported?
To evaluate how well a model generalizes, we split data into training, validation, and test sets. This prevents overfitting and helps in hyperparameter tuning.
How was it used?
- Splitting the dataset into training, validation, and test subsets.

5. ML Libraries and Metrics

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.impute import SimpleImputer
import numpy as np
import time

Why imported?
These are the core tools for building, training, and evaluating machine learning models.
How was each used?
- XGBoost, LightGBM, RandomForest: Different powerful regression algorithms trained to predict purchaseValue.
- mean_absolute_error, r2_score: Evaluation metrics to measure prediction accuracy and model performance.
- SimpleImputer: Imputes missing values (here, median strategy) for Random Forest inputs.
- NumPy: Numeric operations like rounding and clipping predictions.
- time: Measure training durations.

6. Key Code Operations and Their Purpose

Data Cleaning & Preprocessing

Dropping columns with only one unique value to reduce noise.
Removing sessionId and userId since they likely don't help prediction.
Removing duplicates for clean data.
Dropping columns with too many missing values (>90%).
Filling missing values and converting types for categorical and numerical columns (totals.bounces, new_visits, etc.).
Extracting date features (day, week, month, year) from a date column.

Feature Engineering

Creating engagement and temporal features like:
- Hits per session, views per hit, activity scores.
- Weekend indicator, quarter, month start/end, seasons.
- Binning skewed variables using quantiles.
- Interaction features (multiplying related columns to capture combined effects).

Outlier Handling

Visualizing pageViews distribution via boxplots.
Using IQR (Interquartile Range) method to cap outliers in pageViews.

Exploratory Data Analysis (EDA)

Plotting scatter and boxplots for numerical and categorical features vs target.
Computing correlation matrix and plotting heatmap to understand feature relationships.

Encoding

Applying target encoding to categorical features.
Then using one-hot encoding to prepare features for models.

Modeling

Training and evaluating three regressors:
- XGBoost (with early stopping).
- LightGBM (with early stopping and logging).
- Random Forest (with median imputation).
Comparing models using R² and MAE metrics.
Plotting feature importance for the best model.
Making final predictions on the test set with the best model.

Summary: Why These Libraries?

Library/Module	Purpose	Why Essential
`pandas`	Data loading, manipulation, cleaning	Core data handling
`seaborn`, `matplotlib`	Data visualization	Understand data distribution & relations
`category_encoders`	Target encoding	Better categorical encoding
`sklearn.model_selection`	Data splitting	Proper training and validation splits
`xgboost`, `lightgbm`, `RandomForestRegressor`	Modeling	Powerful, fast predictive models
`sklearn.metrics`	Performance evaluation	Measure model quality
`sklearn.impute`	Missing data imputation	Handle missing values for Random Forest
`numpy`	Numeric operations	Array manipulation and calculations
`time`	Timing model training	Measure computational efficiency

Final Thoughts

Each library and tool in your notebook has a specific role in the machine learning pipeline — from loading data, cleaning it, understanding it visually, encoding categorical data, splitting for validation, training models, and finally evaluating and predicting. Understanding this helps you build efficient, well-structured data science projects.

If you'd like, I can also help you write a more focused post on feature engineering or model comparison. Just let me know!

Search This Blog

Data Science