Understanding the Libraries Imported and Their Roles in Our Data Science Notebook

When working on a machine learning project, especially data preprocessing and modeling, we rely on several powerful Python libraries. Let's explore which libraries were imported in this notebook, why we needed them, and what role they played in the process.


https://www.kaggle.com/code/jating4you/21f2000735-notebook-t22025?scriptVersionId=255216108


1. pandas — The Backbone for Data Handling

import pandas as pd
  • Why imported?
    pandas is the go-to library for data manipulation and analysis. It provides easy-to-use data structures like DataFrames which help in handling tabular data.

  • How was it used?

    • Reading CSV files (pd.read_csv) for loading train and test datasets.

    • Handling missing values, dropping columns, removing duplicates, and performing feature engineering (creating new columns).

    • Converting date columns to datetime and extracting day, month, year features.


2. seaborn and matplotlib.pyplot — Data Visualization Tools

import seaborn as sns
import matplotlib.pyplot as plt
  • Why imported?
    Visualization is crucial for understanding data distribution, spotting outliers, and relationships between features.

  • How was it used?

    • Creating boxplots for identifying outliers in pageViews.

    • Plotting scatterplots and boxplots to analyze the relationship between features and the target variable (purchaseValue).

    • Plotting heatmaps to visualize correlation matrices and feature importances.


3. category_encoders — For Target Encoding of Categorical Features

from category_encoders import TargetEncoder
  • Why imported?
    Categorical variables need to be converted into numerical values for most ML models. Target encoding is a powerful technique that replaces categories with a value derived from the target variable, preserving important information.

  • How was it used?

    • Encoding categorical columns in train and test sets based on the purchaseValue target before applying one-hot encoding.


4. sklearn.model_selection — Splitting Data for Training and Validation

from sklearn.model_selection import train_test_split
  • Why imported?
    To evaluate how well a model generalizes, we split data into training, validation, and test sets. This prevents overfitting and helps in hyperparameter tuning.

  • How was it used?

    • Splitting the dataset into training, validation, and test subsets.


5. ML Libraries and Metrics

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.impute import SimpleImputer
import numpy as np
import time
  • Why imported?
    These are the core tools for building, training, and evaluating machine learning models.

  • How was each used?

    • XGBoost, LightGBM, RandomForest: Different powerful regression algorithms trained to predict purchaseValue.

    • mean_absolute_error, r2_score: Evaluation metrics to measure prediction accuracy and model performance.

    • SimpleImputer: Imputes missing values (here, median strategy) for Random Forest inputs.

    • NumPy: Numeric operations like rounding and clipping predictions.

    • time: Measure training durations.


6. Key Code Operations and Their Purpose

Data Cleaning & Preprocessing

  • Dropping columns with only one unique value to reduce noise.

  • Removing sessionId and userId since they likely don't help prediction.

  • Removing duplicates for clean data.

  • Dropping columns with too many missing values (>90%).

  • Filling missing values and converting types for categorical and numerical columns (totals.bounces, new_visits, etc.).

  • Extracting date features (day, week, month, year) from a date column.

Feature Engineering

  • Creating engagement and temporal features like:

    • Hits per session, views per hit, activity scores.

    • Weekend indicator, quarter, month start/end, seasons.

    • Binning skewed variables using quantiles.

    • Interaction features (multiplying related columns to capture combined effects).

Outlier Handling

  • Visualizing pageViews distribution via boxplots.

  • Using IQR (Interquartile Range) method to cap outliers in pageViews.

Exploratory Data Analysis (EDA)

  • Plotting scatter and boxplots for numerical and categorical features vs target.

  • Computing correlation matrix and plotting heatmap to understand feature relationships.

Encoding

  • Applying target encoding to categorical features.

  • Then using one-hot encoding to prepare features for models.

Modeling

  • Training and evaluating three regressors:

    • XGBoost (with early stopping).

    • LightGBM (with early stopping and logging).

    • Random Forest (with median imputation).

  • Comparing models using R² and MAE metrics.

  • Plotting feature importance for the best model.

  • Making final predictions on the test set with the best model.


Summary: Why These Libraries?

Library/Module Purpose Why Essential
pandas Data loading, manipulation, cleaning Core data handling
seaborn, matplotlib Data visualization Understand data distribution & relations
category_encoders Target encoding Better categorical encoding
sklearn.model_selection Data splitting Proper training and validation splits
xgboost, lightgbm, RandomForestRegressor Modeling Powerful, fast predictive models
sklearn.metrics Performance evaluation Measure model quality
sklearn.impute Missing data imputation Handle missing values for Random Forest
numpy Numeric operations Array manipulation and calculations
time Timing model training Measure computational efficiency

Final Thoughts

Each library and tool in your notebook has a specific role in the machine learning pipeline — from loading data, cleaning it, understanding it visually, encoding categorical data, splitting for validation, training models, and finally evaluating and predicting. Understanding this helps you build efficient, well-structured data science projects.


If you'd like, I can also help you write a more focused post on feature engineering or model comparison. Just let me know!

Comments

Popular posts from this blog

Understanding Data Leakage in Machine Learning: Causes, Examples, and Prevention

🌳 Understanding Maximum Leaf Nodes in Decision Trees (Scikit-Learn)

Linear Regression with and without Intercept: Explained Simply