Table of Contents¶
- Introduction
- Notebook settings
- Data Loading, Cleaning and Initial Exploration
- Data Preparation
- EDA
- Statistical Inference and Hypothesis Testing
- Data Preprocessing
- Hyperparameter Tuning including Class Weights
- Ensemble Model with Voting Classifier
- Cross-validation with Stratified K-Fold
- Bootstrapping with Bagging
- Learning Curve Analysis - Ensemble Model (Voting Classifier)
- Confusion Matrix for the Ensemble Model (Voting Classifier)
- LIME Interpretation
- AUC-ROC Curves for the Ensemble Model
- Final Conclusion
- Suggested Improvements
Introduction¶
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
Dataset Structure¶
| Column | Description | Datatype | Count |
|---|---|---|---|
| gender | Gender like: 'Male' 'Female' 'Other' | object | 5110 |
| age | Age | float64 | 5110 |
| hypertension | If someone has hypertension: 0-'No', 1-'Yes' | int64 | 5110 |
| heart_disease | If someone has heart disease: 0-'No', 1-'Yes' | int64 | 5110 |
| ever_married | If someone is married or not: 'Yes' 'No' | object | 5110 |
| work_type | Type of work: 'Private' 'Self-employed' 'Govt_job' 'children' 'Never_worked' | object | 5110 |
| Residence_type | Type of residence: 'Urban' or 'Rural' | object | 5110 |
| avg_glucose_level | Glucose level average | float64 | 5110 |
| bmi | Body Mass Index: estimate the amount of body fat by using height and weight measurements | float64 | 4909 |
| smoking_status | Smoking status;: 'formerly smoked' 'never smoked' 'smokes' 'Unknown' | object | 5110 |
Target variable:
| Column | Description | Datatype | Count |
|---|---|---|---|
| stroke | If someone had stroke or not: 0-'No', 1-'Yes' | int64 | 5110 |
Notebook settings¶
from assets.utils.functions import *
# Standard libraries
import warnings
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as mpatches
from IPython.display import display
# Model explanation tools
import lime
import lime.lime_tabular
# Data inspection and EDA
import sidetable as stb
from skimpy import skim
# Statistical analysis and hypothesis testing
from scipy import stats
from scipy.stats import norm, ttest_ind, chi2_contingency, t, zscore
import statsmodels.api as sm
import statsmodels.stats.api as sms_stats
from statsmodels.stats.weightstats import CompareMeans, DescrStatsW
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Machine learning and modeling
from sklearn.model_selection import (train_test_split, cross_val_score,
learning_curve, cross_val_predict,
StratifiedKFold, GridSearchCV,
RandomizedSearchCV)
from sklearn.impute import SimpleImputer
from sklearn.exceptions import DataConversionWarning
from sklearn.pipeline import Pipeline as SkPipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (OneHotEncoder, LabelEncoder,
StandardScaler, MinMaxScaler)
from sklearn.ensemble import (RandomForestClassifier,
GradientBoostingClassifier,
VotingClassifier, BaggingClassifier)
from sklearn.metrics import (classification_report, roc_auc_score,
f1_score, make_scorer, accuracy_score,
confusion_matrix, ConfusionMatrixDisplay,
roc_curve)
from sklearn.inspection import permutation_importance
from sklearn.utils import resample, shuffle
from sklearn.linear_model import LogisticRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV, ShuffleSplit
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
# Resampling techniques and pipelines (imbalanced data)
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
# Additional machine learning libraries
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
# Categorical encoding
from category_encoders import TargetEncoder
# Warnings
warnings.filterwarnings('ignore')
%load_ext pycodestyle_magic
# %reload_ext pycodestyle_magic
%pycodestyle_on
%flake8_on
%flake8_on --max_line_length 79
%matplotlib inline
Data Loading, Cleaning and Initial Exploration¶
main_df = pd.read_csv("assets/data/healthcare-dataset-stroke-data.csv")
main_df.head()
| id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9046 | Male | 67.0 | 0 | 1 | Yes | Private | Urban | 228.69 | 36.6 | formerly smoked | 1 |
| 1 | 51676 | Female | 61.0 | 0 | 0 | Yes | Self-employed | Rural | 202.21 | NaN | never smoked | 1 |
| 2 | 31112 | Male | 80.0 | 0 | 1 | Yes | Private | Rural | 105.92 | 32.5 | never smoked | 1 |
| 3 | 60182 | Female | 49.0 | 0 | 0 | Yes | Private | Urban | 171.23 | 34.4 | smokes | 1 |
| 4 | 1665 | Female | 79.0 | 1 | 0 | Yes | Self-employed | Rural | 174.12 | 24.0 | never smoked | 1 |
main_df.stb.freq(['stroke'], cum_cols=False).round(2)
| stroke | count | percent | |
|---|---|---|---|
| 0 | 0 | 4861 | 95.13 |
| 1 | 1 | 249 | 4.87 |
# Drop column = 'id'
main_df.drop(columns='id', inplace=True)
main_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5110 entries, 0 to 5109 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 5110 non-null object 1 age 5110 non-null float64 2 hypertension 5110 non-null int64 3 heart_disease 5110 non-null int64 4 ever_married 5110 non-null object 5 work_type 5110 non-null object 6 Residence_type 5110 non-null object 7 avg_glucose_level 5110 non-null float64 8 bmi 4909 non-null float64 9 smoking_status 5110 non-null object 10 stroke 5110 non-null int64 dtypes: float64(3), int64(3), object(5) memory usage: 439.3+ KB
skim(main_df)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 5110 │ │ string │ 5 │ │ │ │ Number of columns │ 11 │ │ float64 │ 3 │ │ │ └───────────────────┴────────┘ │ int64 │ 3 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━┩ │ │ │ age │ 0 │ 0 │ 43.23 │ 22.61 │ 0.08 │ 25 │ 45 │ 61 │ 82 │ ▅▆▇▇▇▆ │ │ │ │ hypertension │ 0 │ 0 │ 0.09746 │ 0.2966 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▁ │ │ │ │ heart_disease │ 0 │ 0 │ 0.05401 │ 0.2261 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ avg_glucose_level │ 0 │ 0 │ 106.1 │ 45.28 │ 55.12 │ 77.25 │ 91.88 │ 114.1 │ 271.7 │ ▇▅▁▁▁ │ │ │ │ bmi │ 201 │ 3.93 │ 28.89 │ 7.854 │ 10.3 │ 23.5 │ 28.1 │ 33.1 │ 97.6 │ ▅▇▁ │ │ │ │ stroke │ 0 │ 0 │ 0.04873 │ 0.2153 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ └─────────────────────┴──────┴───────┴──────────┴─────────┴────────┴────────┴───────┴───────┴───────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ gender │ 0 │ 0 │ 1 │ 5110 │ │ │ │ ever_married │ 0 │ 0 │ 1 │ 5110 │ │ │ │ work_type │ 0 │ 0 │ 1 │ 5110 │ │ │ │ Residence_type │ 0 │ 0 │ 1 │ 5110 │ │ │ │ smoking_status │ 0 │ 0 │ 1.5 │ 7887 │ │ │ └───────────────────────────────┴────────┴────────────┴─────────────────────────────┴────────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
Strategy for BMI missing data: Median imputation¶
I decided to move with the median imputation, since it is quite robust to outliers and represents the central tendency well. Also It won’t distort the central tendency significantly due to the slight skew and is a good general-purpose strategy for small percentages of missing data, which is the case here with 201 missing values.
# Handling missing values in the 'BMI' column by filling with the median value
bmi_median = main_df['bmi'].median()
# Assining values to the BMI missing values
main_df['bmi'] = main_df['bmi'].fillna(bmi_median)
# Check if there are missing values
missing_values = main_df.isnull().sum()
print("Missing values:")
print(missing_values)
Missing values: gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 0 smoking_status 0 stroke 0 dtype: int64
check_blank_or_whitespace(main_df)
Count of empty strings or single spaces per column: gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 0 smoking_status 0 stroke 0 dtype: int64
check_distinct_values(main_df)
Distinct values for 'gender': ['Male' 'Female' 'Other'] Distinct values for 'ever_married': ['Yes' 'No'] Distinct values for 'work_type': ['Private' 'Self-employed' 'Govt_job' 'children' 'Never_worked'] Distinct values for 'Residence_type': ['Urban' 'Rural'] Distinct values for 'smoking_status': ['formerly smoked' 'never smoked' 'smokes' 'Unknown']
Gender Analysis¶
main_df['gender'].value_counts()
gender Female 2994 Male 2115 Other 1 Name: count, dtype: int64
Since there is just one record for gender 'Other', I decided to drop it from the dataset.
main_df.drop(main_df[main_df.gender == 'Other'].index, inplace=True)
main_df.reset_index(drop=True, inplace=True)
Outliers Analysis¶
Since I want to check I want to visualize them for the features that might present such values. Considering all numerical features, for age, hypertension and hear disease, I won't consider any of them as having outliers, since for age all values seems to be reasonable and for the other two, the values are binary.
# Assuming 'main_df' contains your dataset and 'numerical_features' is defined
numerical_features = ['avg_glucose_level', 'bmi']
# Function to detect outliers using the IQR method
def count_outliers(feature):
Q1 = main_df[feature].quantile(0.25)
Q3 = main_df[feature].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = main_df[(
main_df[feature] < lower_bound) | (
main_df[feature] > upper_bound)]
return outliers.shape[0]
# Plot outliers for each numerical feature using boxplots
plt.figure(figsize=(15, 10))
# Loop through each feature, create a boxplot, and print outlier count
for i, feature in enumerate(numerical_features, 1):
# Count the number of outliers
outliers_count = count_outliers(feature)
# Create a boxplot
plt.subplot(2, 3, i)
sns.boxplot(x=main_df[feature])
plt.title(f"Boxplot of {feature} \n Outliers: {outliers_count}")
plt.tight_layout()
plt.show()
Winsorizing (Capping) Outliers¶
For features like BMI and average glucose level, capping outliers is an effective way to handle these extreme values without losing valuable data. By capping, we replace values above or below a certain threshold with the threshold itself, thus reducing the influence of extreme data points. That could improves Model Performance, reduce overfitting, interpretability and balance data distirbution.
# Apply capping for 'avg_glucose_level' and 'bmi'
main_df = cap_outliers(main_df, 'avg_glucose_level')
main_df = cap_outliers(main_df, 'bmi')
Data Preparation¶
Grouping Data¶
Binning would be useful for further EDA analysis. We can group the data into different bins based on the values of the features. This will help us to understand the data better and also to identify the patterns in the data.
BMI Common Ranges:
- Underweight: Below 18.5
- Normal weight: 18.5 - 24.9
- Overweight: 25.0 - 29.9
- Obese: 30.0 and above
# Step 1: Define the BMI bins and labels
bmi_bins = [0, 18.5, 24.9, 29.9, float('inf')]
bmi_labels = ['Underweight', 'Normal weight', 'Overweight', 'Obesity']
# Step 2: Create a new column in the dataframe for categorized BMI
main_df['bmi_cat'] = pd.cut(main_df['bmi'], bins=bmi_bins,
labels=bmi_labels, right=False)
# Step 3: Calculate the distribution of each BMI category
bmi_dist = main_df['bmi_cat'].value_counts(normalize=True).sort_index()
# Display the distribution of the bmi levels
print(bmi_dist)
bmi_cat Underweight 0.065962 Normal weight 0.237816 Overweight 0.315326 Obesity 0.380896 Name: proportion, dtype: float64
Common Glucose Level Ranges:
- Low: Below 70 mg/dL
- Normal: 70-99 mg/dL
- Prediabetes: 100-125 mg/dL
- Diabetes: 126 mg/dL and above
# Step 1: glucose level bins and labels
bins = [0, 70, 99, 125, float('inf')] # Use 'inf' for an upper bound
labels = ['Low', 'Normal', 'Prediabetes', 'Diabetes']
# Step 2: New column in the dataframe for categorized glucose levels
main_df['avg_glucose_level_cat'] = pd.cut(main_df['avg_glucose_level'],
bins=bins, labels=labels,
right=False)
# Step 3: Distribution of each glucose level category
glucose_dist = main_df['avg_glucose_level_cat'].value_counts().sort_index()
print(glucose_dist)
avg_glucose_level_cat Low 754 Normal 2315 Prediabetes 1041 Diabetes 999 Name: count, dtype: int64
# Step 1: Define age group bins and labels
age_bins = [0, 2, 12, 18, 35, 60, float('inf')] # Use 'inf' for an upper bound
age_labels = ['Infant', 'Child', 'Adolescent', 'Young Adults',
'Middle Aged Adults', 'Old Aged Adults']
# Step 2: Create a new column in the dataframe for categorized age groups
main_df['age_group_cat'] = pd.cut(main_df['age'],
bins=age_bins, labels=age_labels,
right=False)
# Step 3: Calculate the distribution of each age group category
age_group_dist = main_df['age_group_cat'].value_counts().sort_index()
# Display the distribution of the new age groups
print(age_group_dist)
age_group_cat Infant 120 Child 423 Adolescent 313 Young Adults 988 Middle Aged Adults 1889 Old Aged Adults 1376 Name: count, dtype: int64
# Dataframe info and checks
print(main_df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5109 entries, 0 to 5108 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 5109 non-null object 1 age 5109 non-null float64 2 hypertension 5109 non-null int64 3 heart_disease 5109 non-null int64 4 ever_married 5109 non-null object 5 work_type 5109 non-null object 6 Residence_type 5109 non-null object 7 avg_glucose_level 5109 non-null float64 8 bmi 5109 non-null float64 9 smoking_status 5109 non-null object 10 stroke 5109 non-null int64 11 bmi_cat 5109 non-null category 12 avg_glucose_level_cat 5109 non-null category 13 age_group_cat 5109 non-null category dtypes: category(3), float64(3), int64(3), object(5) memory usage: 454.8+ KB None
EDA¶
Univaritate Analysis¶
The goal of univariate analysis is to explore the data and find patterns in it. We can use different plots to visualize the data and identify the patterns in it.
# Drop the 'Id' column before plotting
univariate_analysis(main_df)
Insights from Univariate Plots:¶
- Gender Distribution:
- The dataset has a higher proportion of females (58.6%) compared to males (41.4%). There is only one record for “Other” gender, which could be dropped due to insignificance.
- Age Distribution:
- The age distribution appears uniform, with most individuals in the range of 20 to 80 years old. This balanced distribution should help the model generalize well across different age groups.
- Hypertension Distribution:
- A large proportion of individuals do not have hypertension (90.3%), with only 9.7% reporting hypertension. This imbalance might be an important feature to focus on, as it could significantly impact stroke prediction.
- Heart Disease Distribution:
- Most individuals do not have heart disease (94.6%), with only 5.4% having the condition. Like hypertension, this imbalance could make heart disease a critical indicator for stroke risk.
- Marital Status Distribution:
- The majority of individuals are married (65.6%). Marital status could have implications on lifestyle factors that may indirectly affect stroke risk.
- Work Type Distribution:
- Most individuals work in the private sector (57.6%), while a smaller percentage are self-employed (16%) or work in government jobs (12.9%). The “children” category accounts for 13.4%, while “never worked” is a minimal category (0.4%).
- Residence Type Distribution:
- The distribution between urban and rural populations is nearly equal, with a slight tilt toward rural areas (50.8%).
- Average Glucose Level Distribution (after capping):
- The glucose levels have been capped, and we see a smoother distribution between 60-140 mg/dL. There are still individuals with extreme values (closer to 160 mg/dL), but these extremes are more controlled due to capping, which helps reduce skewness and the impact of outliers.
- BMI Distribution (after capping):
- After capping BMI, the distribution appears more normalized, with most individuals falling within the 20-35 range. Extreme values beyond 45 have been capped, preventing those extreme points from skewing the model’s performance.
- Smoking Status Distribution:
- There is a significant portion of “unknown” smoking status (30.7%). However, the majority of individuals have “never smoked” (37.9%), and smaller portions fall into “formerly smoked” (17.3%) and “smokes” (15.4%) categories.
- Stroke Distribution:
- There is a heavy imbalance between the no-stroke group (95.1%) and the stroke group (4.9%). This imbalance will need to be addressed in the model through techniques like resampling or adjusting class weights.
- BMI Categories Distribution:
- The distribution of BMI categories shows that the largest portion of the population falls under “Obesity” (38.1%) and “Overweight” (31.5%), while fewer individuals are in the “Normal weight” (23.8%) or “Underweight” (6.6%) categories. Capping helped smooth the extreme BMI values.
- Average Glucose Level Categories Distribution:
- The majority of the population falls in the “Normal” range (45.3%), while the rest are distributed across “Low” (14.8%), “Prediabetes” (20.4%), and “Diabetes” (19.6%) categories. These categories could serve as useful features for stroke prediction.
Overall, after capping the BMI and glucose levels, the distributions appear more normalized, helping prevent the impact of extreme values on model performance. The insights from this univariate analysis will help guide feature engineering and model improvement strategies.
Now it's important to check bivariate analysis to understand the relationship between the features and the target variable, stroke.
Now it's important to check bivariate analysis to understand the relationship between the features and the target variable, stroke.
features_stroke(main_df)
Insights from the Features vs. Stroke Plots:¶
- Gender vs Stroke:
- Both males and females have a similar proportion of stroke occurrences, with slightly more males experiencing strokes. The “Other” gender category has negligible representation and doesn’t show any stroke cases.
- Age vs Stroke:
- Stroke occurrences increase with age, with the median age for stroke cases higher than for non-stroke cases. There is a clear trend showing older individuals are more likely to have a stroke, as expected.
- Hypertension vs Stroke:
- Individuals with hypertension are more likely to experience strokes. While most of the dataset does not have hypertension, those with hypertension show a higher proportion of strokes.
- Heart Disease vs Stroke:
- The presence of heart disease is strongly associated with stroke. A higher proportion of individuals with heart disease experience strokes compared to those without heart disease.
- Ever Married vs Stroke:
- There is a slight trend indicating that married individuals might be more prone to strokes. However, the difference is not as pronounced.
- Work Type vs Stroke:
- People in the private sector or self-employed seem to have more stroke occurrences compared to other work types. Those who have never worked and government employees show fewer strokes, potentially due to lifestyle or stress factors related to different work types.
- Residence Type vs Stroke:
- There is no significant difference in stroke occurrence between urban and rural residents, indicating that residential location may not be a strong determinant of stroke.
- Average Glucose Level vs Stroke:
- Individuals with higher average glucose levels tend to have more strokes. The median glucose level is notably higher for individuals with stroke, highlighting that high glucose is an important factor in stroke risk.
- BMI vs Stroke:
- Individuals with higher BMI values seem to have a higher stroke risk. The median BMI for stroke cases is higher, indicating that obesity is likely a risk factor for stroke.
- Smoking Status vs Stroke:
- Individuals who formerly smoked or currently smoke show a higher risk of stroke. Interestingly, the “unknown” category also shows a non-negligible number of strokes, which could suggest missing data or unreported smoking habits are still relevant to stroke risk.
- Average Glucose Level Categories vs Stroke:
- A clear trend emerges where individuals with “Prediabetes” or “Diabetes” categories show a higher proportion of strokes. This emphasizes the importance of glucose control in preventing strokes.
- BMI Categories vs Stroke:
- The risk of stroke increases with higher BMI categories. Individuals in the “Obesity” category have the highest stroke risk, followed by those in the “Overweight” category, further reinforcing the link between obesity and stroke risk.
Conclusion:
Key Risk Factors: age, hypertension, heart disease, high glucose levels, and higher BMI are strong indicators of stroke risk. Smoking status also plays a role, with smokers and former smokers showing a higher tendency for strokes. Lifestyle and Health Conditions: Factors such as smoking status and marital status show potential associations but are less conclusive.
Hypertension analysis with other features¶
Now I want to explore what kind of impact hypertension has on other features. I will analyze the relationship between hypertension and other features to understand how hypertension affects the likelihood of stroke.
import matplotlib.pyplot as plt
import seaborn as sns
# Create a figure and axes for 4 subplots
fig, axes = plt.subplots(1, 3, figsize=(24, 6))
# Plot for hypertension VS age VS gender
sns.boxplot(data=main_df, x='hypertension', y='age', hue='gender', ax=axes[0])
axes[0].set_title('Hypertension vs Age by Gender')
# Plot for hypertension VS avg_glucose_level VS gender
sns.boxplot(data=main_df, x='hypertension', y='avg_glucose_level',
hue='gender', ax=axes[1])
axes[1].set_title('Hypertension vs Avg Glucose Level by Gender')
# Plot for hypertension VS bmi VS gender
sns.boxplot(data=main_df, x='hypertension', y='bmi', hue='gender', ax=axes[2])
axes[2].set_title('Hypertension vs BMI by Gender')
# Adjust layout to prevent overlapping
plt.tight_layout()
plt.show()
multiple_distribution(
main_df,
group_by='hypertension',
hue='heart_disease',
column='gender',
legend_loc=(1, 1.1)
)
multiple_distribution(
main_df,
group_by='hypertension',
hue='heart_disease',
column='stroke',
legend_loc=(1, 1.1)
)
Insights from the Hypertension Plots:¶
Hypertension vs Age by Gender: • Median Age Difference: Individuals with hypertension are generally older compared to those without hypertension across all genders. • Age Distribution: The age range is more concentrated between 40 to 60 for hypertensive individuals, while non-hypertensive individuals have a broader age distribution, especially in the younger age group.
Hypertension vs Avg Glucose Level by Gender: • Higher Glucose Levels: Individuals with hypertension tend to have higher average glucose levels across all genders, indicating a possible correlation between high glucose levels and hypertension. • Outliers: There are significant outliers in the non-hypertensive group, suggesting some individuals without hypertension still have high glucose levels.
Hypertension vs BMI by Gender: • Similar BMI Distribution: The BMI distribution is relatively similar across both hypertensive and non-hypertensive groups, though hypertensive individuals have slightly higher median BMI values. • Outliers in BMI: There are numerous outliers in both groups, especially in non-hypertensive individuals, indicating a wide range of BMI values.
Heart Disease and Hypertension: • The presence of heart disease is more prevalent in individuals with hypertension compared to those without hypertension, regardless of stroke status. • For individuals without a stroke (stroke = 0), the percentage of heart disease is higher in those with hypertension (11.8%) compared to those without hypertension (4.0%). • For individuals with a stroke (stroke = 1), the percentage of heart disease is also higher in those with hypertension (19.7%) compared to those without hypertension (18.6%).
Impact of Stroke: • Overall, the presence of heart disease is significantly higher in stroke patients compared to non-stroke patients, for both hypertension categories.
Relative Comparison: • The difference in heart disease prevalence between hypertensive and non-hypertensive groups is more pronounced in individuals without a stroke.
Conclusion Age, heart disease, and average glucose level are significant indicators of hypertension across different genders. Individuals with hypertension are generally older and have higher average glucose levels, indicating a potential link between these factors. Additionally, heart disease is more prevalent among hypertensive individuals, especially those who have suffered a stroke. BMI, on the other hand, shows a more varied distribution, suggesting that while it is related to hypertension, it may not be as strong an indicator as the other factors mentioned. These findings highlight the importance of monitoring glucose levels and cardiovascular health in managing hypertension, particularly in older populations.
Smoking Status analysis with other features¶
multiple_distribution(
main_df,
group_by='hypertension',
hue='smoking_status',
column='stroke',
legend_loc=(1, 1.1)
)
Interesting to see the pattern for individuals that did and did not have stroke, but having hypertension are the same.
# Step 1: Create a crosstab of smoking status, hypertension, and heart disease
crosstab_df = pd.crosstab(
index=[main_df['smoking_status'], main_df['hypertension']],
columns=main_df['heart_disease'],
normalize='index'
)
# Step 2: Create a heatmap from the crosstab
plt.figure(figsize=(10, 6))
sns.heatmap(crosstab_df, annot=True, cmap='Blues', fmt=".1%",
cbar_kws={'label': 'Percentage'})
plt.title('Heatmap of Smoking Status, Hypertension, and Heart Disease')
plt.xlabel('Heart Disease')
plt.ylabel('Smoking Status and Hypertension')
plt.show()
Insights from the Smoking Status Plots:¶
- Hypertension and Smoking:
- Among individuals with hypertension and no stroke, the majority never smoked (46.3%), followed by formerly smoked (23.4%).
- In hypertensive individuals with a stroke, the percentage of those who never smoked is still the highest (48.5%), indicating smoking may not be the primary factor for strokes in hypertensive individuals. 2. Non-Hypertensive Individuals:
- Those without hypertension but with a stroke have higher percentages of individuals who formerly smoked (28.8%) and currently smoke (16.7%), suggesting smoking might play a role in stroke risk for non-hypertensive individuals. 3. Stroke Influence:
- For individuals with hypertension, smoking status shows relatively higher percentages of never smoked and formerly smoked compared to those without hypertension, regardless of stroke status.
Insights from the Second Plot (Smoking Status, Hypertension, and Heart Disease):
- Heart Disease and Hypertension Correlation:
- The risk of heart disease is notably higher in individuals who are hypertensive and smoke. For example, 16% of hypertensive smokers have heart disease compared to 6.6% of non-hypertensive smokers.
- The highest risk group for heart disease is hypertensive individuals who have formerly smoked (17.5%).
- Lower Heart Disease Prevalence:
- Non-hypertensive individuals across all smoking statuses have lower percentages of heart disease, indicating hypertension is a stronger predictor for heart disease than smoking alone.
- Unknown Smoking Status:
- Individuals with unknown smoking status but without hypertension show the lowest prevalence of heart disease (2.9%), highlighting that hypertension remains a critical factor.
Overall, these visualizations emphasize the interplay between hypertension, smoking status, and both stroke and heart disease risk, suggesting that hypertension is a stronger risk factor for heart disease, while smoking impacts stroke risk more significantly in non-hypertensive individuals.
bivariate_features(main_df,
group_by='avg_glucose_level_cat',
compare_by='hypertension')
bivariate_features(main_df,
group_by='work_type',
compare_by='avg_glucose_level_cat')
bivariate_features(main_df,
group_by='work_type',
compare_by='hypertension')
Correlation Matrix¶
# Create a copy of the dataframe to avoid modifying the original
corr_data = main_df.copy()
# Encoding categorical features for correlation analysis
encoder = LabelEncoder()
categorical_cols = corr_data.select_dtypes(
include=["object", "category"]).columns
# Encode categorical columns into numerical values
for col in categorical_cols:
corr_data[col] = encoder.fit_transform(corr_data[col])
# Compute correlation matrix
corr_matrix = corr_data.corr()
# Generate mask for the upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
# Set up the matplotlib figure
plt.figure(figsize=(16, 13))
# Draw the heatmap with the mask
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='coolwarm', square=True)
plt.title("Correlation Heatmap")
plt.show()
Insights from the Correlation Matrix:¶
Correlation coefficient (r):
Values range from -1 to +1.
- +1: Perfect positive correlation, meaning as one feature increases, the other increases.
- 1: Perfect negative correlation, meaning as one feature increases, the other decreases.
- 0: No correlation, meaning there is no linear relationship between the features.
age and hypertension have a correlation of 0.28, which means there’s a moderate positive relationship between age and hypertension. Older people tend to have more hypertension.
Ever_married and age show a high positive correlation of 0.68, which is quite intuitive since older individuals are more likely to be married.
BMI and work_type have a negative correlation of -0.32, suggesting that certain work types might be linked to lower BMI values.
Most features show small to moderate correlations, with the highest being between age and marital status. None of the correlations are strong enough to indicate significant multicollinearity, but age, hypertension, heart disease, and marital status appear to have some influence on each other and on the target variable (stroke).
Multicolinarity Analysis¶
# Calculate VIF for each feature in corr_data
X_vif = corr_data.drop(columns=['stroke'])
vif_data = pd.DataFrame()
vif_data["Feature"] = X_vif.columns
vif_data["VIF"] = [variance_inflation_factor(
X_vif.values, i) for i in range(X_vif.shape[1])]
# Display VIF results
print(vif_data)
Feature VIF 0 gender 1.732442 1 age 10.805309 2 hypertension 1.217095 3 heart_disease 1.166223 4 ever_married 5.715051 5 work_type 4.890263 6 Residence_type 1.995980 7 avg_glucose_level 10.433977 8 bmi 16.321572 9 smoking_status 3.125287 10 bmi_cat 2.871162 11 avg_glucose_level_cat 3.663102 12 age_group_cat 8.661315
Insights: The VIF values interpretation for the features can be summarized as follows:
- VIF < 5: Low to moderate multicollinearity. This is generally acceptable.
- VIF between 5 and 10: Moderate multicollinearity, which may require further investigation.
- VIF > 10: High multicollinearity, signaling problematic relationships between variables that could affect model stability.
Higher VIF values for features like age (10.8), bmi (16.32), and avg_glucose_level (10.43), indicating potential multicollinearity among these features. However, since the pre-processing steps that will be done soon will help to reduce multicollinearity, I decided to not drop any features based on the VIF values.
Statistical Inference and Hypothesis Testing¶
Since I want to test multiple hypotesis, I will check the testing for each feature, as follows:
- Continuous variables: T-tests
- Categorical variables: Chi-Square tests
main_df['stroke'] = main_df['stroke'].astype(int)
# Continuous features for T-Test
continuous_features = ['avg_glucose_level', 'bmi', 'age',
'heart_disease', 'hypertension']
for feature in continuous_features:
perform_ttests(main_df, feature)
# Categorical features for Chi-Square Test
categorical_features = ['work_type', 'smoking_status',
'ever_married', 'Residence_type', 'gender']
for feature in categorical_features:
perform_chi2_test(main_df, feature)
Feature: avg_glucose_level T-Statistic: 6.6415, P-Value: 1.766534321786102e-10 95% Confidence Interval for the Difference in Means: (12.558552333081456, 23.142887178049982) Reject the null hypothesis: Significant difference in avg_glucose_level between stroke and no-stroke groups. Feature: bmi T-Statistic: 3.7795, P-Value: 0.00019067641466155664 95% Confidence Interval for the Difference in Means: (0.6761611400158061, 2.145547422629734) Reject the null hypothesis: Significant difference in bmi between stroke and no-stroke groups. Feature: age T-Statistic: 29.6819, P-Value: 2.175773269747846e-95 95% Confidence Interval for the Difference in Means: (24.046577639007694, 27.460145351720612) Reject the null hypothesis: Significant difference in age between stroke and no-stroke groups. Feature: heart_disease T-Statistic: 5.6578, P-Value: 4.1031523007226445e-08 95% Confidence Interval for the Difference in Means: (0.0923369828292136, 0.19093437420385695) Reject the null hypothesis: Significant difference in heart_disease between stroke and no-stroke groups. Feature: hypertension T-Statistic: 6.2202, P-Value: 1.985812780520689e-09 95% Confidence Interval for the Difference in Means: (0.1203992598303807, 0.23194344431955238) Reject the null hypothesis: Significant difference in hypertension between stroke and no-stroke groups. Feature: work_type Chi-Square Statistic: 49.1591, P-Value: 5.40903546949726e-10 Reject the null hypothesis: Significant association between work_type and stroke. Feature: smoking_status Chi-Square Statistic: 29.2257, P-Value: 2.007704175610833e-06 Reject the null hypothesis: Significant association between smoking_status and stroke. Feature: ever_married Chi-Square Statistic: 58.8678, P-Value: 1.686285619167346e-14 Reject the null hypothesis: Significant association between ever_married and stroke. Feature: Residence_type Chi-Square Statistic: 1.0750, P-Value: 0.29982523877153633 Fail to reject the null hypothesis: No significant association between Residence_type and stroke. Feature: gender Chi-Square Statistic: 0.3400, P-Value: 0.5598277580669416 Fail to reject the null hypothesis: No significant association between gender and stroke.
Since features Residence_type and gender have no significant impact on the target variable, we could drop them. However we will keep them for now and see how the model performs.
Data Preprocessing¶
# Drop uneeded columns
columns_to_drop = ['avg_glucose_level_cat', 'bmi_cat', 'age_group_cat']
main_df.drop(columns=columns_to_drop, axis=1, errors='ignore', inplace=True)
# Cell 1: Perform your data processing and encoding
encoded_df = main_df.copy()
encoded_df.ever_married = main_df.ever_married.map(
{'No': 0, 'Yes': 1})
encoded_df.smoking_status = main_df.smoking_status.map(
{'never smoked': 0, 'formerly smoked': 1,
'smokes': 2, 'Unknown': 3})
encoded_df.work_type = main_df.work_type.map(
{'Private': 0, 'Self-employed': 1, 'children': 2,
'Govt_job': 3, 'Never_worked': 4})
encoded_df.Residence_type = main_df.Residence_type.map(
{'Rural': 0, 'Urban': 1})
encoded_df.gender = main_df.gender.map(
{'Male': 0, 'Female': 1})
# Cell 2: Handle print separately (without issues from pycodestyle)
encoded_df.head()
| gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 67.0 | 0 | 1 | 1 | 0 | 1 | 169.365 | 36.6 | 1 | 1 |
| 1 | 1 | 61.0 | 0 | 0 | 1 | 1 | 0 | 169.365 | 28.1 | 0 | 1 |
| 2 | 0 | 80.0 | 0 | 1 | 1 | 0 | 0 | 105.920 | 32.5 | 0 | 1 |
| 3 | 1 | 49.0 | 0 | 0 | 1 | 0 | 1 | 169.365 | 34.4 | 2 | 1 |
| 4 | 1 | 79.0 | 1 | 0 | 1 | 1 | 0 | 169.365 | 24.0 | 0 | 1 |
X = encoded_df.drop('stroke', axis=1)
y = encoded_df['stroke']
# Split the data - avoid data leakage
train_x, test_x, train_y, test_y = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
# Imputation and Label Encoding only on the training data
imputer = SimpleImputer(strategy='mean')
train_x['bmi'] = imputer.fit_transform(train_x[['bmi']])
# Apply the same imputation on test data
test_x['bmi'] = imputer.transform(test_x[['bmi']])
# Label Encoding
le = LabelEncoder()
train_x['gender'] = le.fit_transform(train_x['gender'])
# Apply the same encoding on test data
test_x['gender'] = le.transform(test_x['gender'])
# Scale the training data
scaler = StandardScaler()
train_x[['age', 'avg_glucose_level', 'bmi']] = scaler.fit_transform(
train_x[['age', 'avg_glucose_level', 'bmi']])
# Apply the scaling to the test data (do not fit again)
test_x[['age', 'avg_glucose_level', 'bmi']] = scaler.transform(
test_x[['age', 'avg_glucose_level', 'bmi']])
# Apply SMOTE to only the training data
smote = SMOTE(random_state=42)
balanced_train_x, balanced_train_y = smote.fit_resample(train_x, train_y)
Baseline Model¶
# Train model
model = RandomForestClassifier(random_state=42)
model.fit(balanced_train_x, balanced_train_y)
# Evaluate on the test set (unseen data)
y_pred = model.predict(test_x)
# Print evaluation metrics
print("Classification Report:\n", classification_report(test_y, y_pred))
print("ROC-AUC Score:",
roc_auc_score(test_y, model.predict_proba(test_x)[:, 1]))
Classification Report:
precision recall f1-score support
0 0.96 0.92 0.94 972
1 0.13 0.22 0.16 50
accuracy 0.89 1022
macro avg 0.54 0.57 0.55 1022
weighted avg 0.92 0.89 0.90 1022
ROC-AUC Score: 0.7478086419753087
Baseline Results Conclusion:¶
The baseline model demonstrates strong performance for class 0 (non-stroke), with 96% precision and 92% recall, indicating that the model is very good at identifying non-stroke cases. However, it struggles significantly with class 1 (stroke), showing only 13% precision and 22% recall, which means it has a high false positive rate and misses many true stroke cases.
The overall accuracy is 89%, but this is mainly driven by the dominant class 0. The macro average (0.55 f1-score) and the ROC-AUC score of 0.748 reflect the model’s limitations in handling the minority class (stroke cases). This imbalance suggests that further work on class balancing and model improvement is needed to better capture stroke predictions.
Below I will try to adjust the class weights to see if it improves the model performance, but including aditional models to test.
models = {
'CatBoost': CatBoostClassifier(
auto_class_weights='Balanced', iterations=500,
depth=6, learning_rate=0.1, silent=True),
'Gradient Boosting': GradientBoostingClassifier(
n_estimators=300, max_depth=10, learning_rate=0.2,
subsample=0.7, random_state=42 # No class_weight
),
'XGBoost': XGBClassifier(
n_estimators=300, max_depth=10, learning_rate=0.2,
colsample_bytree=0.8, subsample=0.7,
scale_pos_weight=10, # Balance classes
random_state=42
),
'LightGBM': LGBMClassifier(
n_estimators=300, max_depth=6, learning_rate=0.1,
class_weight='balanced', # Automatically balances classes
subsample=0.7, verbosity=-1, random_state=42
),
'Random Forest': RandomForestClassifier(
n_estimators=200, max_depth=10, min_samples_split=10,
min_samples_leaf=4, class_weight={0: 1, 1: 5},
random_state=42
)
}
# Evaluate each model
for model_name, model in models.items():
evaluate_model(model, balanced_train_x,
balanced_train_y, test_x, test_y, model_name)
CatBoost Results:
precision recall f1-score support
0 0.95 0.96 0.96 972
1 0.11 0.10 0.10 50
accuracy 0.92 1022
macro avg 0.53 0.53 0.53 1022
weighted avg 0.91 0.92 0.91 1022
CatBoost ROC-AUC Score: 0.7602880658436214
Gradient Boosting Results:
precision recall f1-score support
0 0.96 0.96 0.96 972
1 0.15 0.12 0.13 50
accuracy 0.92 1022
macro avg 0.55 0.54 0.55 1022
weighted avg 0.92 0.92 0.92 1022
Gradient Boosting ROC-AUC Score: 0.7665020576131687
XGBoost Results:
precision recall f1-score support
0 0.96 0.90 0.93 972
1 0.11 0.24 0.15 50
accuracy 0.86 1022
macro avg 0.53 0.57 0.54 1022
weighted avg 0.92 0.86 0.89 1022
XGBoost ROC-AUC Score: 0.7423045267489711
LightGBM Results:
precision recall f1-score support
0 0.96 0.96 0.96 972
1 0.15 0.14 0.15 50
accuracy 0.92 1022
macro avg 0.55 0.55 0.55 1022
weighted avg 0.92 0.92 0.92 1022
LightGBM ROC-AUC Score: 0.7653703703703704
Random Forest Results:
precision recall f1-score support
0 0.99 0.65 0.78 972
1 0.11 0.82 0.19 50
accuracy 0.66 1022
macro avg 0.55 0.73 0.49 1022
weighted avg 0.94 0.66 0.75 1022
Random Forest ROC-AUC Score: 0.7678395061728395
Summary:¶
While all models perform very well on the majority class (stroke=0), there is a consistent struggle to effectively handle the minority class (stroke=1). Random Forest stands out with high recall for the minority class but sacrifices overall accuracy. Other models, such as CatBoost, Gradient Boosting, and LightGBM, balance overall accuracy well but fail to significantly improve minority class detection.
Hyperparameter Tuning including Class Weights¶
# Hyperparameter grids for CatBoost
catboost_param_grid = {
'iterations': [200, 300, 500],
'depth': [6, 8, 10],
'learning_rate': [0.01, 0.1, 0.2],
'l2_leaf_reg': [1, 3, 5],
'auto_class_weights': ['Balanced'] # Keep class balancing
}
# Hyperparameter grids for Gradient Boosting
gb_param_grid = {
'n_estimators': [200, 300, 500],
'max_depth': [6, 8, 10],
'learning_rate': [0.01, 0.1, 0.2],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Hyperparameter grids for XGBoost
xgb_param_grid = {
'n_estimators': [200, 300, 500],
'max_depth': [6, 8, 10],
'learning_rate': [0.01, 0.1, 0.2],
'scale_pos_weight': [10, 20, 30], # Adjust for class imbalance
'subsample': [0.7, 0.8, 1.0],
'colsample_bytree': [0.7, 0.8, 1.0]
}
# Hyperparameter grids for Random Forest
rf_param_grid = {
'n_estimators': [200, 300, 500],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'class_weight': [{0: 1, 1: 5}, {0: 1, 1: 10}, {0: 1, 1: 20}]
}
# Hyperparameter grids for LightGBM
lgbm_param_grid = {
'n_estimators': [200, 300, 500],
'max_depth': [6, 8, 10],
'learning_rate': [0.01, 0.1, 0.2],
'class_weight': ['balanced'],
'subsample': [0.7, 0.8, 1.0]
}
# Define the models
catboost_model = CatBoostClassifier(
auto_class_weights='Balanced', silent=True, random_state=42)
xgb_model = XGBClassifier(
use_label_encoder=False, random_state=42, scale_pos_weight=20,
eval_metric='logloss', verbosity=0)
lgbm_model = LGBMClassifier(
class_weight='balanced', random_state=42, verbosity=-1)
rf_model = RandomForestClassifier(class_weight='balanced',
random_state=42)
gb_model = GradientBoostingClassifier(random_state=42)
# RandomizedSearchCV for CatBoost
catboost_search = RandomizedSearchCV(
catboost_model, param_distributions=catboost_param_grid,
n_iter=10, scoring='precision', cv=3, verbose=0,
random_state=42, n_jobs=-1)
# RandomizedSearchCV for XGBoost
xgb_search = RandomizedSearchCV(
xgb_model, param_distributions=xgb_param_grid,
n_iter=10, scoring='precision', cv=3, verbose=0,
random_state=42, n_jobs=-1)
# RandomizedSearchCV for Random Forest
rf_search = RandomizedSearchCV(
rf_model, param_distributions=rf_param_grid, n_iter=10,
scoring='precision', cv=3, verbose=0, random_state=42, n_jobs=-1)
# RandomizedSearchCV for LightGBM
lgbm_search = RandomizedSearchCV(
lgbm_model, param_distributions=lgbm_param_grid, n_iter=10,
scoring='precision', cv=3, verbose=0, random_state=42, n_jobs=-1)
# RandomizedSearchCV for Gradient Boosting
gb_search = RandomizedSearchCV(
gb_model, param_distributions=gb_param_grid, n_iter=10,
scoring='precision', cv=3, verbose=0, random_state=42, n_jobs=-1)
# Fit the searches to the training data
catboost_search.fit(balanced_train_x, balanced_train_y)
xgb_search.fit(balanced_train_x, balanced_train_y)
rf_search.fit(balanced_train_x, balanced_train_y)
lgbm_search.fit(balanced_train_x, balanced_train_y)
gb_search.fit(balanced_train_x, balanced_train_y)
# Get Hyperparameters
print("Best CatBoost Parameters:", catboost_search.best_params_)
print("Best XGBoost Parameters:", xgb_search.best_params_)
print("Best Random Forest Parameters:", rf_search.best_params_)
print("Best LightGBM Parameters:", lgbm_search.best_params_)
print("Best Gradient Boosting Parameters:", gb_search.best_params_)
Best CatBoost Parameters: {'learning_rate': 0.1, 'l2_leaf_reg': 3, 'iterations': 500, 'depth': 6, 'auto_class_weights': 'Balanced'}
Best XGBoost Parameters: {'subsample': 0.7, 'scale_pos_weight': 10, 'n_estimators': 300, 'max_depth': 10, 'learning_rate': 0.2, 'colsample_bytree': 0.8}
Best Random Forest Parameters: {'n_estimators': 500, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': 20, 'class_weight': {0: 1, 1: 10}}
Best LightGBM Parameters: {'subsample': 0.7, 'n_estimators': 300, 'max_depth': 6, 'learning_rate': 0.1, 'class_weight': 'balanced'}
Best Gradient Boosting Parameters: {'n_estimators': 500, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_depth': 10, 'learning_rate': 0.1}
Since now I have the best hyperparameters for each model, using RandomizedSearchCV, I will now train the models with the best hyperparameters and class weights to see if it improves the model performance.
# Train and evaluate CatBoost
best_catboost = CatBoostClassifier(
**catboost_search.best_params_, random_state=42, silent=True)
best_catboost.fit(balanced_train_x, balanced_train_y)
y_pred_catboost = best_catboost.predict(test_x)
print("CatBoost Results after Hyperparameter:\n",
classification_report(test_y, y_pred_catboost))
print("CatBoost ROC-AUC after Hyperparameter:",
roc_auc_score(test_y, best_catboost.predict_proba(test_x)[:, 1]))
# Train and evaluate XGBoost
best_xgb = XGBClassifier(
**xgb_search.best_params_, random_state=42, use_label_encoder=False,
eval_metric='logloss', verbosity=0)
best_xgb.fit(balanced_train_x, balanced_train_y)
y_pred_xgb = best_xgb.predict(test_x)
print("XGBoost Results after Hyperparameter:\n",
classification_report(test_y, y_pred_xgb))
print("XGBoost ROC-AUC after Hyperparameter:",
roc_auc_score(test_y, best_xgb.predict_proba(test_x)[:, 1]))
# Train and evaluate Random Forest
best_rf = RandomForestClassifier(
**rf_search.best_params_, random_state=42)
best_rf.fit(balanced_train_x, balanced_train_y)
y_pred_rf = best_rf.predict(test_x)
print("Random Forest Results after Hyperparameter:\n",
classification_report(test_y, y_pred_rf))
print("Random Forest ROC-AUC after Hyperparameter:",
roc_auc_score(test_y, best_rf.predict_proba(test_x)[:, 1]))
# Train and evaluate LightGBM
best_lgbm = LGBMClassifier(
**lgbm_search.best_params_, random_state=42, verbosity=-1)
best_lgbm.fit(balanced_train_x, balanced_train_y)
y_pred_lgbm = best_lgbm.predict(test_x)
print("LightGBM Results after Hyperparameter:\n",
classification_report(test_y, y_pred_lgbm))
print("LightGBM ROC-AUC after Hyperparameter:",
roc_auc_score(test_y, best_lgbm.predict_proba(test_x)[:, 1]))
# Train and evaluate Gradient Boosting
best_gb = GradientBoostingClassifier(
**gb_search.best_params_, random_state=42)
best_gb.fit(balanced_train_x, balanced_train_y)
y_pred_gb = best_gb.predict(test_x)
print("Gradient Boosting Results after Hyperparameter:\n",
classification_report(test_y, y_pred_gb))
print("Gradient Boosting ROC-AUC after Hyperparameter",
roc_auc_score(test_y, best_gb.predict_proba(test_x)[:, 1]))
CatBoost Results after Hyperparameter:
precision recall f1-score support
0 0.96 0.95 0.96 972
1 0.15 0.16 0.16 50
accuracy 0.91 1022
macro avg 0.55 0.56 0.56 1022
weighted avg 0.92 0.91 0.92 1022
CatBoost ROC-AUC after Hyperparameter: 0.7675308641975309
XGBoost Results after Hyperparameter:
precision recall f1-score support
0 0.96 0.90 0.93 972
1 0.11 0.24 0.15 50
accuracy 0.86 1022
macro avg 0.53 0.57 0.54 1022
weighted avg 0.92 0.86 0.89 1022
XGBoost ROC-AUC after Hyperparameter: 0.7423045267489711
Random Forest Results after Hyperparameter:
precision recall f1-score support
0 0.96 0.87 0.91 972
1 0.12 0.36 0.18 50
accuracy 0.84 1022
macro avg 0.54 0.61 0.55 1022
weighted avg 0.92 0.84 0.88 1022
Random Forest ROC-AUC after Hyperparameter: 0.7509156378600823
LightGBM Results after Hyperparameter:
precision recall f1-score support
0 0.96 0.96 0.96 972
1 0.15 0.14 0.15 50
accuracy 0.92 1022
macro avg 0.55 0.55 0.55 1022
weighted avg 0.92 0.92 0.92 1022
LightGBM ROC-AUC after Hyperparameter: 0.7653703703703704
Gradient Boosting Results after Hyperparameter:
precision recall f1-score support
0 0.95 0.96 0.96 972
1 0.13 0.12 0.12 50
accuracy 0.92 1022
macro avg 0.54 0.54 0.54 1022
weighted avg 0.91 0.92 0.92 1022
Gradient Boosting ROC-AUC after Hyperparameter 0.7502880658436214
Conclusion after Hyperparameter Tuning:¶
After hyperparameter tuning, all models show consistent performance, particularly with high precision, recall, and accuracy for the majority class (class 0). However, the precision and recall for the minority class (class 1) remain relatively low across all models, reflecting the challenges of imbalanced data. Overall, while hyperparameter tuning helped to maintain high accuracy and ROC-AUC scores for class 0, further improvement is needed to boost class 1 (stroke) performance, possibly through more advanced balancing techniques or feature engineering. However I will check if applying ensemble models could improve the model performance.
Ensemble Model with Voting Classifier¶
ensemble_ml = VotingClassifier(estimators=[
('catboost', best_catboost),
('xgb', best_xgb),
('rf', best_rf),
('lgbm', best_lgbm),
('gb', best_gb)
], voting='soft', weights=[2, 1, 1, 1, 1])
ensemble_ml.fit(balanced_train_x, balanced_train_y)
y_pred_voting = ensemble_ml.predict(test_x)
print("Ensemble (Voting Classifier) Results:\n",
classification_report(test_y, y_pred_voting))
Ensemble (Voting Classifier) Results:
precision recall f1-score support
0 0.96 0.95 0.95 972
1 0.13 0.14 0.14 50
accuracy 0.91 1022
macro avg 0.55 0.55 0.55 1022
weighted avg 0.92 0.91 0.91 1022
Conclusion:¶
The ensemble model, which combines the strengths of multiple classifiers (CatBoost, XGBoost, Random Forest, LightGBM, and Gradient Boosting) using soft voting, yields strong overall performance, particularly for the majority class (class 0). The accuracy remains high at 0.91, with the weighted average precision, recall, and f1-score around 0.91. Still trying to improve results for prescision and recall for the minority class (class 1), I will try to adjust the threshold for the ensemble model.
# Get predicted probabilities
y_pred_proba_voting = ensemble_ml.predict_proba(
test_x)[:, 1] # Probabilities for class 1
# Threshold (need to try values lower-> 0.5, e.g., 0.3 or 0.25)
threshold = 0.28
y_pred_adjusted = (
y_pred_proba_voting >= threshold).astype(int)
# Evaluate the adjusted predictions
print("Ensemble (Voting Classifier) with Adjusted Threshold Results:\n",
classification_report(test_y, y_pred_adjusted))
Ensemble (Voting Classifier) with Adjusted Threshold Results:
precision recall f1-score support
0 0.96 0.88 0.92 972
1 0.13 0.34 0.19 50
accuracy 0.86 1022
macro avg 0.55 0.61 0.55 1022
weighted avg 0.92 0.86 0.89 1022
Conclusion:¶
After adjusting the threshold to 0.28 (I used other, but 0.28 achived the best trade-off) for the ensemble model (Voting Classifier), the results show a notable trade-off between precision and recall for the minority class (stroke=1). By lowering the threshold, the model becomes more sensitive to detecting stroke cases, resulting in an improvement in recall for class 1, which increased to 0.34 compared to the previous value of 0.14. However, this comes at the cost of precision, which remains low at 0.13.
Cross-validation with Stratified K-Fold¶
# StratifiedKFold to maintain the same class distribution across folds
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# List to hold F1-scores for each fold
f1_scores = []
# Custom cross-validation loop
for train_index, val_index in skf.split(balanced_train_x, balanced_train_y):
# Split data into train and validation sets
X_train, X_val = balanced_train_x.iloc[
train_index], balanced_train_x.iloc[val_index]
y_train, y_val = balanced_train_y.iloc[
train_index], balanced_train_y.iloc[val_index]
# Fit the voting classifier on the current fold's training data
ensemble_ml.fit(X_train, y_train)
# Get predicted probabilities on the validation set
y_pred_proba_val = ensemble_ml.predict_proba(X_val)[:, 1]
# Apply the adjusted threshold
y_pred_adjusted_val = (y_pred_proba_val >= threshold).astype(int)
# Calculate F1-score with adjusted threshold and store it
f1 = f1_score(y_val, y_pred_adjusted_val)
f1_scores.append(f1)
# Convert the list to a numpy array for easier statistics
f1_scores = np.array(f1_scores)
# Output the F1-scores and mean
print(f"Custom Cross-Validation F1 Scores: {f1_scores}")
print(f"Mean F1 Score with Threshold Adjustment: {np.mean(f1_scores)}")
Custom Cross-Validation F1 Scores: [0.93567961 0.93967093 0.92882992 0.93268648 0.94067278] Mean F1 Score with Threshold Adjustment: 0.9355079438193711
Conclusion:¶
The custom cross-validation with StratifiedKFold and the adjusted threshold provided consistent F1-scores across all folds, with scores ranging from 0.9288 to 0.9407. The mean F1-score of 0.9355 indicates strong overall model performance, particularly when balancing precision and recall for the minority class (stroke=1). This suggests that the voting ensemble model with the adjusted threshold generalizes well across different subsets of the data, and the threshold tuning was successful in maintaining a good balance between precision and recall.
Bootstrapping with Bagging¶
Now, since I have the cross-validation results, I will apply bootstrapping with bagging to further stabilize the model’s performance and reduce variance. By using the BaggingClassifier, I can train multiple instances of the VotingClassifier on bootstrapped samples of the training data. This approach helps mitigate overfitting, enhances robustness, and provides a more reliable ensemble model. The adjusted threshold will again be applied to fine-tune the precision-recall trade-off for the minority class, ensuring a balanced evaluation of the final model.
# Wrap the VotingClassifier in a BaggingClassifier to apply bootstrapping
bagging_clf = BaggingClassifier(base_estimator=ensemble_ml,
n_estimators=10, # bootstrap iterations
random_state=42,
n_jobs=-1,
bootstrap=True)
# Fit the bagging ensemble model on the balanced training data
bagging_clf.fit(balanced_train_x, balanced_train_y)
# Get predicted probabilities for the test set
y_pred_proba_bagging = bagging_clf.predict_proba(test_x)[:, 1]
# Adjust the threshold
threshold = 0.28
y_pred_adjusted_bagging = (y_pred_proba_bagging >= threshold).astype(int)
# Evaluate the adjusted predictions for the bagging ensemble
print("Bagging (Voting Classifier) with Adjusted Threshold Results:\n",
classification_report(test_y, y_pred_adjusted_bagging))
# Evaluate the ROC-AUC score for the bagging ensemble
roc_auc = roc_auc_score(test_y, y_pred_proba_bagging)
print("Bagging Ensemble ROC-AUC:", roc_auc)
Bagging (Voting Classifier) with Adjusted Threshold Results:
precision recall f1-score support
0 0.97 0.86 0.91 972
1 0.13 0.40 0.19 50
accuracy 0.83 1022
macro avg 0.55 0.63 0.55 1022
weighted avg 0.92 0.83 0.87 1022
Bagging Ensemble ROC-AUC: 0.7741975308641975
Conclusion¶
The results indicate that applying bagging with the VotingClassifier led to a noticeable improvement in the recall for the minority class (stroke=1). Before bagging, the recall was 0.34, but after incorporating bagging, the recall increased to 0.40. This suggests that the model is now better at identifying true positive cases (strokes) while maintaining an acceptable precision and F1-score. Although the overall accuracy slightly decreased to 0.83, the trade-off is worthwhile as the model’s ability to detect strokes (recall) improved, which is crucial for addressing the minority class in this imbalanced dataset. Additionally, the ROC-AUC score of 0.774 indicates a strong performance in distinguishing between the two classes.
Cross Validation with Threshold Adjustment after Bootstrap and Bagging¶
Now, since I have the bootstrap and bagging results, I will apply cross-validation with threshold adjustment to further ensure the model’s performance is consistent and generalized across different subsets of the data. The cross-validation process helps in verifying the stability of the model’s predictions by splitting the data into multiple folds and evaluating how well the model balances precision and recall for each fold.
# Number of folds for cross-validation
cv_folds = 5
# Perform cross-validation predictions using the bagging classifier
y_pred_proba_cv = cross_val_predict(
bagging_clf, balanced_train_x,
balanced_train_y, cv=cv_folds,
method='predict_proba')[:, 1] # probability for class 1
# Adjust the threshold during cross-validation
threshold = 0.28
y_pred_adjusted_cv = (y_pred_proba_cv >= threshold).astype(int)
# Calculate the F1 score for each fold
f1_scores = []
for i in range(cv_folds):
# Calculate F1-score with adjusted threshold for each fold
f1 = f1_score(balanced_train_y, y_pred_adjusted_cv)
f1_scores.append(f1)
# Output the F1-scores and mean
print(f"Custom Cross-Validation F1 Scores: {f1_scores}")
print(f"Mean F1 Score with Threshold Adjustment: {np.mean(f1_scores)}")
Custom Cross-Validation F1 Scores: [0.9244851258581235, 0.9244851258581235, 0.9244851258581235, 0.9244851258581235, 0.9244851258581235] Mean F1 Score with Threshold Adjustment: 0.9244851258581235
Conclustion¶
The cross-validation results demonstrate that the model’s F1-scores across all folds are consistent, with each fold achieving an F1-score of approximately 0.924. The mean F1-score of 0.924 suggests strong and stable performance of the bagging classifier with the adjusted threshold. This indicates that the model is effectively balancing precision and recall, providing reliable generalization across different folds of the data without significant variance in performance. The use of threshold adjustment appears to have contributed to improving the F1-score across the validation sets.
Overfitting Check¶
Now, since I have the cross-validation results, I will apply this overfitting check to ensure that the model generalizes well to unseen data. By evaluating the performance on both the training set and the test set, I can assess whether the model has overfitted. If the training performance is significantly better than the test performance, it indicates that the model has learned patterns specific to the training data, which may not generalize to new data. This step is crucial for determining the balance between model complexity and its ability to generalize.
# Check training performance
y_train_pred = ensemble_ml.predict(balanced_train_x)
print("Training Set Performance:\n",
classification_report(balanced_train_y, y_train_pred))
# Check test performance
y_test_pred = ensemble_ml.predict(test_x)
print("Test Set Performance:\n",
classification_report(test_y, y_test_pred))
Training Set Performance:
precision recall f1-score support
0 0.99 0.99 0.99 3888
1 0.99 0.99 0.99 3888
accuracy 0.99 7776
macro avg 0.99 0.99 0.99 7776
weighted avg 0.99 0.99 0.99 7776
Test Set Performance:
precision recall f1-score support
0 0.96 0.95 0.95 972
1 0.12 0.14 0.13 50
accuracy 0.91 1022
macro avg 0.54 0.54 0.54 1022
weighted avg 0.91 0.91 0.91 1022
Conclusion¶
The model demonstrates perfect performance on the training set with an accuracy of 100%, indicating that it has likely overfitted to the training data. This is evident because the test set performance shows a significant drop in metrics, especially for class 1 (stroke). The precision, recall, and F1-score for class 1 are notably lower on the test set, with a recall of only 14% and an F1-score of 14%, while the overall accuracy is still relatively high at 91%.
This suggests that while the model fits the training data perfectly, it struggles to generalize to unseen data, particularly for the minority class (stroke). Overfitting is likely, and further adjustments such as regularization or tuning may be necessary to improve the model’s generalization capability.
Learning Curve Analysis - Ensemble Model (Voting Classifier)¶
The learning curve provides insights into how the model’s F1 score evolves as the training set size increases. The model is trained using different subsets of the training data, and its performance is evaluated on both the training set and cross-validation folds using F1 score as the evaluation metric.
# Generate learning curve data
train_sizes, train_scores, test_scores = learning_curve(
ensemble_ml, balanced_train_x, balanced_train_y, cv=5,
scoring='f1', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))
# Calculate mean and standard deviation for the training and test scores
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
# Plot learning curve
plt.figure()
plt.title("Learning Curve (Voting Classifier)")
plt.xlabel("Training Examples")
plt.ylabel("F1 Score")
# Plot the training and test scores
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training F1 score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation F1 score")
plt.legend(loc="best")
plt.show()
Conclusion¶
The learning curve shows the performance of the Voting Classifier as more training data is added:
- The training F1 score (red line) remains high and consistent, close to 1.0, indicating that the model fits the training data very well.
- The cross-validation F1 score (green line) improves steadily as the amount of training data increases, which suggests that the model is generalizing better as more data is used.
- The green shaded area represents the variability of the cross-validation F1 score across different folds. As the training data size increases, this variability reduces, indicating more stable performance on unseen data.
Overall, the model appears to be learning well, and the gap between training and validation scores suggests that, while the model fits the data well, some generalization improvements could be made.
Confusion Matrix for the Ensemble Model (Voting Classifier)¶
The confusion matrix is a valuable tool for understanding how well a model performs on classification tasks by showing the number of correct and incorrect predictions for each class. It provides a detailed breakdown of the model’s performance on both the positive and negative classes (in this case, stroke or no stroke).
# Get predictions for the test set using the ensemble model
y_pred_voting = ensemble_ml.predict(test_x)
# Generate confusion matrix
conf_matrix = confusion_matrix(test_y, y_pred_voting)
# Display the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix,
display_labels=[0, 1])
disp.plot(cmap=plt.cm.Blues)
plt.title("Confusion Matrix for Ensemble Model")
plt.show()
Conclusion¶
- True Negatives (927): The model correctly classified 927 out of 972 instances of class 0 (no stroke).
- False Positives (45): The model incorrectly classified 45 instances of class 0 as class 1 (stroke).
- False Negatives (43): The model incorrectly classified 43 instances of class 1 as class 0 (no stroke).
- True Positives (7): The model correctly classified 7 out of 50 instances of class 1 (stroke).
The confusion matrix highlights the challenges of identifying the minority class (stroke = 1), where the model has a relatively low true positive rate (recall). This suggests that while the model is quite good at identifying negative cases (no stroke), there is still room for improvement in detecting positive cases (stroke).
LIME Interpretation¶
LIME is used to explain the decision of an ensemble voting classifier for stroke prediction. It breaks down the contribution of each feature to a specific prediction, providing insight into how the model arrived at its decision for an individual data point.
# Before scaling, save a copy of the unscaled training and test data
unscaled_train_x = train_x.copy()
unscaled_test_x = test_x.copy()
# Scale the training data
scaler = StandardScaler()
train_x[['age', 'avg_glucose_level', 'bmi']] = scaler.fit_transform(
train_x[['age', 'avg_glucose_level', 'bmi']])
test_x[['age', 'avg_glucose_level', 'bmi']] = scaler.transform(
test_x[['age', 'avg_glucose_level', 'bmi']])
# Train your model using the scaled data
ensemble_ml.fit(train_x, train_y)
# Create the LIME explainer using the **unscaled** training data
explainer = lime.lime_tabular.LimeTabularExplainer(
training_data=np.array(unscaled_train_x),
feature_names=unscaled_train_x.columns,
class_names=['No Stroke', 'Stroke'],
mode='classification'
)
# Select a test instance (unscaled) to explain, e.g., the first test sample
test_instance_unscaled = unscaled_test_x.iloc[0].values.reshape(1, -1)
# Generate the explanation using the unscaled test instance,
# but the model trained on scaled data
explanation = explainer.explain_instance(
test_instance_unscaled[0], ensemble_ml.predict_proba)
# Plot the LIME explanation
fig = explanation.as_pyplot_figure()
plt.title('LIME Explanation for Stroke Prediction', fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.show()
Conclusion¶
The LIME explanation visualizes the contribution of individual features to the model’s stroke prediction for a specific instance:
- Red bars represent features that push the model’s prediction towards predicting stroke (class 1).
- Green bars represent features that push the model’s prediction towards no stroke (class 0).
- The length of the bar indicates the magnitude of the feature’s contribution to the prediction.
Overall conclusion
- Age and BMI had the largest negative influence on predicting a stroke.
- Hypertension contributed positively towards predicting a stroke.
AUC-ROC Curves for the Ensemble Model¶
The ROC curve is a graphical representation of the performance of a classification model at different threshold levels. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR).
# Get the predicted probabilities
# for the test set from the ensemble model
y_pred_proba = ensemble_ml.predict_proba(
test_x)[:, 1] # Probabilities for class 1 (Stroke)
# Calculate the FPR, TPR, and threshold values for the ROC curve
fpr, tpr, thresholds = roc_curve(test_y, y_pred_proba)
# Calculate the AUC score
roc_auc = roc_auc_score(test_y, y_pred_proba)
# Step 4: Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='orange',
label=f'Ensemble ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', linestyle='--') # Baseline
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Ensemble Model')
plt.legend(loc='lower right')
plt.show()
Conclusion:¶
The AUC-ROC curve provides insights into the overall performance of the ensemble model:
- The AUC score is 0.80, which indicates that the ensemble model has a good ability to distinguish between stroke and non-stroke cases.
- The ROC curve moves away from the diagonal (random chance), suggesting that the model has a reasonable performance in terms of balancing true positives and false positives.
The orange curve shows how the model performs across different thresholds, and the closer it is to the top-left corner, the better the model is at distinguishing between classes.
Model Deployment: FastAPI¶
Now the model is ready to be deployed using FastAPI.
import joblib
# ensemble_model
fco_model_01 = ensemble_ml
# Save the entire pipeline
joblib.dump(fco_model_01, 'fco_model_01.pkl')
['fco_model_01.pkl']
Final Conclusion¶
Overall, this project demonstrated the challenges and intricacies of working with imbalanced datasets in a medical setting, where correctly identifying rare events (such as strokes) is critical. While we were able to improve recall for stroke prediction, further improvement could come from more advanced resampling techniques, feature engineering, and possibly exploring deep learning methods. The project’s results show that a balanced combination of model selection, hyperparameter tuning, threshold adjustment, and ensemble learning can yield a robust model for predicting stroke.
Suggested Improvements¶
- Advanced Feature Engineering:
- Explore creating new features based on domain knowledge, such as interactions between age, hypertension, and BMI. This could improve the model’s ability to detect complex patterns related to stroke risk.
- Class Weight Adjustment:
- Further explore the use of class weights in models like Random Forest, XGBoost, and LightGBM to emphasize the importance of the minority class (stroke=1). Class weighting strategies could help improve recall and F1 scores for stroke predictions by handling the imbalance more effectively.
- Resampling Techniques:
- Test different resampling strategies like ADASYN, SMOTE-ENN, or Random Over/Under-sampling to balance the dataset. Combining oversampling with undersampling techniques (e.g., SMOTE-Tomek) might yield better performance for the minority class.
- Threshold Optimization Techniques:
- Explore techniques such as Optimal Cutoff Point Analysis or Youden’s Index to find the best threshold that maximizes sensitivity and specificity for stroke prediction. This could help fine-tune the decision-making process of the model.