Table of Contents¶
- Introduction
- Notebook settings
- Data Loading
- Optimize Memory Usage
- EDA
- Application_train (train_df) Dataset
- Statistical Inference
- Bureau Data Analysis
- Bureau Balance Data Analysis
- Previous Application Data Analysis
- Install Payments Data Analysis
- POS_CASH_balance Data Analysis
- Credit Card Balance Data Analysis
- Main conclusions from EDA
- Suggestions for EDA Improvement
Reference to reduce memory usage function used: https://www.kaggle.com/rinnqd/reduce-memory-usage
Introduction¶
This exploratory data analysis (EDA) focuses on the Home Credit Default Risk competition hosted on Kaggle. The primary objective of this competition is to predict the likelihood of a client defaulting on a loan based on a rich dataset provided by Home Credit, a financial institution. By accurately predicting credit risk, Home Credit aims to enhance financial inclusion by providing fair and responsible access to credit for underserved populations.
The dataset includes a variety of features such as demographic information, financial data, and historical loan records, making it an excellent opportunity to explore data preprocessing, feature engineering, and machine learning techniques. This EDA aims to uncover key patterns, relationships, and potential predictors that can assist in building an effective model.
Notebook settings¶
from assets.utils.functions import *
%load_ext pycodestyle_magic
# %reload_ext pycodestyle_magic
%pycodestyle_on
%flake8_on
%flake8_on --max_line_length 79
%matplotlib inline
# Standard libraries
from IPython.display import display
import warnings
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as mpatches
from statsmodels.stats.outliers_influence import variance_inflation_factor
import phik
from phik import resources, report
from scipy.stats import ttest_ind
# from IPython.core.display import display, HTML
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# for 100% jupyter notebook cell width
# display(HTML("<style>.container { width:100% !important; }</style>"))
Data Loading¶
# Loading datasets
train_df = pd.read_csv(
'../analytical/assets/data/application_train.csv')
test_df = pd.read_csv(
'../analytical/assets/data/application_test.csv')
bureau_df = pd.read_csv(
'../analytical/assets/data/bureau.csv')
bureau_bal_df = pd.read_csv(
'../analytical/assets/data/bureau_balance.csv')
credit_card_bal_df = pd.read_csv(
'../analytical/assets/data/credit_card_balance.csv')
installments_pay_df = pd.read_csv(
'../analytical/assets/data/installments_payments.csv')
poscash_bal_df = pd.read_csv(
'../analytical/assets/data/pos_cash_balance.csv')
prev_app_df = pd.read_csv(
'../analytical/assets/data/previous_application.csv')
# List of train and test datasets
train_test_df_list = {
"train_df": train_df,
"test_df": test_df
}
# List of all datasets
all_df_dict = {
"train_df": train_df,
"test_df": test_df,
"bureau_df": bureau_df,
"bureau_bal_df": bureau_bal_df,
"credit_card_bal_df": credit_card_bal_df,
"installments_pay_df": installments_pay_df,
"poscash_bal_df": poscash_bal_df,
"prev_app_df": prev_app_df
}
# Displaying the shape of each dataset
print("application_train :", train_df.shape)
print("application_test :", test_df.shape)
print("bureau :", bureau_df.shape)
print("bureau_balance :", bureau_bal_df.shape)
print("credit_card_balance :", credit_card_bal_df.shape)
print("installments_payments :", installments_pay_df.shape)
print("pos_cash_balance :", poscash_bal_df.shape)
print("previous_application :", prev_app_df.shape)
application_train : (307511, 122) application_test : (48744, 121) bureau : (1716428, 17) bureau_balance : (27299925, 3) credit_card_balance : (3840312, 23) installments_payments : (13605401, 8) pos_cash_balance : (10001358, 8) previous_application : (1670214, 37)
Optimize Memory Usage:¶
- One technique that could be used when loading large dataframes, is to reduce memory usage of a dataFrame by downcasting numerical columns to smaller, more memory-efficient data types. However, this optimization can potentially impact precision, especially for floating-point numbers.
Since some features could be converted to even smaller data types, we will use the following function to reduce memory usage of the dataframes, and standardize the data types of the features, considering the data types of the features in the training and testing datasets.
# Apply the function to all DataFrames
for name, df in all_df_dict.items():
print(f"\n--- Optimizing {name} DataFrame ---\n")
memory_usage_improve(df)
--- Optimizing train_df DataFrame --- Memory Usage Before Optimization: Memory usage of float64 columns: 152.50 MB Memory usage of int64 columns: 96.19 MB Memory usage of object columns: 288.00 MB Total memory usage: 536.69 MB Memory Usage After Optimization: Memory usage of float32 columns: 76.25 MB Memory usage of int32 columns: 48.10 MB Memory usage of category columns: 4.71 MB Total memory usage: 129.05 MB Memory reduced by: 407.64 MB (75.95%) --- Optimizing test_df DataFrame --- Memory Usage Before Optimization: Memory usage of float64 columns: 24.17 MB Memory usage of int64 columns: 14.88 MB Memory usage of object columns: 45.69 MB Total memory usage: 84.74 MB Memory Usage After Optimization: Memory usage of float32 columns: 12.09 MB Memory usage of int32 columns: 7.44 MB Memory usage of category columns: 0.76 MB Total memory usage: 20.28 MB Memory reduced by: 64.45 MB (76.06%) --- Optimizing bureau_df DataFrame --- Memory Usage Before Optimization: Memory usage of float64 columns: 104.76 MB Memory usage of int64 columns: 78.57 MB Memory usage of object columns: 328.78 MB Total memory usage: 512.11 MB Memory Usage After Optimization: Memory usage of float32 columns: 52.38 MB Memory usage of int32 columns: 39.29 MB Memory usage of category columns: 4.91 MB Total memory usage: 96.58 MB Memory reduced by: 415.53 MB (81.14%) --- Optimizing bureau_bal_df DataFrame --- Memory Usage Before Optimization: Memory usage of float64 columns: 0.00 MB Memory usage of int64 columns: 416.56 MB Memory usage of object columns: 1510.04 MB Total memory usage: 1926.61 MB Memory Usage After Optimization: Memory usage of float32 columns: 0.00 MB Memory usage of int32 columns: 208.28 MB Memory usage of category columns: 26.04 MB Total memory usage: 234.32 MB Memory reduced by: 1692.29 MB (87.84%) --- Optimizing credit_card_bal_df DataFrame --- Memory Usage Before Optimization: Memory usage of float64 columns: 439.49 MB Memory usage of int64 columns: 205.09 MB Memory usage of object columns: 231.10 MB Total memory usage: 875.69 MB Memory Usage After Optimization: Memory usage of float32 columns: 219.74 MB Memory usage of int32 columns: 102.55 MB Memory usage of category columns: 3.66 MB Total memory usage: 325.96 MB Memory reduced by: 549.73 MB (62.78%) --- Optimizing installments_pay_df DataFrame --- Memory Usage Before Optimization: Memory usage of float64 columns: 519.00 MB Memory usage of int64 columns: 311.40 MB Memory usage of object columns: 0.00 MB Total memory usage: 830.41 MB Memory Usage After Optimization: Memory usage of float32 columns: 259.50 MB Memory usage of int32 columns: 155.70 MB Memory usage of category columns: 0.00 MB Total memory usage: 415.20 MB Memory reduced by: 415.20 MB (50.00%) --- Optimizing poscash_bal_df DataFrame --- Memory Usage Before Optimization: Memory usage of float64 columns: 152.61 MB Memory usage of int64 columns: 381.52 MB Memory usage of object columns: 603.12 MB Total memory usage: 1137.25 MB Memory Usage After Optimization: Memory usage of float32 columns: 76.30 MB Memory usage of int32 columns: 190.76 MB Memory usage of category columns: 9.54 MB Total memory usage: 276.60 MB Memory reduced by: 860.65 MB (75.68%) --- Optimizing prev_app_df DataFrame --- Memory Usage Before Optimization: Memory usage of float64 columns: 191.14 MB Memory usage of int64 columns: 76.46 MB Memory usage of object columns: 1633.03 MB Total memory usage: 1900.63 MB Memory Usage After Optimization: Memory usage of float32 columns: 95.57 MB Memory usage of int32 columns: 38.23 MB Memory usage of category columns: 25.50 MB Total memory usage: 159.30 MB Memory reduced by: 1741.33 MB (91.62%)
The memory optimization process has significantly reduced the memory footprint of the dataset. Here's a breakdown of the improvements:
| Dataset | Memory Before (MB) | Memory After (MB) | Reduction (MB) | Reduction (%) |
|---|---|---|---|---|
| train_df | 536.69 | 129.05 | 407.64 | 75.95% |
| test_df | 84.74 | 20.28 | 64.45 | 76.06% |
| bureau_df | 512.11 | 96.58 | 415.53 | 81.14% |
| bureau_bal_df | 1926.61 | 234.32 | 1692.29 | 87.84% |
| credit_card_bal_df | 875.69 | 325.96 | 549.73 | 62.78% |
| installments_pay_df | 830.41 | 415.20 | 415.20 | 50.00% |
| poscash_bal_df | 1137.25 | 276.60 | 860.65 | 75.68% |
| prev_app_df | 1900.63 | 159.30 | 1741.33 | 91.62% |
Overall Improvement:
Total memory usage dropped from 536.69 MB to 129.05 MB, achieving a ~76% reduction for the training dataset.
EDA¶
print('-'*80)
print(f'The shape of application_train.csv is: {train_df.shape}')
print('-'*80)
print(
f"Number of duplicate values in application_train: "
f"{train_df.shape[0] - train_df.duplicated().shape[0]}"
)
print('-'*80)
display(train_df.head())
-------------------------------------------------------------------------------- The shape of application_train.csv is: (307511, 122) -------------------------------------------------------------------------------- Number of duplicate values in application_train: 0 --------------------------------------------------------------------------------
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0.0 | 0.0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0.0 | 0.0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
print('-'*80)
print(f'The shape of application_test.csv is: {test_df.shape}')
print('-'*80)
print(
f"Number of duplicate values in application_train: "
f"{test_df.shape[0] - test_df.duplicated().shape[0]}"
)
print('-'*80)
display(test_df.head())
-------------------------------------------------------------------------------- The shape of application_test.csv is: (48744, 121) -------------------------------------------------------------------------------- Number of duplicate values in application_train: 0 --------------------------------------------------------------------------------
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.018850 | -19241 | -2329 | -5170.0 | -812 | NaN | 1 | 1 | 0 | 1 | 0 | 1 | NaN | 2.0 | 2 | 2 | TUESDAY | 18 | 0 | 0 | 0 | 0 | 0 | 0 | Kindergarten | 0.752614 | 0.789654 | 0.159520 | 0.0660 | 0.0590 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0505 | NaN | NaN | 0.0672 | 0.0612 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0526 | NaN | NaN | 0.0666 | 0.0590 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0514 | NaN | NaN | NaN | block of flats | 0.0392 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | -1740.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.035792 | -18064 | -4469 | -9118.0 | -1623 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Low-skill Laborers | 2.0 | 2 | 2 | FRIDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Self-employed | 0.564990 | 0.291656 | 0.432962 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | NaN | Working | Higher education | Married | House / apartment | 0.019101 | -20038 | -4458 | -2175.0 | -3503 | 5.0 | 1 | 1 | 0 | 1 | 0 | 0 | Drivers | 2.0 | 2 | 2 | MONDAY | 14 | 0 | 0 | 0 | 0 | 0 | 0 | Transport: type 3 | NaN | 0.699787 | 0.610991 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -856.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.026392 | -13976 | -1866 | -2000.0 | -4208 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Sales staff | 4.0 | 2 | 2 | WEDNESDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.525734 | 0.509677 | 0.612704 | 0.3052 | 0.1974 | 0.9970 | 0.9592 | 0.1165 | 0.32 | 0.2759 | 0.375 | 0.0417 | 0.2042 | 0.2404 | 0.3673 | 0.0386 | 0.08 | 0.3109 | 0.2049 | 0.9970 | 0.9608 | 0.1176 | 0.3222 | 0.2759 | 0.375 | 0.0417 | 0.2089 | 0.2626 | 0.3827 | 0.0389 | 0.0847 | 0.3081 | 0.1974 | 0.9970 | 0.9597 | 0.1173 | 0.32 | 0.2759 | 0.375 | 0.0417 | 0.2078 | 0.2446 | 0.3739 | 0.0388 | 0.0817 | reg oper account | block of flats | 0.3700 | Panel | No | 0.0 | 0.0 | 0.0 | 0.0 | -1805.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.010032 | -13040 | -2191 | -4000.0 | -4262 | 16.0 | 1 | 1 | 1 | 1 | 0 | 0 | NaN | 3.0 | 2 | 2 | FRIDAY | 5 | 0 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | 0.202145 | 0.425687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -821.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
Summary and Insights:
application_train.csv:
- The application_train.csv dataset consists of approximately 307,511 records and 122 features.
- These features represent various personal and financial details about customers, such as age, income, loan type, and housing characteristics.
- The dataset contains 307,511 unique loan applications, identified by the SK_ID_CURR column.
- The TARGET column indicates whether a customer defaulted on their loan:
- 0 signifies a non-defaulter.
- 1 signifies a defaulter.
- Importantly, there are no duplicate entries in this dataset.
application_test.csv:
- The application_test.csv dataset includes approximately 48,744 records and 121 features.
- These features are identical to those in application_train.csv, except for the absence of the TARGET column, which needs to be predicted.
- The dataset contains 48,744 unique loan applications, also identified by the SK_ID_CURR column.
- Like the training data, there are no duplicate entries in this dataset.
Null values (Nan) analysis¶
I want to check about the missing values in the dataset. I will check the percentage of missing values in each column and then visualize the missing values using bar plot. Since the dataset is large, I will only display the columns with missing values greater than 0 and for that I will create a dataframe with the features having null percentages and then plot the bar plot.
# NaN values in the training dataset
view_percent_nan(
null_df(train_df), 'Application Train Dataset', grid=True)
Number of columns having NaN values: 67 columns
# NaN values in the test dataset
view_percent_nan(null_df(test_df), 'Application Test Dataset', grid=True)
Number of columns having NaN values: 64 columns
Conclusion for Missing Values in application_train.csv and application_test.csv
application_train.csv Observations:
- Out of the 122 features, 67 columns contain missing values.
- Several columns, such as those related to “COMMONAREA” and “NONLIVINGAPARTMENT,” have close to 70% missing data.
- Most of the columns with >50% missing values are associated with apartment statistics. This suggests that these data points were either optional during data entry or unavailable for many borrowers.
- Dropping these columns entirely would result in a significant loss of information. Hence, appropriate techniques, such as imputation or feature engineering, will need to be applied to handle the missing values effectively.
application_test.csv Observations:
- Out of the 121 features, 64 columns contain missing values, which is comparable to the application_train dataset.
- The percentage of missing values in the test set closely mirrors that of the training set. This similarity in distributions between the training and test datasets indicates a consistent data collection process.
- Columns with high missing percentages in the test set are also primarily related to apartment statistics, aligning with the patterns observed in the training data.
Distribution of Target Variable¶
I will now visualize the distribution of the target variable in the training dataset. The target variable represents whether a customer defaulted on their loan (1) or not (0). Understanding the distribution of the target variable is crucial for assessing the class balance and potential biases in the dataset.
train_df['TARGET'] = train_df['TARGET'].astype(int)
plot_target_distribution(train_df)
The TARGET variable shows the status of the loan, where:
- 0: indicates that the loan was repaid on time (no payment issues), also known as non-default or
- 1: indicates the client had payment difficulties (payment issues or did not repay on time), also known as a default.
As we can see, the TARGET variable is higly imbalanced, with 92% of the loans repaid on time and 8% not repaid on time.
Downsampling or oversampling will be needed before model training and it will be done accordingly later on.
For datasets with an imbalance in class distribution, certain algorithms that are sensitive to imbalances cannot process the data effectively without adjustments. Similarly, the choice of performance metrics must be carefully considered.
Metrics like Accuracy are often unsuitable for imbalanced datasets, as they tend to favor the majority class, leading to misleading results. Instead, alternative metrics such as ROC-AUC, Log-Loss, F1-Score, and Confusion Matrix are better suited for evaluating model performance.
Another critical point is that the number of individuals who actually default is very small, and they often exhibit distinct behavioral patterns. In scenarios like fraud detection, default prediction, or anomaly detection, these outliers play a vital role. They must be retained and analyzed carefully, as they may hold the key to distinguishing between defaulters and non-defaulters.
Correlation Analysis: Numerical and Categorical Features¶
Since this is a large dataset, I want to ensure the most relevant features are selected for model training. To achieve this, I will conduct correlation analysis to identify the relationships between the features and the target variable. This analysis will help identify the most important features that influence the target variable.
A heatmap will be used to visualize the Phi-K Correlation Coefficient between each pair of features in the dataset. Phi-K is used to measure the association between two categorical variables. It ranges from -1 to 1, where 1 indicates a strong positive association, -1 indicates a strong negative association, and 0 indicates no association.
The Phi-K coefficient is similar to the standard Correlation Coefficient but is designed for categorical features. It helps us identify whether one categorical feature is associated with another. A Phi-K value of 1 indicates the strongest possible association between two features. This makes it particularly useful for understanding relationships between categorical variables.
Phi-K Correlation advantages:
- Captures both linear and non-linear relationships.
- Handles skewed and non-normal data effectively.
- Ideal for data with mixed types or categorical variables
Correlation Analysis: Categorical Features¶
# Categorical columns (object type and boolean-like integers)
categorical_columns_auto = list(
train_df.select_dtypes(
include=['object', 'category']).columns)
# Add boolean-like numerical features explicitly
flag_columns = [
'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY'
]
# Add FLAG_DOCUMENT_2 to FLAG_DOCUMENT_21 explicitly
document_flags = ['FLAG_DOCUMENT_' + str(i) for i in range(2, 22)]
# Combine all
categorical_columns = list(set(
categorical_columns_auto + flag_columns + document_flags))
# Ensure all are present in the dataframe
categorical_columns = [
col for col in categorical_columns if col in train_df.columns]
# Check that all identified columns exist in the dataset
missing_cols = [
col for col in categorical_columns if col not in train_df.columns]
print(f"Columns in the list but missing from the dataset: {missing_cols}")
Columns in the list but missing from the dataset: []
# Categorical columns and target
categorical_columns_with_target = categorical_columns + ['TARGET']
# Display top correlations
view_categ_corr_matrix(
data=train_df,
categorical_columns=categorical_columns_with_target,
figsize=(20, 20),
mask_upper=True,
show_target_top_corr=True,
target_top_columns=10
)
Categories with highest correlation value with Target:
| Column Name | Phik-Correlation | |
|---|---|---|
| 0 | OCCUPATION_TYPE | 0.102846 |
| 40 | ORGANIZATION_TYPE | 0.089164 |
| 16 | NAME_INCOME_TYPE | 0.084831 |
| 3 | REG_CITY_NOT_WORK_CITY | 0.079946 |
| 20 | FLAG_EMP_PHONE | 0.072087 |
| 48 | REG_CITY_NOT_LIVE_CITY | 0.069588 |
| 43 | FLAG_DOCUMENT_3 | 0.069525 |
| 22 | NAME_FAMILY_STATUS | 0.056043 |
| 17 | NAME_HOUSING_TYPE | 0.051107 |
| 19 | LIVE_CITY_NOT_WORK_CITY | 0.050956 |
Insights
The Phi-K correlation heatmap and summary table reveal the relationships between categorical features and the target variable (TARGET). Notably, the correlations between most features and TARGET are weak, as the Phi-K values are relatively low. This suggests that none of the categorical features have a strong direct influence on the target variable.
Key Findings:
- Top Features with the Highest Correlation with TARGET:
- OCCUPATION_TYPE: 0.1028 – This feature shows the strongest correlation with the target, albeit still weak.
- ORGANIZATION_TYPE: 0.0892 – Indicates some level of association with the target.
- NAME_INCOME_TYPE: 0.0848 – Represents another notable feature linked to income categories.
- REG_CITY_NOT_WORK_CITY: 0.0799 – Indicates some spatial dynamics that might slightly affect the target.
- FLAG_EMP_PHONE: 0.0721 – Suggests a weak relationship related to employment verification.
- Additional Noteworthy Features:
- REG_CITY_NOT_LIVE_CITY: 0.0696 – Highlights slight geographic differences between living and registration locations.
- FLAG_DOCUMENT_3: 0.0695 – Indicates weak association with documentation completeness.
- NAME_FAMILY_STATUS: 0.0560 – Correlates modestly with family status categories.
- NAME_HOUSING_TYPE: 0.0511 – Housing-related categories show a small influence on the target.
- LIVE_CITY_NOT_WORK_CITY: 0.0510 – Reflects weak spatial influences on the target.
General Observation:
- No strong correlations were observed between the target variable and any categorical feature, as all Phi-K values are below 0.2. This indicates that categorical features individually do not strongly determine the target.
Correlation Analysis: Numerical Features¶
Now, I will create a heatmap to show the correlation between all numeric features in the dataset. This heatmap will help us identify numeric features that are strongly correlated with each other and also highlight those that have a strong correlation with the target variable.
# Columns to drop
columns_to_drop = ['SK_ID_CURR'] + list(
set(categorical_columns_with_target) - set(['TARGET']))
view_num_corr_matrix(
data=train_df,
columns_to_drop=columns_to_drop,
figsize=(25, 23),
mask_upper=True,
linewidth=0.1,
fontsize=10,
cmap='coolwarm'
)
# Select numerical features
numerical_columns = train_df.select_dtypes(
include=['float32', 'int32']).columns.tolist()
# Initialize the correlation matrix for numerical features
corr_mat = CorrelationMatrix(
data=train_df, target='TARGET',
numerical_columns=numerical_columns, min_unique_values=10)
top_corr_numerical = corr_mat.target_top_corr(target_top_columns=10)
print("Top Phi-K Correlations for Numerical Features:")
display(top_corr_numerical)
Top Phi-K Correlations for Numerical Features:
| Column Name | Phik-Correlation | |
|---|---|---|
| 16 | EXT_SOURCE_3 | 0.247680 |
| 14 | EXT_SOURCE_1 | 0.217846 |
| 15 | EXT_SOURCE_2 | 0.213965 |
| 7 | DAYS_BIRTH | 0.102395 |
| 63 | DAYS_LAST_PHONE_CHANGE | 0.073215 |
| 8 | DAYS_EMPLOYED | 0.072094 |
| 10 | DAYS_ID_PUBLISH | 0.067769 |
| 5 | AMT_GOODS_PRICE | 0.059093 |
| 24 | FLOORSMAX_AVG | 0.058743 |
| 52 | FLOORSMAX_MEDI | 0.058445 |
Insights
- The heatmap effectively illustrates the degree of correlation between each feature and all others in the dataset.
- A majority of the heatmap displays a blue color, indicating very low correlation values. This suggests that most features are not strongly correlated with one another.
- In contrast, we observe some distinct patterns in the center of the heatmap with contrasting shades, signifying higher correlations. These correlations are primarily seen among features related to apartment statistics.
- Upon examining the features in application_train, it is evident that apartment-related statistics (e.g., mean, median, and mode) are naturally correlated, as they represent similar attributes. Additionally, features within the same category, such as means (e.g., Number of Elevators, Living Area, Non-Living Area, Basement Area), also exhibit high correlations with one another.
- Certain feature pairs, such as AMT_GOODS_PRICE and AMT_CREDIT or DAYS_EMPLOYED and DAYS_BIRTH, show strong correlations, which might require attention during model building.
- Highly correlated features can increase the complexity of the model without providing significant additional value. Therefore, inter-correlated features may need to be removed to improve model efficiency.
- Notably, the EXT_SOURCE features show strong correlations with the target variable. These features are likely to be highly significant for the classification task and should be prioritized in feature selection.
Multicollinearity Analysis¶
The goal here is to examine the presence of multicollinearity among the numerical features in the dataset. Multicollinearity occurs when two or more independent variables are highly correlated, leading to redundancy in the information provided by the features. This can negatively impact the model's performance, as it may introduce noise and instability in the predictions.
# Exclude irrelevant and non-numeric columns
columns_to_exclude = ['SK_ID_CURR', 'TARGET']
columns_for_vif = [
col for col in train_df.columns if col not in columns_to_exclude]
# Create a copy of the dataset for VIF analysis
multicollinearity_data = train_df[columns_for_vif].copy()
# Encode categorical variables to numeric
categorical_cols = multicollinearity_data.select_dtypes(
include=['object', 'category']).columns
for col in categorical_cols:
multicollinearity_data[col] = multicollinearity_data[col].astype(
'category').cat.codes
# Handle missing and infinite values
multicollinearity_data.fillna(
multicollinearity_data.mean(), inplace=True)
multicollinearity_data.replace(
[float('inf'), -float('inf')], 0, inplace=True)
# Add a constant column for VIF calculation
multicollinearity_data = multicollinearity_data.assign(CONSTANT=1)
# Calculate Variance Inflation Factor (VIF)
vif_data = pd.DataFrame()
vif_data['Feature'] = multicollinearity_data.columns
vif_data['VIF'] = [variance_inflation_factor(
multicollinearity_data.values, i) for i in range(
multicollinearity_data.shape[1])]
# Drop the constant column from the results
vif_data = vif_data[vif_data['Feature'] != 'CONSTANT']
# Sort and display the VIF results
vif_data = vif_data.sort_values(by='VIF', ascending=False)
print(vif_data)
Feature VIF 21 FLAG_EMP_PHONE 2209.718472 16 DAYS_EMPLOYED 2189.441347 45 YEARS_BUILD_AVG 379.598773 73 YEARS_BUILD_MEDI 378.368117 91 OBS_60_CNT_SOCIAL_CIRCLE 332.441132 89 OBS_30_CNT_SOCIAL_CIRCLE 331.921964 75 ELEVATORS_MEDI 300.852974 76 ENTRANCES_MEDI 292.604097 70 APARTMENTS_MEDI 289.053368 77 FLOORSMAX_MEDI 285.926160 78 FLOORSMIN_MEDI 279.862892 48 ENTRANCES_AVG 258.957539 42 APARTMENTS_AVG 258.902094 47 ELEVATORS_AVG 242.028826 49 FLOORSMAX_AVG 240.232505 50 FLOORSMIN_AVG 229.106694 81 LIVINGAREA_MEDI 221.029095 53 LIVINGAREA_AVG 196.001510 80 LIVINGAPARTMENTS_MEDI 181.425466 52 LIVINGAPARTMENTS_AVG 160.537111 74 COMMONAREA_MEDI 154.317179 46 COMMONAREA_AVG 135.833277 71 BASEMENTAREA_MEDI 124.030316 44 YEARS_BEGINEXPLUATATION_AVG 107.424140 43 BASEMENTAREA_AVG 104.370801 79 LANDAREA_MEDI 88.002448 82 NONLIVINGAPARTMENTS_MEDI 85.449105 72 YEARS_BEGINEXPLUATATION_MEDI 83.078401 83 NONLIVINGAREA_MEDI 79.161674 56 APARTMENTS_MODE 75.210034 61 ELEVATORS_MODE 74.520567 63 FLOORSMAX_MODE 65.565445 64 FLOORSMIN_MODE 63.225511 51 LANDAREA_AVG 61.305289 54 NONLIVINGAPARTMENTS_AVG 61.292899 59 YEARS_BUILD_MODE 60.782848 55 NONLIVINGAREA_AVG 58.414368 62 ENTRANCES_MODE 53.177391 67 LIVINGAREA_MODE 53.175891 8 AMT_GOODS_PRICE 42.896141 6 AMT_CREDIT 41.926529 57 BASEMENTAREA_MODE 39.809246 66 LIVINGAPARTMENTS_MODE 39.095080 60 COMMONAREA_MODE 34.314145 65 LANDAREA_MODE 34.237400 68 NONLIVINGAPARTMENTS_MODE 25.961802 69 NONLIVINGAREA_MODE 25.930602 58 YEARS_BEGINEXPLUATATION_MODE 18.900677 27 CNT_FAM_MEMBERS 17.760169 4 CNT_CHILDREN 13.257755 95 FLAG_DOCUMENT_3 12.266187 28 REGION_RATING_CLIENT 10.989874 29 REGION_RATING_CLIENT_W_CITY 10.872970 33 REG_REGION_NOT_WORK_REGION 9.075078 86 TOTALAREA_MODE 7.987570 36 REG_CITY_NOT_WORK_CITY 7.573843 34 LIVE_REGION_NOT_WORK_REGION 7.357019 85 HOUSETYPE_MODE 6.643642 88 EMERGENCYSTATE_MODE 6.382314 98 FLAG_DOCUMENT_6 6.212168 37 LIVE_CITY_NOT_WORK_CITY 6.037850 0 NAME_CONTRACT_TYPE 5.376454 100 FLAG_DOCUMENT_8 5.293834 87 WALLSMATERIAL_MODE 5.087571 12 NAME_FAMILY_STATUS 4.109616 90 DEF_30_CNT_SOCIAL_CIRCLE 4.066447 92 DEF_60_CNT_SOCIAL_CIRCLE 3.871220 7 AMT_ANNUITY 2.784039 15 DAYS_BIRTH 2.448739 35 REG_CITY_NOT_LIVE_CITY 2.446805 32 REG_REGION_NOT_LIVE_REGION 2.436091 84 FONDKAPREMONT_MODE 1.830200 38 ORGANIZATION_TYPE 1.717919 97 FLAG_DOCUMENT_5 1.624040 14 REGION_POPULATION_RELATIVE 1.557676 26 OCCUPATION_TYPE 1.411083 1 CODE_GENDER 1.279183 22 FLAG_WORK_PHONE 1.264659 39 EXT_SOURCE_1 1.250308 2 FLAG_OWN_CAR 1.234432 40 EXT_SOURCE_2 1.210609 101 FLAG_DOCUMENT_9 1.202234 10 NAME_INCOME_TYPE 1.189643 17 DAYS_REGISTRATION 1.182372 11 NAME_EDUCATION_TYPE 1.159386 103 FLAG_DOCUMENT_11 1.158173 24 FLAG_PHONE 1.157811 18 DAYS_ID_PUBLISH 1.148993 31 HOUR_APPR_PROCESS_START 1.142273 13 NAME_HOUSING_TYPE 1.127115 3 FLAG_OWN_REALTY 1.121518 115 AMT_REQ_CREDIT_BUREAU_DAY 1.113343 93 DAYS_LAST_PHONE_CHANGE 1.108702 41 EXT_SOURCE_3 1.072609 23 FLAG_CONT_MOBILE 1.065465 5 AMT_INCOME_TOTAL 1.063118 114 AMT_REQ_CREDIT_BUREAU_HOUR 1.059802 119 AMT_REQ_CREDIT_BUREAU_YEAR 1.058902 116 AMT_REQ_CREDIT_BUREAU_WEEK 1.055091 108 FLAG_DOCUMENT_16 1.037873 25 FLAG_EMAIL 1.034020 110 FLAG_DOCUMENT_18 1.032908 105 FLAG_DOCUMENT_13 1.032336 19 OWN_CAR_AGE 1.022424 106 FLAG_DOCUMENT_14 1.021144 112 FLAG_DOCUMENT_20 1.020985 9 NAME_TYPE_SUITE 1.017852 117 AMT_REQ_CREDIT_BUREAU_MON 1.015819 107 FLAG_DOCUMENT_15 1.008876 118 AMT_REQ_CREDIT_BUREAU_QRT 1.008262 113 FLAG_DOCUMENT_21 1.005291 99 FLAG_DOCUMENT_7 1.004723 96 FLAG_DOCUMENT_4 1.003694 111 FLAG_DOCUMENT_19 1.003514 94 FLAG_DOCUMENT_2 1.003042 109 FLAG_DOCUMENT_17 1.002282 30 WEEKDAY_APPR_PROCESS_START 1.001028 104 FLAG_DOCUMENT_12 1.000682 20 FLAG_MOBIL 1.000566 102 FLAG_DOCUMENT_10 1.000539
Insights
- As already mentioned in the correlation matrix, we observe some distinct patterns in the center of the heatmap with contrasting shades, signifying higher correlations. These correlations are primarily seen among features related to apartment statistics. And this can also be seen here with the perfect correlation such as APARTMENTS_AVG, APARTMENTS_MEDI, and APARTMENTS_MODE. These features are likely to be highly correlated due to their similar nature and measurement units, indicating a potential redundancy in the dataset.
Possible actions
These highly correlated features can be addressed through feature selection techniques like VIF (Variance Inflation Factor) or PCA (Principal Component Analysis) to reduce multicollinearity and improve model performance.
Application_train: Categorical Variables Analysis¶
Distribution of NAME_CONTRACT_TYPE feature and Analysis
This column provides details about the type of loan associated with the applicant. According to the documentation from Home Credit, there are two loan categories: Revolving Loans and Cash Loans. Brief explanation of the two loan types:
- Cash Loans: Fixed-term loans provided as a lump sum (entire loan amount), repaid in regular installments over a set period.
- Revolving Loans: Credit lines that allow borrowing, repayment, and re-borrowing up to a set limit, such as with credit cards.
print(train_df['NAME_CONTRACT_TYPE'].value_counts())
categ_distr_target(train_df, 'NAME_CONTRACT_TYPE')
NAME_CONTRACT_TYPE Cash loans 278232 Revolving loans 29279 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'NAME_CONTRACT_TYPE': Cash loans: 91.65% Revolving loans: 94.52%
Insights
Contract type Revolving loans are just a small fraction (~10%) from the total number of loans; however, when you focus on entries that had difficulties repaying, a Cash loan has a slightly higher percentage of defaults compared to Revolving loans.
The distribution of loan types and their associated risk (represented by target = 1) shows distinct differences between Cash loans and Revolving loans:
- Distribution of Loan Types:
- Cash loans make up the majority of the data, accounting for 90.48% of all loans.
- Revolving loans represent a much smaller portion, with only 9.52% of the total loans.
- Risk of Default (target = 1) by Loan Type:
- For Cash loans, 8.35% of the loans are associated with client having payment difficulties, indicating default risk.
- For Revolving loans, the percentage of loans with client having payment difficulties is slightly lower, at 5.48%.
Implication:
- Although “Revolving loans” have a slightly lower percentage of defaults compared to Cash loans,the risk in both categories is notable.
- The higher proportion of Cash loans in the dataset means that the majority of defaults come from this category, potentially influencing model predictions and risk assessments focused on this loan type.
This analysis highlights the importance of loan type in assessing default risk, with Cash loans being both more prevalent and slightly riskier than Revolving loans.
Distribution of CODE_GENDER feature and Analysis
Here we will analyze the distribution of CODE_GENDER feature and its impact on the target variable.
print(train_df['CODE_GENDER'].value_counts())
categ_distr_target(train_df, 'CODE_GENDER')
CODE_GENDER F 202448 M 105059 XNA 4 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'CODE_GENDER': F: 93.00% M: 89.86% XNA: 100.00%
Insights
The distribution of the CODE_GENDER feature reveals some notable insights:
- There are 4 rows in the application_train table with the gender labeled as 'XNA', which is not meaningful and can be treated as missing (NaN) values. Since these rows are few in number and belong exclusively to the Non-Defaulter category, they can be safely removed without significant impact.
- From the first subplot, it is evident that the dataset contains more female applicants (65.8%) than male applicants (34.2%).
- Interestingly, the second plot shows that male applicants have a higher default rate (10.14%) compared to female applicants (7%), despite being fewer in the total applicants.
These observations suggest that male applicants exhibit a higher likelihood of defaulting on loans compared to female applicants, considering this data.
Distribution of OCCUPATION_TYPE feature and Analysis
The OCCUPATION_TYPE feature represents the type of occupation or employment category of the applicants. Understanding its distribution provides insights into the employment demographics of the dataset and helps assess whether certain occupations are associated with a higher likelihood of loan default (TARGET = 1). By analyzing this feature, we can identify patterns or trends that may influence loan repayment behavior across different occupational groups. This analysis is critical for understanding how employment type impacts credit risk.
print(train_df['OCCUPATION_TYPE'].value_counts())
categ_distr_target(train_df, 'OCCUPATION_TYPE')
OCCUPATION_TYPE Laborers 55186 Sales staff 32102 Core staff 27570 Managers 21371 Drivers 18603 High skill tech staff 11380 Accountants 9813 Medicine staff 8537 Security staff 6721 Cooking staff 5946 Cleaning staff 4653 Private service staff 2652 Low-skill Laborers 2093 Waiters/barmen staff 1348 Secretaries 1305 Realty agents 751 HR staff 563 IT staff 526 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'OCCUPATION_TYPE': Accountants: 95.17% Cleaning staff: 90.39% Cooking staff: 89.56% Core staff: 93.70% Drivers: 88.67% HR staff: 93.61% High skill tech staff: 93.84% IT staff: 93.54% Laborers: 89.42% Low-skill Laborers: 82.85% Managers: 93.79% Medicine staff: 93.30% Private service staff: 93.40% Realty agents: 92.14% Sales staff: 90.37% Secretaries: 92.95% Security staff: 89.26% Waiters/barmen staff: 88.72%
Insights
- Low-skill Laborers have the highest percentage of defaults (17.15%) and representing only roughly 1% of the total applicants, followed by Drivers (11.33%) and Waiters/barmen staff (11.28%).
- Occupations such as Accountants, High skill tech staff, and Managers show relatively low default percentages, highlighting their financial reliability.
Conclusion:
- The type of occupation has a noticeable impact on loan repayment behavior. Occupations with higher skills or stable jobs (e.g., Accountants, Managers, Core staff) are less likely to default compared to roles like Low-skill Laborers or Waiters/barmen staff.
- This insight can be valuable in predicting credit risk and designing targeted strategies for different occupational categories.
Distribution of ORGANIZATION_TYPE feature and Analysis
This feature is related to the type of organization where the applicant is employed.
# Subplots setup
fig, axes = plt.subplots(1, 2, figsize=(25, 14))
sns.set(style='whitegrid', font_scale=1.2)
# Spacing between plots
plt.subplots_adjust(wspace=0.4)
# Plot 1: Distribution of ORGANIZATION_TYPE
count_organization = train_df[
'ORGANIZATION_TYPE'].value_counts().sort_values(ascending=False)
sns.barplot(
ax=axes[0],
x=count_organization.values,
y=count_organization.index,
order=count_organization.index,
palette="viridis"
)
axes[0].set_title(
'Total Distribution of ORGANIZATION_TYPE', fontsize=16, pad=20)
axes[0].set_xlabel('Counts', fontsize=14)
axes[0].set_ylabel('ORGANIZATION_TYPE', fontsize=14)
# Plot 2: Percentage of Defaulters for each ORGANIZATION_TYPE
default_counts = train_df[
train_df['TARGET'] == 1]['ORGANIZATION_TYPE'].value_counts()
percentage_default_per_organization = (
default_counts / count_organization * 100).dropna().sort_values(
ascending=False)
sns.barplot(
ax=axes[1],
x=percentage_default_per_organization.values,
y=percentage_default_per_organization.index,
order=percentage_default_per_organization.index,
palette="coolwarm"
)
axes[1].set_title(
'Percentage of TARGET = 1 (did not repay on time)', fontsize=16, pad=20)
axes[1].set_xlabel('Percentage of TARGET = 1', fontsize=14)
axes[1].set_ylabel('ORGANIZATION_TYPE', fontsize=14)
# Show the plots
plt.show()
# Count the total number of unique categories
print(f"Total Number of categories in ORGANIZATION_TYPE: "
f"{len(train_df['ORGANIZATION_TYPE'].unique())}")
Total Number of categories in ORGANIZATION_TYPE: 58
Insights
- The feature ORGANIZATION_TYPE contains a wide variety of categories, showcasing applicants’ employment sectors.
- The most common category is “Business Entity Type 3”, with the highest number of applicants.
- Other significant categories include “Self-employed”, “Medicine”, and “Security” sectors.
- Percentage of Defaulters:
- The defaulter percentage varies significantly across categories, indicating that employment sector plays a role in repayment behavior.
- The top 10 categories with the highest percentage of defaulters are:
- “XNA”: ~15%
- “Transport: type 3”
- “Restaurant”
- “Construction”
- “Trade: type 6”
- “Kindergarten”
- “Mobile”
- “Trade: type 4”
- “Housing”
- “Emergency”
- These categories exhibit defaulter percentages ranging from ~6% to ~15%, with “XNA” standing out as a particularly risky category.
- Conclusion:
- The ORGANIZATION_TYPE feature provides valuable insight into loan repayment risk across different employment sectors.
- Sectors like “XNA”, “Transport”, and “Restaurant” have a notably high percentage of defaulters and might require stricter credit assessment.
- On the other hand, sectors with lower defaulter percentages indicate relatively stable repayment behavior.
Distribution of FLAG_DOCUMENT_3 feature and Analysis
The FLAG_DOCUMENT_3 feature indicates whether a specific document (document 3) was submitted during the loan application process. Since it presented a correlation of 0.069525 with the target variable, it is essential to understand its distribution and potential impact on loan repayment behavior.
print(train_df['FLAG_DOCUMENT_3'].value_counts())
categ_distr_target(train_df, 'FLAG_DOCUMENT_3')
FLAG_DOCUMENT_3 1 218340 0 89171 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'FLAG_DOCUMENT_3': 1: 91.16% 0: 93.82%
Insights
- Submitting FLAG_DOCUMENT_3 appears to be slightly associated with a higher likelihood of default. However, the overall differences in repayment and default rates between the two groups are modest.
- While this feature may have some predictive value for loan default, it is likely not a dominant factor on its own.
Distribution of FLAG_EMP_PHONE feature and Analysis
This feature indicates whether the applicant provided a work phone number during the loan application process. (1=YES, 0=NO)"
print(train_df['FLAG_EMP_PHONE'].value_counts())
categ_distr_target(train_df, 'FLAG_EMP_PHONE')
FLAG_EMP_PHONE 1 252125 0 55386 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'FLAG_EMP_PHONE': 1: 91.34% 0: 94.60%
Insights
- Having a flagged work phone (FLAG_EMP_PHONE = 1) seems to slightly correlate with a higher likelihood of default (TARGET = 1), as the default percentage is higher (8.66%) compared to those without a flagged work phone (5.40%).
- However, the majority of applicants in both categories repaid their loans on time (TARGET = 0), so this feature may have a relatively small impact on predicting loan repayment behavior.
Distribution of NAME_HOUSING_TYPE feature and Analysis
This feature describes the housing arrangement of the client, providing insight into their living situation and potentially their financial commitments. Understanding the housing situation is valuable in evaluating the client’s financial stability and repayment ability, as different housing types may indicate varying levels of financial obligation or disposable income.
print(train_df['NAME_HOUSING_TYPE'].value_counts())
categ_distr_target(train_df, 'NAME_HOUSING_TYPE')
NAME_HOUSING_TYPE House / apartment 272868 With parents 14840 Municipal apartment 11183 Rented apartment 4881 Office apartment 2617 Co-op apartment 1122 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'NAME_HOUSING_TYPE': House / apartment: 92.20% With parents: 88.30% Municipal apartment: 91.46% Rented apartment: 87.69% Office apartment: 93.43% Co-op apartment: 92.07%
Insights
- Distribution:
- The majority of clients (88.73%) live in their own house or apartment, indicating that homeownership is the predominant housing type.
- Other housing types have significantly smaller shares.
- Default Rate:
- Clients living in rented apartments have the highest default rate (12.31%), followed by those living with parents (11.70%), suggesting these groups may face more financial instability.
- Municipal apartment residents also show a relatively high default rate (8.54%).
- The lowest default rates are observed among clients in office apartments (6.57%) and those in houses or apartments they own (7.80%), indicating greater financial stability for these groups.
This analysis highlights the relationship between housing type and loan repayment behavior, where clients who rent or live with parents exhibit a higher likelihood of default, while homeowners and those with employer-provided housing tend to be more reliable borrowers.
Distribution of NAME_EDUCATION_TYPE feature and Analysis
The NAME_EDUCATION_TYPE feature represents the highest level of education attained by the client. Analyzing this feature can help us understand the relationship between education level and loan repayment behavior. This analysis provides insights into how educational background correlates with default risk, aiding in identifying patterns among different education groups.
print(train_df['NAME_EDUCATION_TYPE'].value_counts())
categ_distr_target(train_df, 'NAME_EDUCATION_TYPE')
NAME_EDUCATION_TYPE Secondary / secondary special 218391 Higher education 74863 Incomplete higher 10277 Lower secondary 3816 Academic degree 164 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'NAME_EDUCATION_TYPE': Secondary / secondary special: 91.06% Higher education: 94.64% Incomplete higher: 91.52% Lower secondary: 89.07% Academic degree: 98.17%
Insights
- Education level plays a significant role in repayment behavior, with higher education levels generally associated with lower default rates.
- Applicants with a “Lower secondary” education level have the highest risk of defaulting, considering they represent only 1.2% of the total applicants, so they may need extra attention during the credit evaluation process.
- The “Academic degree” category, though small in size, demonstrates the most reliable repayment behavior.
Distribution of REGION_RATING_CLIENT_W_CITY feature and Analysis
The REGION_RATING_CLIENT_W_CITY feature represents a rating of the region where the client resides, considering the city as well. The ratings range from 1 (lowest) to 3 (highest). Analyzing its distribution and relationship with the target variable (TARGET) can help us understand how regional factors, including the client’s city of residence, influence loan repayment behavior and default risk.
print(train_df['REGION_RATING_CLIENT_W_CITY'].value_counts())
categ_distr_target(train_df, 'REGION_RATING_CLIENT_W_CITY')
REGION_RATING_CLIENT_W_CITY 2 229484 3 43860 1 34167 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'REGION_RATING_CLIENT_W_CITY': 2: 92.08% 3: 88.60% 1: 95.16%
Insights
- Regions with lower ratings (3) tend to have higher default rates, indicating increased credit risk for applicants from these areas.
- Conversely, applicants from regions with the highest rating (1) show the lowest default rates, making them comparatively lower-risk borrowers.
- This feature provides useful insights into how regional ratings, when accounting for city-level factors, correlate with repayment behavior and default risk.
Distribution of REG_CITY_NOT_LIVE_CITY -> REG_CITY_NOT_WORK_CITY -> LIVE_CITY_NOT_WORK_CITY feature and Analysis
These features represent flags indicating discrepancies between a client’s addresses at the city level:
- REG_CITY_NOT_LIVE_CITY: Indicates whether the client’s permanent address differs from their contact address.
- REG_CITY_NOT_WORK_CITY: Indicates whether the client’s permanent address differs from their work address.
- LIVE_CITY_NOT_WORK_CITY: Indicates whether the client’s contact address differs from their work address.
Understanding the distribution and relationship of these features with the target variable can provide insights into how address mismatches influence loan repayment behavior. Discrepancies in addresses may indicate lifestyle or employment patterns that could potentially correlate with default risk.
categ_distr_target(train_df, 'REG_CITY_NOT_LIVE_CITY')
categ_distr_target(train_df, 'REG_CITY_NOT_WORK_CITY')
categ_distr_target(train_df, 'LIVE_CITY_NOT_WORK_CITY')
Percentages of TARGET = 0 (repaid on time) for each category in 'REG_CITY_NOT_LIVE_CITY': 0: 92.28% 1: 87.77%
Percentages of TARGET = 0 (repaid on time) for each category in 'REG_CITY_NOT_WORK_CITY': 0: 92.69% 1: 89.39%
Percentages of TARGET = 0 (repaid on time) for each category in 'LIVE_CITY_NOT_WORK_CITY': 0: 92.34% 1: 90.03%
Insights
- Applicants with mismatched addresses, whether between permanent, contact, or work addresses, are generally more likely to default on loans.
- Among the three features, REG_CITY_NOT_LIVE_CITY has the strongest correlation with default, followed by REG_CITY_NOT_WORK_CITY and LIVE_CITY_NOT_WORK_CITY.
- These address features may provide useful signals for credit risk modeling, but their contributions should be combined with other significant factors for more robust predictions.
Visualizing Continuous Variables¶
Continuous variables offer a wealth of information about applicants’ profiles and behaviors. Here, we examine three key variables:
- Age: Captures the age of the applicant in years, providing insights into the age distribution and its potential relationship with repayment behavior.
- Employment Duration: Reflects the number of years the applicant has been employed, a crucial indicator of financial stability and repayment capacity.
- Days Since Document Change: Represents the number of days before the loan application that the applicant updated their identity documents, which may highlight recent administrative changes and potential risk factors.
By visualizing these variables, we aim to uncover patterns and trends that differentiate on-time payers from late payers, helping us better understand their impact on credit risk.
Distribution of Applicant Age
In the dataset, the age of applicants is recorded in days, making it less intuitive for analysis. To enhance interpretability, we will convert age into years, providing a clearer and more meaningful representation.
train_df['AGE_YEARS'] = train_df['DAYS_BIRTH'] * -1 / 365
view_continuous_feat(
data=train_df,
column_name='AGE_YEARS',
plots=['distplot', 'box', 'CDF'],
scale_limits=(0, 70), # Age range
log_scale=False
)
Insights
- Density Plot (Left Panel):
- The density plot reveals that the majority of applicants are in the age range of 30 to 50 years.
- Late payers (Defaulters) tend to have a slightly younger distribution compared to on-time payers (Non-defaulters), with a visible shift in the curve toward the left.
- Box Plot (Middle Panel):
- The median age of on-time payers is higher compared to late payers, indicating that older applicants are generally more reliable in repaying their loans.
- Late payers show a slightly wider interquartile range, reflecting more variation in their age group.
- CDF Plot (Right Panel):
- The cumulative distribution shows that:
- Approximately 50% of late payers are younger than ~35 years, whereas the same percentage for on-time payers is slightly higher at ~40 years.
- The curve for late payers rises more steeply at younger ages, suggesting a concentration of younger defaulters.
Conclusions:
- Age plays a significant role in repayment behavior, with younger applicants showing a higher likelihood of default.
- This analysis emphasizes the importance of considering age as a factor during risk evaluation, as older applicants generally exhibit more stable repayment behavior.
Distribution of Days Employed
The dataset records the employment duration of applicants in days, which can be challenging to interpret directly. To make the analysis more meaningful and intuitive, we will convert the recorded values from days into years. This transformation will provide a clearer representation of the distribution of employment duration and enable better insights into applicants’ employment histories.
# Check percentiles for DAYS_EMPLOYED
view_percentiles(train_df, column_name='DAYS_EMPLOYED')
Percentile values for column: DAYS_EMPLOYED The 0th percentile value of DAYS_EMPLOYED is -17912.00 The 25th percentile value of DAYS_EMPLOYED is -2760.00 The 50th percentile value of DAYS_EMPLOYED is -1213.00 The 75th percentile value of DAYS_EMPLOYED is -289.00 The 90th percentile value of DAYS_EMPLOYED is 365243.00 The 92th percentile value of DAYS_EMPLOYED is 365243.00 The 94th percentile value of DAYS_EMPLOYED is 365243.00 The 96th percentile value of DAYS_EMPLOYED is 365243.00 The 98th percentile value of DAYS_EMPLOYED is 365243.00 The 100th percentile value of DAYS_EMPLOYED is 365243.00
The DAYS_EMPLOYED feature presents the value 365243.00. The Kaggle discussion channel, there are some comments suggesting that the value 365243 in the days_employed column is used as a placeholder for missing or "infinite" values. This is likely a data entry convention used in the dataset to indicate that no valid data was recorded for those entries. https://www.kaggle.com/c/home-credit-default-risk/discussion/57248
Here's a breakdown of the comment:
Key Points:
- 365243 as "Infinity" or Missing:
In the dataset, 365243 is used to signify "infinity" or missing data for the days_employed column.
This might mean the person has been employed for an indefinite time or that the data for their employment duration is unavailable.
To handle this, you should treat 365243 as a missing value (NA).
- XNA/XAP as Missing:
Similarly, strings like XNA or XAP in other categorical columns are placeholders for missing values.
These can be replaced with NaN (Not a Number) for better handling during data analysis and machine learning.
train_df['YEARS_EMPLOYED'] = train_df['DAYS_EMPLOYED'] * -1 / 365
view_continuous_feat(
data=train_df,
column_name='YEARS_EMPLOYED',
plots=['distplot', 'box'],
scale_limits=(0, 50),
log_scale=False
)
Insights
The analysis of the YEARS_EMPLOYED feature reveals several key insights:
- Density Distribution:
- The majority of applicants, regardless of repayment behavior, have less than 10 years of employment. The density peaks within this range, indicating that most individuals have relatively short employment histories.
- Differences Between Groups:
- Late payers (represented by the black curve) tend to have slightly longer employment histories compared to on-time payers (red curve), as observed in the extended tail of the late payers’ distribution.
- However, the overall differences between the two groups remain subtle.
- Box Plot Insights:
- The median employment duration is relatively similar for both on-time and late payers.
- A significant number of outliers, with employment durations exceeding 20 years, are observed in the dataset.
In summary, while employment duration shows a slight trend of longer histories for late payers, the differences are not highly pronounced, and most applicants fall within a similar range of employment experience.
Distribution of DAYS_ID_PUBLISH
This feature represents the number of days before the application that the applicant updated their identity document. Analyzing this variable can provide insights into the recency of document updates and their potential impact on loan repayment behavior. How many days before the application did client change the identity document with which he applied for the loan,time only relative to the application
view_continuous_feat(data=train_df,
column_name='DAYS_ID_PUBLISH',
plots=['box', 'CDF'],
figsize=(10, 8))
Insights
The DAYS_ID_PUBLISH feature does not appear to be a strong differentiator between on-time payers and late payers. Both groups show similar trends in their identity document update timing. While this feature might have limited predictive power, it can still provide some context in combination with other variables.
Distribution of EXT_SOURCES
Since these features presented the highest correlation with the target variable, it is essential to understand their distributions and relationships with the target variable. By visualizing the EXT_SOURCE features, we can gain insights into how these external scores influence loan repayment behavior and default risk.
- EXT_SOURCE_1 -> 0.217846
- EXT_SOURCE_2 -> 0.213965
- EXT_SOURCE_3 -> 0.247680
view_continuous_feat(data=train_df,
column_name='EXT_SOURCE_1',
plots=['distplot', 'box'],
figsize=(16, 8))
view_continuous_feat(data=train_df,
column_name='EXT_SOURCE_2',
plots=['distplot', 'box'],
figsize=(16, 8))
view_continuous_feat(data=train_df,
column_name='EXT_SOURCE_3',
plots=['distplot', 'box'],
figsize=(16, 8))
Insights
The EXT_SOURCE variables (EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3) show significant differences between on-time payers and late payers. These variables likely capture external risk scores or creditworthiness assessments.
Since the data is already normalized, the distributions of these features are comparable, allowing for direct comparisons between the groups. The meaning can be understood as:
- 0: Indicates the highest risk score, suggesting a higher likelihood of default or poorer creditworthiness.
- 1: Represents the lowest risk score, indicating a lower likelihood of default and stronger creditworthiness.
Here’s a summary of observations:
1. EXT_SOURCE_1:
- The density plot indicates that late payers (red) have lower EXT_SOURCE_1 values on average compared to on-time payers (green).
- The box plot confirms this observation, showing a clear separation between the two groups, with on-time payers having higher EXT_SOURCE_1 scores overall.
2. EXT_SOURCE_2:
- This variable shows a similar pattern to EXT_SOURCE_1, with late payers tending to have lower values and on-time payers clustering around higher scores.
- The box plot highlights a more noticeable separation between the two groups, suggesting EXT_SOURCE_2 could be an important predictor for default risk.
3. EXT_SOURCE_3:
- Like the other two external sources, EXT_SOURCE_3 shows higher values for on-time payers compared to late payers in both the density plot and the box plot.
- The difference in distributions is particularly evident in the density plot, with late payers concentrated toward lower scores.
Key Insight:
All three EXT_SOURCE variables consistently indicate that higher values are associated with lower default risk. This trend suggests that these variables are likely critical in distinguishing creditworthy applicants from higher-risk ones, making them valuable predictors in any risk modeling process.
Statistical Inference¶
Statistical inference involves drawing conclusions about a population based on sample data. In the context of credit default analysis, it helps identify significant patterns and relationships between various features (e.g., income, loan amount, family size) and default risk through hypothesis testing, confidence intervals, and other statistical techniques. This process provides insights to guide feature engineering and predictive modeling.
train_df['TARGET'] = train_df['TARGET'].astype(int)
train_df['TARGET'] = train_df['TARGET'].astype(int)
valid_income_data = train_df[train_df[
'AMT_INCOME_TOTAL'].notnull()]
defaulters_income = valid_income_data[
valid_income_data['TARGET'] == 1]['AMT_INCOME_TOTAL']
non_defaulters_income = valid_income_data[
valid_income_data['TARGET'] == 0]['AMT_INCOME_TOTAL']
defaulters = train_df[
train_df['TARGET'] == 1]
non_defaulters = train_df[
train_df['TARGET'] == 0]
Hypothesis 1: Do Income Levels Significantly Differ Between Defaulters and Non-Defaulters?¶
Null Hypothesis (H₀): There is no significant difference in income levels between defaulters and non-defaulters. The mean income for defaulters is equal to that of non-defaulters.
Alternative Hypothesis (H₁): There is a significant difference in income levels between defaulters and non-defaulters. The mean income for defaulters is not equal to that of non-defaulters.
income_results = hypothesis_test(
'AMT_INCOME_TOTAL', defaulters, non_defaulters)
print("### Hypothesis 1: Income Levels Influence Default Rates ###")
print(f"T-statistic: {income_results['T-statistic']:.2f}, "
f"P-value: {income_results['P-value']:.4f}")
print(f"Defaulters Mean Income: {income_results['Defaulters Mean']:.2f}")
print(f"Non-Defaulters Mean Income: "
f"{income_results['Non-Defaulters Mean']:.2f}")
print(f"95% Confidence Interval: {income_results['95% CI']}")
print(f"Decision: {income_results['Decision']}\n")
### Hypothesis 1: Income Levels Influence Default Rates ### T-statistic: -2.21, P-value: 0.0272 Defaulters Mean Income: 165611.77 Non-Defaulters Mean Income: 169077.73 95% Confidence Interval: (-12763.362769931764, 5831.425269931764) Decision: Reject the null hypothesis
Insights
Based on the t-test results, the mean income for defaulters ($165,611.77) is significantly different from that of non-defaulters ($169,077.73), with a p-value of 0.0272, which is less than the significance level of 0.05. The 95% confidence interval for the difference in mean income ranges from -$12,763.36 to $5,831.43, which excludes 0.
The null hypothesis is rejected, indicating that income levels have a statistically significant relationship with default rates. However, the effect size appears to be small.
Hypothesis 2: The Loan Amount Has a Positive Influence on Default Risk¶
Null Hypothesis (H₀): The mean loan amount for defaulters is less than or equal to the mean loan amount for non-defaulters.
Alternative Hypothesis (H₁): The mean loan amount for defaulters is greater than the mean loan amount for non-defaulters.
loan_results = hypothesis_test(
'AMT_CREDIT', defaulters, non_defaulters, alternative='greater')
print("### Hypothesis 2: Loan Amount Affects Default Risk ###")
print(f"T-statistic: {loan_results['T-statistic']:.2f}, "
f"P-value: {loan_results['P-value']:.4f}")
print(f"Defaulters Mean Loan Amount: {loan_results['Defaulters Mean']:.2f}")
print(f"Non-Defaulters Mean Loan Amount: "
f"{loan_results['Non-Defaulters Mean']:.2f}")
print(f"95% Confidence Interval: {loan_results['95% CI']}")
print(f"Decision: {loan_results['Decision']}\n")
### Hypothesis 2: Loan Amount Affects Default Risk ### T-statistic: -16.85, P-value: 1.0000 Defaulters Mean Loan Amount: 557778.50 Non-Defaulters Mean Loan Amount: 602648.31 95% Confidence Interval: (-49432.87600700068, -40306.74899299932) Decision: Fail to reject the null hypothesis
Insights
The analysis indicates that the mean loan amount for defaulters is significantly lower than that for non-defaulters. The 95% confidence interval confirms this difference, as it does not include zero. However, the p-value of 1.0000 implies that there is no statistical evidence to support the claim that defaulters have a higher mean loan amount compared to non-defaulters. Therefore, we fail to reject the null hypothesis.
Hypothesis 3: Age Distribution Differs Between Defaulters and Non-Defaulters¶
Null Hypothesis (H₀): The age distribution (in years) of defaulters is the same as that of non-defaulters.
Alternative Hypothesis (H₁): The age distribution (in years) of defaulters is different from that of non-defaulters.
age_results = hypothesis_test(
'DAYS_BIRTH', defaulters, non_defaulters)
print("### Hypothesis 3: Age Distribution Differs ###")
print(f"T-statistic: {age_results['T-statistic']:.2f}, "
f"P-value: {age_results['P-value']:.4f}")
print(f"Defaulters Mean Age (Years): "
f"{-age_results['Defaulters Mean'] / 365:.2f}")
print(f"Non-Defaulters Mean Age (Years): "
f"{-age_results['Non-Defaulters Mean'] / 365:.2f}")
print(f"95% Confidence Interval: "
f"{(-age_results['95% CI'][1] / 365, -age_results['95% CI'][0] / 365)}")
print(f"Decision: {age_results['Decision']}\n")
### Hypothesis 3: Age Distribution Differs ### T-statistic: 43.52, P-value: 0.0000 Defaulters Mean Age (Years): 40.78 Non-Defaulters Mean Age (Years): 44.21 95% Confidence Interval: (-3.5833728693545583, -3.2842891613164795) Decision: Reject the null hypothesis
Insights
The results show a significant difference in the age distribution between defaulters and non-defaulters, as evidenced by a T-statistic of 43.52 and a P-value of 0.0000. The mean age of defaulters is 40.78 years, which is notably younger than the mean age of non-defaulters at 44.21 years. The 95% confidence interval for the difference in mean ages ranges from -3.58 to -3.28 years, further confirming the disparity. Therefore, we reject the null hypothesis and conclude that age distribution significantly differs between the two groups.
Hypothesis 4: Employment Stability Impacts Default Risk¶
Null Hypothesis (H₀): There is no significant difference in the average length of employment (DAYS_EMPLOYED) between defaulters and non-defaulters.
Alternative Hypothesis (H₁): The average length of employment significantly differs between defaulters and non-defaulters.
employment_results = hypothesis_test(
'DAYS_EMPLOYED', defaulters, non_defaulters)
print("### Hypothesis 4: Employment Stability and Default Risk ###")
print(f"T-statistic: {employment_results['T-statistic']:.2f}, "
f"P-value: {employment_results['P-value']:.4f}")
print(f"Defaulters Mean Employment Days: "
f"{-employment_results['Defaulters Mean']:.2f}")
print(f"Non-Defaulters Mean Employment Days: "
f"{-employment_results['Non-Defaulters Mean']:.2f}")
ci = employment_results['95% CI']
print(f"95% Confidence Interval: {(-ci[1], -ci[0])}")
print(f"Decision: {employment_results['Decision']}\n")
### Hypothesis 4: Employment Stability and Default Risk ### T-statistic: -24.94, P-value: 0.0000 Defaulters Mean Employment Days: -42394.68 Non-Defaulters Mean Employment Days: -65696.15 95% Confidence Interval: (21724.5494170607, 24878.391933186253) Decision: Reject the null hypothesis
Insights
The analysis reveals a significant difference in the average length of employment between defaulters and non-defaulters. Non-defaulters exhibit a much higher average employment duration compared to defaulters. This finding suggests that employment stability is an important factor influencing default risk, as individuals with shorter employment histories tend to have a higher likelihood of defaulting on loans. The null hypothesis is rejected based on the observed t-statistic and p-value, supported by the confidence interval.
Hypothesis 5: Family Size and Default Risk Correlation¶
Null Hypothesis (H₀): Family size does not impact default risk; the mean family size of defaulters and non-defaulters is the same.
Alternative Hypothesis (H₁): Family size impacts default risk; the mean family size of defaulters differs from that of non-defaulters.
family_results = hypothesis_test(
'CNT_FAM_MEMBERS', defaulters, non_defaulters)
print("### Hypothesis 5: Family Size Impacts Default Risk ###")
print(f"T-statistic: {family_results['T-statistic']:.2f}, "
f"P-value: {family_results['P-value']:.4f}")
print(f"Defaulters Mean Family Size: "
f"{family_results['Defaulters Mean']:.2f}")
print(f"Non-Defaulters Mean Family Size: "
f"{family_results['Non-Defaulters Mean']:.2f}")
print(f"95% Confidence Interval: {family_results['95% CI']}")
print(f"Decision: {family_results['Decision']}\n")
### Hypothesis 5: Family Size Impacts Default Risk ### T-statistic: 5.16, P-value: 0.0000 Defaulters Mean Family Size: 2.18 Non-Defaulters Mean Family Size: 2.15 95% Confidence Interval: (0.018817774837163313, 0.04341281216845192) Decision: Reject the null hypothesis
Insights
The analysis reveals that defaulters tend to have a slightly larger average family size (2.18) compared to non-defaulters (2.15). The T-statistic of 5.16 and a P-value of 0.0000 indicate strong statistical significance. The 95% confidence interval for the difference in means is (0.0188, 0.0434), which does not include zero. The null hypothesis is rejected, suggesting that family size has a statistically significant, albeit small, impact on default risk.
Bureau Data Analysis¶
Context:
- All client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample).
- For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.
print(f'The shape of bureau.csv is: {bureau_df.shape}')
print(f'Number of unique SK_ID_BUREAU in bureau.csv are: '
f'{len(bureau_df.SK_ID_BUREAU.unique())}')
print(f'Number of unique SK_ID_CURR in bureau.csv are: '
f'{len(bureau_df.SK_ID_CURR.unique())}')
train_unique = set(train_df.SK_ID_CURR.unique())
bureau_unique = set(bureau_df.SK_ID_CURR.unique())
print(f'Number of overlapping SK_ID_CURR in application_train.csv and '
f'bureau.csv are: {len(train_unique.intersection(bureau_unique))}')
test_unique = set(test_df.SK_ID_CURR.unique())
print(f'Number of overlapping SK_ID_CURR in application_test.csv and '
f'bureau.csv are: {len(test_unique.intersection(bureau_unique))}')
print(f'Number of duplicate values in bureau: '
f'{bureau_df.shape[0] - bureau_df.drop_duplicates().shape[0]}')
display(bureau_df.head(5))
The shape of bureau.csv is: (1716428, 17) Number of unique SK_ID_BUREAU in bureau.csv are: 1716428 Number of unique SK_ID_CURR in bureau.csv are: 305811 Number of overlapping SK_ID_CURR in application_train.csv and bureau.csv are: 263491 Number of overlapping SK_ID_CURR in application_test.csv and bureau.csv are: 42320 Number of duplicate values in bureau: 0
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
Insights
The bureau.csv dataset contains detailed credit bureau data, with 1,716,428 records and 17 features. Key observations include:
- Unique Identifiers:
- There are 1,716,428 unique SK_ID_BUREAU entries, each representing a unique credit record.
- The dataset covers 305,811 unique SK_ID_CURR, indicating the number of unique applicants in this dataset.
- Overlap with Main Datasets:
- A total of 263,491 applicants in application_train.csv overlap with bureau.csv, indicating their credit records are included in the bureau data.
- Similarly, 42,320 applicants in application_test.csv overlap with bureau.csv, making it a valuable source for analyzing both training and testing datasets.
- Data Quality:
- No duplicate values were found in the bureau.csv dataset, indicating high data integrity.
This analysis confirms that the bureau.csv dataset is a comprehensive and reliable source for augmenting the main datasets with additional credit bureau information.
I want to check about the missing values in the bureau dataset. I will check the percentage of missing values in each column and then visualize the missing values using bar plot, similar to what was done for the application_train and application_test datasets.
nan_df_bureau = null_df(bureau_df)
view_percent_nan(nan_df_bureau, 'bureau', grid=True)
Number of columns having NaN values: 7 columns
Insights
Total of 7 features out of 17 have missing values in the bureau dataset. The columns with the highest missing percentages are:
- AMT_ANNUITY: ~70% missing values.
- AMT_CREDIT_MAX_OVERDUE: ~65% missing values.
Merging Bureau Data with Application Data¶
# Merge the filtered train_df with bureau_df
bureau_merged = train_df[['SK_ID_CURR', 'TARGET']].merge(
bureau_df, on='SK_ID_CURR', how='left')
# Verify the merge
print("Merging completed. Shape of bureau_merged:", bureau_merged.shape)
print(bureau_merged['TARGET'].value_counts(dropna=False))
Merging completed. Shape of bureau_merged: (1509345, 18) TARGET 0 1390368 1 118977 Name: count, dtype: int64
Correlation Analysis: Categorical Features¶
bureau_merged['TARGET'] = bureau_merged['TARGET'].astype(int)
# Define categorical columns and target
bureau_categorical_cols = ['TARGET', 'CREDIT_ACTIVE',
'CREDIT_CURRENCY', 'CREDIT_TYPE']
# Display top correlations
view_categ_corr_matrix(
data=bureau_merged,
categorical_columns=bureau_categorical_cols,
figsize=(6, 6),
mask_upper=True,
show_target_top_corr=True
)
Categories with highest correlation value with Target:
| Column Name | Phik-Correlation | |
|---|---|---|
| 1 | CREDIT_ACTIVE | 0.064481 |
| 3 | CREDIT_TYPE | 0.049954 |
| 2 | CREDIT_CURRENCY | 0.004993 |
Insights
- CREDIT_ACTIVE:
- This feature shows the strongest correlation with the TARGET variable. It suggests that the current status of the credit (whether active, closed, or other) may provide useful information about the likelihood of repayment issues.
- CREDIT_TYPE:
- The type of credit (e.g., mortgage, car loan, etc.) also exhibits a moderate correlation with the TARGET, indicating its potential relevance for predicting loan defaults.
- CREDIT_CURRENCY:
- This feature shows minimal correlation with the TARGET, suggesting that the currency in which the credit was issued does not significantly influence repayment behavior.
These findings can guide feature selection, highlighting CREDIT_ACTIVE and CREDIT_TYPE as more predictive features for modeling.
Correlation Analysis: Numerical Features¶
# Columns to drop
columns_to_drop = ['SK_ID_CURR', 'SK_ID_BUREAU',
'CREDIT_ACTIVE', 'CREDIT_CURRENCY',
'CREDIT_TYPE']
view_num_corr_matrix(
data=bureau_merged,
columns_to_drop=columns_to_drop,
figsize=(10, 8),
mask_upper=True,
linewidth=0.1,
fontsize=10,
cmap='coolwarm'
)
# Select numerical columns dynamically, excluding IDs
numerical_columns_bureau = [
col for col in bureau_merged.columns
if bureau_merged[col].dtype in [
'float32', 'int32'] and col not in [
'SK_ID_CURR', 'SK_ID_BUREAU']
]
# Ensure important columns are included
key_features = ['DAYS_CREDIT', 'CREDIT_DAY_OVERDUE',
'CNT_CREDIT_PROLONG', 'DAYS_CREDIT_UPDATE']
numerical_columns_bureau = list(set(
numerical_columns_bureau + key_features))
# Initialize correlation matrix
corr_mat_bureau = CorrelationMatrix(
data=bureau_merged,
target='TARGET',
numerical_columns=numerical_columns_bureau,
min_unique_values=10
)
# Compute top correlated numerical features
top_corr_numerical_bureau = corr_mat_bureau.target_top_corr(
target_top_columns=10)
# Display results
print("Top Phi-K Correlations for Numerical Features in Bureau:")
display(top_corr_numerical_bureau)
Top Phi-K Correlations for Numerical Features in Bureau:
| Column Name | Phik-Correlation | |
|---|---|---|
| 11 | DAYS_CREDIT | 0.088648 |
| 0 | DAYS_CREDIT_ENDDATE | 0.018997 |
| 5 | AMT_CREDIT_SUM_OVERDUE | 0.005654 |
| 7 | AMT_CREDIT_SUM_LIMIT | 0.005107 |
| 10 | AMT_CREDIT_MAX_OVERDUE | 0.004280 |
| 4 | CNT_CREDIT_PROLONG | 0.003862 |
| 3 | CREDIT_DAY_OVERDUE | 0.002528 |
| 8 | DAYS_CREDIT_UPDATE | 0.002158 |
| 2 | AMT_CREDIT_SUM_DEBT | 0.001695 |
| 9 | AMT_CREDIT_SUM | 0.000670 |
Insights
- Key Correlations with the Target Variable:
- DAYS_CREDIT has the highest correlation with the target (0.088648), indicating it plays a significant role in distinguishing between late payers and on-time payers. It suggests that the number of days since the last credit application is an important predictor.
- Other features, such as DAYS_CREDIT_ENDDATE (0.018997), show some level of association but are less impactful.
- Low Correlation for Other Features:
- Features such as AMT_CREDIT_SUM, AMT_CREDIT_SUM_DEBT, and AMT_ANNUITY have extremely low correlation values, indicating they have minimal direct association with the target variable.
- AMT_CREDIT_SUM_OVERDUE and AMT_CREDIT_SUM_LIMIT have slightly higher correlations, but they remain relatively weak predictors.
While DAYS_CREDIT stands out as the most relevant feature, most of the numerical variables in the bureau dataset have low correlations with the target variable. This indicates limited predictive power for these features individually, and they may need to be engineered further or used in combination to add value to a predictive model.
Bureau: Categorical Variables Analysis¶
The goal here is to analyze the Bureau categorical features and their relationship with the target variable.
Distribution of CREDIT_ACTIVE feature and Analysis
This feature is related to the status of the Credit Bureau (CB) reported credits. It provides insights into the current status of the credit, such as whether it is active, closed, or in another state. Analyzing this feature can help us understand how the credit status influences loan repayment behavior and default risk.
unique_values_counts = bureau_merged['CREDIT_ACTIVE'].value_counts()
print(unique_values_counts)
categ_distr_target(bureau_merged, 'CREDIT_ACTIVE')
CREDIT_ACTIVE Closed 917733 Active 541919 Sold 5653 Bad debt 20 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'CREDIT_ACTIVE': Closed: 93.07% Active: 90.71% Sold: 89.90% Bad debt: 80.00%
Insights
- Bad debt accounts stand out with the highest percentage of non-repayment (20%), indicating that these accounts may represent higher-risk borrowers, adding this feature has the smallest amount of categories.
- The Sold accounts also show a relatively higher proportion of non-repayment compared to Closed and Active accounts. Similar to Bad debt, has the second smallest amount of categories.
- Credit risk appears more concentrated in Bad debt and Sold accounts, suggesting areas for closer monitoring or policy adjustments.
Bureau: Numerical Variables Analysis¶
Distribution of DAYS_CREDIT feature and Analysis This feature represent how many days before current application did client apply for Credit Bureau credit. It provides insights into the recency of the credit application and its potential impact on loan repayment behavior. Since this is represented in days, we will convert it to years for better interpretability.
bureau_merged['YEARS_CREDIT'] = bureau_merged['DAYS_CREDIT'] / -365
view_continuous_feat(
data=bureau_merged,
column_name='YEARS_CREDIT',
plots=['distplot', 'box'],
figsize=(15, 8),
log_scale=False
)
Insights
- Shorter credit histories are more strongly associated with non-repayment, as observed in the higher density of late payers at lower credit years. Late payers tend to have more recent credit applications compared to on-time payers.
- A robust credit history may act as a positive indicator for on-time repayment and creditworthiness.
Distribution of DAYS_CREDIT_ENDDATE feature and Analysis Remaining duration of CB credit (in days) at the time of application in Home Credit. This feature provides insights into the remaining duration of the Credit Bureau credit at the time of the Home Credit application. Analyzing this feature can help us understand how the remaining credit duration influences loan repayment behavior and default risk.
# Check percentiles for DAYS_CREDIT_ENDDATE
view_percentiles(bureau_merged, column_name='DAYS_CREDIT_ENDDATE')
Percentile values for column: DAYS_CREDIT_ENDDATE The 0th percentile value of DAYS_CREDIT_ENDDATE is -42060.00 The 25th percentile value of DAYS_CREDIT_ENDDATE is -1144.00 The 50th percentile value of DAYS_CREDIT_ENDDATE is -334.00 The 75th percentile value of DAYS_CREDIT_ENDDATE is 473.00 The 90th percentile value of DAYS_CREDIT_ENDDATE is 1332.00 The 92th percentile value of DAYS_CREDIT_ENDDATE is 1503.00 The 94th percentile value of DAYS_CREDIT_ENDDATE is 1693.00 The 96th percentile value of DAYS_CREDIT_ENDDATE is 6258.00 The 98th percentile value of DAYS_CREDIT_ENDDATE is 27473.00 The 100th percentile value of DAYS_CREDIT_ENDDATE is 31199.00
view_continuous_feat(
data=bureau_merged,
column_name='DAYS_CREDIT_ENDDATE',
plots=['box'],
figsize=(8, 6),
log_scale=False
)
Insights
- The majority of the data clusters around 0 days, with a symmetrical distribution of outliers in both directions. Positive values suggest loans yet to end, while negative values indicate past loans.
- The overall trends between On-time Payers and Late Payers are similar, with no significant difference in the central tendencies or variability.
- Possible Actions:
- These extreme values (like -42,060 days, roughly 115 years) will need to be carefully addressed during the data preprocessing stage, as they could distort analysis and model predictions.
- Potential strategies include removing or capping such extreme values to maintain the integrity of the dataset.
Distribution of DAYS_ENDDATE_FACT feature and Analysis This feature represent the days since CB credit ended at the time of application in Home Credit (only for closed credit). It provides insights into the duration since the Credit Bureau credit ended at the time of the Home Credit application. Analyzing this feature can help us understand how the elapsed time since the credit ended influences loan repayment behavior and default risk.
# Check percentiles for DAYS_ENDDATE_FACT
view_percentiles(bureau_merged, column_name='DAYS_ENDDATE_FACT')
Percentile values for column: DAYS_ENDDATE_FACT The 0th percentile value of DAYS_ENDDATE_FACT is -42023.00 The 25th percentile value of DAYS_ENDDATE_FACT is -1503.00 The 50th percentile value of DAYS_ENDDATE_FACT is -900.00 The 75th percentile value of DAYS_ENDDATE_FACT is -427.00 The 90th percentile value of DAYS_ENDDATE_FACT is -177.00 The 92th percentile value of DAYS_ENDDATE_FACT is -145.00 The 94th percentile value of DAYS_ENDDATE_FACT is -111.00 The 96th percentile value of DAYS_ENDDATE_FACT is -78.00 The 98th percentile value of DAYS_ENDDATE_FACT is -43.00 The 100th percentile value of DAYS_ENDDATE_FACT is 0.00
view_continuous_feat(
data=bureau_merged,
column_name='DAYS_ENDDATE_FACT',
plots=['box'],
figsize=(8, 6),
log_scale=False,
scale_limits=[-40000, 0]
)
Insights
- Extreme Values:
- The 0th percentile value is -42,023 days (~115 years), which seems erroneous and unrealistic. This might indicate a data error or an inherited record. These extreme values should be removed during preprocessing to ensure accurate analysis.
- Box-Plot Observations:
- Defaulters (Target = 1): Tend to have fewer days since their credit ended, implying that their previous loans were closed more recently.
- Non-Defaulters (Target = 0): Generally have a longer duration since their credit ended, suggesting better financial behavior over time.
- Percentile Insights:
- Most values for this column lie within a realistic range:
- Median (50th percentile): -900 days (~2.5 years).
- 75th percentile: -427 days (~1.2 years).
- The higher percentiles (closer to 0) suggest that some loans ended very recently.
- Possible Actions:
- The discrepancy in loan closure timelines between defaulters and non-defaulters could be a key indicator in understanding repayment behavior.
- Cleaning the extreme values will make the analysis and models more robust.
Distribution of DAYS_CREDIT_UPDATE feature and Analysis How many days before loan application did last information about the Credit Bureau credit come
# Check percentiles for DAYS_CREDIT_UPDATE
view_percentiles(bureau_merged, column_name='DAYS_CREDIT_UPDATE')
Percentile values for column: DAYS_CREDIT_UPDATE The 0th percentile value of DAYS_CREDIT_UPDATE is -41947.00 The 25th percentile value of DAYS_CREDIT_UPDATE is -904.00 The 50th percentile value of DAYS_CREDIT_UPDATE is -406.00 The 75th percentile value of DAYS_CREDIT_UPDATE is -33.00 The 90th percentile value of DAYS_CREDIT_UPDATE is -13.00 The 92th percentile value of DAYS_CREDIT_UPDATE is -11.00 The 94th percentile value of DAYS_CREDIT_UPDATE is -9.00 The 96th percentile value of DAYS_CREDIT_UPDATE is -7.00 The 98th percentile value of DAYS_CREDIT_UPDATE is -5.00 The 100th percentile value of DAYS_CREDIT_UPDATE is 372.00
view_continuous_feat(
data=bureau_merged,
column_name='DAYS_CREDIT_UPDATE',
plots=['box'],
figsize=(8, 6),
log_scale=False,
scale_limits=[-40000, 400]
)
Insights
- Extreme Values:
- The 0th percentile value is -41,947 days (~115 years), which is clearly erroneous. This value will need to be removed during the preprocessing stage to maintain the integrity of the analysis.
- Box-Plot Observations:
- Defaulters (Target = 1): Tend to have a smaller number of days since the Credit Bureau information was updated, indicating that their credit information might have been updated more recently compared to Non-Defaulters.
- Non-Defaulters (Target = 0): Have larger median and upper percentile values, suggesting that their credit information tends to remain stable for longer periods.
- Percentile Insights:
- Beyond the extreme 0th percentile, most of the values appear reasonable:
- Median (50th percentile): -406 days (~1.1 years).
- 75th percentile: -33 days (~1 month).
- A small percentage of records have recent updates (e.g., percentiles closer to zero), reflecting regular credit checks or updates.
- Possible Actinions:
- The discrepancy in update timelines between defaulters and non-defaulters can provide valuable insights into borrower behavior.
- Cleaning the extreme erroneous value will enhance the data quality and enable more accurate predictions.
Bureau Balance Data Analysis¶
Context:
Monthly balances of previous credits in Credit Bureau.
- This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.ry observable for the previous credits) rows.
print(f'The shape of bureau.csv is: {bureau_bal_df.shape}')
dup_count = bureau_bal_df.shape[0] - bureau_bal_df.duplicated().shape[0]
print(f'Number of duplicate values in bureau: {dup_count}')
display(bureau_bal_df.head(5))
The shape of bureau.csv is: (27299925, 3) Number of duplicate values in bureau: 0
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
print(f'Number of unique SK_ID_BUREAU in bureau.csv are: '
f'{len(bureau_bal_df.SK_ID_BUREAU.unique())}')
print(f'Number of unique values for STATUS are: '
f'{len(bureau_bal_df.STATUS.unique())}')
print(f'Unique values of STATUS are:\n{bureau_bal_df.STATUS.unique()}')
print(f'Max number of months for Months Balance: '
f'{np.abs(bureau_bal_df.MONTHS_BALANCE.min())}')
Number of unique SK_ID_BUREAU in bureau.csv are: 817395 Number of unique values for STATUS are: 8 Unique values of STATUS are: ['C', '0', 'X', '1', '2', '3', '5', '4'] Categories (8, object): ['0', '1', '2', '3', '4', '5', 'C', 'X'] Max number of months for Months Balance: 96
# NaN values in the bureau_balance dataset
view_percent_nan(null_df(
bureau_bal_df), 'bureau_balance', grid=True)
The dataframe bureau_balance does not contain any NaN values.
Insights
- Dataset Overview: • The dataset bureau.csv has a total of 27,299,925 rows and 3 columns. • There are no duplicate rows, ensuring data integrity.
- Unique Identifiers: • The column SK_ID_BUREAU contains 817,395 unique identifiers, linking each record to a specific credit bureau entry. • The STATUS column has 8 unique values, representing different credit status categories.
- STATUS Values: • The unique STATUS categories are: [‘C’, ‘0’, ‘X’, ‘1’, ‘2’, ‘3’, ‘5’, ‘4’]. • These represent various credit conditions, with potential meanings such as closed accounts (‘C’), no overdue (‘0’), and increasing levels of delinquency (‘1’, ‘2’, etc.).
- Months Balance: • The dataset spans a maximum of 96 months, providing historical insights into borrowers’ credit activities over the last 8 years.
- Missing Values: • The dataset contains no missing values, ensuring complete data coverage for all records.
Previous Application Data Analysis¶
Context:
- All previous applications for Home Credit loans of clients who have loans in our sample.
- There is one row for each previous application related to loans in our data sample.
print(f'The shape of previous_application is: {prev_app_df.shape}')
print(f'Number of unique SK_ID_BUREAU in previous_application are: '
f'{len(prev_app_df.SK_ID_PREV.unique())}')
print(f'Number of unique SK_ID_CURR in previous_application are: '
f'{len(prev_app_df.SK_ID_CURR.unique())}')
train_unique = set(train_df.SK_ID_CURR.unique())
prev_app_unique = set(prev_app_df.SK_ID_CURR.unique())
overlap_train = len(train_unique.intersection(prev_app_unique))
print(f'Number of overlapping SK_ID_CURR in application_train.csv and '
f'previous_application are: {overlap_train}')
test_unique = set(test_df.SK_ID_CURR.unique())
overlap_test = len(test_unique.intersection(prev_app_unique))
print(f'Number of overlapping SK_ID_CURR in application_test.csv and '
f'previous_application are: {overlap_test}')
duplicates = prev_app_df.shape[0] - prev_app_df.drop_duplicates().shape[0]
print(f'Number of duplicate values in previous_application: {duplicates}')
display(prev_app_df.head())
The shape of previous_application is: (1670214, 37) Number of unique SK_ID_BUREAU in previous_application are: 1670214 Number of unique SK_ID_CURR in previous_application are: 338857 Number of overlapping SK_ID_CURR in application_train.csv and previous_application are: 291057 Number of overlapping SK_ID_CURR in application_test.csv and previous_application are: 47800 Number of duplicate values in previous_application: 0
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430054 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | Y | 1 | 0.0 | 0.182832 | 0.867336 | XAP | Approved | -73 | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615234 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -164 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735352 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -301 | Cash through the bank | XAP | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335938 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -512 | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.394531 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | Y | 1 | NaN | NaN | NaN | Repairs | Refused | -781 | Cash through the bank | HC | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
Insights
- Dataset Overview:
- The previous_application dataset contains 1,670,214 rows and 37 columns.
- There are no duplicate rows, ensuring clean and reliable data for analysis.
- Unique Identifiers:
- The column SK_ID_BUREAU has 1,670,214 unique entries, linking records to specific bureau applications.
- The column SK_ID_CURR has 338,857 unique values, corresponding to individual applicants.
- Overlap with Training and Test Data:
- 291,057 applicants (SK_ID_CURR) overlap with the application_train.csv dataset.
- 47,800 applicants overlap with the application_test.csv dataset.
This dataset provides a detailed record of previous loan applications, offering valuable insights into the applicants’ credit and loan histories. It serves as an essential source for understanding historical loan behaviors, which can be leveraged to enhance predictions for new applications.
Previous Application: Null Values analysis¶
# NaN values in the training dataset
view_percent_nan(null_df(
prev_app_df), 'Previous Applications Dataset', grid=True)
Number of columns having NaN values: 16 columns
Insights
- Extent of Missing Values:
- A total of 16 columns in the prev_app_df dataset contain missing (NaN) values.
- Features with High Missing Values:
- Columns such as RATE_INTEREST_PRIVILEGED and RATE_INTEREST_PRIMARY have nearly 100% missing values, indicating these features may not be usable for analysis unless imputed or dropped. But still soon to be analyzed.
- Insights from Visualization:
- The chart provides a clear overview of the percentage of missing values for each column, helping prioritize which features need attention during preprocessing.
- Possible Actions:
- Columns with excessive NaN values (e.g., >80%) may need to be removed or handled carefully.
- For columns with moderate missing values, consider imputation strategies based on the data distribution or domain-specific knowledge.
Merging Previous Application with Application Data Target¶
# Merge the filtered train_df with prev_app_df
prev_app_merged = train_df[['SK_ID_CURR', 'TARGET']].merge(
prev_app_df, on='SK_ID_CURR', how='left')
# Verify the merge
print("Merging completed. Shape of prev_app_merged:", prev_app_merged.shape)
print(prev_app_merged['TARGET'].value_counts(dropna=False))
Merging completed. Shape of prev_app_merged: (1430155, 38) TARGET 0 1306815 1 123340 Name: count, dtype: int64
Correlation Analysis: Categorical Features¶
prev_app_merged['TARGET'] = prev_app_merged['TARGET'].astype(int)
# Define categorical columns and include additional relevant features
cat_cols_prev_app = ['TARGET'] + prev_app_merged.dtypes[
prev_app_merged.dtypes == 'category'].index.tolist() + [
'NFLAG_INSURED_ON_APPROVAL']
view_categ_corr_matrix(
data=prev_app_merged,
categorical_columns=cat_cols_prev_app,
figsize=(12, 10),
mask_upper=True,
show_target_top_corr=True
)
Categories with highest correlation value with Target:
| Column Name | Phik-Correlation | |
|---|---|---|
| 5 | NAME_CONTRACT_STATUS | 0.088266 |
| 16 | PRODUCT_COMBINATION | 0.063839 |
| 7 | CODE_REJECT_REASON | 0.062771 |
| 1 | NAME_CONTRACT_TYPE | 0.050859 |
| 13 | CHANNEL_TYPE | 0.050302 |
| 10 | NAME_GOODS_CATEGORY | 0.042951 |
| 4 | NAME_CASH_LOAN_PURPOSE | 0.040305 |
| 6 | NAME_PAYMENT_TYPE | 0.039752 |
| 14 | NAME_SELLER_INDUSTRY | 0.038077 |
| 15 | NAME_YIELD_GROUP | 0.034626 |
Insights
- Strong Correlation with TARGET:
- NAME_CONTRACT_STATUS (Phi-K: 0.088), PRODUCT_COMBINATION (Phi-K: 0.064), and CODE_REJECT_REASON (Phi-K: 0.063) show the highest correlations with loan repayment behavior. These features require further investigation as they may significantly influence the likelihood of default.
- Inter-feature Relationships:
- PRODUCT_COMBINATION is closely linked to features like NAME_CONTRACT_TYPE, NAME_PRODUCT_TYPE, and NAME_PORTFOLIO, indicating shared relevance in loan structuring.
- NAME_GOODS_CATEGORY is strongly associated with NAME_SELLER_INDUSTRY, suggesting the type of goods and seller industry may jointly impact credit outcomes.
- Notable Features:
- CHANNEL_TYPE and NAME_CASH_LOAN_PURPOSE also show moderate correlations with TARGET, implying their potential role in loan repayment analysis.
Correlation Analysis: Numerical Features¶
num_cols_prev_app = prev_app_merged.select_dtypes(
include=['float32', 'float64', 'int32', 'int64']).columns.tolist()
num_cols_prev_app = [
col for col in num_cols_prev_app if col not in [
'SK_ID_CURR', 'SK_ID_PREV']
]
corr_mat_prev = CorrelationMatrix(
data=prev_app_merged,
target='TARGET',
numerical_columns=num_cols_prev_app,
min_unique_values=10
)
corr_mat_prev.plot_correlation_matrix(figsize=(14, 12))
top_corr_numerical = corr_mat_prev.target_top_corr(target_top_columns=10)
print("Top Phi-K Correlations for Numerical Features:")
display(top_corr_numerical)
Top Phi-K Correlations for Numerical Features:
| Column Name | Phik-Correlation | |
|---|---|---|
| 11 | CNT_PAYMENT | 0.056639 |
| 9 | DAYS_DECISION | 0.053695 |
| 12 | DAYS_FIRST_DRAWING | 0.048990 |
| 6 | RATE_DOWN_PAYMENT | 0.039592 |
| 5 | HOUR_APPR_PROCESS_START | 0.038121 |
| 8 | RATE_INTEREST_PRIVILEGED | 0.028204 |
| 14 | DAYS_LAST_DUE_1ST_VERSION | 0.027877 |
| 15 | DAYS_LAST_DUE | 0.027321 |
| 16 | DAYS_TERMINATION | 0.026480 |
| 0 | AMT_ANNUITY | 0.013808 |
Insights
- Highly Correlated Feature Pairs:
- Strong correlations are observed between:
- DAYS_TERMINATION and DAYS_LAST_DUE
- AMT_CREDIT, AMT_APPLICATION, and AMT_GOODS_PRICE
- AMT_CREDIT and AMT_ANNUITY
- These relationships suggest redundancy, which may require feature selection or combination.
- Correlation with Target:
- Features like CNT_PAYMENT and DAYS_DECISION show the highest correlation with the TARGET, but these correlations are weak, indicating limited direct relationships with the target variable.
- Key Insights for Modeling:
- Limited correlation with the target implies that complex, non-linear models may better capture patterns in the data.
- Highly correlated feature pairs could introduce multicollinearity and might need to be addressed during feature engineering.
Previous Application: Categorical Variables Analysis¶
The goal here is to analyze the Bureau categorical features and their relationship with the target variable.
Distribution of NAME_CONTRACT_TYPE feature and Analysis
The NAME_CONTRACT_TYPE feature represents the type of contract associated with previous applications, such as cash loans, consumer loans (POS), and others. Analyzing its distribution helps in understanding the preference for various product types among applicants and their potential correlation with loan repayment behavior.
unique_values_counts = prev_app_merged['NAME_CONTRACT_TYPE'].value_counts()
print(unique_values_counts)
categ_distr_target(prev_app_merged, 'NAME_CONTRACT_TYPE')
NAME_CONTRACT_TYPE Cash loans 626764 Consumer loans 625256 Revolving loans 161368 XNA 313 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'NAME_CONTRACT_TYPE': Cash loans: 90.87% Consumer loans: 92.29% Revolving loans: 89.53% XNA: 79.87%
Insights
The NAME_CONTRACT_TYPE distribution reveals that the majority of previous loans are either Cash Loans (44.33%) or Consumer Loans (44.23%). A smaller proportion consists of Revolving Loans (11.41%) and a negligible number of loans labeled as XNA (0.02%).
- Default Rates:
- Loans categorized as XNA have the highest default rate (20.13%), though their sample size is minimal.
- Revolving Loans follow with a default rate of 10.47%.
- Cash Loans and Consumer Loans exhibit relatively lower default rates, at 9.13% and 7.71% respectively.
- Implications:
- The higher default rate for Revolving Loans and XNA loans suggests they may require stricter risk assessment.
- The low default rate for Consumer Loans highlights their relatively better repayment behavior.
Distribution of NAME_CONTRACT_STATUS feature and Analysis
The NAME_CONTRACT_STATUS feature represents the contract status during the month.
unique_values_counts = prev_app_merged['NAME_CONTRACT_STATUS'].value_counts()
print(unique_values_counts)
categ_distr_target(prev_app_merged, 'NAME_CONTRACT_STATUS')
NAME_CONTRACT_STATUS Approved 886099 Canceled 259441 Refused 245390 Unused offer 22771 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'NAME_CONTRACT_STATUS': Approved: 92.41% Canceled: 90.83% Refused: 88.00% Unused offer: 91.75%
Insights
- Loan Approval Trends:
- Approved loans dominate with 62.68% of all loans.
- Canceled and Refused loans each account for around 18%.
- Unused offers are the least common, making up only 1.61% of all loans.
- Default Rates:
- Loans with a Refused status show the highest default rate at 12%, likely due to profiles being inadequate for approval.
- Loans with a Canceled status have a default rate of 9.17%.
- The Approved loans exhibit the lowest default rate of 7.59%, aligning with logical expectations given their approval status.
These observations suggest that past contract statuses can provide valuable insights into repayment behavior, with “Refused” loans being a significant indicator of higher default likelihood.
Distribution of CODE_REJECT_REASON feature and Analysis
This feature represents the reason for rejection of the previous application.
unique_values_counts = prev_app_merged['CODE_REJECT_REASON'].value_counts()
print(unique_values_counts)
categ_distr_target(prev_app_merged, 'CODE_REJECT_REASON')
CODE_REJECT_REASON XAP 1145533 HC 145984 LIMIT 47773 SCO 32636 CLIENT 22771 SCOFR 10875 XNA 4378 VERIF 3079 SYSTEM 672 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'CODE_REJECT_REASON': XAP: 92.05% HC: 88.27% LIMIT: 87.44% SCO: 90.00% CLIENT: 91.75% SCOFR: 79.07% XNA: 89.90% VERIF: 90.74% SYSTEM: 93.75%
Insights
- Distribution of Rejection Reasons:
- The most common rejection reason is XAP, accounting for a significant 81% of all rejections.
- Other notable reasons include HC (10.33%) and LIMIT (3.38%).
- Less common reasons include SCOFR, VERIF, and SYSTEM, which together represent a very small fraction of the dataset.
- Default Rates by Rejection Reason:
- Applicants rejected for the SCOFR reason show the highest default rate at 20.93%, indicating a strong association between this rejection reason and defaults.
- Other reasons with relatively higher default rates include LIMIT (12.56%) and HC (11.73%).
- XAP, the most common rejection reason, has a low default rate of 7.95%, suggesting that while frequent, it does not heavily associate with defaults.
- The lowest default rate is observed for the SYSTEM reason, at just 6.25%.
These findings indicate that certain rejection reasons, particularly SCOFR, LIMIT, and HC, may be potential risk indicators of default and warrant further investigation.
Distribution of CHANNEL_TYPE feature and Analysis
unique_values_counts = prev_app_merged['CHANNEL_TYPE'].value_counts()
print(unique_values_counts)
categ_distr_target(prev_app_merged, 'CHANNEL_TYPE')
CHANNEL_TYPE Credit and cash offices 600342 Country-wide 423062 Stone 183508 Regional / Local 92472 Contact center 58891 AP+ (Cash loan) 49624 Channel of corporate sales 5396 Car dealer 406 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'CHANNEL_TYPE': Credit and cash offices: 90.99% Country-wide: 91.74% Stone: 92.63% Regional / Local: 92.39% Contact center: 89.73% AP+ (Cash loan): 87.17% Channel of corporate sales: 93.62% Car dealer: 94.83%
Insights
- Most Common Channels:
- The majority of previous applications were through Credit and Cash Offices (42.47%) and Country-wide (29.93%).
- Other channels, such as Stone (12.98%) and Regional/Local (6.54%), had relatively fewer applications.
- The least-used channels were Car Dealer (0.03%) and Channel of Corporate Sales (0.38%).
- Default Rates by Channel:
- The highest default rate was seen in the AP+ (Cash loan) channel, with 12.83% of applicants defaulting.
- Contact Center and Credit and Cash Offices also showed relatively higher default rates at 10.27% and 9.01%, respectively.
- The Car Dealer channel had the lowest default rate at 5.17%, indicating better creditworthiness in this category.
These insights indicate that certain channels like AP+ (Cash loan) and Contact Center are associated with higher default risks, whereas Car Dealer and Channel of Corporate Sales show relatively safer applicant profiles.
Distribution of PRODUCT_COMBINATION feature and Analysis
unique_values_counts = prev_app_merged['PRODUCT_COMBINATION'].value_counts()
print(unique_values_counts)
categ_distr_target(prev_app_merged, 'PRODUCT_COMBINATION')
PRODUCT_COMBINATION Cash 235101 POS household with interest 226464 POS mobile with interest 190850 Cash X-Sell: middle 120036 Cash X-Sell: low 110599 Card Street 94594 POS industry with interest 83508 POS household without interest 71519 Card X-Sell 66774 Cash Street: high 51699 Cash X-Sell: high 50802 Cash Street: middle 29934 Cash Street: low 28593 POS other with interest 20309 POS mobile without interest 19890 POS industry without interest 10621 POS others without interest 2095 Name: count, dtype: int64 Percentages of TARGET = 0 (repaid on time) for each category in 'PRODUCT_COMBINATION': Cash: 90.46% POS household with interest: 92.27% POS mobile with interest: 91.19% Cash X-Sell: middle: 92.16% Cash X-Sell: low: 93.42% Card Street: 88.71% POS industry with interest: 93.70% POS household without interest: 93.36% Card X-Sell: 90.70% Cash Street: high: 88.64% Cash X-Sell: high: 88.51% Cash Street: middle: 88.38% Cash Street: low: 89.90% POS other with interest: 91.87% POS mobile without interest: 92.12% POS industry without interest: 95.36% POS others without interest: 92.79%
Insights
- Most Common Product Combinations:
- The three most common product combinations are Cash (16.63%), POS Household with Interest (16.02%), and POS Mobile with Interest (13.50%), accounting for about 46% of all applications.
- Other notable combinations include Cash X-Sell: Middle (8.49%) and Cash X-Sell: Low (7.83%), with lower representation for categories like POS Others Without Interest (0.15%) and POS Industry Without Interest (0.75%).
- Default Rates by Product Combination:
- Highest Default Rates:
- Categories like Cash Street: High (11.36%), Cash X-Sell: High (11.49%), Card Street (11.29%), and Cash Street: Middle (11.62%) show the highest percentage of defaulters.
- Lowest Default Rates:
- The POS Industry Without Interest category exhibits the lowest default rate at 4.64%, indicating safer credit profiles.
- Other combinations such as Cash X-Sell: Low (6.58%) and POS Household Without Interest (6.64%) also demonstrate relatively low default rates.
Observations and Insights:
- Cash-related categories (like Cash Street and Cash X-Sell) show a higher likelihood of defaults compared to POS-related categories.
- POS combinations without interest are generally associated with lower default rates, likely reflecting more stable repayment behaviors.
Previous Application: Numerical Variables Analysis¶
Distribution of DAYS_DECISION feature and Analysis
This feature is relative to the current application when was the decision about previous application made.
view_continuous_feat(
data=prev_app_merged,
column_name='DAYS_DECISION',
plots=['distplot', 'box'],
figsize=(15, 8),
log_scale=False,
)
Insights
- Distribution Insights:
- The distribution of DAYS_DECISION shows that decisions for both defaulters and non-defaulters cluster closer to recent times, with the peak around 0 to -500 days.
- Non-defaulters have slightly more spread-out decision timelines compared to defaulters.
- Observations:
- Defaulters: The decision on previous applications for defaulters tends to be made more recently than for non-defaulters.
- Non-Defaulters: Non-defaulters generally had decisions on their previous applications made further back in time, implying potentially better credit profiles.
This behavior may indicate that recent application decisions could serve as a subtle indicator of financial instability or repayment challenges.
Distribution of DAYS_FIRST_DRAWING feature and Analysis
# Check percentiles for DAYS_ENDDATE_FACT
view_percentiles(prev_app_merged, column_name='DAYS_FIRST_DRAWING')
view_continuous_feat(
data=prev_app_merged,
column_name='DAYS_FIRST_DRAWING',
plots=['box'],
figsize=(8, 6),
log_scale=False,
scale_limits=[-3000, 0]
)
Percentile values for column: DAYS_FIRST_DRAWING The 0th percentile value of DAYS_FIRST_DRAWING is -2922.00 The 25th percentile value of DAYS_FIRST_DRAWING is 365243.00 The 50th percentile value of DAYS_FIRST_DRAWING is 365243.00 The 75th percentile value of DAYS_FIRST_DRAWING is 365243.00 The 90th percentile value of DAYS_FIRST_DRAWING is 365243.00 The 92th percentile value of DAYS_FIRST_DRAWING is 365243.00 The 94th percentile value of DAYS_FIRST_DRAWING is 365243.00 The 96th percentile value of DAYS_FIRST_DRAWING is 365243.00 The 98th percentile value of DAYS_FIRST_DRAWING is 365243.00 The 100th percentile value of DAYS_FIRST_DRAWING is 365243.00
Insights
- Erroneous Values: The majority of values in DAYS_FIRST_DRAWING appear to be erroneous, with many being fixed at a specific number (e.g., 365243). As already mentioned, In the dataset, 365243 is used to signify "infinity" or missing data for the days_employed column. These points will need to be removed in the data preprocessing stage.
- Distribution: After excluding erroneous values, Defaulters tend to have their first drawing on previous credit more recently compared to Non-Defaulters. This is evident from the box plot, where Defaulters show lower median and 75th percentile values than Non-Defaulters.
Possible Actions:
- Remove - erroneous values (e.g., 365243) during preprocessing.
- Further analyze valid data for more robust insights.
Distribution of DAYS_FIRST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_LAST_DUE, and DAYS_TERMINATION features and Analysis
# Check percentiles for DAYS_ENDDATE_FACT
view_percentiles(prev_app_merged, column_name='DAYS_FIRST_DUE')
print('-'*80)
view_percentiles(prev_app_merged, column_name='DAYS_LAST_DUE_1ST_VERSION')
print('-'*80)
view_percentiles(prev_app_merged, column_name='DAYS_LAST_DUE')
print('-'*80)
view_percentiles(prev_app_merged, column_name='DAYS_TERMINATION')
Percentile values for column: DAYS_FIRST_DUE The 0th percentile value of DAYS_FIRST_DUE is -2892.00 The 25th percentile value of DAYS_FIRST_DUE is -1630.00 The 50th percentile value of DAYS_FIRST_DUE is -825.00 The 75th percentile value of DAYS_FIRST_DUE is -408.00 The 90th percentile value of DAYS_FIRST_DUE is -173.00 The 92th percentile value of DAYS_FIRST_DUE is -137.00 The 94th percentile value of DAYS_FIRST_DUE is -86.00 The 96th percentile value of DAYS_FIRST_DUE is -4.00 The 98th percentile value of DAYS_FIRST_DUE is 365243.00 The 100th percentile value of DAYS_FIRST_DUE is 365243.00 -------------------------------------------------------------------------------- Percentile values for column: DAYS_LAST_DUE_1ST_VERSION The 0th percentile value of DAYS_LAST_DUE_1ST_VERSION is -2801.00 The 25th percentile value of DAYS_LAST_DUE_1ST_VERSION is -1244.00 The 50th percentile value of DAYS_LAST_DUE_1ST_VERSION is -358.00 The 75th percentile value of DAYS_LAST_DUE_1ST_VERSION is 135.00 The 90th percentile value of DAYS_LAST_DUE_1ST_VERSION is 1374.00 The 92th percentile value of DAYS_LAST_DUE_1ST_VERSION is 365243.00 The 94th percentile value of DAYS_LAST_DUE_1ST_VERSION is 365243.00 The 96th percentile value of DAYS_LAST_DUE_1ST_VERSION is 365243.00 The 98th percentile value of DAYS_LAST_DUE_1ST_VERSION is 365243.00 The 100th percentile value of DAYS_LAST_DUE_1ST_VERSION is 365243.00 -------------------------------------------------------------------------------- Percentile values for column: DAYS_LAST_DUE The 0th percentile value of DAYS_LAST_DUE is -2889.00 The 25th percentile value of DAYS_LAST_DUE is -1316.00 The 50th percentile value of DAYS_LAST_DUE is -534.00 The 75th percentile value of DAYS_LAST_DUE is -73.00 The 90th percentile value of DAYS_LAST_DUE is 365243.00 The 92th percentile value of DAYS_LAST_DUE is 365243.00 The 94th percentile value of DAYS_LAST_DUE is 365243.00 The 96th percentile value of DAYS_LAST_DUE is 365243.00 The 98th percentile value of DAYS_LAST_DUE is 365243.00 The 100th percentile value of DAYS_LAST_DUE is 365243.00 -------------------------------------------------------------------------------- Percentile values for column: DAYS_TERMINATION The 0th percentile value of DAYS_TERMINATION is -2874.00 The 25th percentile value of DAYS_TERMINATION is -1269.00 The 50th percentile value of DAYS_TERMINATION is -494.00 The 75th percentile value of DAYS_TERMINATION is -42.00 The 90th percentile value of DAYS_TERMINATION is 365243.00 The 92th percentile value of DAYS_TERMINATION is 365243.00 The 94th percentile value of DAYS_TERMINATION is 365243.00 The 96th percentile value of DAYS_TERMINATION is 365243.00 The 98th percentile value of DAYS_TERMINATION is 365243.00 The 100th percentile value of DAYS_TERMINATION is 365243.00
Insights
- Erroneous Values:
- Across all the DAYS_* columns (DAYS_FIRST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_LAST_DUE, DAYS_TERMINATION), we observe the presence of erroneous values like 365243. These are evident in the 98th percentile and above in most cases.
- Such values likely represent missing or placeholder data and should be addressed during preprocessing.
- Distribution Insights:
- For valid values, these columns show logical ranges where earlier days correspond to earlier events (e.g., due dates, termination dates). Defaulters generally have earlier due and termination dates compared to Non-Defaulters.
- Removing erroneous values will provide a clearer analysis and better input for modeling.
Possible Actions:
- Preprocessing: Replace or drop the erroneous values (e.g., 365243) to ensure the data is clean for analysis and modeling.
- Analysis Post-Cleanup: Re-evaluate these columns after handling erroneous values to extract meaningful patterns related to TARGET.
Install Payments Data Analysis¶
Context:
This table represents the repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
- There is a) one row for every payment that was made plus b) one row each for missed payment.
- One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.
print(f'The shape of previous_application is: {installments_pay_df.shape}')
print(f'Number of unique SK_ID_BUREAU in previous_application are: '
f'{len(installments_pay_df.SK_ID_PREV.unique())}')
print(f'Number of unique SK_ID_CURR in previous_application are: '
f'{len(installments_pay_df.SK_ID_CURR.unique())}')
overlap_train = len(set(train_df.SK_ID_CURR.unique()).intersection(
set(installments_pay_df.SK_ID_CURR.unique())))
print(f'Number of overlapping SK_ID_CURR in application_train.csv and '
f'previous_application are: {overlap_train}')
overlap_test = len(set(test_df.SK_ID_CURR.unique()).intersection(
set(installments_pay_df.SK_ID_CURR.unique())))
print(f'Number of overlapping SK_ID_CURR in application_test.csv and '
f'previous_application are: {overlap_test}')
duplicates = (installments_pay_df.shape[0] -
installments_pay_df.duplicated().shape[0])
print(f'Number of duplicate values in previous_application: {duplicates}')
display(prev_app_df.head())
The shape of previous_application is: (13605401, 8) Number of unique SK_ID_BUREAU in previous_application are: 997752 Number of unique SK_ID_CURR in previous_application are: 339587 Number of overlapping SK_ID_CURR in application_train.csv and previous_application are: 291643 Number of overlapping SK_ID_CURR in application_test.csv and previous_application are: 47944 Number of duplicate values in previous_application: 0
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430054 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | Y | 1 | 0.0 | 0.182832 | 0.867336 | XAP | Approved | -73 | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615234 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -164 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735352 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -301 | Cash through the bank | XAP | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335938 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -512 | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.394531 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | Y | 1 | NaN | NaN | NaN | Repairs | Refused | -781 | Cash through the bank | HC | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
Insights
- Dataset Summary:
- The installments_payments table contains 13.6 million rows representing installment histories for previous loans.
- There are 997,752 unique previous loans (SK_ID_BUREAU) across 339,587 unique applicants (SK_ID_CURR).
- Overlap with Training and Test Data:
- 291,643 applicants from application_train.csv had previous loans.
- 47,944 applicants from application_test.csv had previous loans.
- No duplicate rows are present in the dataset.
- Key Observations:
- Nearly all applicants in the training and test datasets have installment histories in the installments_payments table, indicating a rich source of historical repayment information.
- This table provides 8 unique features, primarily related to installment statistics, offering crucial insights into repayment behaviors on previous loans.
- Conclusion:
- The table offers significant value for feature engineering, especially in assessing applicant repayment patterns and their correlations with current loan default probabilities.
Install Payments: Null Values analysis¶
# NaN values in the training dataset
view_percent_nan(null_df(
installments_pay_df), 'Install Payments Dataset', grid=True)
Number of columns having NaN values: 2 columns
Insights
- Out of the 8 columns in the installments_payments table, only 2 columns contain NaN values.
- The proportion of missing values in these columns is minimal, at just 0.02%, which is negligible and unlikely to pose any significant concern.
Merging Install Payments with Application Data Target¶
# Merge the filtered train_df with installments_pay_df
inst_pay_merged = train_df[
['SK_ID_CURR', 'TARGET']].merge(
installments_pay_df, on='SK_ID_CURR', how='left')
# Convert 'TARGET' to numeric
inst_pay_merged['TARGET'] = pd.to_numeric(
inst_pay_merged['TARGET']).round().astype(int)
# Verify the merge
print("Merging completed. Shape of inst_pay_merged:", inst_pay_merged.shape)
print(inst_pay_merged['TARGET'].value_counts(dropna=False))
Merging completed. Shape of inst_pay_merged: (11607460, 9) TARGET 0 10726258 1 881202 Name: count, dtype: int64
Correlation Analysis: Numerical Features¶
# Select numerical columns for the correlation matrix
num_cols_inst_app = inst_pay_merged.select_dtypes(
include=['float32', 'float64', 'int32', 'int64']).columns.tolist()
num_cols_inst_app = [
col for col in num_cols_inst_app if col not in [
'SK_ID_CURR', 'SK_ID_PREV']]
# Initialize and plot the correlation matrix
corr_mat_inst = CorrelationMatrix(
data=inst_pay_merged,
target='TARGET',
numerical_columns=num_cols_inst_app,
min_unique_values=10
)
# Plot the correlation matrix
corr_mat_inst.plot_correlation_matrix(figsize=(8, 6))
top_corr_numerical = corr_mat_inst.target_top_corr(target_top_columns=10)
print("Top Phi-K Correlations for Numerical Features:")
display(top_corr_numerical)
Top Phi-K Correlations for Numerical Features:
| Column Name | Phik-Correlation | |
|---|---|---|
| 3 | DAYS_ENTRY_PAYMENT | 0.047231 |
| 2 | DAYS_INSTALMENT | 0.046815 |
| 1 | NUM_INSTALMENT_NUMBER | 0.022993 |
| 4 | AMT_INSTALMENT | 0.004125 |
| 5 | AMT_PAYMENT | 0.003084 |
| 0 | NUM_INSTALMENT_VERSION | 0.002198 |
Insights
- Feature Correlations:
- There are two sets of highly correlated features:
- DAYS_INSTALMENT and DAYS_ENTRY_PAYMENT
- NUM_INSTALMENT_NUM
- These correlations are logical since they represent expected relationships:
- When the installment was due vs. when it was paid.
- The amount due vs. the amount paid.
- Impact on Feature Engineering:
- These correlated features can be leveraged to derive new, uncorrelated features (e.g., differences or ratios) for more robust predictive modeling.
- Relationship with Target:
- The features do not exhibit significant correlation with the target variable (TARGET), suggesting no strong linear relationship between them.
Correlation Analysis: Continuous Features¶
First, we will group the data by the ‘SK_ID_PREV’ field and calculate the mean for each group, resulting in an averaged row for each of the client’s previous loans. This will help us analyze the relationship between the continuous features and the target variable.
Distribution of DAYS_FIRST_DRAWING feature and Analysis¶
view_continuous_feat(
data=inst_pay_merged,
column_name='DAYS_INSTALMENT',
plots=['box'],
figsize=(8, 6),
log_scale=False,
)
Insights
- Defaulters (TARGET = 1) tend to have more recent installment due dates (DAYS_INSTALMENT closer to 0) compared to non-defaulters (TARGET = 0).
- Non-defaulters generally have installments due farther back in time compared to defaulters, as seen from their median and interquartile range.
- This suggests that overdue or recent due dates might be indicative of higher default risk, which could be explored further for feature engineering.
Distribution of DAYS_ENTRY_PAYMENT feature and Analysis
view_continuous_feat(
data=inst_pay_merged,
column_name='DAYS_ENTRY_PAYMENT',
plots=['box'],
figsize=(8, 6),
log_scale=False,
)
Insights
- Defaulters (TARGET = 1) tend to have more recent payment dates (DAYS_ENTRY_PAYMENT closer to 0) compared to non-defaulters (TARGET = 0).
- Non-defaulters generally have their payments recorded further back in time compared to defaulters, as seen from their median and interquartile range.
- This suggests that the recency of payment might correlate with default risk, potentially indicating payment delays or financial instability among defaulters.
- The presence of outliers for non-defaulters with very distant payment dates should also be investigated further.
POS_CASH_balance Data Analysis¶
Context:
Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
- This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows.
print(f'The shape of poscash_bal_df is: {poscash_bal_df.shape}')
print(f'Number of unique SK_ID_BUREAU in poscash_bal_df are: '
f'{len(poscash_bal_df.SK_ID_PREV.unique())}')
print(f'Number of unique SK_ID_CURR in poscash_bal_df are: '
f'{len(poscash_bal_df.SK_ID_CURR.unique())}')
overlap_train = len(set(train_df.SK_ID_CURR.unique()).intersection(
set(poscash_bal_df.SK_ID_CURR.unique())))
print(f'Number of overlapping SK_ID_CURR in application_train.csv and '
f'poscash_bal_df are: {overlap_train}')
overlap_test = len(set(test_df.SK_ID_CURR.unique()).intersection(
set(poscash_bal_df.SK_ID_CURR.unique())))
print(f'Number of overlapping SK_ID_CURR in application_test.csv and '
f'poscash_bal_df are: {overlap_test}')
duplicates = poscash_bal_df.shape[0] - poscash_bal_df.duplicated().shape[0]
print(f'Number of duplicate values in poscash_bal_df: {duplicates}')
display(prev_app_df.head())
The shape of poscash_bal_df is: (10001358, 8) Number of unique SK_ID_BUREAU in poscash_bal_df are: 936325 Number of unique SK_ID_CURR in poscash_bal_df are: 337252 Number of overlapping SK_ID_CURR in application_train.csv and poscash_bal_df are: 289444 Number of overlapping SK_ID_CURR in application_test.csv and poscash_bal_df are: 47808 Number of duplicate values in poscash_bal_df: 0
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430054 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | Y | 1 | 0.0 | 0.182832 | 0.867336 | XAP | Approved | -73 | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615234 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -164 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735352 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -301 | Cash through the bank | XAP | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335938 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -512 | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.394531 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | Y | 1 | NaN | NaN | NaN | Repairs | Refused | -781 | Cash through the bank | HC | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
Insights
- The poscash_bal_df dataset contains 10,001,358 records across 8 features.
- It includes 936,325 unique previous loans (SK_ID_BUREAU) and 337,252 unique clients (SK_ID_CURR).
- Out of these, 289,444 clients overlap with the training dataset and 47,808 clients overlap with the test dataset.
- There are no duplicate records in the dataset.
POS_CASH_balance: Null Values analysis¶
# NaN values in the POS_CASH_balance
view_percent_nan(null_df(
poscash_bal_df), 'POS_CASH_balance Dataset', grid=True)
Number of columns having NaN values: 2 columns
Insights
- The POS_CASH_balance dataset has two columns (CNT_INSTALMENT_FUTURE and CNT_INSTALMENT) containing NaN values.
- Both columns show a similar proportion of missing values, approximately 25%.
- Although significant, this level of missing data can potentially be imputed or managed during preprocessing to retain useful insights from the dataset.
Merging POS_CASH_balance with Application Data Target¶
# Merge the filtered train_df with pos_cash_bal_df
pos_cash_merged = train_df[
['SK_ID_CURR', 'TARGET']].merge(
poscash_bal_df, on='SK_ID_CURR', how='left')
# Verify the merge
print("Merging completed. Shape of pos_cash_merged:", pos_cash_merged.shape)
print(pos_cash_merged['TARGET'].value_counts(dropna=False))
pos_cash_merged['TARGET'] = pos_cash_merged['TARGET'].astype(int)
Merging completed. Shape of pos_cash_merged: (8561442, 9) TARGET 0 7931652 1 629790 Name: count, dtype: int64
Correlation Analysis: Numerical Features¶
# Select numerical columns for the correlation matrix
num_cols_pos_cash = pos_cash_merged.select_dtypes(
include=['float32', 'float64', 'int32', 'int64']).columns.tolist()
num_cols_pos_cash = [
col for col in num_cols_pos_cash if col not in [
'SK_ID_CURR', 'SK_ID_PREV']
]
# Initialize and plot the correlation matrix
corr_mat_pos_cash = CorrelationMatrix(
data=pos_cash_merged,
target='TARGET',
numerical_columns=num_cols_pos_cash,
min_unique_values=10
)
# Plot the correlation matrix
corr_mat_pos_cash.plot_correlation_matrix(figsize=(8, 6))
top_corr_numerical = corr_mat_pos_cash.target_top_corr(target_top_columns=10)
print("Top Phi-K Correlations for Numerical Features:")
display(top_corr_numerical)
Top Phi-K Correlations for Numerical Features:
| Column Name | Phik-Correlation | |
|---|---|---|
| 2 | CNT_INSTALMENT_FUTURE | 0.033194 |
| 1 | CNT_INSTALMENT | 0.030947 |
| 0 | MONTHS_BALANCE | 0.027383 |
| 3 | SK_DPD | 0.012773 |
| 4 | SK_DPD_DEF | 0.010539 |
Insights
The heatmap displays the correlation among numerical features in the pos_cash_merged dataset.
- A notable pair of moderately correlated features is CNT_INSTALMENT and CNT_INSTALMENT_FUTURE, which is logical as these features represent past and future installment counts, respectively.
- The correlation of features with the target variable (TARGET) is quite low, indicating that these features do not have a strong linear relationship with the likelihood of repayment or default.
Correlation Analysis: Continuous Features¶
Distribution of CNT_INSTALMENT_FUTURE feature and Analysis¶
view_continuous_feat(
data=pos_cash_merged,
column_name='CNT_INSTALMENT_FUTURE',
plots=['box'],
figsize=(8, 6),
log_scale=False,
)
Insights
The box plot above illustrates the distribution of CNT_INSTALMENT_FUTURE (future installment counts) for both defaulters (TARGET = 1) and non-defaulters (TARGET = 0).
- Defaulters (TARGET = 1) generally have a slightly lower median number of future installments compared to non-defaulters.
- Non-defaulters exhibit a wider range of future installment counts, with more outliers on the higher end.
- This distribution suggests that defaulters are less likely to have a large number of future installments, possibly indicating shorter or fewer ongoing loans. However, the difference between the two groups is not very pronounced.
Credit Card Balance Data Analysis¶
Context:
- Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
- This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * #of months where we have some history observable for the previous credit card) rows.
print(f'The shape of credit_card_bal_df is: {credit_card_bal_df.shape}')
print(f'Number of unique SK_ID_BUREAU in credit_card_bal_df are: '
f'{len(credit_card_bal_df.SK_ID_PREV.unique())}')
print(f'Number of unique SK_ID_CURR in credit_card_bal_df are: '
f'{len(credit_card_bal_df.SK_ID_CURR.unique())}')
overlap_train = len(set(train_df.SK_ID_CURR.unique()).intersection(
set(credit_card_bal_df.SK_ID_CURR.unique())))
print(f'Number of overlapping SK_ID_CURR in application_train.csv and '
f'credit_card_bal_df are: {overlap_train}')
overlap_test = len(set(test_df.SK_ID_CURR.unique()).intersection(
set(credit_card_bal_df.SK_ID_CURR.unique())))
print(f'Number of overlapping SK_ID_CURR in application_test.csv and '
f'credit_card_bal_df are: {overlap_test}')
duplicates = (credit_card_bal_df.shape[0] -
credit_card_bal_df.duplicated().shape[0])
print(f'Number of duplicate values in credit_card_bal_df: {duplicates}')
display(credit_card_bal_df.head())
The shape of credit_card_bal_df is: (3840312, 23) Number of unique SK_ID_BUREAU in credit_card_bal_df are: 104307 Number of unique SK_ID_CURR in credit_card_bal_df are: 103558 Number of overlapping SK_ID_CURR in application_train.csv and credit_card_bal_df are: 86905 Number of overlapping SK_ID_CURR in application_test.csv and credit_card_bal_df are: 16653 Number of duplicate values in credit_card_bal_df: 0
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970001 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.324951 | 1800.0 | 1800.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.554688 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000000 | 2250.0 | 2250.0 | 60175.078125 | 64875.554688 | 64875.554688 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.224609 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000000 | 2250.0 | 2250.0 | 26926.425781 | 31460.085938 | 31460.085938 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.109375 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.759766 | 11925.0 | 11925.0 | 224949.281250 | 233048.968750 | 233048.968750 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.468750 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890625 | 27000.0 | 27000.0 | 443044.406250 | 453919.468750 | 453919.468750 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
Insights
The credit_card_bal_df dataset contains 3.84M rows and 23 columns. It includes information for 104k unique previous loans (SK_ID_BUREAU) across 103.5k unique clients (SK_ID_CURR).
- Of these clients, 86.9k overlap with the training dataset, and 16.6k overlap with the test dataset.
- There are no duplicate rows in the dataset, ensuring data integrity for analysis.
Credit Card Balance: Null Values analysis¶
# NaN values in the POS_CASH_balance
view_percent_nan(null_df(
credit_card_bal_df), 'Credit Card Balance Dataset', grid=True)
Number of columns having NaN values: 9 columns
Insights
The credit_card_bal_df dataset contains 9 columns with NaN values out of 23 total features.
- Most of the features with NaN values (7 columns) are related to drawing amounts and counts, with approximately 20% missing values.
- The remaining 2 columns are associated with installment statistics, having a slightly lower proportion of missing values.
- Overall, the percentage of NaN values in this dataset is relatively lower compared to other datasets analyzed so far, making it more manageable for preprocessing.
Merging Credit Card Balance with Application Data Target¶
# Merge the filtered train_df with prev_app_df
credit_card_merged = train_df[
['SK_ID_CURR', 'TARGET']].merge(
credit_card_bal_df, on='SK_ID_CURR', how='left')
# Verify the merge
print(
"Merging completed. Shape of credit_card_merged:", credit_card_merged.shape
)
print(credit_card_merged['TARGET'].value_counts(dropna=False))
credit_card_merged['TARGET'] = credit_card_merged['TARGET'].astype(int)
Merging completed. Shape of credit_card_merged: (3448571, 24) TARGET 0 3201279 1 247292 Name: count, dtype: int64
# Select numerical columns for the correlation matrix
num_cols_credit_card = credit_card_merged.select_dtypes(
include=['float32', 'float64', 'int32', 'int64']).columns.tolist()
num_cols_credit_card = [
col for col in num_cols_credit_card if col not in [
'SK_ID_CURR', 'SK_ID_PREV']
]
# Initialize and plot the correlation matrix
corr_mat_credit_card = CorrelationMatrix(
data=credit_card_merged,
target='TARGET',
numerical_columns=num_cols_credit_card,
min_unique_values=10
)
# Plot the correlation matrix
corr_mat_credit_card.plot_correlation_matrix(figsize=(15, 13))
top_corr_numerical = corr_mat_credit_card.target_top_corr(
target_top_columns=10)
print("Top Phi-K Correlations for Numerical Features:")
display(top_corr_numerical)
Top Phi-K Correlations for Numerical Features:
| Column Name | Phik-Correlation | |
|---|---|---|
| 1 | AMT_BALANCE | 0.059838 |
| 11 | AMT_RECIVABLE | 0.059311 |
| 12 | AMT_TOTAL_RECEIVABLE | 0.059288 |
| 10 | AMT_RECEIVABLE_PRINCIPAL | 0.058895 |
| 0 | MONTHS_BALANCE | 0.050330 |
| 7 | AMT_INST_MIN_REGULARITY | 0.042174 |
| 17 | CNT_INSTALMENT_MATURE_CUM | 0.037519 |
| 13 | CNT_DRAWINGS_ATM_CURRENT | 0.030052 |
| 2 | AMT_CREDIT_LIMIT_ACTUAL | 0.028752 |
| 14 | CNT_DRAWINGS_CURRENT | 0.027841 |
Insights
- A few groups of highly correlated features are identified:
- AMT_RECEIVABLE_PRINCIPAL, AMT_RECIVABLE, AMT_TOTAL_RECEIVABLE, and AMT_BALANCE: These are closely related and reflect similar financial measures.
- AMT_PAYMENT_TOTAL_CURRENT and AMT_PAYMENT_CURRENT: These features are also strongly correlated as they represent similar payment metrics.
- These correlations are logical, as the features describe similar aspects of credit card usage and repayment.
- Correlation with TARGET is minimal, indicating a weak linear relationship between these features and the likelihood of repayment issues.
Correlation Analysis: Continuous Features¶
Distribution of AMT_BALANCE feature and Analysis¶
This feature represents the balance on the credit card at the end of the month. Is the balance during the month of previous credit.
view_continuous_feat(
data=credit_card_merged,
column_name='AMT_BALANCE',
plots=['box'],
figsize=(8, 6),
log_scale=False,
)
Insights
The box plot shows the distribution of AMT_BALANCE across defaulters (TARGET = 1) and non-defaulters (TARGET = 0):
- Defaulters (TARGET = 1) tend to have slightly higher balances on average compared to non-defaulters, as seen by the elevated median value in the red box plot.
- Outliers are present in both groups, with some individuals having exceptionally high balances, especially in the non-defaulter category (TARGET = 0).
- Despite some variation, there is no drastic difference in the overall distribution, suggesting that AMT_BALANCE alone may not strongly distinguish defaulters from non-defaulters.
Distribution of AMT_TOTAL_RECEIVABLE feature and Analysis¶
This feature represents the total amount receivable on the previous credit.
view_continuous_feat(
data=credit_card_merged,
column_name='AMT_TOTAL_RECEIVABLE',
plots=['box'],
figsize=(8, 6),
log_scale=False,
)
Insights
The box plot illustrates the distribution of AMT_TOTAL_RECEIVABLE for defaulters (TARGET = 1) and non-defaulters (TARGET = 0):
- Defaulters (TARGET = 1) have a slightly higher median value for AMT_TOTAL_RECEIVABLE compared to non-defaulters, indicating they tend to have more outstanding amounts on average.
- Outliers are observed in both groups, with significantly higher values in some cases, particularly for non-defaulters (TARGET = 0).
- The overall distribution shows substantial overlap between the two groups, suggesting that while AMT_TOTAL_RECEIVABLE may provide some insight, it alone may not be a strong distinguishing factor between defaulters and non-defaulters.
Distribution of CNT_INSTALMENT_MATURE_CUM feature and Analysis¶
This feature represents the number of paid installments on the previous credit.
view_continuous_feat(
data=credit_card_merged,
column_name='CNT_INSTALMENT_MATURE_CUM',
plots=['box'],
figsize=(8, 6),
log_scale=False,
)
Insights
The box plot depicts the distribution of CNT_INSTALMENT_MATURE_CUM for defaulters (TARGET = 1) and non-defaulters (TARGET = 0):
- Non-defaulters (TARGET = 0) tend to have slightly higher cumulative matured installments compared to defaulters. The median value for non-defaulters is higher, suggesting they typically have more completed installments.
- Outliers are present in both groups, with significantly high values observed, but the spread and range are similar for both defaulters and non-defaulters.
- The overlap in the distributions indicates that CNT_INSTALMENT_MATURE_CUM may have limited discriminative power between defaulters and non-defaulters on its own. However, it could still provide value in combination with other features.
Main conclusions from EDA¶
Based on the comprehensive EDA performed, we can derive several key insights and actionable steps for the dataset:
- Data Integration and Merging:
- The dataset consists of multiple tables containing detailed information about past loans, credit card balances, installment payments, and more. These will need to be merged systematically to ensure the combined data is meaningful and maintains the relationships between key features.
- Categorical Variables:
- Certain categorical features, such as NAME_CONTRACT_TYPE, NAME_CONTRACT_STATUS, and PRODUCT_COMBINATION, exhibit strong discriminatory power between defaulters and non-defaulters. These features will be particularly valuable for classification tasks and model development.
- Erroneous and Outlier Data:
- A few continuous numerical variables, such as DAYS_FIRST_DRAWING, DAYS_FIRST_DUE, and DAYS_TERMINATION, contain erroneous or unrealistic values that need to be handled appropriately.
- While addressing outliers, it is important to remember that defaulters often exhibit abnormal behaviors. Thus, removing outliers indiscriminately may result in losing crucial information.
- Feature Correlation:
- The correlation heatmaps revealed sets of highly correlated features, such as AMT_BALANCE, AMT_RECEIVABLE, and AMT_TOTAL_RECEIVABLE. These features, while providing similar information, may need dimensionality reduction techniques to avoid redundancy without losing predictive power.
- Class Imbalance:
- The dataset is imbalanced, with significantly fewer defaulters compared to non-defaulters. Special care, such as oversampling, undersampling, or class-weight adjustments, will be necessary during model training to ensure fair performance.
- Behavioral Patterns:
- Defaulters exhibit distinct behavioral patterns, such as more recent credit activity (e.g., DAYS_DECISION) and different installment counts (e.g., CNT_INSTALMENT_FUTURE). These patterns will be key to constructing features for predictive models.
- Handling Missing Values:
- While missing values are present in several tables, they are generally localized to a small number of features. Strategies such as imputation or feature-specific handling will be necessary but not overly complex.
Next Steps:
With these insights, we are now ready to transition to the Data Cleaning and Feature Engineering phase. This can involve:
- Handling missing and erroneous values.
- Creating derived features from correlated and redundant variables.
- Encoding categorical variables effectively.
- Addressing class imbalance to ensure robust model performance.
- Preparing the dataset for model building and evaluation.
Suggestions for EDA Improvement¶
- Advanced Visualizations
- Use interactive visualizations to allow stakeholders to drill down into EDA insights.
- Include comparative boxplots, swarm plots, and violin plots to better represent distributions for key variables.
- Feature Redundancy
- Address highly correlated numerical variables identified in the EDA:
- For instance, AMT_TOTAL_RECEIVABLE and AMT_BALANCE.
- Identify redundant features using VIF (Variance Inflation Factor) analysis and drop them to reduce multicollinearity.
- Segmentation Analysis
- Segment applicants by meaningful categories, such as income brackets, employment type, or age groups, to identify specific risk profiles.
- Use clustering algorithms (e.g., K-Means, DBSCAN) on numerical features to discover hidden applicant segments.
- Exploring Temporal Dynamics
- Explore trends in DAYS_INSTALMENT, DAYS_LAST_DUE, and similar time-based variables to identify temporal patterns among defaulters and non-defaulters.
- Segment data by application date ranges to uncover seasonal or economic trends influencing default rates.