← Back to Insight-ML

Loan Status Analysis: Exploring Approval Patterns

Loan Status Analysis
πŸ“…2024
🏦381 Loan Applications
πŸ€–4 ML Models
Python Pandas Matplotlib Seaborn Scikit-learn Machine Learning Finance Analytics EDA
πŸ“‚ View Code on GitHub πŸš€ View Live on Kaggle

🎯 Project Overview

This project focuses on analyzing a dataset of loan applications to understand the factors affecting loan approval status. By examining various attributes such as applicant details, income, loan amount, credit history, and property area, the analysis aims to uncover trends, disparities, and insights in loan approval patterns.

The project combines thorough exploratory data analysis with four machine learning classifiers β€” KNN, Decision Tree, Gaussian NaΓ―ve Bayes, and Random Forest β€” to predict loan approval status and evaluate which features are most influential.

πŸ“Š Dataset Description

The dataset comprises 381 rows and 13 columns, each representing different attributes related to loan applications.

# Column Description
1 Loan_ID A unique loan ID
2 Gender Gender of the applicant (Male/Female)
3 Married Marital status of the applicant (Yes/No)
4 Dependents Number of dependents of the applicant
5 Education Education level (Graduate/Not Graduate)
6 Self_Employed Whether the applicant is self-employed (Yes/No)
7 ApplicantIncome Income of the applicant
8 CoapplicantIncome Income of the co-applicant
9 LoanAmount Loan amount in thousands
10 Loan_Amount_Term Term of the loan in months
11 Credit_History Credit history (1: meets guidelines, 0: does not meet)
12 Property_Area Area where the property is located (Urban/Semi-urban/Rural)
13 Loan_Status Loan approved (Y/N)

Data Quality

πŸ“ˆ Exploratory Data Analysis

4.1 | Individual Variables Analysis

Loan Status Distribution

Loan Status Distribution

Overview of the overall loan approval rate across all 381 applications in the dataset.

Categorical Variable Distributions

Categorical Variable Distributions

Bar chart breakdown of all categorical features including Gender, Married, Education, Self-Employed, and Property Area.

Histogram Continuous Data

Histogram β€” Continuous Data

Distribution of continuous variables: ApplicantIncome, CoapplicantIncome, LoanAmount, and Loan_Amount_Term.

Gender Distribution

Gender Distribution

Breakdown of loan applicants by gender.

Married Distribution

Married Distribution

Breakdown of loan applicants by marital status.

Dependents Distribution

Dependents Distribution

Distribution of the number of dependents among applicants.

Education Distribution

Education Distribution

Breakdown of applicants by education level (Graduate vs Not Graduate).

Self Employed Distribution

Self Employed Distribution

Proportion of applicants who are self-employed vs salaried.

Credit History Distribution

Credit History Distribution

Distribution of applicants by credit history compliance.

Property Area Distribution

Property Area Distribution

Breakdown of applicants by property area: Urban, Semi-urban, and Rural.

KDE Continuous Data

KDE β€” Continuous Data

KDE curves for continuous variables revealing distribution shapes and skewness patterns.

4.2 | Outlier Identification

Box Plot

Box Plot β€” All Features

Box plots exposing the spread and outliers in continuous variables, particularly ApplicantIncome and LoanAmount.

4.3 | Pairs of Variables Insights

Box and Scatter

Box and Scatter

Combined box and scatter plots showing relationships between continuous variables and loan status.

Violin Plot

Violin Plot

Violin plots combining distribution shape with quartile summaries across continuous features.

Violin Plot Binary

Violin Plot β€” Binary

Violin plots split by loan approval status, highlighting distributional differences between approved and rejected applications.

4.4 | Multiple Variables Examination

Correlation Heatmap

Correlation Heatmap

Correlation heatmap across all continuous and encoded features, revealing relationships between income, loan amount, and approval status.

Regression Plot

Regression Plot

Regression plots highlighting linear relationships between continuous predictors and loan approval.

Pair Plot

Pair Plot

Pairplot providing a full cross-feature view with loan status as the grouping variable.

4.5 | Hypothesis Testing

Chi-squared Test β€” Testing independence between categorical variables and loan status.

Z-test β€” Comparing group means for continuous variables against loan approval outcomes.

πŸ€– Model Development & Evaluation

5.1 | Data Normalization

Continuous features are normalized prior to model training.

5.2 | Feature Encoding

Categorical columns are label-encoded for use in classifiers.

5.3 | Feature Selection

Feature Importance

Feature Importance

ExtraTreesClassifier ranks Credit_History as the dominant feature, followed by ApplicantIncome and LoanAmount.

Feature Selection

Feature Selection

Features selected above the importance threshold for use in model training.

5.4 | Model Preparation

Train Test Data Split

Train / Test Data Split

An 80/20 split is applied to produce training and test sets for all classifiers.

5.5 | KNeighborsClassifier

Overfit Underfit KNN

Overfitting / Underfitting Detection β€” KNN

Training and test accuracy curves across K values, showing the optimal K for generalization.

KNN Confusion Matrix

Confusion Matrix β€” KNN

KNN classification results on the loan status test set.

5.6 | DecisionTreeClassifier

Decision Tree Confusion Matrix

Confusion Matrix β€” Decision Tree

Decision Tree classification results on the loan status test set.

5.7 | Gaussian NaΓ―ve Bayes

Gaussian NB Confusion Matrix

Confusion Matrix β€” Gaussian NaΓ―ve Bayes

Gaussian NaΓ―ve Bayes classification results on the loan status test set.

5.8 | RandomForestClassifier

Random Forest Confusion Matrix

Confusion Matrix β€” Random Forest

Random Forest classification results on the loan status test set.

5.9 | Best Model Result

Best Model Result

Best Model Result β€” All Classifiers

Comparison of all four classifiers β€” KNN, Decision Tree, Gaussian NaΓ―ve Bayes, and Random Forest β€” ranked by test accuracy.

πŸŽ‰ Key Insights