π― Project Overview
This project focuses on analyzing a dataset of loan applications to understand the factors affecting loan approval status. By examining various attributes such as applicant details, income, loan amount, credit history, and property area, the analysis aims to uncover trends, disparities, and insights in loan approval patterns.
The project combines thorough exploratory data analysis with four machine learning classifiers β KNN, Decision Tree, Gaussian NaΓ―ve Bayes, and Random Forest β to predict loan approval status and evaluate which features are most influential.
π Dataset Description
The dataset comprises 381 rows and 13 columns, each representing different attributes related to loan applications.
| # | Column | Description |
|---|---|---|
| 1 | Loan_ID | A unique loan ID |
| 2 | Gender | Gender of the applicant (Male/Female) |
| 3 | Married | Marital status of the applicant (Yes/No) |
| 4 | Dependents | Number of dependents of the applicant |
| 5 | Education | Education level (Graduate/Not Graduate) |
| 6 | Self_Employed | Whether the applicant is self-employed (Yes/No) |
| 7 | ApplicantIncome | Income of the applicant |
| 8 | CoapplicantIncome | Income of the co-applicant |
| 9 | LoanAmount | Loan amount in thousands |
| 10 | Loan_Amount_Term | Term of the loan in months |
| 11 | Credit_History | Credit history (1: meets guidelines, 0: does not meet) |
| 12 | Property_Area | Area where the property is located (Urban/Semi-urban/Rural) |
| 13 | Loan_Status | Loan approved (Y/N) |
Data Quality
- Missing Values: Some missing values exist in Gender, Dependents, Self_Employed, Loan_Amount_Term, and Credit_History.
- Duplicates: No duplicate values found.
- RangeIndex: 381 entries.
- Data Types: 4 float columns, 1 integer column, and 8 object columns.
π Exploratory Data Analysis
4.1 | Individual Variables Analysis
Loan Status Distribution
Overview of the overall loan approval rate across all 381 applications in the dataset.
Categorical Variable Distributions
Bar chart breakdown of all categorical features including Gender, Married, Education, Self-Employed, and Property Area.
Histogram β Continuous Data
Distribution of continuous variables: ApplicantIncome, CoapplicantIncome, LoanAmount, and Loan_Amount_Term.
Gender Distribution
Breakdown of loan applicants by gender.
Married Distribution
Breakdown of loan applicants by marital status.
Dependents Distribution
Distribution of the number of dependents among applicants.
Education Distribution
Breakdown of applicants by education level (Graduate vs Not Graduate).
Self Employed Distribution
Proportion of applicants who are self-employed vs salaried.
Credit History Distribution
Distribution of applicants by credit history compliance.
Property Area Distribution
Breakdown of applicants by property area: Urban, Semi-urban, and Rural.
KDE β Continuous Data
KDE curves for continuous variables revealing distribution shapes and skewness patterns.
4.2 | Outlier Identification
Box Plot β All Features
Box plots exposing the spread and outliers in continuous variables, particularly ApplicantIncome and LoanAmount.
4.3 | Pairs of Variables Insights
Box and Scatter
Combined box and scatter plots showing relationships between continuous variables and loan status.
Violin Plot
Violin plots combining distribution shape with quartile summaries across continuous features.
Violin Plot β Binary
Violin plots split by loan approval status, highlighting distributional differences between approved and rejected applications.
4.4 | Multiple Variables Examination
Correlation Heatmap
Correlation heatmap across all continuous and encoded features, revealing relationships between income, loan amount, and approval status.
Regression Plot
Regression plots highlighting linear relationships between continuous predictors and loan approval.
Pair Plot
Pairplot providing a full cross-feature view with loan status as the grouping variable.
4.5 | Hypothesis Testing
Chi-squared Test β Testing independence between categorical variables and loan status.
Z-test β Comparing group means for continuous variables against loan approval outcomes.
π€ Model Development & Evaluation
5.1 | Data Normalization
Continuous features are normalized prior to model training.
5.2 | Feature Encoding
Categorical columns are label-encoded for use in classifiers.
5.3 | Feature Selection
Feature Importance
ExtraTreesClassifier ranks Credit_History as the dominant feature, followed by ApplicantIncome and LoanAmount.
Feature Selection
Features selected above the importance threshold for use in model training.
5.4 | Model Preparation
Train / Test Data Split
An 80/20 split is applied to produce training and test sets for all classifiers.
5.5 | KNeighborsClassifier
Overfitting / Underfitting Detection β KNN
Training and test accuracy curves across K values, showing the optimal K for generalization.
Confusion Matrix β KNN
KNN classification results on the loan status test set.
5.6 | DecisionTreeClassifier
Confusion Matrix β Decision Tree
Decision Tree classification results on the loan status test set.
5.7 | Gaussian NaΓ―ve Bayes
Confusion Matrix β Gaussian NaΓ―ve Bayes
Gaussian NaΓ―ve Bayes classification results on the loan status test set.
5.8 | RandomForestClassifier
Confusion Matrix β Random Forest
Random Forest classification results on the loan status test set.
5.9 | Best Model Result
Best Model Result β All Classifiers
Comparison of all four classifiers β KNN, Decision Tree, Gaussian NaΓ―ve Bayes, and Random Forest β ranked by test accuracy.
π Key Insights
- Credit history is the strongest predictor of loan approval β applicants who meet credit guidelines are approved at a significantly higher rate than those who do not.
- Missing values are present in Gender, Dependents, Self_Employed, Loan_Amount_Term, and Credit_History, requiring careful imputation before modeling.
- ApplicantIncome and LoanAmount are right-skewed with significant outliers, particularly for high-income applicants requesting large loan amounts.
- Semi-urban properties tend to have higher approval rates compared to Urban and Rural areas, indicating a property area effect in the approval process.
- Graduates are approved at higher rates than non-graduates, and married applicants show a modest approval advantage over unmarried applicants.
- Chi-squared tests confirm statistically significant associations between loan approval and Credit_History, Education, and Property_Area.