← Back to Insight-ML

Diabetes Risk Analysis

Diabetes Risk Analysis
πŸ“…2024
🩸768 Patient Records
πŸ€–5 ML Models
Python Pandas Matplotlib Seaborn Scikit-learn Machine Learning Healthcare Analytics EDA
πŸ“‚ View Code on GitHub πŸš€ View Live on Kaggle

🎯 Project Overview

This project focuses on analyzing a dataset of Pima Indians to understand the risk factors associated with diabetes. By examining various attributes such as glucose levels, BMI, insulin, and age, the analysis aims to uncover trends, correlations, and insights into diabetes risk.

The project combines thorough exploratory data analysis with five machine learning classifiers β€” KNN, Decision Tree, NaΓ―ve Bayes, Random Forest, and SVM β€” to predict diabetes status and evaluate which features drive prediction performance.

πŸ“Š Dataset Description

The dataset comprises 768 rows and 9 columns, each representing different attributes related to diabetes risk in Pima Indian women.

# Column Description
1 Pregnancies Number of pregnancies the individual has had
2 Glucose Plasma glucose concentration (2-hour oral glucose tolerance test)
3 BloodPressure Diastolic blood pressure (mm Hg)
4 SkinThickness Triceps skin fold thickness (mm)
5 Insulin 2-hour serum insulin (mu U/ml)
6 BMI Body mass index (weight in kg / height in mΒ²)
7 DiabetesPedigreeFunction Likelihood of diabetes based on family history
8 Age Age of the individual in years
9 Outcome Diabetes status: 1 = diabetic, 0 = non-diabetic

Data Quality

πŸ“ˆ Exploratory Data Analysis

4.1 | Individual Variables Analysis

Histogram All Features

Histogram β€” All Features

Pregnancies and insulin are heavily right-skewed; glucose follows a near-normal distribution with a slight right tail (ΞΌ = 120.89).

KDE All Features

KDE β€” All Features

Blood pressure is unimodal and symmetric; skin thickness and insulin show bimodal shapes suggesting zero-value clusters in the data.

Categorical Distribution

Categorical Distribution β€” Pregnancies & Outcome

Frequency distributions confirm a class imbalance: 65.1% non-diabetic vs 34.9% diabetic.

4.2 | Pairs of Variables Insights

Continuous Features vs Target

Continuous Features vs Target

Diabetic individuals consistently show higher mean glucose (141 vs 110), higher BMI (35.1 vs 30.3), and are older (37.1 vs 31.2).

Count Data by Outcome

Count Data by Outcome

Pregnancies, Blood Pressure, Skin Thickness, and Age all show visible distributional shifts between diabetic and non-diabetic groups.

Grouped Count Data by Outcome

Grouped Count Data by Outcome

Higher glucose groups (>25) and higher BMI groups (6–7) skew strongly toward diabetic outcomes.

Strip Plots by Outcome

Strip Plots by Outcome

Glucose and BMI have the clearest vertical separation between the two classes.

Scatter Matrix

Scatter Matrix

Glucose vs BMI and glucose vs age pairings show the clearest class separation across all feature combinations.

4.3 | Outlier Identification

Boxplots

Boxplots β€” All Features

Insulin has the most severe upper outliers (up to ~850 mu U/ml); pregnancies and diabetes pedigree function also show significant right-tail spread.

4.4 | Handling Outliers

Violin Grid Post Winsorization

Violin Grid by Outcome β€” Post Winsorization

After Winsorization, distributions are cleaner with the class separation signal preserved across all features.

Numeric Data After Winsorization

Numeric Data After Winsorization

A three-panel view (KDE, box, scatter vs outcome) confirms tighter distributions while retaining meaningful between-group differences.

4.5 | Multiple Variables Examination

Pairplot

Pairplot β€” All Features

Diagonal KDE plots show glucose and BMI have the most distinct separation between diabetic and non-diabetic distributions.

4.6 | Correlation Analysis

Correlation Heatmap Lower Triangle

Correlation Heatmap β€” Lower Triangle

Strongest pairings: Skin Thickness & Insulin (0.51), Age & Pregnancies (0.58), Glucose & Outcome (0.49).

Correlation Heatmap Full

Correlation Heatmap β€” Full

Full symmetric heatmap confirms Glucose as the highest correlate with Outcome at 0.49.

Scatter with Regression Lines

Scatter with Regression Lines

Glucose shows the steepest positive slope; Blood Pressure and Pregnancies show much weaker trends against the outcome.

πŸ€– Model Development & Evaluation

5.2 | Feature Selection

Feature Importance

Feature Importance β€” All Features

ExtraTreesClassifier ranks Glucose as the dominant feature (importance β‰ˆ 0.25), followed by BMI and Age. Insulin ranks last due to its noisy zero-heavy distribution.

Selected Features

Selected Features Above Threshold

All 8 features are retained β€” none fall below the importance cutoff, confirming every feature contributes meaningfully to prediction.

Train Test Split

Train / Test Split Distribution

An 80/20 split produces 614 training instances and 154 test instances.

5.4 | KNeighborsClassifier

KNN Overfit Underfit

Overfitting / Underfitting Detection β€” KNN

Training accuracy starts at 100% (K=1) and falls sharply; test accuracy rises from 64%. Curves converge around K=25–30, indicating the optimal K for generalization.

KNN Confusion Matrix

Confusion Matrix β€” KNN

KNN achieves 77.92% test accuracy after hyperparameter tuning.

5.5 | DecisionTreeClassifier

Decision Tree Plot

Decision Tree Plot

The tree first splits on Glucose (≀ 0.213), then branches on Age, BMI, and Blood Pressure β€” visually confirming the feature importance ranking.

Decision Tree Confusion Matrix

Confusion Matrix β€” Decision Tree

Decision Tree achieves the highest test accuracy at 79.22%.

5.6 | NaΓ―ve Bayes

Naive Bayes Confusion Matrix

Confusion Matrix β€” Gaussian NaΓ―ve Bayes

NaΓ―ve Bayes lands around 72–74% test accuracy with the simplest model assumptions.

5.7 | RandomForestClassifier

Random Forest Confusion Matrix

Confusion Matrix β€” Random Forest

Random Forest scores 75.32% test accuracy despite the highest training accuracy (88.27%), indicating mild overfitting.

5.8 | Support Vector Machine

SVM Confusion Matrix

Confusion Matrix β€” SVM

SVM lands around 73–74% test accuracy after kernel and regularization tuning.

5.9 | Best Model Result

Best Model Result

Best Model Result β€” All Classifiers

Decision Tree leads at 79.22%, narrowly beating KNN (77.92%). Random Forest shows mild overfitting; NaΓ―ve Bayes and SVM both land around 72–74%.

πŸŽ‰ Key Insights