π― Project Overview
This project focuses on analyzing a dataset of Pima Indians to understand the risk factors associated with diabetes. By examining various attributes such as glucose levels, BMI, insulin, and age, the analysis aims to uncover trends, correlations, and insights into diabetes risk.
The project combines thorough exploratory data analysis with five machine learning classifiers β KNN, Decision Tree, NaΓ―ve Bayes, Random Forest, and SVM β to predict diabetes status and evaluate which features drive prediction performance.
π Dataset Description
The dataset comprises 768 rows and 9 columns, each representing different attributes related to diabetes risk in Pima Indian women.
| # | Column | Description |
|---|---|---|
| 1 | Pregnancies | Number of pregnancies the individual has had |
| 2 | Glucose | Plasma glucose concentration (2-hour oral glucose tolerance test) |
| 3 | BloodPressure | Diastolic blood pressure (mm Hg) |
| 4 | SkinThickness | Triceps skin fold thickness (mm) |
| 5 | Insulin | 2-hour serum insulin (mu U/ml) |
| 6 | BMI | Body mass index (weight in kg / height in mΒ²) |
| 7 | DiabetesPedigreeFunction | Likelihood of diabetes based on family history |
| 8 | Age | Age of the individual in years |
| 9 | Outcome | Diabetes status: 1 = diabetic, 0 = non-diabetic |
Data Quality
- Missing Values: The dataset contains no missing values.
- Duplicates: The dataset contains no duplicate values.
- RangeIndex: The dataset includes 768 entries.
- Data Types: 7 integer columns and 2 float columns.
π Exploratory Data Analysis
4.1 | Individual Variables Analysis
Histogram β All Features
Pregnancies and insulin are heavily right-skewed; glucose follows a near-normal distribution with a slight right tail (ΞΌ = 120.89).
KDE β All Features
Blood pressure is unimodal and symmetric; skin thickness and insulin show bimodal shapes suggesting zero-value clusters in the data.
Categorical Distribution β Pregnancies & Outcome
Frequency distributions confirm a class imbalance: 65.1% non-diabetic vs 34.9% diabetic.
4.2 | Pairs of Variables Insights
Continuous Features vs Target
Diabetic individuals consistently show higher mean glucose (141 vs 110), higher BMI (35.1 vs 30.3), and are older (37.1 vs 31.2).
Count Data by Outcome
Pregnancies, Blood Pressure, Skin Thickness, and Age all show visible distributional shifts between diabetic and non-diabetic groups.
Grouped Count Data by Outcome
Higher glucose groups (>25) and higher BMI groups (6β7) skew strongly toward diabetic outcomes.
Strip Plots by Outcome
Glucose and BMI have the clearest vertical separation between the two classes.
Scatter Matrix
Glucose vs BMI and glucose vs age pairings show the clearest class separation across all feature combinations.
4.3 | Outlier Identification
Boxplots β All Features
Insulin has the most severe upper outliers (up to ~850 mu U/ml); pregnancies and diabetes pedigree function also show significant right-tail spread.
4.4 | Handling Outliers
Violin Grid by Outcome β Post Winsorization
After Winsorization, distributions are cleaner with the class separation signal preserved across all features.
Numeric Data After Winsorization
A three-panel view (KDE, box, scatter vs outcome) confirms tighter distributions while retaining meaningful between-group differences.
4.5 | Multiple Variables Examination
Pairplot β All Features
Diagonal KDE plots show glucose and BMI have the most distinct separation between diabetic and non-diabetic distributions.
4.6 | Correlation Analysis
Correlation Heatmap β Lower Triangle
Strongest pairings: Skin Thickness & Insulin (0.51), Age & Pregnancies (0.58), Glucose & Outcome (0.49).
Correlation Heatmap β Full
Full symmetric heatmap confirms Glucose as the highest correlate with Outcome at 0.49.
Scatter with Regression Lines
Glucose shows the steepest positive slope; Blood Pressure and Pregnancies show much weaker trends against the outcome.
π€ Model Development & Evaluation
5.2 | Feature Selection
Feature Importance β All Features
ExtraTreesClassifier ranks Glucose as the dominant feature (importance β 0.25), followed by BMI and Age. Insulin ranks last due to its noisy zero-heavy distribution.
Selected Features Above Threshold
All 8 features are retained β none fall below the importance cutoff, confirming every feature contributes meaningfully to prediction.
Train / Test Split Distribution
An 80/20 split produces 614 training instances and 154 test instances.
5.4 | KNeighborsClassifier
Overfitting / Underfitting Detection β KNN
Training accuracy starts at 100% (K=1) and falls sharply; test accuracy rises from 64%. Curves converge around K=25β30, indicating the optimal K for generalization.
Confusion Matrix β KNN
KNN achieves 77.92% test accuracy after hyperparameter tuning.
5.5 | DecisionTreeClassifier
Decision Tree Plot
The tree first splits on Glucose (β€ 0.213), then branches on Age, BMI, and Blood Pressure β visually confirming the feature importance ranking.
Confusion Matrix β Decision Tree
Decision Tree achieves the highest test accuracy at 79.22%.
5.6 | NaΓ―ve Bayes
Confusion Matrix β Gaussian NaΓ―ve Bayes
NaΓ―ve Bayes lands around 72β74% test accuracy with the simplest model assumptions.
5.7 | RandomForestClassifier
Confusion Matrix β Random Forest
Random Forest scores 75.32% test accuracy despite the highest training accuracy (88.27%), indicating mild overfitting.
5.8 | Support Vector Machine
Confusion Matrix β SVM
SVM lands around 73β74% test accuracy after kernel and regularization tuning.
5.9 | Best Model Result
Best Model Result β All Classifiers
Decision Tree leads at 79.22%, narrowly beating KNN (77.92%). Random Forest shows mild overfitting; NaΓ―ve Bayes and SVM both land around 72β74%.
π Key Insights
- Glucose is the strongest predictor of diabetes, with the highest feature importance (β 0.25) and the highest correlation with Outcome (0.49).
- Class imbalance is present β 65.1% non-diabetic vs 34.9% diabetic β which influences model performance and evaluation strategy.
- Insulin and Skin Thickness contain heavy zero-value clusters suggesting missing or unreported measurements, reflected in bimodal KDE shapes.
- Age and Pregnancies are strongly correlated (0.58), while Skin Thickness and Insulin share a moderate correlation (0.51).
- Decision Tree outperformed all models at 79.22% test accuracy, with KNN close behind at 77.92%. Random Forest overfits slightly despite higher training accuracy.
- Winsorization effectively tightened extreme distributions β particularly insulin β while preserving the between-group signal needed for classification.