🎯 Project Overview
This project focuses on analyzing a dataset of Iris flowers to understand the diversity among different species. By examining various attributes such as sepal length, sepal width, petal length, and petal width, the analysis aims to uncover trends, disparities, and insights into the different Iris species.
The project combines thorough exploratory data analysis with four machine learning classifiers — KNN, Decision Tree, and three variants of Naïve Bayes (Gaussian, Multinomial, and Bernoulli) — to predict Iris species and evaluate which features drive prediction performance.
📊 Dataset Description
The dataset comprises 150 rows and 6 columns, each representing different attributes related to Iris flowers.
| # | Column | Description |
|---|---|---|
| 1 | Id | A unique identifier for each Iris flower |
| 2 | SepalLengthCm | Length of the sepals in centimeters |
| 3 | SepalWidthCm | Width of the sepals in centimeters |
| 4 | PetalLengthCm | Length of the petals in centimeters |
| 5 | PetalWidthCm | Width of the petals in centimeters |
| 6 | Species | Species of the Iris flower (Setosa, Versicolor, Virginica) |
Data Quality
- Missing Values: The dataset contains no missing values.
- Duplicates: The dataset contains no duplicate values.
- RangeIndex: The dataset includes 150 entries.
- Data Types: 4 float columns, 1 integer column, and 1 object column.
📈 Exploratory Data Analysis
4.1 | Individual Variables Analysis
Species Distribution
The dataset is perfectly balanced — each of the three species accounts for exactly 33.33% of the records.
Histogram — Continuous Features
Histogram distributions of the four continuous features, with density curves overlaid across all species.
KDE — Continuous Features
KDE curves reveal that petal length and petal width show clear bimodal patterns, indicating strong species separation.
4.2 | Outlier Identification
Boxplots — All Features
Box plots exposing the spread and outliers across all four features.
4.3 | Handling Outliers
KDE by Species — Post Outlier Treatment
After applying Winsorization, KDE plots by species show how each feature separates across Setosa, Versicolor, and Virginica.
Violin Grid by Species
Violin plots reinforce the same picture — Setosa is clearly distinct, while Versicolor and Virginica overlap more on sepal features.
4.4 | Pairs of Variables Insights
Multiple Variable Scatter
Scatter plots of petal and sepal dimensions, colored by species, confirm that petal measurements are the strongest separators.
Violin Boxplot Grid
Violin-boxplot overlays per species and feature, combining distribution shape with quartile summaries.
Outlier Analysis
Outlier analysis using regression plots across all features with species as the grouping variable.
4.5 | Multiple Variables Examination
Correlation Heatmap
The correlation heatmap shows that petal length and petal width are highly correlated (0.96), and both correlate strongly with sepal length (0.87 and 0.82 respectively).
Relplot — Sepal
Faceted relplot showing sepal relationships broken out per species.
Relplot — Petal
Faceted relplot showing petal relationships broken out per species.
Pairplot — All Features
Pairplot providing a full cross-feature view with per-species KDE on the diagonal.
4.6 | Hypothesis Testing with Z-test
Z-tests were conducted to assess whether mean differences across species for each feature are statistically significant.
🤖 Model Development & Evaluation
5.1 | Feature Selection
Feature Importance
ExtraTreesClassifier feature importances confirm petal width and petal length as the dominant predictors, with sepal features contributing far less.
5.2 | Data Normalization
Feature values are normalized prior to model training.
5.3 | KNeighborsClassifier
Overfitting / Underfitting Detection — KNN
Overfit/underfit detection across different values of k, with training and testing accuracy tracked.
Confusion Matrix — KNN
KNN achieves 97% accuracy on the test set.
5.4 | DecisionTreeClassifier
Confusion Matrix — Decision Tree
Decision Tree classification results on the Iris test set.
5.5 | Naïve Bayes
Confusion Matrix — Gaussian Naïve Bayes
Three Naïve Bayes variants were evaluated — Gaussian, Multinomial, and Bernoulli — all achieving 96.7% accuracy.
Confusion Matrix — Multinomial Naïve Bayes
Multinomial Naïve Bayes achieves 96.7% accuracy on the test set.
Confusion Matrix — Bernoulli Naïve Bayes
Bernoulli Naïve Bayes achieves 96.7% accuracy on the test set.
Naïve Bayes Algorithm Scores
All three Naïve Bayes variants scored identically at 96.7%.
5.6 | Best Model Result
Best Model Result — All Classifiers
KNN and GaussianNB tied for the top spot at 96.7%, with Decision Tree close behind at 93.3%.
🎉 Key Insights
- Petal measurements are the strongest separators — petal length and petal width are highly correlated (0.96) and provide the clearest class boundaries across all three species.
- Setosa is perfectly separable from Versicolor and Virginica across all feature combinations, while Versicolor and Virginica show moderate overlap on sepal features.
- The dataset is perfectly balanced — each of the three species accounts for exactly 33.33% of the 150 records, removing class imbalance as a concern.
- Winsorization cleaned extreme outlier values while preserving the between-species signal needed for classification.
- KNN and GaussianNB tied at 96.7% test accuracy, with Decision Tree close behind at 93.3%. All three Naïve Bayes variants scored identically.
- Sepal features contribute far less to prediction than petal features, as confirmed by ExtraTreesClassifier feature importance scores.