← Back to Insight-ML

Iris Diversity: Analysis, Modeling, Prediction

Iris Diversity Analysis
📅2024
🌸150 Flower Records
🤖4 ML Models
Python Pandas Matplotlib Seaborn Scikit-learn Machine Learning Classification EDA
📂 View Code on GitHub 🚀 View Live on Kaggle

🎯 Project Overview

This project focuses on analyzing a dataset of Iris flowers to understand the diversity among different species. By examining various attributes such as sepal length, sepal width, petal length, and petal width, the analysis aims to uncover trends, disparities, and insights into the different Iris species.

The project combines thorough exploratory data analysis with four machine learning classifiers — KNN, Decision Tree, and three variants of Naïve Bayes (Gaussian, Multinomial, and Bernoulli) — to predict Iris species and evaluate which features drive prediction performance.

📊 Dataset Description

The dataset comprises 150 rows and 6 columns, each representing different attributes related to Iris flowers.

# Column Description
1 Id A unique identifier for each Iris flower
2 SepalLengthCm Length of the sepals in centimeters
3 SepalWidthCm Width of the sepals in centimeters
4 PetalLengthCm Length of the petals in centimeters
5 PetalWidthCm Width of the petals in centimeters
6 Species Species of the Iris flower (Setosa, Versicolor, Virginica)

Data Quality

📈 Exploratory Data Analysis

4.1 | Individual Variables Analysis

Species Distribution

Species Distribution

The dataset is perfectly balanced — each of the three species accounts for exactly 33.33% of the records.

Histogram Continuous Features

Histogram — Continuous Features

Histogram distributions of the four continuous features, with density curves overlaid across all species.

KDE Continuous Features

KDE — Continuous Features

KDE curves reveal that petal length and petal width show clear bimodal patterns, indicating strong species separation.

4.2 | Outlier Identification

Boxplots

Boxplots — All Features

Box plots exposing the spread and outliers across all four features.

4.3 | Handling Outliers

KDE by Species Post Outlier Treatment

KDE by Species — Post Outlier Treatment

After applying Winsorization, KDE plots by species show how each feature separates across Setosa, Versicolor, and Virginica.

Violin Grid by Species

Violin Grid by Species

Violin plots reinforce the same picture — Setosa is clearly distinct, while Versicolor and Virginica overlap more on sepal features.

4.4 | Pairs of Variables Insights

Multiple Variable Scatter

Multiple Variable Scatter

Scatter plots of petal and sepal dimensions, colored by species, confirm that petal measurements are the strongest separators.

Violin Boxplot Grid

Violin Boxplot Grid

Violin-boxplot overlays per species and feature, combining distribution shape with quartile summaries.

Outlier Analysis

Outlier Analysis

Outlier analysis using regression plots across all features with species as the grouping variable.

4.5 | Multiple Variables Examination

Correlation Heatmap

Correlation Heatmap

The correlation heatmap shows that petal length and petal width are highly correlated (0.96), and both correlate strongly with sepal length (0.87 and 0.82 respectively).

Relplot Sepal

Relplot — Sepal

Faceted relplot showing sepal relationships broken out per species.

Relplot Petal

Relplot — Petal

Faceted relplot showing petal relationships broken out per species.

Pairplot

Pairplot — All Features

Pairplot providing a full cross-feature view with per-species KDE on the diagonal.

4.6 | Hypothesis Testing with Z-test

Z-tests were conducted to assess whether mean differences across species for each feature are statistically significant.

🤖 Model Development & Evaluation

5.1 | Feature Selection

Feature Importance

Feature Importance

ExtraTreesClassifier feature importances confirm petal width and petal length as the dominant predictors, with sepal features contributing far less.

5.2 | Data Normalization

Feature values are normalized prior to model training.

5.3 | KNeighborsClassifier

KNN Overfit Underfit

Overfitting / Underfitting Detection — KNN

Overfit/underfit detection across different values of k, with training and testing accuracy tracked.

KNN Confusion Matrix

Confusion Matrix — KNN

KNN achieves 97% accuracy on the test set.

5.4 | DecisionTreeClassifier

Decision Tree Confusion Matrix

Confusion Matrix — Decision Tree

Decision Tree classification results on the Iris test set.

5.5 | Naïve Bayes

Gaussian NB Confusion Matrix

Confusion Matrix — Gaussian Naïve Bayes

Three Naïve Bayes variants were evaluated — Gaussian, Multinomial, and Bernoulli — all achieving 96.7% accuracy.

Multinomial NB Confusion Matrix

Confusion Matrix — Multinomial Naïve Bayes

Multinomial Naïve Bayes achieves 96.7% accuracy on the test set.

Bernoulli NB Confusion Matrix

Confusion Matrix — Bernoulli Naïve Bayes

Bernoulli Naïve Bayes achieves 96.7% accuracy on the test set.

Naive Bayes Algorithm Scores

Naïve Bayes Algorithm Scores

All three Naïve Bayes variants scored identically at 96.7%.

5.6 | Best Model Result

Best Model Result

Best Model Result — All Classifiers

KNN and GaussianNB tied for the top spot at 96.7%, with Decision Tree close behind at 93.3%.

🎉 Key Insights