Machine Learning

Iris Species Classification with Decision Tree

Train a decision tree classifier on the Iris dataset, evaluate accuracy, and visualize the decision boundaries using an AI data analyst.

What

This AI Data Analyst workflow loads the Iris dataset from scikit-learn and creates an 80/20 train-test split. It trains a decision tree classifier, reports accuracy, and generates a classification report. It also plots a confusion matrix and visualizes feature importances as a bar chart.

Who

This is for learners and practitioners who want a compact, reproducible example of multiclass classification with a decision tree. It helps validate model performance with standard metrics and interpret which Iris features drive predictions.

Tools

scikit-learn
pandas
numpy
matplotlib
seaborn

Outcomes

Train/test split with dataset shapes printed
Decision tree model trained with accuracy and classification report
Confusion matrix heatmap for error inspection
Feature importance bar chart highlighting the most influential features

Quality Score

0/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

Task Completion: 0/2
Needs work
No notebook cells, code, or outputs are provided, so none of the required steps (split, training, evaluation, plots) were performed.
Execution Correctness: 0/2
Needs work
There is no code to assess for correctness or runnability.
Output Quality: 0/3
Needs work
All expected outputs (shapes, accuracy/report, confusion matrix heatmap, feature-importance bar chart) are missing.
Reasoning Quality: 0/2
Needs work
No reasoning or explanation is present in the provided content.
Reliability: 0/1
Needs work
With no evidence of implementation, the workflow cannot be considered robust or reliable.

load iris dataset (from sklearn) and split into train/test sets with 80/20 ratio

Thinking

Following user instructions

I need to comply with the iterative process here. The user asked to load the iris dataset and split it into train/test sets; however, the developer guidelines suggest focusing on one objective at a time in each response. So, my plan is to load the iris dataset first, inspect it, and show its shape, columns, and head. After that, I can address the split in the next step. It’s important to follow the guidelines carefully!

Let’s start by loading the Iris dataset into a DataFrame and checking its structure before splitting.

Load and preview iris data

from sklearn.datasets import load_iris
import pandas as pd
from IPython.display import display

iris = load_iris(as_frame=True)
df = iris.frame.copy()

display(df.head())
df.shape

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

(150, 5)

The dataset in the notebook has 150 rows and 5 columns, with 4 feature columns plus the target column, so it’s ready to split into features and labels.

Split data into train test

from sklearn.model_selection import train_test_split

X = df.drop(columns='target')
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

{
    "X_train_shape": X_train.shape,
    "X_test_shape": X_test.shape,
    "y_train_shape": y_train.shape,
    "y_test_shape": y_test.shape,
}

{'X_train_shape': (120, 4),
 'X_test_shape': (30, 4),
 'y_train_shape': (120,),
 'y_test_shape': (30,)}

#10

From the notebook outputs, the Iris data is present as a table with 150 rows and 5 columns: 4 feature columns and 1 target column. The split then separated features into X and labels into y, and produced:

X_train: 120 rows, 4 columns
X_test: 30 rows, 4 columns
y_train: 120 rows
y_test: 30 rows

That matches an 80/20 train/test split exactly, since 20% of 150 is 30 and the remaining 120 go to training. The use of stratify=y also means the class proportions were preserved across train and test sets.

#11

train a decision tree classifier and show accuracy

#12

Thinking

Structuring decision tree training

I’m focusing on delivering an answer step-by-step. I need to train a decision tree and show its accuracy, which seems to be the user's main request. I can probably combine the training and accuracy evaluation process into one code block. I should mention that there are no ID columns included. I'll use sklearn's DecisionTreeClassifier and accuracy_score. It might be a good idea to show train/test accuracy too. Let’s keep it concise and end the code with a final expression.

#13

The split is already ready, and there are no ID-like columns to exclude here, so we can train directly on the 4 Iris features.

#14

Train decision tree classifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

y_pred = tree.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

accuracy

#15

0.9333333333333333

#16

The decision tree classifier was trained on the Iris training set and evaluated on the test set. Its observed test accuracy is 0.9333, which means it correctly classified about 93.3% of the test samples.

With 30 test rows, that corresponds to roughly 28 correct predictions out of 30. This shows the model performs well on this split of the Iris dataset.

#17

plot the confusion matrix

#18

Thinking

Preparing to plot confusion matrix

I need to answer by plotting a confusion matrix in a clear, concise manner. I should provide one code block without additional requests. Since the user asked specifically for a confusion matrix, I'll focus on creating that directly using seaborn or matplotlib. I'll utilize the y_test and y_pred for this and maybe compute the confusion matrix to plot a heatmap. Each code block will end with a single expression, like plt.show(), without multiple charts. Let's make sure to include a brief introduction and the code.

#19

Here’s the confusion matrix for the decision tree using the test predictions.

#20

Plot confusion matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=iris.target_names,
    yticklabels=iris.target_names
)
plt.title('Decision Tree Confusion Matrix')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.tight_layout()
plt.show()

#21

#22

Create confusion matrix table

import pandas as pd

pd.DataFrame(
    cm,
    index=iris.target_names,
    columns=iris.target_names
)

#23

	setosa	versicolor	virginica
setosa	10	0	0
versicolor	0	9	1
virginica	0	1	9

#24

The confusion matrix has already been plotted, and the numeric values confirm what the chart shows:

Setosa was classified perfectly: 10 correct, 0 mistakes.
Versicolor had 9 correct and 1 misclassified as virginica.
Virginica had 9 correct and 1 misclassified as versicolor.

So the model is doing very well overall, with errors only between versicolor and virginica, which are the two classes it finds hardest to separate. This matches the observed test accuracy of about 93.3%.

#25

show feature importances as a bar chart

#26

Thinking

Creating a feature importance chart

I need to take this step-by-step. The user wants to see feature importances in a bar chart. I have a trained tree model, so I’ll utilize seaborn for the visualization. It's essential to provide only one code block. I’ll include details about using the trained decision tree to plot importances and suggest sorting them. I'll use pd.Series to organize the feature importances and then create a horizontal bar chart with seaborn. Let’s keep the code minimal and focused!

#27

We can use the trained decision tree to plot feature importances as a horizontal bar chart.

#28

Plot feature importances

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

fi = pd.Series(tree.feature_importances_, index=X_train.columns).sort_values()

plt.figure(figsize=(7, 4))
sns.barplot(x=fi.values, y=fi.index, color='steelblue')
plt.title('Decision Tree Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

#29

#30

Thinking

Deciding on output format

I need to choose between two paths. The user wants to show feature importances as a bar chart, but the most recent output is visual-only. I realize I shouldn’t go with Path A because that won’t meet the user's goal. Path B makes more sense, as it includes a value table listing insights along with a minimal runnable Python code block to gather more evidence. I’ll need the code to present the sorted importance values to give a clearer interpretation.Finalizing output format

I need to make sure to end with a single clear expression instead of multiple ones separated by commas. To present the information clearly, I'll use a display format for one table. By focusing on crafting concise and coherent content, I can ensure it meets the user's expectations effectively. This streamlined approach should provide the clarity and simplicity needed for the final output!

#31

The bar chart is in place, so the remaining missing piece is the exact ranking behind it. Since feature importance is a ranking task, the numeric values are needed to interpret which predictors matter most and whether some contribute little or nothing.

#32

Sort feature importances

fi.sort_values(ascending=False).to_frame('importance')

#33

	importance
petal length (cm)	0.558568
petal width (cm)	0.406015
sepal width (cm)	0.029167
sepal length (cm)	0.006250

#34

The feature importance bar chart shows that the decision tree relies overwhelmingly on the petal measurements.

Petal length (cm) is the most important feature at 0.5586
Petal width (cm) is next at 0.4060
Sepal width (cm) contributes very little at 0.0292
Sepal length (cm) is almost unused at 0.0063

This means the chart is telling a clear story: for this iris decision tree, petal length and petal width drive nearly all of the model’s decisions, while the sepal features add very little.

Iris Species Classification with Decision Tree

What

Who

Tools

Outcomes

Quality Score

Browse All Analysis Examples

Machine Learning

Time Series

Data Analysis

Exploratory Data Analysis

Finance

Nlp

Statistics

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2