Machine Learning

Iris Species Classification with Decision Tree

Train a decision tree classifier on the Iris dataset, evaluate accuracy, and visualize the decision boundaries using an AI data analyst.

What

This AI Data Analyst workflow loads the Iris dataset from scikit-learn and creates an 80/20 train-test split. It trains a decision tree classifier, reports accuracy, and generates a classification report. It also plots a confusion matrix and visualizes feature importances as a bar chart.

Who

This is for learners and practitioners who want a compact, reproducible example of multiclass classification with a decision tree. It helps validate model performance with standard metrics and interpret which Iris features drive predictions.

Tools

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Outcomes

  • Train/test split with dataset shapes printed
  • Decision tree model trained with accuracy and classification report
  • Confusion matrix heatmap for error inspection
  • Feature importance bar chart highlighting the most influential features

Quality Score

0/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

  • Task Completion: 0/2

    Needs work

    No notebook cells, code, or outputs are provided, so none of the required steps (split, training, evaluation, plots) were performed.

  • Execution Correctness: 0/2

    Needs work

    There is no code to assess for correctness or runnability.

  • Output Quality: 0/3

    Needs work

    All expected outputs (shapes, accuracy/report, confusion matrix heatmap, feature-importance bar chart) are missing.

  • Reasoning Quality: 0/2

    Needs work

    No reasoning or explanation is present in the provided content.

  • Reliability: 0/1

    Needs work

    With no evidence of implementation, the workflow cannot be considered robust or reliable.

load iris dataset (from sklearn) and split into train/test sets with 80/20 ratio
Thinking

Following user instructions

I need to comply with the iterative process here. The user asked to load the iris dataset and split it into train/test sets; however, the developer guidelines suggest focusing on one objective at a time in each response. So, my plan is to load the iris dataset first, inspect it, and show its shape, columns, and head. After that, I can address the split in the next step. It’s important to follow the guidelines carefully!

Let’s start by loading the Iris dataset into a DataFrame and checking its structure before splitting.

Load and preview iris data
from sklearn.datasets import load_iris
import pandas as pd
from IPython.display import display

iris = load_iris(as_frame=True)
df = iris.frame.copy()

display(df.head())
df.shape
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
(150, 5)

The dataset in the notebook has 150 rows and 5 columns, with 4 feature columns plus the target column, so it’s ready to split into features and labels.

Split data into train test
from sklearn.model_selection import train_test_split

X = df.drop(columns='target')
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

{
    "X_train_shape": X_train.shape,
    "X_test_shape": X_test.shape,
    "y_train_shape": y_train.shape,
    "y_test_shape": y_test.shape,
}
{'X_train_shape': (120, 4),
 'X_test_shape': (30, 4),
 'y_train_shape': (120,),
 'y_test_shape': (30,)}

From the notebook outputs, the Iris data is present as a table with 150 rows and 5 columns: 4 feature columns and 1 target column. The split then separated features into X and labels into y, and produced:

  • X_train: 120 rows, 4 columns
  • X_test: 30 rows, 4 columns
  • y_train: 120 rows
  • y_test: 30 rows

That matches an 80/20 train/test split exactly, since 20% of 150 is 30 and the remaining 120 go to training. The use of stratify=y also means the class proportions were preserved across train and test sets.

train a decision tree classifier and show accuracy
Thinking

Structuring decision tree training

I’m focusing on delivering an answer step-by-step. I need to train a decision tree and show its accuracy, which seems to be the user's main request. I can probably combine the training and accuracy evaluation process into one code block. I should mention that there are no ID columns included. I'll use sklearn's DecisionTreeClassifier and accuracy_score. It might be a good idea to show train/test accuracy too. Let’s keep it concise and end the code with a final expression.

The split is already ready, and there are no ID-like columns to exclude here, so we can train directly on the 4 Iris features.

Train decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

y_pred = tree.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

accuracy
0.9333333333333333

The decision tree classifier was trained on the Iris training set and evaluated on the test set. Its observed test accuracy is 0.9333, which means it correctly classified about 93.3% of the test samples.

With 30 test rows, that corresponds to roughly 28 correct predictions out of 30. This shows the model performs well on this split of the Iris dataset.

plot the confusion matrix
Thinking

Preparing to plot confusion matrix

I need to answer by plotting a confusion matrix in a clear, concise manner. I should provide one code block without additional requests. Since the user asked specifically for a confusion matrix, I'll focus on creating that directly using seaborn or matplotlib. I'll utilize the y_test and y_pred for this and maybe compute the confusion matrix to plot a heatmap. Each code block will end with a single expression, like plt.show(), without multiple charts. Let's make sure to include a brief introduction and the code.

Here’s the confusion matrix for the decision tree using the test predictions.

Plot confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=iris.target_names,
    yticklabels=iris.target_names
)
plt.title('Decision Tree Confusion Matrix')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.tight_layout()
plt.show()
Create confusion matrix table
import pandas as pd

pd.DataFrame(
    cm,
    index=iris.target_names,
    columns=iris.target_names
)
setosa versicolor virginica
setosa 10 0 0
versicolor 0 9 1
virginica 0 1 9

The confusion matrix has already been plotted, and the numeric values confirm what the chart shows:

  • Setosa was classified perfectly: 10 correct, 0 mistakes.
  • Versicolor had 9 correct and 1 misclassified as virginica.
  • Virginica had 9 correct and 1 misclassified as versicolor.

So the model is doing very well overall, with errors only between versicolor and virginica, which are the two classes it finds hardest to separate. This matches the observed test accuracy of about 93.3%.

show feature importances as a bar chart
Thinking

Creating a feature importance chart

I need to take this step-by-step. The user wants to see feature importances in a bar chart. I have a trained tree model, so I’ll utilize seaborn for the visualization. It's essential to provide only one code block. I’ll include details about using the trained decision tree to plot importances and suggest sorting them. I'll use pd.Series to organize the feature importances and then create a horizontal bar chart with seaborn. Let’s keep the code minimal and focused!

We can use the trained decision tree to plot feature importances as a horizontal bar chart.

Plot feature importances
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

fi = pd.Series(tree.feature_importances_, index=X_train.columns).sort_values()

plt.figure(figsize=(7, 4))
sns.barplot(x=fi.values, y=fi.index, color='steelblue')
plt.title('Decision Tree Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()
Thinking

Deciding on output format

I need to choose between two paths. The user wants to show feature importances as a bar chart, but the most recent output is visual-only. I realize I shouldn’t go with Path A because that won’t meet the user's goal. Path B makes more sense, as it includes a value table listing insights along with a minimal runnable Python code block to gather more evidence. I’ll need the code to present the sorted importance values to give a clearer interpretation.Finalizing output format

I need to make sure to end with a single clear expression instead of multiple ones separated by commas. To present the information clearly, I'll use a display format for one table. By focusing on crafting concise and coherent content, I can ensure it meets the user's expectations effectively. This streamlined approach should provide the clarity and simplicity needed for the final output!

The bar chart is in place, so the remaining missing piece is the exact ranking behind it. Since feature importance is a ranking task, the numeric values are needed to interpret which predictors matter most and whether some contribute little or nothing.

Sort feature importances
fi.sort_values(ascending=False).to_frame('importance')
importance
petal length (cm) 0.558568
petal width (cm) 0.406015
sepal width (cm) 0.029167
sepal length (cm) 0.006250

The feature importance bar chart shows that the decision tree relies overwhelmingly on the petal measurements.

  • Petal length (cm) is the most important feature at 0.5586
  • Petal width (cm) is next at 0.4060
  • Sepal width (cm) contributes very little at 0.0292
  • Sepal length (cm) is almost unused at 0.0063

This means the chart is telling a clear story: for this iris decision tree, petal length and petal width drive nearly all of the model’s decisions, while the sepal features add very little.

Browse All Analysis Examples