Iris Species Classification with Decision Tree

Train a decision tree classifier on the Iris dataset, evaluate accuracy, and visualize the decision boundaries using an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Iris dataset from scikit-learn and creates an 80/20 train-test split. It trains a decision tree classifier, reports accuracy, and generates a classification report. It also plots a confusion matrix and visualizes feature importances as a bar chart.

Who this example is for

This is for learners and practitioners who want a compact, reproducible example of multiclass classification with a decision tree. It helps validate model performance with standard metrics and interpret which Iris features drive predictions.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Train/test split with dataset shapes printed
  • Decision tree model trained with accuracy and classification report
  • Confusion matrix heatmap for error inspection
  • Feature importance bar chart highlighting the most influential features

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load iris dataset (from sklearn) and split into train/test sets with 80/20 ratio
  2. 2train a decision tree classifier and show accuracy
  3. 3plot the confusion matrix
  4. 4show feature importances as a bar chart

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Iris Species Classification with Decision Tree. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load iris dataset (from sklearn) and split into train/test sets with 80/20 ratio
Thinking

Crafting a concise response

I think the user wants to load the iris dataset and split it, so I need to do this step by step. Since they’re asking for both actions together, I can provide a single code block. I should load the dataset and perform the split using sklearn's load_iris and train_test_split, while mentioning stratification as it’s likely needed. I’ll keep the code minimal and focus on displaying the shapes at the end without using multiple expressions. Let’s make this concise!

Here’s a single step that loads the Iris dataset from sklearn and creates an 80/20 train/test split, using stratification to preserve class balance.

Split iris train test
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd

iris = load_iris(as_frame=True)

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

pd.DataFrame({
    "set": ["X_train", "X_test", "y_train", "y_test"],
    "rows": [X_train.shape[0], X_test.shape[0], y_train.shape[0], y_test.shape[0]],
    "cols": [X_train.shape[1], X_test.shape[1], 1, 1]
})
set rows cols
0 X_train 120 4
1 X_test 30 4
2 y_train 120 1
3 y_test 30 1

The observed output shows the Iris data was split into train and test sets exactly as requested with an 80/20 ratio:

  • X_train: 120 rows, 4 feature columns
  • X_test: 30 rows, 4 feature columns
  • y_train: 120 target values
  • y_test: 30 target values

Since the full dataset has 150 samples, this corresponds to:

  • 80% training data: 120 samples
  • 20% test data: 30 samples

This means the dataset is loaded and separated into features (X) and target (y), then split correctly into training and testing subsets.

train a decision tree classifier and show accuracy
Thinking

Planning code for decision tree

I need to focus on just one step for my response. I have the state split done and now need to train the decision tree classifier and show its accuracy, all in a single code block. I have to be sure to exclude any identifier or leakage columns before training, though it seems there aren't any here.

I'll probably import DecisionTreeClassifier and accuracy_score, then fit on X_train and y_train, predict, and maybe create a table for scalar accuracy. To keep it minimal, I might just need a one-row table displaying accuracy at the end. Let's aim for a concise explanation followed by the code!

Next step is to fit a decision tree on the training set and evaluate it on the test set. There are no obvious ID/leakage columns in Iris, so we can use the four feature columns directly.

Train decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

pd.DataFrame({"metric": ["accuracy"], "value": [accuracy]})
metric value
0 accuracy 0.933333

The decision tree classifier was trained on the Iris training set and evaluated on the held-out test set.

The observed test accuracy is 0.9333, which means the model correctly classified about 93.3% of the test samples. On a 30-row test set, that corresponds to 28 correct predictions out of 30.

So the model performs very well on this split of the Iris dataset.

plot the confusion matrix
Thinking

Creating a confusion matrix

I need to respond with one code block that uses seaborn or matplotlib defaults. Since we have the notebook state ready, I should probably include a brief mention about it. I’ll focus on creating just one chart using the existing y_test and y_pred. Importing ConfusionMatrixDisplay or confusion_matrix from sklearn.metrics seems like a good step, and I think using a seaborn heatmap is acceptable. I’ll keep it concise!

A confusion matrix will show where the decision tree is making the 2 mistakes on the test set.

Plot confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=iris.target_names,
    yticklabels=iris.target_names
)
plt.title("Decision Tree Confusion Matrix")
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.tight_layout()

The confusion matrix heatmap rendered, and with the earlier accuracy of about 93.3%, it likely contains only a small number of errors. To make the plot interpretable in text, here are the exact counts behind it.

Display confusion matrix
import pandas as pd

display(pd.DataFrame(cm, index=iris.target_names, columns=iris.target_names))
setosa versicolor virginica
setosa 10 0 0
versicolor 0 9 1
virginica 0 1 9

The confusion matrix shows the model is performing very well on the Iris test set.

  • Setosa: all 10 were classified correctly.
  • Versicolor: 9 were correct, 1 was predicted as virginica.
  • Virginica: 9 were correct, 1 was predicted as versicolor.

So the only confusion is between versicolor and virginica, which is common because those two classes are more similar than setosa. This matches the observed test accuracy of about 93.3%.

show feature importances as a bar chart

The confusion matrix is done, so the next useful view is which features the decision tree relied on most. Here’s a simple bar chart of the model’s feature importances.

Plot feature importances
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

feat_imp = (
    pd.Series(model.feature_importances_, index=X_train.columns)
    .sort_values(ascending=False)
    .reset_index()
)
feat_imp.columns = ["feature", "importance"]

plt.figure(figsize=(7, 4))
sns.barplot(data=feat_imp, x="importance", y="feature", palette="Blues_r")
plt.title("Decision Tree Feature Importances")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.tight_layout()
/var/folders/pp/x90pfm0140d4glxshn77bthh0000gn/T/ipykernel_1866/2435154119.py:13: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(data=feat_imp, x="importance", y="feature", palette="Blues_r")
Thinking

Evaluating code display

I need to figure out how to show rankings based on visible chart bars rather than the text itself. I should take a different path this time by using the display(feat_imp) method. It’s important to keep the code minimal and distinct from the previous steps, so I’ll ensure that it only includes one code block that concludes with a bare expression. I’ll avoid using print and stick with just one table in the output.

The bar chart was created, so the missing piece is the exact importance values behind the ranking. Showing the table makes the chart interpretable and confirms which features matter most.

Display feature importance
display(feat_imp)
feature importance
0 petal length (cm) 0.558568
1 petal width (cm) 0.406015
2 sepal width (cm) 0.029167
3 sepal length (cm) 0.006250

The feature importance bar chart shows a very clear ranking:

  • Petal length (cm) is the most important feature at 0.558568
  • Petal width (cm) is next at 0.406015
  • Sepal width (cm) contributes only 0.029167
  • Sepal length (cm) contributes very little at 0.006250

This means the decision tree is relying almost entirely on the petal measurements to classify iris species, while the sepal measurements add very little. That pattern fits the model’s behavior well: most of the predictive signal is coming from petal size rather than sepal size.

Score cards for gpt-5.4 for Iris Species Classification with Decision Tree

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All requested major steps were completed: train/test split, decision tree training with accuracy, confusion matrix, and feature importance bar chart.

  • Execution Correctness: 2/2

    Excellent

    The code is consistent and likely runnable; the outputs match the code structure and show no major execution errors.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes well: correct 80/20 split, accuracy of 0.9333, confusion matrix, and feature importances showing petal features dominate.

  • Reasoning Quality: 2/2

    Excellent

    The explanations are clear and correctly interpret the results, including the class confusions and why petal features matter most.

  • Reliability: 1/1

    Excellent

    The workflow is internally consistent and uses standard sklearn/seaborn methods without hallucinated steps.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.