Iris Species Classification with Decision Tree

Train a decision tree classifier on the Iris dataset, evaluate accuracy, and visualize the decision boundaries using an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Iris dataset from scikit-learn and creates an 80/20 train-test split. It trains a decision tree classifier, reports accuracy, and generates a classification report. It also plots a confusion matrix and visualizes feature importances as a bar chart.

Who this example is for

This is for learners and practitioners who want a compact, reproducible example of multiclass classification with a decision tree. It helps validate model performance with standard metrics and interpret which Iris features drive predictions.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Train/test split with dataset shapes printed
  • Decision tree model trained with accuracy and classification report
  • Confusion matrix heatmap for error inspection
  • Feature importance bar chart highlighting the most influential features

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load iris dataset (from sklearn) and split into train/test sets with 80/20 ratio
  2. 2train a decision tree classifier and show accuracy
  3. 3plot the confusion matrix
  4. 4show feature importances as a bar chart

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-oss:120b

This run uses AI with a step-by-step list of prompts to perform the analysis task Iris Species Classification with Decision Tree. Below, you can review the full conversation generated with gpt-oss:120b. After the conversation, we show detailed score cards for this run.

load iris dataset (from sklearn) and split into train/test sets with 80/20 ratio
Thinking

We need to load iris dataset from sklearn, split train/test 80/20. Provide code block with loading, splitting, then show shapes maybe. According to rules: load and inspect only: show shape, columns, head. So after splitting, we can display shapes of X_train, X_test, y_train, y_test. Provide code. End with bare expression: maybe X_train.shape. But we can display multiple using separate lines? The rule: end each code block with at most one final bare expression. We can have separate statements before. So we can do:

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split import pandas as pd

data = load_iris(as_frame=True) df = data.frame X = df.drop(columns='target') y = df['target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

But that is a tuple, which is comma-separated expression. Not allowed? Should not output comma-separated expressions. So instead, maybe display each shape individually using separate lines and final expression just one variable.

We can do:

display(X_train.head()) X_train.shape

But we need to show shape of all splits. Could compute a dictionary and then display. Maybe final expression is a dict: {'X_train': X_train.shape, ...}. That's a single expression not comma-separated? It's a dict with commas inside but maybe considered fine. The rule says "Do not output comma-separated expressions (e.g., df.head(), df.shape)." Using a dict might be okay. Safer to just output X_train.shape.

We'll provide code showing head and shape for train set as example. That satisfies inspection.

Let's produce code.

Here we load the Iris data, turn it into a pandas DataFrame, and split it into an 80 % training set and a 20 % test set (stratified by the target).
We then display a few rows of the training features and show the shape of the training split.

Split iris training set
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd

# Load iris as a DataFrame
iris = load_iris(as_frame=True)
df = iris.frame                     # columns: sepal length, sepal width, petal length, petal width, target

# Features / target
X = df.drop(columns="target")
y = df["target"]

# 80/20 split, stratified to keep class proportions
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Inspect the training portion
display(X_train.head())
X_train.shape
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
8 4.4 2.9 1.4 0.2
106 4.9 2.5 4.5 1.7
76 6.8 2.8 4.8 1.4
9 4.9 3.1 1.5 0.1
89 5.5 2.5 4.0 1.3
(120, 4)
Thinking

The user asked to load iris dataset and split into train/test sets with 80/20 ratio. The code was executed, and output shows head of X_train and shape (120,4). That's sufficient: we have loaded and split, showing training set shape. For completeness may also show test shape, but the goal was just to load and split. The output already demonstrates split, training set shape (120 rows). Could also mention test set shape is 30 rows (since iris has 150). We have not displayed test shape, but we can infer. However to fully answer, we could state that training set has 120 rows, test set 30 rows. We can compute test shape, but we could also mention based on split ratio. Might be okay to just summarize. Since we have the evidence for training shape, we can also compute test shape quickly without code? We know original dataset size is 150 (Iris). 20% => 30. Could state that. That's derived, not needed code.

Thus Path A: provide final answer summarizing outcome.

The Iris dataset was loaded, its features and target were separated, and an 80 % / 20 % stratified split was performed.

  • Training set: 120 rows × 4 feature columns (sepal length (cm), sepal width (cm), petal length (cm), petal width (cm)).
  • Test set: 30 rows × 4 feature columns (the remaining 20 % of the data).

The split preserves the class distribution across the train and test subsets.

train a decision tree classifier and show accuracy
Thinking

We need to train a decision tree classifier on X_train, y_train, then evaluate accuracy on test set. Use sklearn. Provide code block that fits model, predicts, computes accuracy, and displays result. Follow notebook rules: final expression bare, maybe display accuracy. Provide just one code block.

Evaluate decision tree accuracy
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Train a decision tree on the training split
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)

# Predict on the test split
y_pred = tree_clf.predict(X_test)

# Compute and show accuracy
accuracy = accuracy_score(y_test, y_pred)
accuracy
0.9333333333333333
Thinking

We have loaded iris, split, trained decision tree, accuracy output is 0.9333333333333333. The original goal: "train a decision tree classifier and show accuracy". The latest output already shows the accuracy. So sufficient, we respond with final answer summarizing accuracy.

No code needed.

The decision‑tree classifier achieved an accuracy of 0.9333333333333333 on the 20 % test split.

plot the confusion matrix
Thinking

We need to plot confusion matrix for the decision tree predictions. Use seaborn heatmap. We have variables: y_test, y_pred. Ensure they are available. Provide code to compute confusion matrix and plot. Use classification_report? Just confusion matrix.

We need to follow guidelines: one chart per step. Provide code block with seaborn heatmap. Use sklearn.metrics.confusion_matrix. Provide axis labels, title, maybe normalization. Provide code.

Here’s a quick view of how the model’s predictions are distributed across the three iris classes:

Plot confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot it as a heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    cbar=False,
    xticklabels=iris.target_names,
    yticklabels=iris.target_names,
)
plt.xlabel("Predicted class")
plt.ylabel("True class")
plt.title("Confusion Matrix – Decision Tree on Iris Data")
plt.tight_layout()
plt.show()
Thinking

We need to decide if current outputs are sufficient to answer the original user goal: "plot the confusion matrix". The user already executed code to plot confusion matrix and got a figure displayed. The output shows a figure. So the goal is satisfied: they have a plot. According to guidelines, if outputs are sufficient, we should provide final answer with no code. Provide natural final answer based on observed outputs, include <!--final answer-->.

We should not repeat output, but we can describe the confusion matrix content maybe. But guidelines: "Do not restate output already visible in the latest output." So we shouldn't restate the figure, but we can summarize: the plot shows values.

We can say something like: The confusion matrix has been plotted, showing correct classifications etc. Provide final answer.

Thus Path A.

The decision‑tree’s confusion matrix has been plotted, displaying the counts of true vs. predicted classes for the iris test set.

show feature importances as a bar chart
Thinking

We need to respond by providing code to plot feature importances bar chart for the trained DecisionTreeClassifier. Provide explanation then code. According to instructions: one step, provide code block, no extra instructions. Use seaborn/matplotlib. Use tree_clf.feature_importances_ and feature names from X columns. Provide display of bar chart. Stop.

Here’s a simple bar chart of the decision‑tree’s feature importances:

Plot feature importances
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Feature names from the training DataFrame
feature_names = X_train.columns

# Pull the importance values from the fitted model
importances = tree_clf.feature_importances_

# Build a tidy DataFrame for plotting
imp_df = pd.DataFrame({
    "feature": feature_names,
    "importance": importances
}).sort_values(by="importance", ascending=False)

# Plot
plt.figure(figsize=(6, 4))
sns.barplot(
    data=imp_df,
    x="importance",
    y="feature",
    palette="viridis"
)
plt.xlabel("Importance")
plt.title("Decision Tree Feature Importances – Iris Dataset")
plt.tight_layout()
plt.show()
/tmp/ipykernel_2872800/2682457621.py:19: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(
Thinking

We need to decide if current outputs sufficiently answer the original goal: "show feature importances as a bar chart". The notebook already has cell-4 that plotted a bar chart of feature importances. The output is a figure. According to sufficiency heuristics: if the goal is to show/plot/chart and latest output contains both a valid chart and explicit numeric/text values supporting the conclusion, choose Path A. However they mention that for ranking tasks like top features, do not choose Path A from chart-only output; first produce a value table/list via Path B. The goal is to show feature importances as a bar chart. The chart is already shown; no need for numeric table? But the heuristic says for ranking tasks, we should provide a table/list before concluding? The user explicitly asked to "show feature importances as a bar chart". The chart exists. The heuristic says: "If the goal is to show/plot/chart/visualize and latest output contains both a valid chart/figure object and explicit numeric/text values supporting the conclusion, choose Path A unless there is an explicit error. Prefer Path A whenever observed outputs already answer the goal." The chart does not include numeric values, but the underlying data is displayed in the chart? Not explicit. The chart is visual, but may not show numbers. The heuristic for ranking tasks says: "If the goal is to show/plot/chart/visualize and latest output contains both a valid chart/figure object and explicit numeric/text values supporting the conclusion, choose Path A." It does not have explicit numeric values displayed. However they say for ranking tasks, do not choose Path A from chart-only output; first produce a value table/list via Path B. Feature importance ranking is a ranking task. So we should provide underlying numeric values table.

Thus we choose Path B: provide a minimal code block to display the feature importances values (sorted) perhaps as a table. Provide insights: The last chart shows the importances; here is the table of values. Provide one code block to display the DataFrame. No extra explanations beyond concise. We'll do that.

The bar chart visualizes how each iris feature contributes to the decision‑tree’s predictions. Below is a concise table with the exact importance values used for that plot:

Show numeric importances
# Show the numeric feature importances in descending order
imp_df
feature importance
2 petal length (cm) 0.558568
3 petal width (cm) 0.406015
1 sepal width (cm) 0.029167
0 sepal length (cm) 0.006250
Thinking

We need to decide if we have enough to answer original goal: "show feature importances as a bar chart". The notebook has already produced a bar chart (output-4) showing feature importances. Also they displayed the numeric table (output-5). The goal was to show feature importances as a bar chart; that has been satisfied. According to sufficiency heuristics: if chart-only not enough, produce values. But they already have values displayed in output-5 (a table). The chart is there, and numbers are shown. So we can consider goal satisfied.

Thus Path A: provide final answer summarizing insights, provide final answer comment.

Need to include <!--final answer-->.

Should not include code, just natural final answer.

Let's write: The bar chart shows petal length most important, then petal width, etc. Also give numeric importances.

Make sure to include comment.

The decision‑tree model attributes most of its predictive power to petal length (cm) (≈ 0.56 importance) followed by petal width (cm) (≈ 0.41). The remaining features contribute very little: sepal width (cm) ≈ 0.03 and sepal length (cm) ≈ 0.01. This ranking is reflected both in the bar chart and the accompanying table of numeric importances.

Score cards for gpt-oss:120b for Iris Species Classification with Decision Tree

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All major requested steps were completed: train/test split, decision tree training with accuracy, confusion matrix plot, and feature-importance bar chart.

  • Execution Correctness: 2/2

    Excellent

    The code shown is coherent and likely runnable; the outputs confirm successful execution of the split, model fit, evaluation, and plots.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically, including the split shapes, accuracy, confusion matrix heatmap, and feature importances showing petal length and petal width as dominant.

  • Reasoning Quality: 2/2

    Excellent

    The workflow uses appropriate ML steps and correctly interprets the feature-importance results. The explanations are clear and aligned with the task.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and robust overall, with only a minor deprecation warning that does not affect the results.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.