Iris Species Classification with Decision Tree

Train a decision tree classifier on the Iris dataset, evaluate accuracy, and visualize the decision boundaries using an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Iris dataset from scikit-learn and creates an 80/20 train-test split. It trains a decision tree classifier, reports accuracy, and generates a classification report. It also plots a confusion matrix and visualizes feature importances as a bar chart.

Who this example is for

This is for learners and practitioners who want a compact, reproducible example of multiclass classification with a decision tree. It helps validate model performance with standard metrics and interpret which Iris features drive predictions.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Train/test split with dataset shapes printed
  • Decision tree model trained with accuracy and classification report
  • Confusion matrix heatmap for error inspection
  • Feature importance bar chart highlighting the most influential features

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load iris dataset (from sklearn) and split into train/test sets with 80/20 ratio
  2. 2train a decision tree classifier and show accuracy
  3. 3plot the confusion matrix
  4. 4show feature importances as a bar chart

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with glm-5.1

This run uses AI with a step-by-step list of prompts to perform the analysis task Iris Species Classification with Decision Tree. Below, you can review the full conversation generated with glm-5.1. After the conversation, we show detailed score cards for this run.

load iris dataset (from sklearn) and split into train/test sets with 80/20 ratio
Thinking

The user wants to load the iris dataset from sklearn and split it into train/test sets with an 80/20 ratio. I'll provide a simple, runnable code block for this.

Split iris data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train/test (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Quick summary
f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}, y_train shape: {y_train.shape}, y_test shape: {y_test.shape}"
'X_train shape: (120, 4), X_test shape: (30, 4), y_train shape: (120,), y_test shape: (30,)'

The dataset is loaded and split as requested. We have:

  • Training set: 120 samples with 4 features each
  • Test set: 30 samples with 4 features each

This reflects the standard 80/20 split on the 150-sample Iris dataset (120 train / 30 test), using random_state=42 for reproducibility.

train a decision tree classifier and show accuracy
Thinking

The user wants to train a Decision Tree Classifier on the already-split iris dataset and show accuracy. I'll use the X_train, X_test, y_train, y_test from the previous step.

Train decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Train Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predict and compute accuracy
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
1.0
Thinking

The user asked to train a decision tree classifier and show accuracy. The notebook shows that the decision tree classifier was trained on the Iris training set and achieved an accuracy of 1.0 (100%) on the test set. This directly answers the original goal.

The Decision Tree classifier achieved 100% accuracy on the test set, meaning it correctly predicted all 30 test samples. This perfect score is common on the Iris dataset with a decision tree, as its class boundaries are relatively simple and a single tree can fully capture them without any misclassifications.

plot the confusion matrix
Thinking

The user wants to plot the confusion matrix for the decision tree classifier trained on the Iris dataset. I'll use seaborn's heatmap for a nice visualization, along with sklearn's confusion_matrix.

Plot confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot it
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Decision Tree (Iris)')
plt.tight_layout()
plt.show()
Thinking

The user's goal was to plot the confusion matrix. The confusion matrix has been plotted successfully in cell-3, showing a perfect diagonal with all predictions correct (matching the 1.0 accuracy). The plot was rendered successfully. The goal is complete.

The confusion matrix plot has been rendered successfully. It shows a perfect diagonal — all 30 test samples were classified correctly, with no off-diagonal misclassifications. This aligns with the 100% accuracy observed earlier.

ActualPredicted setosaPredicted versicolorPredicted virginica
setosa1000
versicolor090
virginica0011

The Decision Tree classifier perfectly separated all three Iris species in the test set.

show feature importances as a bar chart
Thinking

The user wants to see feature importances as a bar chart. I have a trained DecisionTreeClassifier (clf) and the Iris dataset with feature names in iris.feature_names.

Plot feature importances
import pandas as pd

# Get feature importances from the trained model
importances = pd.Series(clf.feature_importances_, index=iris.feature_names).sort_values(ascending=True)

# Plot as horizontal bar chart
plt.figure(figsize=(7, 4))
importances.plot(kind='barh', color='steelblue')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Decision Tree Feature Importances (Iris)')
plt.tight_layout()
plt.show()
Thinking

The user's goal was to show feature importances as a bar chart. The latest output (cell-4) shows a bar chart was successfully rendered. However, the output is chart-only (Figure size 700x400) without the actual numeric values displayed. Since the goal was to "show feature importances as a bar chart" and we have a valid chart, but the sufficiency heuristic says for ranking tasks with chart-only output, I should first produce a value table. Let me check - the goal is "show feature importances as a bar chart" - this is a visualization goal, and the chart was produced. But the heuristic says for ranking tasks (like feature importance), don't choose Path A from chart-only output; first produce a value table.

So I should go with Path B - print the underlying values.

The bar chart was rendered successfully, but since the output is visual-only, let me print the actual numeric feature importance values to accompany it.

Sort feature importances
importances.sort_values(ascending=False)
petal length (cm)    0.906143
petal width (cm)     0.077186
sepal width (cm)     0.016670
sepal length (cm)    0.000000
dtype: float64
Thinking

The user's original goal was to show feature importances as a bar chart. Looking at the notebook state, I can see that:

  1. Cell-4 produced a bar chart of feature importances (output shows <Figure size 700x400 with 1 Axes>)
  2. Cell-5 produced the numeric values of the feature importances sorted in descending order

The outputs are:

  • petal length (cm): 0.906143
  • petal width (cm): 0.077186
  • sepal width (cm): 0.016670
  • sepal length (cm): 0.000000

So we have both the bar chart (visual) and the underlying numeric values. The goal is complete - we have the bar chart showing feature importances and the numeric values to back it up.

This is a case where I should choose Path A - the outputs are sufficient to answer the original goal.

The feature importance bar chart has been displayed, and the numeric values confirm what it shows:

  • Petal length dominates with ~90.6% importance, making it the most influential feature for the Decision Tree's predictions on the Iris dataset.
  • Petal width contributes a modest ~7.7%.
  • Sepal width has minimal impact at ~1.7%.
  • Sepal length has zero importance — the trained tree never split on this feature.

This is a typical pattern for the Iris dataset: petal measurements are far more discriminative than sepal measurements for distinguishing the three species.

Score cards for glm-5.1 for Iris Species Classification with Decision Tree

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All requested major steps were completed: train/test split, decision tree training, accuracy, confusion matrix, and feature importance bar chart.

  • Execution Correctness: 2/2

    Excellent

    The code blocks are coherent and likely runnable, with valid sklearn, seaborn, and matplotlib usage. No major syntax or logic errors are evident from the provided content.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically: 80/20 split shapes, 1.0 accuracy, confusion matrix visualization, and feature importances showing petal length and petal width dominate.

  • Reasoning Quality: 2/2

    Excellent

    The assistant correctly interprets the results and explains why the Iris decision tree performs well, including the dominance of petal features.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and grounded in the shown outputs, with no obvious hallucinations or unsupported claims.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.