Iris Species Classification with Decision Tree

Train a decision tree classifier on the Iris dataset, evaluate accuracy, and visualize the decision boundaries using an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Iris dataset from scikit-learn and creates an 80/20 train-test split. It trains a decision tree classifier, reports accuracy, and generates a classification report. It also plots a confusion matrix and visualizes feature importances as a bar chart.

Who this example is for

This is for learners and practitioners who want a compact, reproducible example of multiclass classification with a decision tree. It helps validate model performance with standard metrics and interpret which Iris features drive predictions.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Train/test split with dataset shapes printed
  • Decision tree model trained with accuracy and classification report
  • Confusion matrix heatmap for error inspection
  • Feature importance bar chart highlighting the most influential features

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load iris dataset (from sklearn) and split into train/test sets with 80/20 ratio
  2. 2train a decision tree classifier and show accuracy
  3. 3plot the confusion matrix
  4. 4show feature importances as a bar chart

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gemma4:31b

This run uses AI with a step-by-step list of prompts to perform the analysis task Iris Species Classification with Decision Tree. Below, you can review the full conversation generated with gemma4:31b. After the conversation, we show detailed score cards for this run.

load iris dataset (from sklearn) and split into train/test sets with 80/20 ratio
Thinking
  • Goal: Load the iris dataset from sklearn and split it into train/test sets (80/20 ratio).
    • Constraints:

      • Iterative process (one step at a time).
      • One Python code block per response.
      • First step: Load and inspect only (shape, columns, head).
      • Currency rule: N/A here.
      • Notebook rules: Variables displayed directly, no comma-separated expressions, no print for simple values.
    • The user wants to load and split.

    • According to the "Notebook rules", when a dataset is not yet loaded, I must first load and inspect only (shape, columns, head).

    • I cannot do the split in the same step as the initial load and inspection.

    • Load iris from sklearn.datasets.

    • Convert to DataFrame for easier inspection.

    • Display df.shape, df.columns, and df.head().

    • Wait, the rule says: "When a dataset is not yet loaded in current notebook context, first load and inspect only: show shape, columns, and head()." and "Do not combine first-load data inspection with plotting/modeling in the same response."

    • Splitting is part of modeling preparation. So I will load and inspect first.

    • Import load_iris from sklearn.datasets.

    • Import pandas.

    • Create DataFrame.

    • Show shape, columns, and head sequentially.

I'll start by loading the Iris dataset and inspecting its structure to ensure everything is correct before we proceed with the split.

Load and inspect Iris data
import pandas as pd
from sklearn.datasets import load_iris

# Load the dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Inspect the dataset
display(df.shape)
display(df.columns)
display(df.head())
(150, 5)
Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='str')
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
Thinking

The user wants to load the iris dataset and split it into train/test sets with an 80/20 ratio. The previous step loaded the dataset into a DataFrame df and inspected it. Current state:

  • df exists with 150 rows and 5 columns (4 features + 1 target).
  • The original goal was: load AND split.
  • Loading is done. Splitting is not yet done.

Therefore, Path B is required to perform the split.

Plan:

  • Acknowledge the data is loaded (150 samples, target column present).
  • Use train_test_split from sklearn.model_selection to perform the 80/20 split.
  • Define features X and target y.
  • Split and display the shapes of the resulting sets to verify the ratio.

The dataset is loaded with 150 samples and 5 columns, including the target variable. Now I will split the data into features and target, then perform the 80/20 train-test split.

Split train test sets
from sklearn.model_selection import train_test_split

# Separate features and target
X = df.drop(columns=['target'])
y = df['target']

# Split into 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify the split shapes
display(X_train.shape, X_test.shape)
(120, 4)
(30, 4)
Thinking

The user goal is to "load iris dataset (from sklearn) and split into train/test sets with 80/20 ratio".

Reviewing the notebook state:

  1. Cell 1: load_iris() was called, df was created with 150 rows and 5 columns (4 features + target).
  2. Cell 2: train_test_split(X, y, test_size=0.2, random_state=42) was called.
  3. Output 2: (120, 4) and (30, 4).
    • 120 is 80% of 150.
    • 30 is 20% of 150.
    • X_train has 120 samples, X_test has 30 samples.

The outputs confirm that the dataset was loaded correctly and split precisely into the requested 80/20 ratio. All components of the user goal have been addressed and verified by the output.

Decision: Path A.

The Iris dataset has been loaded with 150 samples and 5 columns. The data was split into training and testing sets using an 80/20 ratio, resulting in 120 samples for training and 30 samples for testing, as confirmed by the output shapes.

train a decision tree classifier and show accuracy
Thinking

The user wants to train a Decision Tree Classifier on the previously split Iris dataset and display the accuracy.

- `X_train`, `X_test`, `y_train`, `y_test` are already available in the notebook state.
- The data is split into 80/20.
- Features are numeric (sepal/petal lengths and widths).

- Import `DecisionTreeClassifier` from `sklearn.tree`.
- Import `accuracy_score` from `sklearn.metrics`.
- Instantiate the model.
- Fit the model on the training set.
- Predict on the test set.
- Calculate and display the accuracy.

- Use seaborn/matplotlib for visualization if requested (not requested yet, just accuracy).
- Return one runnable code block.
- Keep it simple.

I will now train a Decision Tree classifier using the training set and evaluate its accuracy on the test set.

Train and evaluate model
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Initialize and train the Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions and calculate accuracy
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

accuracy
1.0
Thinking

The user wants to train a decision tree classifier and show the accuracy. Looking at the provided notebook state:

  • Cell 1: Loaded Iris dataset into df.
  • Cell 2: Split data into X_train, X_test, y_train, y_test (80/20 split).
  • Cell 3: Trained DecisionTreeClassifier and calculated accuracy_score.
  • Output 3 shows the accuracy value as 1.0.

The original goal was to "train a decision tree classifier and show accuracy". The outputs show that the model was trained and the accuracy is 1.0 (100%).

Since the outputs directly answer the user's goal, I should follow Path A.

Insights:

  • The Decision Tree classifier achieved an accuracy of 1.0 on the test set.
  • This means it correctly predicted all 30 samples in the test set.

Final answer should be a natural response based on these observations.

The Decision Tree classifier was trained on the Iris dataset and achieved an accuracy of 1.0 on the test set, meaning it correctly predicted every sample in the evaluation group.

plot the confusion matrix
Thinking

The user wants to plot the confusion matrix for the trained Decision Tree classifier. The notebook state contains y_test (true labels) and y_pred (predicted labels). I should use sklearn.metrics.confusion_matrix and seaborn.heatmap for visualization.

Plan:

  1. Compute the confusion matrix using confusion_matrix from sklearn.metrics.
  2. Plot the heatmap using seaborn.
  3. Add labels and a title.

I'll plot the confusion matrix to see how the model performed across the different iris species.

Plot confusion matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, 
            yticklabels=iris.target_names)
plt.title('Confusion Matrix - Decision Tree')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
Thinking

The user wants to plot the confusion matrix. The latest output (Cell 4) shows a figure object <Figure size 800x600 with 2 Axes>, and the code used sns.heatmap on cm (the confusion matrix). The variable cm is present in the state as an ndarray of shape (3, 3). The accuracy was 1.0 (Cell 3). The goal is "plot the confusion matrix". The user has already run the code to plot the confusion matrix. However, the sufficiency heuristic says: "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart." The output for Cell 4 is just the figure object. While the heatmap annot=True puts numbers on the chart, the prompt asks me to analyze the output. In a real notebook, the image is visible. In this context, the "output" provided is <Figure size 800x600 with 2 Axes>. To be rigorous and follow the heuristic, I should print the numeric values of the confusion matrix to provide concrete insights based on data, not just a figure object.

Insights:

  • The model achieved 100% accuracy (1.0).
  • The confusion matrix cm contains the distribution of predictions.
  • Since accuracy is 1.0, the confusion matrix will have values only on the main diagonal.

Path B: Print the cm array to see the exact counts.

Wait, looking at the prompt: "If the goal is to show/plot/chart/visualize and latest output contains both a valid chart/figure object and explicit numeric/text values supporting the conclusion, choose Path A". The current output only has the figure object. It doesn't have the numeric array printed as text in the output. Therefore, Path B.

Plan:

  1. Note that the heatmap was generated.
  2. Print the cm matrix to verify the exact counts for each class.

The confusion matrix has been plotted, showing the model's performance across the three iris species. Since the accuracy is 1.0, the heatmap indicates that all predictions align perfectly with the actual labels. To provide the exact numeric distribution of these predictions, I will display the raw confusion matrix values.

Display confusion matrix
# Display the numeric values of the confusion matrix
cm
array([[10,  0,  0],
       [ 0,  9,  0],
       [ 0,  0, 11]])
Thinking

The user wants to plot the confusion matrix. Looking at the notebook state:

  • Cell 4 plotted the confusion matrix using a seaborn heatmap.
  • Cell 5 displayed the raw numpy array cm: array([[10, 0, 0], [0, 9, 0], [0, 0, 11]]).
  • The accuracy was 1.0.
  • The labels used were iris.target_names.

The original goal "plot the confusion matrix" has been met. I have both the visual chart and the numeric values confirming that the model correctly classified 10 Setosa, 9 Versicolor, and 11 Virginica samples without any misclassifications.

Since the goal is "plot the confusion matrix" and both the plot and the numeric support are present, I should follow Path A.

Final Answer content:

  • The confusion matrix shows perfect classification for the test set.
  • 10 samples of Setosa, 9 of Versicolor, and 11 of Virginica were all correctly predicted.
  • There are no off-diagonal values, which confirms the 1.0 accuracy.

Wait, the prompt asks for a natural final answer based on observed outputs.

Path A: Provide natural final answer, include <!--final answer-->, no code.

The confusion matrix shows that the Decision Tree classifier achieved perfect predictions on the test set. Specifically, it correctly identified 10 samples of Setosa, 9 of Versicolor, and 11 of Virginica, with no misclassifications across any of the categories. This result is consistent with the 1.0 accuracy score observed.

show feature importances as a bar chart
Thinking

The user wants to visualize the feature importances of the trained DecisionTreeClassifier as a bar chart.

- `clf`: The trained `DecisionTreeClassifier` model.
- `X`: The original DataFrame containing the feature names.
- `iris`: The bunch object (though `X.columns` is easier).
- `X_train`: Training features.

- Get `feature_importances_` from `clf`.
- Map these importances to the corresponding column names from `X`.
- Create a DataFrame or series for easier plotting.
- Use seaborn to create a bar chart.

- One runnable Python code block.
- Brief explanation.
- No end-to-end pipeline.
- Default to seaborn.

Since the Decision Tree model is already trained, we can extract the feature_importances_ attribute to see which features contributed most to the predictions. I'll map these values to the feature names and plot them in a bar chart.

Plot feature importances
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Get feature importances from the classifier
importances = clf.feature_importances_
feature_names = X.columns

# Create a DataFrame for plotting
feat_imp_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feat_imp_df = feat_imp_df.sort_values(by='Importance', ascending=False)

# Plot the bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feat_imp_df, palette='viridis')
plt.title('Feature Importances - Decision Tree')
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.show()
/var/folders/pp/x90pfm0140d4glxshn77bthh0000gn/T/ipykernel_6746/772402326.py:15: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='Importance', y='Feature', data=feat_imp_df, palette='viridis')
Thinking

The user wants to show feature importances as a bar chart. Looking at the notebook state:

  • Cell 6 executed the code to get feature_importances_, created a DataFrame feat_imp_df, and plotted a bar chart.
  • Output 6 is a Figure.

According to the "Sufficiency heuristics":

  • "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."
  • "For ranking tasks (for example top features, feature importance, top-N categories), do not choose Path A from chart-only output; first produce a value table/list via Path B."

The latest output is just a Figure object (<Figure 1000x600 with 1 Axes>). I don't have the actual numerical values of the importances in the provided text outputs. Therefore, I must choose Path B to print the feat_imp_df (the values) before concluding.

Plan:

  1. Provide brief insights from the current state (the chart was generated).
  2. Provide a Python code block to display the feat_imp_df DataFrame to get the exact numbers.

The bar chart has been generated, visualizing the relative contribution of each feature to the Decision Tree's predictions. To provide precise insights into which features are most influential, the underlying importance scores need to be extracted.

Display feature importances
# Display the feature importances as a sorted table
display(feat_imp_df)
Feature Importance
2 petal length (cm) 0.906143
3 petal width (cm) 0.077186
1 sepal width (cm) 0.016670
0 sepal length (cm) 0.000000
Thinking

The user goal is "show feature importances as a bar chart". Looking at the notebook state:

  • Cell 6 generated a bar chart of feature importances.
  • Cell 7 displayed the numeric values of these importances in a sorted table (feat_imp_df).
  • Output 7 shows:
    • petal length (cm): 0.906143
    • petal width (cm): 0.077186
    • sepal width (cm): 0.016670
    • sepal length (cm): 0.000000

The user has both the visual (bar chart) and the numeric evidence (sorted table). The goal is fully satisfied.

Decision: Path A. Analysis:

  • Petal length is by far the most important feature (~90.6%).
  • Petal width is the second most important (~7.7%).
  • Sepal width has minimal influence (~1.7%).
  • Sepal length has no importance (0.0) for this specific Decision Tree model.

The feature importance bar chart and the corresponding data show that "petal length (cm)" is the dominant predictor for the model, accounting for approximately 90.6% of the importance. "petal width (cm)" provides a secondary contribution of about 7.7%, while "sepal width (cm)" has a negligible impact (1.7%) and "sepal length (cm)" contributes nothing to the decision process of this specific tree.

Score cards for gemma4:31b for Iris Species Classification with Decision Tree

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All major requested steps were completed: loading/splitting the Iris data, training the decision tree, plotting the confusion matrix, and showing feature importances.

  • Execution Correctness: 2/2

    Excellent

    The code appears runnable and the outputs are consistent with the workflow, including the split shapes, accuracy, confusion matrix, and feature importance plot.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically: 80/20 split, accuracy of 1.0, a valid confusion matrix heatmap, and a feature-importance bar chart showing petal length and petal width as dominant.

  • Reasoning Quality: 2/2

    Excellent

    The notebook reasoning is clear and follows the task sequence appropriately, with correct interpretation of the model results and feature importances.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and robust overall, with only a minor seaborn deprecation warning that does not affect the results.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.