Exploratory Data Analysis (EDA) in Python

Complete EDA workflow: load data, check quality, compute correlations, detect outliers, and visualize distributions — all with an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the scikit-learn diabetes dataset and summarizes its structure, feature names, and basic statistics. It generates correlation analysis across all features, including a full correlation matrix visualization. It detects per-feature outliers using the IQR method and identifies which features relate most strongly to the target with supporting plots and correlation coefficients.

Who this example is for

This is for analysts and students who want a repeatable EDA template for a small, well-known regression dataset. It helps you practice data quality checks, correlation inspection, and outlier detection in a conversational, code-generating notebook.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Dataset shape (442, 10) with feature names and summary stats
  • 10x10 correlation heatmap of all features
  • Outlier counts per feature using the IQR method
  • Top 3 feature vs target scatter plots with correlation coefficients

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load the diabetes dataset (from sklearn) and show basic info
  2. 2plot a correlation matrix of all features
  3. 3detect outliers in each feature using the IQR method
  4. 4which features have the strongest relationship with the target variable?

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b9/102/22/22/32/21/1Open qwen3.5:397b conversation
gemma4:31b8/102/21/22/32/21/1Open gemma4:31b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with qwen3.5:397b

This run uses AI with a step-by-step list of prompts to perform the analysis task Exploratory Data Analysis (EDA) in Python. Below, you can review the full conversation generated with qwen3.5:397b. After the conversation, we show detailed score cards for this run.

load the diabetes dataset (from sklearn) and show basic info
Thinking

The user wants to load the diabetes dataset from sklearn and show basic information about it. This is a straightforward first step - I'll load the dataset and display its shape, feature names, and a preview of the data.

Let me write a simple code block to:

  1. Load the diabetes dataset from sklearn
  2. Show basic info like shape, feature names, and head of the data

I'll load the diabetes dataset from sklearn and show you the basic information.

Load and explore diabetes data
from sklearn.datasets import load_diabetes
import pandas as pd

# Load the diabetes dataset
diabetes = load_diabetes()

# Create a DataFrame for easier exploration
df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target

# Show basic info
print("Dataset shape:", df.shape)
print("\nFeature names:", diabetes.feature_names)
print("\nTarget description (first 200 chars):", diabetes.DESCR[:200])
print("\nFirst 5 rows:")
df.head()
Dataset shape: (442, 11)

Feature names: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

Target description (first 200 chars): .. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 

First 5 rows:
age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641 135.0
Thinking

The user's original goal was to "load the diabetes dataset (from sklearn) and show basic info". Looking at the latest notebook output, I can see:

  1. Dataset shape: (442, 11) - 442 samples, 11 columns (10 features + 1 target)
  2. Feature names: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
  3. Target description snippet showing it's a diabetes dataset with 10 baseline variables
  4. First 5 rows showing the actual data values

This output successfully shows:

  • The dataset is loaded
  • Basic info including shape, feature names, and sample data

The goal has been achieved. The output shows all the basic information requested: dataset dimensions, feature names, and a preview of the data. This is sufficient to answer the original goal.

I should choose Path A and provide a natural final answer based on the observed outputs.

The diabetes dataset has been successfully loaded with the following basic information:

  • Shape: 442 samples with 11 columns (10 features + 1 target variable)
  • Features: age, sex, bmi, bp, s1, s2, s3, s4, s5, s6
    • These represent baseline variables including age, sex, body mass index, average blood pressure, and six blood serum measurements
  • Target: A quantitative measure of disease progression one year after baseline (values range from about 75 to 206 in the sample shown)
  • Data type: All features appear to be standardized numerical values (centered around 0)
plot a correlation matrix of all features
Thinking

The user wants to plot a correlation matrix of all features from the diabetes dataset. The dataframe df is already loaded in the notebook with 10 features and 1 target column. I should create a correlation heatmap using seaborn, which is the default visualization library according to the guidelines.

I'll create a correlation matrix heatmap showing the relationships between all features (including the target). This will help understand which features are most correlated with the target variable.

Plot correlation heatmap
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate correlation matrix
corr_matrix = df.corr()

# Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f', square=True)
plt.title('Correlation Matrix - Diabetes Dataset')
plt.tight_layout()
Thinking

The user's goal is to plot a correlation matrix of all features. Looking at the notebook state:

  1. Cell-1 loaded the diabetes dataset successfully (442 samples, 10 features + 1 target)
  2. Cell-2 executed code to create a correlation heatmap using seaborn

The output from cell-2 shows <Figure size 1000x800 with 2 Axes> which indicates the heatmap was created successfully. However, this is just a figure object reference - I cannot see the actual visual chart or the underlying correlation values.

According to the sufficiency heuristics:

  • "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."

Since I only see the figure object reference and not the actual correlation values, I should choose Path B to display the underlying correlation matrix values so the user can see the actual numbers behind the visualization.

Let me provide a minimal code block to show the correlation matrix as a table.

The correlation heatmap has been generated. To provide concrete insights from the correlation matrix, let me display the actual correlation values as a table so we can see the specific relationships between features.

Display correlation matrix
# Display correlation matrix as a formatted table
corr_matrix.round(2)
age sex bmi bp s1 s2 s3 s4 s5 s6 target
age 1.00 0.17 0.19 0.34 0.26 0.22 -0.08 0.20 0.27 0.30 0.19
sex 0.17 1.00 0.09 0.24 0.04 0.14 -0.38 0.33 0.15 0.21 0.04
bmi 0.19 0.09 1.00 0.40 0.25 0.26 -0.37 0.41 0.45 0.39 0.59
bp 0.34 0.24 0.40 1.00 0.24 0.19 -0.18 0.26 0.39 0.39 0.44
s1 0.26 0.04 0.25 0.24 1.00 0.90 0.05 0.54 0.52 0.33 0.21
s2 0.22 0.14 0.26 0.19 0.90 1.00 -0.20 0.66 0.32 0.29 0.17
s3 -0.08 -0.38 -0.37 -0.18 0.05 -0.20 1.00 -0.74 -0.40 -0.27 -0.39
s4 0.20 0.33 0.41 0.26 0.54 0.66 -0.74 1.00 0.62 0.42 0.43
s5 0.27 0.15 0.45 0.39 0.52 0.32 -0.40 0.62 1.00 0.46 0.57
s6 0.30 0.21 0.39 0.39 0.33 0.29 -0.27 0.42 0.46 1.00 0.38
target 0.19 0.04 0.59 0.44 0.21 0.17 -0.39 0.43 0.57 0.38 1.00
Thinking

The user's original goal was to plot a correlation matrix of all features. Looking at the notebook state:

  1. Cell-1: Loaded the diabetes dataset and created a DataFrame with 10 features + target
  2. Cell-2: Created a correlation heatmap visualization (the actual plot)
  3. Cell-3: Displayed the correlation matrix as a formatted table with numeric values

The correlation matrix plot has been successfully created (Cell-2 output shows a Figure was generated), and the numeric correlation values are displayed in Cell-3 output. Both the visual representation (heatmap) and the underlying numeric values are available.

This satisfies the original goal completely - the user asked to "plot a correlation matrix of all features" and we have:

  1. The actual heatmap plot generated
  2. The correlation values displayed in table format

This is a clear Path A situation - the goal is complete with both visual and numeric outputs available. I should provide a natural final answer summarizing what the correlation matrix shows, include the final answer marker, and not include any Python code.

Let me extract key insights from the correlation table:

  • BMI has the strongest correlation with target (0.59)
  • S5 also has strong correlation with target (0.57)
  • S4 (0.43), BP (0.44), and S6 (0.38) show moderate positive correlations
  • S3 has a negative correlation with target (-0.39)
  • Sex has very weak correlation with target (0.04)
  • There are some notable feature intercorrelations: S1-S2 (0.90), S3-S4 (-0.74)

I'll summarize these findings concisely and mark this as the final answer.

The correlation matrix has been successfully plotted and the numeric values are displayed. Here are the key insights from the correlation analysis:

Strongest predictors of target (diabetes progression):

  • BMI shows the strongest correlation at 0.59
  • S5 (blood serum measurement 5) correlates at 0.57
  • BP (blood pressure) and S4 show moderate correlations at 0.44 and 0.43 respectively

Notable feature relationships:

  • S1 and S2 are highly correlated (0.90), suggesting potential multicollinearity
  • S3 and S4 have a strong negative correlation (-0.74)
  • S3 is the only feature with a notable negative correlation to target (-0.39)

Weak predictors:

  • Sex has minimal correlation with target (0.04)
  • Age shows only modest correlation (0.19)
detect outliers in each feature using the IQR method
Thinking

The user wants to detect outliers in each feature using the IQR (Interquartile Range) method. This is a clear next step after exploring the correlation matrix.

The IQR method works as follows:

  1. Calculate Q1 (25th percentile) and Q3 (75th percentile) for each feature
  2. IQR = Q3 - Q1
  3. Lower bound = Q1 - 1.5 * IQR
  4. Upper bound = Q3 + 1.5 * IQR
  5. Values outside these bounds are considered outliers

I'll create a function to detect outliers for each feature and display the results in a clear format. I should exclude the target column when detecting outliers in features.

Let me write one code block to accomplish this.

I'll detect outliers in each feature using the IQR method. This will identify values that fall below Q1 - 1.5×IQR or above Q3 + 1.5×IQR for each feature.

Detect outliers using IQR
# Detect outliers using IQR method for each feature (excluding target)
import numpy as np

features = df.columns[:-1]  # Exclude target column
outlier_summary = []

for col in features:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    outlier_count = len(outliers)
    outlier_pct = (outlier_count / len(df)) * 100
    
    outlier_summary.append({
        'Feature': col,
        'Q1': round(Q1, 3),
        'Q3': round(Q3, 3),
        'IQR': round(IQR, 3),
        'Lower Bound': round(lower_bound, 3),
        'Upper Bound': round(upper_bound, 3),
        'Outlier Count': outlier_count,
        'Outlier %': round(outlier_pct, 2)
    })

outlier_df = pd.DataFrame(outlier_summary)
outlier_df
Feature Q1 Q3 IQR Lower Bound Upper Bound Outlier Count Outlier %
0 age -0.037 0.038 0.075 -0.150 0.151 0 0.00
1 sex -0.045 0.051 0.095 -0.188 0.194 0 0.00
2 bmi -0.034 0.031 0.065 -0.132 0.129 3 0.68
3 bp -0.037 0.036 0.072 -0.145 0.144 0 0.00
4 s1 -0.034 0.028 0.063 -0.128 0.122 8 1.81
5 s2 -0.030 0.030 0.060 -0.121 0.120 7 1.58
6 s3 -0.035 0.029 0.064 -0.132 0.126 7 1.58
7 s4 -0.039 0.034 0.074 -0.150 0.145 2 0.45
8 s5 -0.033 0.032 0.066 -0.132 0.131 4 0.90
9 s6 -0.033 0.028 0.061 -0.125 0.120 9 2.04
Thinking

The user's goal is to detect outliers in each feature using the IQR method. Looking at the latest notebook output (cell-4), I can see that the outlier detection has been completed successfully. The output shows a dataframe with:

  1. All 10 features analyzed (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)
  2. For each feature: Q1, Q3, IQR, Lower Bound, Upper Bound, Outlier Count, and Outlier %
  3. The output is truncated but shows the key information

From the visible output, I can see:

  • age: 0 outliers (0.00%)
  • sex: 0 outliers (0.00%)
  • bmi: 3 outliers (truncated but visible)
  • bp: 0 outliers
  • s1: 8 outliers
  • s2: 7 outliers
  • s3: 7 outliers
  • s4: 2 outliers
  • s5: 4 outliers
  • s6: 9 outliers

The output is truncated in the display, but the key information is there. However, since the output is truncated and I cannot see the complete outlier percentages for all features, I should provide a final answer based on what's visible while noting the complete picture.

Actually, looking more carefully at the output, I can see the Outlier Count column is complete for all features even though the display is truncated. The counts are: 0, 0, 3, 0, 8, 7, 7, 2, 4, 9.

This is sufficient to answer the original goal. I should provide a natural final answer summarizing the outlier detection results.

The IQR-based outlier detection is complete. Here's what the analysis reveals:

Outlier Summary by Feature:

FeatureOutlier CountOutlier %
age00.00%
sex00.00%
bmi30.68%
bp00.00%
s181.81%
s271.58%
s371.58%
s420.45%
s540.90%
s692.04%

Key Findings:

  • Cleanest features: age, sex, and bp have no outliers at all, indicating well-behaved distributions
  • Most problematic: s6 has the highest outlier rate at 2.04% (9 records), followed by s1 at 1.81% (8 records)
  • Overall: The dataset is relatively clean with only 40 total outliers across 442 records (about 9% of feature-values affected)
  • Blood serum measurements (s1-s6) contain most of the outliers, which is common for clinical measurements

The IQR bounds show the data is already standardized (most values between -0.15 and 0.15), which explains the low outlier counts.

which features have the strongest relationship with the target variable?
Thinking

The user is asking about which features have the strongest relationship with the target variable. I already have the correlation matrix from cell-3 output, so I can analyze this directly without running new code.

Looking at the correlation matrix output, I need to check the 'target' column/row to see correlations with each feature:

  • age: 0.19
  • sex: 0.04
  • bmi: 0.59
  • bp: 0.44
  • s1: 0.21
  • s2: 0.17
  • s3: -0.39
  • s4: 0.43
  • s5: 0.57
  • s6: 0.38

The strongest correlations (by absolute value) are:

  1. bmi: 0.59 (strongest positive)
  2. s5: 0.57 (second strongest positive)
  3. bp: 0.44 (third)
  4. s4: 0.43 (fourth)
  5. s3: -0.39 (strongest negative)

I should present this analysis directly since the data is already available in the notebook state.

Based on the correlation matrix already computed, here are the features ranked by their relationship strength with the target variable:

Strongest Positive Correlations:

FeatureCorrelationStrength
bmi0.59Strong
s50.57Strong
bp0.44Moderate
s40.43Moderate
s60.38Moderate

Strongest Negative Correlation:

FeatureCorrelationStrength
s3-0.39Moderate

Weakest Relationships:

FeatureCorrelation
sex0.04
s20.17
age0.19
s10.21

Key Insights:

  • bmi (body mass index) has the strongest positive relationship with diabetes progression (0.59)
  • s5 (blood serum measurement 5) is nearly as strong at 0.57
  • s3 is the only feature with a notable negative correlation (-0.39), meaning higher s3 values associate with lower target values
  • sex has virtually no linear relationship with the target (0.04)

The top 3 predictive features appear to be bmi, s5, and bp based on correlation strength alone.

Score cards for qwen3.5:397b for Exploratory Data Analysis (EDA) in Python

Total score: 9/10

  • Task Completion: 2/2

    Excellent

    The workflow completed the requested EDA steps: dataset loading/basic info, correlation matrix, and IQR outlier detection. It also identified relationships with the target from the correlation table.

  • Execution Correctness: 2/2

    Excellent

    The provided code is syntactically valid and the shown outputs are consistent with the code. The notebook evidence indicates the cells executed successfully.

  • Output Quality: 2/3

    Good

    Outputs include the dataset shape, feature names, correlation matrix values, and outlier counts per feature. However, the expected scatter plots of top 3 features vs target with correlation coefficients are missing, so it does not fully match all expected outcomes.

  • Reasoning Quality: 2/2

    Excellent

    The reasoning correctly explains the dataset, correlation analysis, and IQR outlier method. It also identifies the strongest target relationships from the correlation matrix.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and grounded in the displayed outputs, with no obvious hallucinations or unsafe behavior.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.