Exploratory Data Analysis (EDA) in Python

Complete EDA workflow: load data, check quality, compute correlations, detect outliers, and visualize distributions — all with an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the scikit-learn diabetes dataset and summarizes its structure, feature names, and basic statistics. It generates correlation analysis across all features, including a full correlation matrix visualization. It detects per-feature outliers using the IQR method and identifies which features relate most strongly to the target with supporting plots and correlation coefficients.

Who this example is for

This is for analysts and students who want a repeatable EDA template for a small, well-known regression dataset. It helps you practice data quality checks, correlation inspection, and outlier detection in a conversational, code-generating notebook.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Dataset shape (442, 10) with feature names and summary stats
  • 10x10 correlation heatmap of all features
  • Outlier counts per feature using the IQR method
  • Top 3 feature vs target scatter plots with correlation coefficients

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load the diabetes dataset (from sklearn) and show basic info
  2. 2plot a correlation matrix of all features
  3. 3detect outliers in each feature using the IQR method
  4. 4which features have the strongest relationship with the target variable?

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b9/102/22/22/32/21/1Open qwen3.5:397b conversation
gemma4:31b8/102/21/22/32/21/1Open gemma4:31b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Exploratory Data Analysis (EDA) in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load the diabetes dataset (from sklearn) and show basic info
Thinking

Crafting code for DataFrame

I need to create a code block for loading the diabetes dataset using load_diabetes(as_frame=True). This will give me a DataFrame that includes both data and target if the option is set. After checking, I see that load_diabetes indeed returns a frame with both. I plan to copy the DataFrame for full access to its 11 columns, including the target, and then display the shape and columns. Finally, I’ll end with df.head() to summarize.Building summary for DataFrame

I need to create two tables plus the head for my DataFrame, but I have to keep it to one final bare expression only. I might show the columns as a plain list output instead. Using variables directly in the notebook is possible, but I think it may be cleaner to summarize with info = {"rows": ..., "columns": df.columns.tolist()}. This will combine the summary information in one cell, plus I’ll include the final head of the DataFrame.

I’ll load the sklearn diabetes dataset into a pandas DataFrame and show a compact first look: shape, columns, and the first few rows.

Summarize dataset structure
import pandas as pd
from sklearn.datasets import load_diabetes

data = load_diabetes(as_frame=True)
df = data.frame.copy()

summary = pd.DataFrame({
    "rows": [df.shape[0]],
    "columns_count": [df.shape[1]],
    "columns": [df.columns.tolist()]
})

display(summary)
df.head()
rows columns_count columns
0 442 11 [age, sex, bmi, bp, s1, s2, s3, s4, s5, s6, ta...
age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641 135.0

The diabetes dataset is loaded and the basic info is visible from the notebook output:

  • It has 442 rows and 11 columns.
  • The columns are: age, sex, bmi, bp, s1, s2, s3, s4, s5, s6, target
  • The first 10 columns are input features, and target is the outcome variable.
  • The preview shows all columns are numeric, which is typical for the sklearn diabetes dataset.
  • The feature values appear to be standardized/scaled around zero, while target is on a different numeric scale.

From the first few rows, the dataset is structured and ready for analysis or modeling, with no obvious formatting issues in the loaded table.

plot a correlation matrix of all features
Thinking

Creating correlation matrix plot

I need to respond with a single step of code since the user wants to plot the correlation matrix of all features. I should use libraries like seaborn and matplotlib. The dataframe is already loaded, but I think I should exclude the target feature based on the user's request to focus on features specifically. I’ll confirm that if they want to include the target later, we can adjust the code then. I'll provide just one code block.

I’ll plot the correlation matrix for the input features only (excluding target, since that’s the response variable).

Plot feature correlation heatmap
import matplotlib.pyplot as plt
import seaborn as sns

feature_cols = [c for c in df.columns if c != "target"]
corr = df[feature_cols].corr()

plt.figure(figsize=(9, 7))
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", square=True)
plt.title("Correlation Matrix of Diabetes Features")
plt.tight_layout()
plt.show()

The heatmap was generated for all 10 diabetes feature columns: age, sex, bmi, bp, s1, s2, s3, s4, s5, and s6, which matches the requested correlation matrix of features.

A few concrete takeaways from this matrix:

  • There are some stronger relationships among the serum variables:

    • s1 and s2 are highly positively correlated.
    • s3 tends to be negatively correlated with s4 and somewhat with s5.
    • s4 and s5 show a positive association.
  • The demographic variables age and sex appear to have weaker correlations with most other features compared with the lab-related variables.

  • bmi and bp show moderate correlations with a few other predictors, but not the kind of near-duplicate relationship seen between s1 and s2.

What this means: the plot suggests some multicollinearity among certain medical measurements, especially within s1s6. That matters if you later use linear models, because correlated predictors can overlap in the information they carry.

detect outliers in each feature using the IQR method

Next I’d summarize IQR outliers for each input feature in a table. I’m excluding target here because it isn’t a feature.

Summarize IQR outliers
import pandas as pd

rows = []
for col in feature_cols:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    mask = (df[col] < lower) | (df[col] > upper)

    rows.append({
        "feature": col,
        "Q1": q1,
        "Q3": q3,
        "IQR": iqr,
        "lower_bound": lower,
        "upper_bound": upper,
        "outlier_count": mask.sum(),
        "outlier_pct": mask.mean() * 100
    })

iqr_outliers = pd.DataFrame(rows).sort_values("outlier_count", ascending=False).reset_index(drop=True)
iqr_outliers
feature Q1 Q3 IQR lower_bound upper_bound outlier_count outlier_pct
0 s6 -0.033179 0.027917 0.061096 -0.124823 0.119561 9 2.036199
1 s1 -0.034248 0.028358 0.062606 -0.128157 0.122267 8 1.809955
2 s2 -0.030358 0.029844 0.060203 -0.120663 0.120149 7 1.583710
3 s3 -0.035117 0.029312 0.064429 -0.131760 0.125954 7 1.583710
4 s5 -0.033246 0.032432 0.065678 -0.131762 0.130949 4 0.904977
5 bmi -0.034229 0.031248 0.065477 -0.132445 0.129464 3 0.678733
6 s4 -0.039493 0.034309 0.073802 -0.150197 0.145012 2 0.452489
7 age -0.037299 0.038076 0.075375 -0.150362 0.151139 0 0.000000
8 sex -0.044642 0.050680 0.095322 -0.187624 0.193663 0 0.000000
9 bp -0.036656 0.035644 0.072300 -0.145106 0.144094 0 0.000000
Thinking

Analyzing outliers with IQR

I need to develop an answer based on the notebook output for detecting outliers using IQR. The latest output indicates a truncated table with outlier counts and percentages. I’m not sure if the output suffices since not all features are visible. I should lean toward showing the full table or a compact list to ensure clarity. My goal is to detect outliers for every feature, and I’ll need to include concise insights and a runnable code block for completeness.Displaying a compact summary

I need to create a code snippet that shows a compact summary of the outlier data. Since the latest code computes the IQR outliers, I can display selected columns for all rows. It seems appropriate to use something like this to show the relevant columns:

iqr_outliers[["feature", "lower_bound", "upper_bound", "outlier_count", "outlier_pct"]].round({"lower_bound": 6, ...})

I want to keep the code minimal and ensure it’s clearly expressed. This should help make the output more readable!

The IQR results already show that outliers are present, but they look fairly limited overall in this dataset. From the visible rows, s6 has the most IQR outliers with 9 observations (about 2.04%), followed by s1 with 8 and s2 with 7, so the strongest outlier activity appears in the serum-related features rather than age, sex, or bp.

To make the detection complete for each feature, here’s a compact full table with the bounds and counts:

Show IQR outlier summary
iqr_outliers[["feature", "lower_bound", "upper_bound", "outlier_count", "outlier_pct"]].round(6)
feature lower_bound upper_bound outlier_count outlier_pct
0 s6 -0.124823 0.119561 9 2.036199
1 s1 -0.128157 0.122267 8 1.809955
2 s2 -0.120663 0.120149 7 1.583710
3 s3 -0.131760 0.125954 7 1.583710
4 s5 -0.131762 0.130949 4 0.904977
5 bmi -0.132445 0.129464 3 0.678733
6 s4 -0.150197 0.145012 2 0.452489
7 age -0.150362 0.151139 0 0.000000
8 sex -0.187624 0.193663 0 0.000000
9 bp -0.145106 0.144094 0 0.000000

Using the IQR method on the 10 features, the dataset has only a small number of outliers overall, and they are concentrated in a few serum-related variables.

Observed outlier counts by feature:

  • s6: 9 outliers, 2.04%
  • s1: 8 outliers, 1.81%
  • s2: 7 outliers, 1.58%
  • s3: 7 outliers, 1.58%
  • s5: 4 outliers, 0.90%
  • bmi: 3 outliers, 0.68%
  • s4: 2 outliers, 0.45%
  • age: 0 outliers
  • sex: 0 outliers
  • bp: 0 outliers

This means:

  • The most outlier-prone features are s6, s1, s2, and s3.
  • age, sex, and bp have no observations beyond the standard IQR fences.
  • Even the highest outlier rate is low at about 2%, so the feature distributions are fairly well-behaved under this rule.

The IQR cutoffs used for detection were:

  • s6: below -0.124823 or above 0.119561
  • s1: below -0.128157 or above 0.122267
  • s2: below -0.120663 or above 0.120149
  • s3: below -0.131760 or above 0.125954
  • s5: below -0.131762 or above 0.130949
  • bmi: below -0.132445 or above 0.129464
  • s4: below -0.150197 or above 0.145012
  • age: below -0.150362 or above 0.151139
  • sex: below -0.187624 or above 0.193663
  • bp: below -0.145106 or above 0.144094

So, the requested outlier detection is answered: outliers exist in 7 of the 10 features, with the largest concentration in s6, while age, sex, and bp show none under the IQR method.

which features have the strongest relationship with the target variable?
Thinking

Analyzing target correlations

The user is asking which features have the strongest relationship with the target variable, and it seems I need to shift my approach based on the notebook's latest outputs. Since our heatmap only covers features without the target's correlation, I should compute those correlations now. It’s important to mention that we haven’t measured the target relationships yet, so I’ll include one clear code block for that purpose and provide a concise explanation as well.

The current heatmap only shows relationships among the predictors, so it doesn’t answer which features relate most strongly to target.

The next useful step is to compute each feature’s correlation with target and sort by absolute strength:

Rank target correlations
target_corr = (
    df.corr(numeric_only=True)["target"]
    .drop("target")
    .sort_values(key=lambda s: s.abs(), ascending=False)
    .to_frame(name="correlation_with_target")
)

target_corr
correlation_with_target
bmi 0.586450
s5 0.565883
bp 0.441482
s4 0.430453
s3 -0.394789
s6 0.382483
s1 0.212022
age 0.187889
s2 0.174054
sex 0.043062

The strongest relationships with the target are:

  • bmi: 0.586
  • s5: 0.566
  • bp: 0.441
  • s4: 0.430
  • s3: -0.395
  • s6: 0.382

What this means:

  • bmi has the strongest association with the target in this dataset.
  • s5 is a close second.
  • bp and s4 also show moderately strong positive relationships.
  • s3 stands out as the strongest negative relationship, meaning higher s3 tends to be associated with lower target values.
  • sex has almost no linear relationship with the target (0.043), and age, s1, and s2 are relatively weak by comparison.

If you rank by absolute correlation strength, the top features are:

  1. bmi
  2. s5
  3. bp
  4. s4
  5. s3
  6. s6

So, the features most strongly related to the target are primarily bmi and s5, followed by bp, s4, and s3.

Score cards for gpt-5.4 for Exploratory Data Analysis (EDA) in Python

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All major EDA steps were completed: dataset loading/basic info, correlation matrix, IQR outlier detection, and target-feature correlation ranking.

  • Execution Correctness: 2/2

    Excellent

    The provided code is coherent and likely runnable, using valid sklearn, pandas, seaborn, and matplotlib operations without obvious syntax or logic errors.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically: dataset shape/info, a 10x10 feature correlation heatmap, outlier counts per feature, and strongest target relationships with correlation coefficients.

  • Reasoning Quality: 2/2

    Excellent

    The explanations are accurate and helpful, correctly interpreting the correlation structure and identifying the strongest target associations.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and grounded in the dataset outputs, with no major hallucinations or unsupported claims.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.