Exploratory Data Analysis (EDA) in Python

Complete EDA workflow: load data, check quality, compute correlations, detect outliers, and visualize distributions — all with an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the scikit-learn diabetes dataset and summarizes its structure, feature names, and basic statistics. It generates correlation analysis across all features, including a full correlation matrix visualization. It detects per-feature outliers using the IQR method and identifies which features relate most strongly to the target with supporting plots and correlation coefficients.

Who this example is for

This is for analysts and students who want a repeatable EDA template for a small, well-known regression dataset. It helps you practice data quality checks, correlation inspection, and outlier detection in a conversational, code-generating notebook.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Dataset shape (442, 10) with feature names and summary stats
10x10 correlation heatmap of all features
Outlier counts per feature using the IQR method
Top 3 feature vs target scatter plots with correlation coefficients

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

scikit-learn
pandas
numpy
matplotlib
seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load the diabetes dataset (from sklearn) and show basic info
2plot a correlation matrix of all features
3detect outliers in each feature using the IQR method
4which features have the strongest relationship with the target variable?

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3-coder-next	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	9/10	2/2	2/2	2/3	2/2	1/1	Open qwen3.5:397b conversation
gemma4:31b	8/10	2/2	1/2	2/3	2/2	1/1	Open gemma4:31b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Exploratory Data Analysis (EDA) in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load the diabetes dataset (from sklearn) and show basic info

Thinking

Crafting code for DataFrame

I need to create a code block for loading the diabetes dataset using load_diabetes(as_frame=True). This will give me a DataFrame that includes both data and target if the option is set. After checking, I see that load_diabetes indeed returns a frame with both. I plan to copy the DataFrame for full access to its 11 columns, including the target, and then display the shape and columns. Finally, I’ll end with df.head() to summarize.Building summary for DataFrame

I need to create two tables plus the head for my DataFrame, but I have to keep it to one final bare expression only. I might show the columns as a plain list output instead. Using variables directly in the notebook is possible, but I think it may be cleaner to summarize with info = {"rows": ..., "columns": df.columns.tolist()}. This will combine the summary information in one cell, plus I’ll include the final head of the DataFrame.

I’ll load the sklearn diabetes dataset into a pandas DataFrame and show a compact first look: shape, columns, and the first few rows.

Summarize dataset structure

import pandas as pd
from sklearn.datasets import load_diabetes

data = load_diabetes(as_frame=True)
df = data.frame.copy()

summary = pd.DataFrame({
    "rows": [df.shape[0]],
    "columns_count": [df.shape[1]],
    "columns": [df.columns.tolist()]
})

display(summary)
df.head()

	rows	columns_count	columns
0	442	11	[age, sex, bmi, bp, s1, s2, s3, s4, s5, s6, ta...

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6	target
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019907	-0.017646	151.0
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068332	-0.092204	75.0
2	0.085299	0.050680	0.044451	-0.005670	-0.045599	-0.034194	-0.032356	-0.002592	0.002861	-0.025930	141.0
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022688	-0.009362	206.0
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031988	-0.046641	135.0

The diabetes dataset is loaded and the basic info is visible from the notebook output:

It has 442 rows and 11 columns.
The columns are: age, sex, bmi, bp, s1, s2, s3, s4, s5, s6, target
The first 10 columns are input features, and target is the outcome variable.
The preview shows all columns are numeric, which is typical for the sklearn diabetes dataset.
The feature values appear to be standardized/scaled around zero, while target is on a different numeric scale.

From the first few rows, the dataset is structured and ready for analysis or modeling, with no obvious formatting issues in the loaded table.

plot a correlation matrix of all features

Thinking

Creating correlation matrix plot

I need to respond with a single step of code since the user wants to plot the correlation matrix of all features. I should use libraries like seaborn and matplotlib. The dataframe is already loaded, but I think I should exclude the target feature based on the user's request to focus on features specifically. I’ll confirm that if they want to include the target later, we can adjust the code then. I'll provide just one code block.

#10

I’ll plot the correlation matrix for the input features only (excluding target, since that’s the response variable).

#11

Plot feature correlation heatmap

import matplotlib.pyplot as plt
import seaborn as sns

feature_cols = [c for c in df.columns if c != "target"]
corr = df[feature_cols].corr()

plt.figure(figsize=(9, 7))
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", square=True)
plt.title("Correlation Matrix of Diabetes Features")
plt.tight_layout()
plt.show()

#12

#13

The heatmap was generated for all 10 diabetes feature columns: age, sex, bmi, bp, s1, s2, s3, s4, s5, and s6, which matches the requested correlation matrix of features.

A few concrete takeaways from this matrix:

There are some stronger relationships among the serum variables:
- s1 and s2 are highly positively correlated.
- s3 tends to be negatively correlated with s4 and somewhat with s5.
- s4 and s5 show a positive association.
The demographic variables age and sex appear to have weaker correlations with most other features compared with the lab-related variables.
bmi and bp show moderate correlations with a few other predictors, but not the kind of near-duplicate relationship seen between s1 and s2.

What this means: the plot suggests some multicollinearity among certain medical measurements, especially within s1–s6. That matters if you later use linear models, because correlated predictors can overlap in the information they carry.

#14

detect outliers in each feature using the IQR method

#15

Next I’d summarize IQR outliers for each input feature in a table. I’m excluding target here because it isn’t a feature.

#16

Summarize IQR outliers

import pandas as pd

rows = []
for col in feature_cols:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    mask = (df[col] < lower) | (df[col] > upper)

    rows.append({
        "feature": col,
        "Q1": q1,
        "Q3": q3,
        "IQR": iqr,
        "lower_bound": lower,
        "upper_bound": upper,
        "outlier_count": mask.sum(),
        "outlier_pct": mask.mean() * 100
    })

iqr_outliers = pd.DataFrame(rows).sort_values("outlier_count", ascending=False).reset_index(drop=True)
iqr_outliers

#17

	feature	Q1	Q3	IQR	lower_bound	upper_bound	outlier_count	outlier_pct
0	s6	-0.033179	0.027917	0.061096	-0.124823	0.119561	9	2.036199
1	s1	-0.034248	0.028358	0.062606	-0.128157	0.122267	8	1.809955
2	s2	-0.030358	0.029844	0.060203	-0.120663	0.120149	7	1.583710
3	s3	-0.035117	0.029312	0.064429	-0.131760	0.125954	7	1.583710
4	s5	-0.033246	0.032432	0.065678	-0.131762	0.130949	4	0.904977
5	bmi	-0.034229	0.031248	0.065477	-0.132445	0.129464	3	0.678733
6	s4	-0.039493	0.034309	0.073802	-0.150197	0.145012	2	0.452489
7	age	-0.037299	0.038076	0.075375	-0.150362	0.151139	0	0.000000
8	sex	-0.044642	0.050680	0.095322	-0.187624	0.193663	0	0.000000
9	bp	-0.036656	0.035644	0.072300	-0.145106	0.144094	0	0.000000

#18

Thinking

Analyzing outliers with IQR

I need to develop an answer based on the notebook output for detecting outliers using IQR. The latest output indicates a truncated table with outlier counts and percentages. I’m not sure if the output suffices since not all features are visible. I should lean toward showing the full table or a compact list to ensure clarity. My goal is to detect outliers for every feature, and I’ll need to include concise insights and a runnable code block for completeness.Displaying a compact summary

I need to create a code snippet that shows a compact summary of the outlier data. Since the latest code computes the IQR outliers, I can display selected columns for all rows. It seems appropriate to use something like this to show the relevant columns:

iqr_outliers[["feature", "lower_bound", "upper_bound", "outlier_count", "outlier_pct"]].round({"lower_bound": 6, ...})

I want to keep the code minimal and ensure it’s clearly expressed. This should help make the output more readable!

#19

The IQR results already show that outliers are present, but they look fairly limited overall in this dataset. From the visible rows, s6 has the most IQR outliers with 9 observations (about 2.04%), followed by s1 with 8 and s2 with 7, so the strongest outlier activity appears in the serum-related features rather than age, sex, or bp.

To make the detection complete for each feature, here’s a compact full table with the bounds and counts:

#20

Show IQR outlier summary

iqr_outliers[["feature", "lower_bound", "upper_bound", "outlier_count", "outlier_pct"]].round(6)

#21

	feature	lower_bound	upper_bound	outlier_count	outlier_pct
0	s6	-0.124823	0.119561	9	2.036199
1	s1	-0.128157	0.122267	8	1.809955
2	s2	-0.120663	0.120149	7	1.583710
3	s3	-0.131760	0.125954	7	1.583710
4	s5	-0.131762	0.130949	4	0.904977
5	bmi	-0.132445	0.129464	3	0.678733
6	s4	-0.150197	0.145012	2	0.452489
7	age	-0.150362	0.151139	0	0.000000
8	sex	-0.187624	0.193663	0	0.000000
9	bp	-0.145106	0.144094	0	0.000000

#22

Using the IQR method on the 10 features, the dataset has only a small number of outliers overall, and they are concentrated in a few serum-related variables.

Observed outlier counts by feature:

s6: 9 outliers, 2.04%
s1: 8 outliers, 1.81%
s2: 7 outliers, 1.58%
s3: 7 outliers, 1.58%
s5: 4 outliers, 0.90%
bmi: 3 outliers, 0.68%
s4: 2 outliers, 0.45%
age: 0 outliers
sex: 0 outliers
bp: 0 outliers

This means:

The most outlier-prone features are s6, s1, s2, and s3.
age, sex, and bp have no observations beyond the standard IQR fences.
Even the highest outlier rate is low at about 2%, so the feature distributions are fairly well-behaved under this rule.

The IQR cutoffs used for detection were:

s6: below -0.124823 or above 0.119561
s1: below -0.128157 or above 0.122267
s2: below -0.120663 or above 0.120149
s3: below -0.131760 or above 0.125954
s5: below -0.131762 or above 0.130949
bmi: below -0.132445 or above 0.129464
s4: below -0.150197 or above 0.145012
age: below -0.150362 or above 0.151139
sex: below -0.187624 or above 0.193663
bp: below -0.145106 or above 0.144094

So, the requested outlier detection is answered: outliers exist in 7 of the 10 features, with the largest concentration in s6, while age, sex, and bp show none under the IQR method.

#23

which features have the strongest relationship with the target variable?

#24

Thinking

Analyzing target correlations

The user is asking which features have the strongest relationship with the target variable, and it seems I need to shift my approach based on the notebook's latest outputs. Since our heatmap only covers features without the target's correlation, I should compute those correlations now. It’s important to mention that we haven’t measured the target relationships yet, so I’ll include one clear code block for that purpose and provide a concise explanation as well.

#25

The current heatmap only shows relationships among the predictors, so it doesn’t answer which features relate most strongly to target.

The next useful step is to compute each feature’s correlation with target and sort by absolute strength:

#26

Rank target correlations

target_corr = (
    df.corr(numeric_only=True)["target"]
    .drop("target")
    .sort_values(key=lambda s: s.abs(), ascending=False)
    .to_frame(name="correlation_with_target")
)

target_corr

#27

	correlation_with_target
bmi	0.586450
s5	0.565883
bp	0.441482
s4	0.430453
s3	-0.394789
s6	0.382483
s1	0.212022
age	0.187889
s2	0.174054
sex	0.043062

#28

The strongest relationships with the target are:

bmi: 0.586
s5: 0.566
bp: 0.441
s4: 0.430
s3: -0.395
s6: 0.382

What this means:

bmi has the strongest association with the target in this dataset.
s5 is a close second.
bp and s4 also show moderately strong positive relationships.
s3 stands out as the strongest negative relationship, meaning higher s3 tends to be associated with lower target values.
sex has almost no linear relationship with the target (0.043), and age, s1, and s2 are relatively weak by comparison.

If you rank by absolute correlation strength, the top features are:

bmi
s5
bp
s4
s3
s6

So, the features most strongly related to the target are primarily bmi and s5, followed by bp, s4, and s3.

Score cards for gpt-5.4 for Exploratory Data Analysis (EDA) in Python

Total score: 10/10

Task Completion: 2/2
Excellent
All major EDA steps were completed: dataset loading/basic info, correlation matrix, IQR outlier detection, and target-feature correlation ranking.
Execution Correctness: 2/2
Excellent
The provided code is coherent and likely runnable, using valid sklearn, pandas, seaborn, and matplotlib operations without obvious syntax or logic errors.
Output Quality: 3/3
Excellent
Outputs match the expected outcomes semantically: dataset shape/info, a 10x10 feature correlation heatmap, outlier counts per feature, and strongest target relationships with correlation coefficients.
Reasoning Quality: 2/2
Excellent
The explanations are accurate and helpful, correctly interpreting the correlation structure and identifying the strongest target associations.
Reliability: 1/1
Excellent
The workflow is consistent and grounded in the dataset outputs, with no major hallucinations or unsupported claims.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Exploratory Data Analysis (EDA) in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with gpt-5.4

Score cards for gpt-5.4 for Exploratory Data Analysis (EDA) in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics