Exploratory Data Analysis (EDA) in Python

Complete EDA workflow: load data, check quality, compute correlations, detect outliers, and visualize distributions — all with an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the scikit-learn diabetes dataset and summarizes its structure, feature names, and basic statistics. It generates correlation analysis across all features, including a full correlation matrix visualization. It detects per-feature outliers using the IQR method and identifies which features relate most strongly to the target with supporting plots and correlation coefficients.

Who this example is for

This is for analysts and students who want a repeatable EDA template for a small, well-known regression dataset. It helps you practice data quality checks, correlation inspection, and outlier detection in a conversational, code-generating notebook.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Dataset shape (442, 10) with feature names and summary stats
  • 10x10 correlation heatmap of all features
  • Outlier counts per feature using the IQR method
  • Top 3 feature vs target scatter plots with correlation coefficients

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load the diabetes dataset (from sklearn) and show basic info
  2. 2plot a correlation matrix of all features
  3. 3detect outliers in each feature using the IQR method
  4. 4which features have the strongest relationship with the target variable?

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b9/102/22/22/32/21/1Open qwen3.5:397b conversation
gemma4:31b8/102/21/22/32/21/1Open gemma4:31b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gemma4:31b

This run uses AI with a step-by-step list of prompts to perform the analysis task Exploratory Data Analysis (EDA) in Python. Below, you can review the full conversation generated with gemma4:31b. After the conversation, we show detailed score cards for this run.

load the diabetes dataset (from sklearn) and show basic info
Thinking

The user wants to load the diabetes dataset from sklearn and see basic information about it.

  1. Load the dataset using sklearn.datasets.load_diabetes.
  2. Convert it into a pandas DataFrame for easier inspection.
  3. Show basic info: shape, columns, and the first few rows (head).

Plan:

  • Import load_diabetes from sklearn.datasets.
  • Import pandas.
  • Load the data.
  • Create a DataFrame.
  • Display basic information.

I'll load the diabetes dataset from scikit-learn and put it into a pandas DataFrame so we can inspect the features and the target variable.

Load diabetes dataset
import pandas as pd
from sklearn.datasets import load_diabetes

# Load the dataset
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target

# Display basic info
display(df.head())
df.shape
age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641 135.0
(442, 11)
Thinking

The user goal was to "load the diabetes dataset (from sklearn) and show basic info".

Looking at the notebook state:

  • The dataset has been loaded into df.
  • df.head() was displayed, showing the first 5 rows of features (age, sex, bmi, bp, s1-s6) and the target.
  • df.shape was displayed as (442, 11).
  • This confirms the dataset is loaded and provides basic information (shape, column names, first few rows).

The output clearly shows:

  • 442 samples and 11 columns.
  • Features are scaled/normalized (small decimal values).
  • Target is a continuous numerical value (e.g., 151.0, 75.0).

The original goal is "load the diabetes dataset (from sklearn) and show basic info". The current output provides:

  1. Loading (completed).
  2. Basic info: Column names, a sample of the data, and the dataset dimensions.

This is sufficient to satisfy the original goal.

Path A: Final answer based on observed outputs.

The diabetes dataset has been successfully loaded, containing 442 samples and 11 columns. The features (age, sex, bmi, bp, and s1 through s6) appear to be pre-scaled, while the target represents a quantitative measure of disease progression.

plot a correlation matrix of all features
Thinking

The user wants to plot a correlation matrix of all features in the diabetes dataset. The dataset df is already loaded and exists in the notebook state.

Plan:

  1. Calculate the correlation matrix using df.corr().
  2. Use seaborn (default plotting library) to create a heatmap.
  3. Ensure the plot has a title and clear labels.

Constraint checklist & Confidence score:

  1. Propose one next step? Yes.
  2. Provide one runnable code block? Yes.
  3. Analyze latest outputs? Yes.
  4. Only one code block? Yes.
  5. No comma-separated expressions? Yes.
  6. No $ for money? N/A.
  7. No print for variables? Yes.
  8. No multi-plot per step? Yes.
  9. Minimal code? Yes.

Confidence Score: 5/5

Mental Sandbox:

  • import seaborn as sns
  • import matplotlib.pyplot as plt
  • corr = df.corr()
  • sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
  • plt.title("Correlation Matrix")

The data is already in df. I will use seaborn as requested by the visualization policy.

I'll create a heatmap to visualize the correlations between all features and the target variable. This will help us identify which features have the strongest linear relationship with the disease progression.

Plot correlation heatmap
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate correlation matrix
corr = df.corr()

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix of Diabetes Dataset Features")
plt.show()
Thinking

The user wants to plot a correlation matrix of all features. Looking at the notebook state:

  • cell-1 loaded the diabetes dataset into df.
  • cell-2 calculated the correlation matrix corr = df.corr() and plotted it using sns.heatmap.
  • output-2 shows a figure object (<Figure size 1000x800 with 2 Axes>).

The goal was to "plot a correlation matrix of all features". The current state shows the code was executed and a figure was produced. However, the "Sufficiency heuristics" state: "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart." The current output is indeed just the figure object (<Figure size 1000x800 with 2 Axes>). I cannot see the actual correlation values in the provided notebook state (I only see the figure object placeholder).

To provide concrete insights and a complete answer, I need to see the actual numeric correlation values. The user's goal is to "plot", which has been done, but to "analyze" and "extract concrete insights" (as requested by the system prompt's persona instructions), I need the numbers.

Wait, the system prompt says "If the goal is to show/plot/chart/visualize and latest output contains both a valid chart/figure object and explicit numeric/text values supporting the conclusion, choose Path A". Here, the output is only the figure object. I don't have the numeric values of corr in the output text.

Therefore, I should choose Path B to print the correlation values (specifically the target correlations) to provide insights.

Actually, the goal was specifically to "plot a correlation matrix". The plot was generated. But since I am acting as an AI Data Analyst and the output block provided to me is just <Figure size 1000x800 with 2 Axes>, I don't actually see the heat map. In a real scenario, I'd see the image. But based on the provided text representation of the notebook, I only see the figure object.

Wait, if the user just wants the plot, it's done. But the persona instructions say "extract concrete insights from observed outputs". If the observed output is just a figure handle, I can't extract insights.

Let's check the a/b decision logic again. Goal: "plot a correlation matrix of all features" Output: Figure generated. Heuristic: "If the latest output is chart-only/visual-only... choose Path B first to print the values".

I will print the correlation matrix (or at least the correlations with the target) to be able to describe the relationships.

Actually, looking at the prompt: "Analyze the latest executed notebook outputs and notebook state against the original user goal." The goal was "plot a correlation matrix". The user's request is now an "Auto-run follow-up" to analyze. If the goal was just to plot, and it's plotted, Path A is tempted. But the prompt requires "concrete insights". Without the numbers, I can't give insights.

I'll choose Path B to print the correlation matrix.

Wait, if the output provided in the prompt for [output-2] is just <Figure size 1000x800 with 2 Axes>, it means I have no visual data. I must get the numeric data to analyze it.

One more check: "If the goal is to show/plot/chart/visualize and latest output contains both a valid chart/figure object and explicit numeric/text values supporting the conclusion, choose Path A". It only contains the figure object.

I will print the correlation matrix.

The correlation matrix has been plotted, but to provide specific insights into which features most strongly influence the target, I will display the numerical correlation values.

Calculate target correlations
# Display the correlations of all features with the target variable
df.corr()['target'].sort_values(ascending=False)
target    1.000000
bmi       0.586450
s5        0.565883
bp        0.441482
s4        0.430453
s6        0.382483
s1        0.212022
age       0.187889
s2        0.174054
sex       0.043062
s3       -0.394789
Name: target, dtype: float64
Thinking

The user's original goal was to "plot a correlation matrix of all features". Looking at the notebook state:

  • Cell 2 plotted the heatmap (correlation matrix) using sns.heatmap.
  • Cell 3 displayed the sorted correlation values of all features with the target variable.

The user's goal was specifically to plot a correlation matrix. The heatmap in Cell 2 fulfills this request perfectly. The subsequent step (Cell 3) provided the numeric support for the visualization.

Based on the sufficiency heuristics:

  • "If the goal is to show/plot/chart/visualize and latest output contains both a valid chart/figure object and explicit numeric/text values supporting the conclusion, choose Path A".
  • The heatmap (chart) is present, and the sorted correlation list (numeric values) is present.

Insights from outputs:

  • bmi (0.59) and s5 (0.57) have the strongest positive correlation with the target.
  • s3 (-0.39) has a notable negative correlation.
  • sex (0.04) shows almost no linear relationship with the target.

Decision: Path A.

The correlation matrix visualization and the subsequent numeric breakdown show that bmi and s5 have the strongest positive linear relationships with the disease progression target. Conversely, s3 exhibits a significant negative correlation. Features like sex show very little linear correlation with the target.

detect outliers in each feature using the IQR method
Thinking

The user wants to detect outliers in each feature of the diabetes dataset using the Interquartile Range (IQR) method.

- Dataset `df` is loaded and available.
- It contains numerical features: `age`, `sex`, `bmi`, `bp`, `s1`, `s2`, `s3`, `s4`, `s5`, `s6` and the `target`.
- The goal is specifically "detect outliers in each feature using the IQR method".

- For each column in `df`:
    - Calculate Q1 (25th percentile).
    - Calculate Q3 (75th percentile).
    - Calculate $IQR = Q3 - Q1$.
    - Define lower bound as $Q1 - 1.5 \times IQR$.
    - Define upper bound as $Q3 + 1.5 \times IQR$.
    - Count values outside these bounds.

- I should iterate through the columns of the dataframe.
- I'll create a summary table showing the number of outliers detected for each feature.

- "detect outliers in each feature using the IQR method"
- I'll implement a loop or a function to calculate this and display the results.

- Brief explanation of IQR.
- Python code to compute the counts.
- Final expression to show the summary.

I will use the Interquartile Range (IQR) method to identify outliers. For each feature, I'll calculate the first quartile (Q1) and third quartile (Q3), then define the bounds as Q1 - 1.5 × IQR and Q3 + 1.5 × IQR.

Detect feature outliers
# Calculate IQR and detect outliers for each feature
outlier_counts = {}

for col in df.columns:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Identify rows that fall outside the bounds
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    outlier_counts[col] = len(outliers)

# Convert results to a Series for better display
pd.Series(outlier_counts).sort_values(ascending=False)
s6        9
s1        8
s2        7
s3        7
s5        4
bmi       3
s4        2
age       0
sex       0
bp        0
target    0
dtype: int64
Thinking

The user wants to detect outliers in each feature using the IQR method. Looking at the provided notebook state:

  • Cell 4 executed the IQR logic.
  • Output 4 provides the number of outliers per column:
    • s6: 9
    • s1: 8
    • s2: 7
    • s3: 7
    • s5: 4
    • bmi: 3
    • s4: 2
    • age: 0
    • sex: 0
    • bp: 0
    • target: 0

The goal was to "detect outliers in each feature using the IQR method". The current output provides the count of outliers for every feature. This is a direct answer to the goal.

Path A: The outputs are sufficient. I should provide a final answer summarizing the findings.

The IQR method identifies outliers for several features in the dataset. The highest number of outliers are found in the blood serum measurements, specifically s6 (9 outliers), s1 (8 outliers), s2 (7 outliers), and s3 (7 outliers). Moderate outliers are present in s5 and bmi, while s4 has very few. Features such as age, sex, bp, and the target variable contain no outliers according to this method.

which features have the strongest relationship with the target variable?
Thinking

The user wants to know which features have the strongest relationship with the target variable.

*   The dataset is the diabetes dataset.
*   A correlation matrix was calculated and a heatmap was plotted in `[cell-2]`.
*   The correlations of all features with the `target` variable were specifically calculated and sorted in `[cell-3]`.

*   `target` (1.000000) - Obviously, the target with itself.
*   `bmi` (0.586450) - Strong positive correlation.
*   `s5` (0.565883) - Strong positive correlation.
*   `bp` (0.441482) - Moderate positive correlation.
*   `s4` (0.430453) - Moderate positive correlation.
*   `s6` (0.382483) - Moderate positive correlation.
*   `s3` (-0.394789) - Moderate negative correlation.

*   Relationship strength is determined by the absolute value of the correlation coefficient.
*   The strongest positive predictors are `bmi` and `s5`.
*   `s3` also shows a significant negative relationship.

The user is asking for analysis of existing output. I should provide a natural language answer based on [output-3]. No code is needed since the information is already present.

*   Identify the top features.
*   Mention the direction of the relationship (positive vs negative).

Based on the correlation analysis in the previous steps, the features with the strongest relationship to the target variable are:

  1. bmi (0.59): The strongest positive correlation.
  2. s5 (0.57): A strong positive correlation.
  3. bp (0.44): A moderate positive correlation.
  4. s4 (0.43): A moderate positive correlation.
  5. s3 (-0.39): A significant negative correlation, meaning as s3 decreases, the target value tends to increase.

Overall, bmi and s5 are the most influential predictors in this dataset.

Score cards for gemma4:31b for Exploratory Data Analysis (EDA) in Python

Total score: 8/10

  • Task Completion: 2/2

    Excellent

    The workflow completed the main EDA steps: loaded the diabetes dataset, plotted the correlation matrix, computed target correlations, and detected outliers with IQR.

  • Execution Correctness: 1/2

    Needs work

    The code is mostly runnable, but the dataset shape is incorrect for the expected task because the target was added as an extra column, making it (442, 11) instead of the expected (442, 10).

  • Output Quality: 2/3

    Good

    Outputs include the dataset head, shape, correlation ranking, and outlier counts. However, the expected scatter plots of the top 3 features vs target with correlation coefficients are missing.

  • Reasoning Quality: 2/2

    Excellent

    The analysis correctly identifies the strongest target relationships, especially bmi and s5, and uses a valid IQR method for outlier detection.

  • Reliability: 1/1

    Excellent

    The workflow is reasonably consistent and uses standard pandas/sklearn/seaborn operations, though it omits one expected visualization step.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.