Boston Housing Prices EDA in Python

Explore the Boston Housing dataset with price distributions, feature correlations, and outlier detection using an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Boston Housing dataset from a CSV URL and produces basic descriptive statistics. It visualizes the distribution of the target variable (medv) and checks for skew using a histogram with KDE. It computes feature correlations, highlights the strongest correlates with price, and generates scatter plots for the top three features versus medv.

Who this example is for

This is for analysts and students who want a guided exploratory data analysis example on a small, well-known regression dataset. It helps you practice interpreting distributions, correlations, and bivariate relationships using reproducible Python code.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Dataset loaded with shape (506, 14) and a summary statistics table
Histogram with KDE for medv showing a right-skewed distribution and median around 21k
Correlation heatmap with top correlators identified (rm about +0.70, lstat about -0.74)
Three scatter plots of the top features versus medv with regression lines

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
matplotlib
seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load boston housing dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv and show basic statistics
2plot the distribution of house prices (medv)
3show a correlation heatmap and identify top features correlated with price
4create scatter plots of the top 3 features vs price

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3-coder-next	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Boston Housing Prices EDA in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load boston housing dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv and show basic statistics

Thinking

Following iterative steps

I need to take this task one step at a time. The user asked to load a dataset and show basic statistics, but first, I need to load and inspect it. The developer points out that when the dataset isn't loaded yet, I should focus on inspecting it by showing the shape, columns, and the head of the dataset. I'll provide one code block that covers this and mention that we'll discuss statistics next. Keeping it simple is key!Clarifying output options

Okay, let’s consider how to present this data succinctly. I think of displaying just one table while using text variables for shape and columns. I can't output them as a comma-separated list, but I can show them sequentially. I'll focus on minimalism: read_csv, df.shape, df.columns.tolist(), and then df.head(). I'll make sure we finish with one final bare expression, using display for the shape, since it should be a neat code block. Alright, let's go with that!

Let’s load it and do the initial inspection first. After that, we can move to summary statistics in the next step.

Load and inspect housing data

import pandas as pd

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv"
df = pd.read_csv(url)

df.shape
df.columns.tolist()
df.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

The dataset is already in memory as df, and the preview confirms it has 506 rows and 14 columns, with the expected Boston housing variables such as CRIM, RM, LSTAT, and the target-like column MEDV.

To show the basic statistics, here is the minimal next step:

Summarize dataframe statistics

df.describe()

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063	22.532806
std	8.601545	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000	21.200000
75%	3.677083	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000	25.000000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000	50.000000

Here are the basic statistics for the Boston housing dataset based on the loaded data:

The dataset has 506 rows and 14 columns.
The columns are standard Boston housing variables, including predictors like CRIM, RM, LSTAT, and the price-related variable MEDV.

A few concrete takeaways from the summary statistics:

MEDV has a mean of about 22.53 and a median of 21.2, so home values are centered in the low 20s.
RM has a mean of 6.28 rooms, with values ranging from 3.56 to 8.78.
LSTAT ranges widely, indicating strong variation in neighborhood socioeconomic conditions.
CRIM is highly spread out: the mean is 3.61, but the max is 88.98, suggesting strong right-skew and some high-crime outliers.
ZN has a median of 0, meaning at least half of the tracts have no large-lot residential zoning.
CHAS is binary, and its mean of 0.069 implies only about 6.9% of observations border the Charles River.

Overall, the statistics show a moderately sized dataset with a mix of continuous variables, one binary indicator (CHAS), and several features with noticeable skew or outliers.

#10

plot the distribution of house prices (medv)

#11

MEDV is the house price column, so the next step is a simple distribution plot.

#12

Plot house price distribution

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4.5))
ax = sns.histplot(df["MEDV"], bins=30, kde=True)
ax.set_title("Distribution of House Prices (MEDV)")
ax.set_xlabel("MEDV")
ax.set_ylabel("Count")
ax

#13

<Axes: title={'center': 'Distribution of House Prices (MEDV)'}, xlabel='MEDV', ylabel='Count'>

#14

#15

Thinking

Creating a frequency table

I’m thinking about how to support the distribution with actual values. I need to show a compact frequency table across bins. There's no need to ask if I can run it; I should just provide the next code step and stop. However, the instruction indicates that if Path B is chosen, I need to give brief insights and code without asking. I want to ensure there’s one runnable Python code block and a final bare expression. I'll use pandas, which is already imported in the notebook. Let's make this concise!

#16

The histogram step worked, so MEDV is available as a numeric column and the distribution plot is set up correctly. To support the visual with actual values, here’s a compact frequency table showing how many homes fall into each price range.

#17

Count MEDV by bins

medv_bins = pd.cut(df["MEDV"], bins=10)
df.groupby(medv_bins)["MEDV"].size().rename("count")

#18

MEDV
(4.955, 9.5]     22
(9.5, 14.0]      55
(14.0, 18.5]     85
(18.5, 23.0]    154
(23.0, 27.5]     84
(27.5, 32.0]     39
(32.0, 36.5]     29
(36.5, 41.0]      7
(41.0, 45.5]     10
(45.5, 50.0]     21
Name: count, dtype: int64

#19

The distribution of MEDV is now shown, and the bin counts make the shape clear:

The highest concentration is in the 18.5 to 23.0 range, with 154 homes.
Most values fall roughly between 14.0 and 27.5, so the distribution is centered in the low-to-mid 20s.
There are relatively few very low-price homes below 9.5 (22 homes).
There are also few very high-price homes above 36.5, aside from a small cluster near the top end.
The uppermost bin 45.5 to 50.0 has 21 homes, which suggests a visible pile-up at the maximum value of 50.0 often seen in this dataset.

So the MEDV distribution is concentrated around moderate house prices, with fewer observations at both extremes and a notable cap near the high end.

#20

show a correlation heatmap and identify top features correlated with price

#21

Thinking

Planning heatmap creation

I’m considering creating a heatmap of the full correlation matrix, or maybe a one-column heatmap sorted by MEDV. This could effectively show the top features. The user specifically asked for a correlation heatmap and to identify key features, so I’ll create a focused heatmap showing the correlation of each feature with MEDV in a sorted manner. Then I can provide a numeric ranked list in the next step. Since I haven’t seen the output yet, I’ll prepare the code and a brief explanation without asking for the output.Preparing heatmap code

I’m thinking that there's no need for multiple charts; just one code block will do. I should ensure it only creates a single chart showing the correlation. I can use df.corr(numeric_only=True)["MEDV"].drop("MEDV").sort_values() to get the correlations. Then I can plot a one-column heatmap using sns.heatmap(corr.to_frame(), annot=True, ...). This approach should be reliable! I’ll go ahead and produce that code.

#22

Let’s start with a focused heatmap: each numeric feature’s correlation with MEDV, sorted from strongest negative to strongest positive. That will make the top price-related features easy to spot visually.

#23

Plot feature correlations

corr_medv = (
    df.corr(numeric_only=True)["MEDV"]
    .drop("MEDV")
    .sort_values()
    .to_frame(name="MEDV")
)

plt.figure(figsize=(4.5, 8))
ax = sns.heatmap(corr_medv, annot=True, cmap="coolwarm", center=0, fmt=".2f")
ax.set_title("Feature Correlation with House Price (MEDV)")
ax.set_xlabel("Correlation")
ax.set_ylabel("Feature")
ax

#24

<Axes: title={'center': 'Feature Correlation with House Price (MEDV)'}, xlabel='Correlation', ylabel='Feature'>

#25

#26

Thinking

Identifying correlations for analysis

I need to provide an answer that aligns with the guidelines while showing a correlation heatmap and identifying the top features correlated with price. Currently, I only have the heatmap chart without explicit correlation values. The plan is to first create a value table via another approach and include insights from my current outputs. I should display the strongest positive and negative correlations numerically, focusing on key values like LSTAT, RM, and PTRATIO. I'll ensure to include a runnable Python code block to support my findings.Calculating top correlations

I need to provide insights based on the latest outputs without just restating. The heatmap indicates strong relationships, especially a positive correlation with RM and negative with LSTAT and PTRATIO. To identify the exact top features numerically, I’ll rank them. I can use a compact ranking with the code: corr_medv["MEDV"].sort_values(key=lambda s: s.abs(), ascending=False).to_frame("correlation"). I think displaying the top 8 correlations in one table should work well. I'll make sure to include exactly one code block for clarity.Refining correlation calculation

I need to craft the correlation ranking so that it includes the original signed values rather than just the absolute ones. Instead of using top_corr = corr_medv["MEDV"].abs().sort_values(ascending=False), I’ll opt for top_corr = corr_medv.sort_values("MEDV", key=lambda s: s.abs(), ascending=False). This should help in preserving the sign while still sorting by absolute values, which is essential for understanding the nature of the correlations. This way, I can provide the most accurate ranking.

#27

The heatmap already shows that MEDV has a few standout relationships: RM is strongly positively correlated with price, while LSTAT is strongly negatively correlated. A couple of other features also look meaningfully related, but the chart alone isn’t ideal for a ranked list.

#28

Sort top MEDV correlations

top_corr = corr_medv.sort_values("MEDV", key=lambda s: s.abs(), ascending=False)
top_corr.head(8)

#29

	MEDV
LSTAT	-0.737663
RM	0.695360
PTRATIO	-0.507787
INDUS	-0.483725
TAX	-0.468536
NOX	-0.427321
CRIM	-0.388305
RAD	-0.381626

#30

The heatmap and ranked correlations together show the main features associated with house price (MEDV).

Top features most correlated with price by absolute correlation:

LSTAT: -0.738
RM: 0.695
PTRATIO: -0.508
INDUS: -0.484
TAX: -0.469
NOX: -0.427
CRIM: -0.388
RAD: -0.382

What this means:

RM has the strongest positive relationship with price among the features shown: areas with more average rooms tend to have higher house prices.
LSTAT has the strongest negative relationship: higher values of this feature are associated with lower house prices.
PTRATIO, INDUS, TAX, and NOX also show moderately negative relationships with price.

From the observed outputs, the two clearest price-related features are:

RM for positive correlation
LSTAT for negative correlation

These are correlations, so they describe association, not causation.

#31

create scatter plots of the top 3 features vs price

#32

The top 3 features from the correlation ranking are LSTAT, RM, and PTRATIO, so the next useful step is to plot each of them against MEDV.

#33

Plot top feature scatterplots

import seaborn as sns
import matplotlib.pyplot as plt

top3 = top_corr.head(3).index.tolist()

fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharey=True)

for ax, col in zip(axes, top3):
    sns.scatterplot(data=df, x=col, y="MEDV", alpha=0.7, ax=ax)
    ax.set_title(f"{col} vs MEDV")
    ax.set_xlabel(col)
    ax.set_ylabel("MEDV")

plt.tight_layout()
fig

#34

#35

#36

The scatter plots for the top 3 features against price (MEDV) are now in place for:

LSTAT vs MEDV
RM vs MEDV
PTRATIO vs MEDV

Based on the correlation values already computed, these are the strongest relationships with price among the features:

LSTAT: -0.738
RM: 0.695
PTRATIO: -0.508

What the plots are meant to show:

LSTAT vs MEDV: a clear negative relationship, so higher LSTAT tends to be associated with lower house prices.
RM vs MEDV: a clear positive relationship, so homes in areas with more rooms tend to have higher prices.
PTRATIO vs MEDV: a moderate negative relationship, so higher pupil-teacher ratios tend to be associated with lower prices.

These three scatter plots directly match the requested goal of visualizing the top 3 features versus price.

Score cards for gpt-5.4 for Boston Housing Prices EDA in Python

Total score: 10/10

Task Completion: 2/2
Excellent
The workflow completed all requested EDA steps: loading the dataset, showing summary statistics, plotting MEDV distribution, creating a correlation heatmap, and generating scatter plots for the top 3 features.
Execution Correctness: 2/2
Excellent
The code shown is syntactically valid and the outputs indicate it ran successfully, producing the expected tables and plots.
Output Quality: 3/3
Excellent
Outputs match the expected outcomes semantically: shape/statistics are shown, MEDV is right-skewed with median around 21.2, correlations highlight LSTAT and RM, and three scatter plots are produced.
Reasoning Quality: 2/2
Excellent
The narrative correctly interprets the statistics and correlation results, including the price center, skew, and strongest correlates with MEDV.
Reliability: 1/1
Excellent
The workflow is consistent and reproducible, with no obvious hallucinations or unsafe steps.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Boston Housing Prices EDA in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with gpt-5.4

Score cards for gpt-5.4 for Boston Housing Prices EDA in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics