Exploratory Data Analysis

Boston Housing Prices EDA in Python

Explore the Boston Housing dataset with price distributions, feature correlations, and outlier detection using an AI data analyst.

What

This AI Data Analyst workflow loads the Boston Housing dataset from a CSV URL and produces basic descriptive statistics. It visualizes the distribution of the target variable (medv) and checks for skew using a histogram with KDE. It computes feature correlations, highlights the strongest correlates with price, and generates scatter plots for the top three features versus medv.

Who

This is for analysts and students who want a guided exploratory data analysis example on a small, well-known regression dataset. It helps you practice interpreting distributions, correlations, and bivariate relationships using reproducible Python code.

Tools

pandas
numpy
matplotlib
seaborn

Outcomes

Dataset loaded with shape (506, 14) and a summary statistics table
Histogram with KDE for medv showing a right-skewed distribution and median around 21k
Correlation heatmap with top correlators identified (rm about +0.70, lstat about -0.74)
Three scatter plots of the top features versus medv with regression lines

Quality Score

7/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

Task Completion: 2/2
Excellent
All requested EDA steps are present: data load + basic stats (including shape via narrative), MEDV distribution plot, correlation heatmap + top correlators, and scatter plots for the top 3 features vs MEDV.
Execution Correctness: 1/2
Needs work
Most code is runnable, but the scatter plots do not include regression lines as required; otherwise the loading, describe, histplot, corr/heatmap, and correlation ranking code are valid.
Output Quality: 2/3
Good
Outputs match key expectations: stats table for 506x14, histogram with KDE, and correlations showing RM (~0.695) and LSTAT (~-0.738) as top features. However, the scatter plots lack regression lines, so the final visualization requirement is not fully met.
Reasoning Quality: 1/2
Needs work
Interpretations of summary stats and correlations are generally correct, but the assistant also states it "can't determine the skew direction without seeing the chart" despite having the plot output, and it doesn't explicitly confirm the expected right-skew/median detail.
Reliability: 1/1
Excellent
Workflow is consistent with the notebook evidence and avoids unsupported claims about computed values (correlations are shown). Main weakness is omission of regression lines rather than hallucination or unsafe behavior.

load boston housing dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv and show basic statistics

Thinking

Loading and Inspecting Dataset

Alright, I need to load the dataset and show some basic statistics. However, since the dataset isn't loaded yet, the first step is to load and inspect it, showing the shape, columns, and the first few rows. The developer has emphasized that I shouldn't combine the inspection with plotting or modeling at this stage. So, I'll ensure to follow these directives and provide the required outputs sequentially, keeping everything simple and focused. Let's create the code!Creating Load and Check Code

I think the simplest approach for loading and inspecting the dataset is to use pandas. First, I’ll import the necessary libraries and load the CSV file from the URL. I’ll display the shape and columns without printing unnecessary details, and the final expression will be the head of the DataFrame. This keeps it clean and straightforward. I want to mention that we're first loading and checking the structure, saving the basic statistics for the next step.

Let’s start by loading the data and checking its structure first. Then we can summarize the basic statistics in the next step.

Load and inspect housing data

import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column": df.columns}))
df.head()

	rows	columns
0	506	14

	column
0	CRIM
1	ZN
2	INDUS
3	CHAS
4	NOX
5	RM
6	AGE
7	DIS
8	RAD
9	TAX
10	PTRATIO
11	B
12	LSTAT
13	MEDV

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

You’ve already confirmed the dataset structure: it has 506 rows and 14 columns, and the preview shows the expected Boston Housing variables, including MEDV as the target.

To complete the “basic statistics” part, here is the minimal next step:

Summarize dataframe statistics

df.describe()

#10

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063	22.532806
std	8.601545	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000	21.200000
75%	3.677083	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000	25.000000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000	50.000000

#11

The dataset contains 506 observations and 14 variables, which matches the standard Boston Housing structure. The columns include predictors such as CRIM, RM, LSTAT, TAX, and the target-like house value column MEDV.

From the basic statistics shown:

There are no missing values in the displayed summary, since every column has a count of 506.
MEDV is the house-value variable; its average is around 22.53, with values ranging from 5.0 to 50.0.
RM averages about 6.28 rooms per dwelling, with most homes between roughly 5.89 and 6.62 rooms.
LSTAT has a median around 11.36, indicating a fairly wide spread across neighborhoods.
CRIM is highly right-skewed: the median is only about 0.26, while the maximum is 88.98, so a small number of areas have much higher crime rates than most.
ZN has a median of 0, meaning at least half the tracts have no large-lot residential zoning.
CHAS is binary and has a mean of about 0.069, so only about 6.9% of observations border the Charles River.
NOX averages about 0.555, and PTRATIO averages about 18.46, giving a sense of pollution and pupil-teacher ratios across areas.

Overall, the summary shows a complete numeric dataset with substantial variation across features, and several variables—especially CRIM, ZN, and likely TAX/RAD—appear unevenly distributed rather than symmetric.

#12

plot the distribution of house prices (medv)

#13

Thinking

Planning data visualization

I need to respond with one concise code block to analyze the latest outputs and plot the distribution of MEDV using seaborn or matplotlib. Since the dataset is already loaded and inspected, I can plot directly. I’ll mention that MEDV ranges from 5 to 50 with a mean of about 22.5 based on the stats. I will ensure to provide one chart and end the code block with one simple expression, avoiding display options.

#14

MEDV ranges from about 5.0 to 50.0, so a histogram is a good first look at the price distribution.

#15

Plot house price distribution

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
sns.histplot(df["MEDV"], bins=30, kde=True)
plt.title("Distribution of House Prices (MEDV)")
plt.xlabel("MEDV")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

#16

#17

Summarize MEDV statistics

medv_stats = df["MEDV"].describe()
medv_stats

#18

count    506.000000
mean      22.532806
std        9.197104
min        5.000000
25%       17.025000
50%       21.200000
75%       25.000000
max       50.000000
Name: MEDV, dtype: float64

#19

The distribution of MEDV is now clear from the histogram and summary stats.

MEDV is centered around the low 20s, with:

mean: 22.53
median: 21.20
25th percentile: 17.03
75th percentile: 25.00

A few takeaways from the plotted distribution:

Most house prices cluster roughly between 17 and 25.
The mean is slightly above the median, which suggests a mild right skew.
The maximum value is 50.0, while the minimum is 5.0, so the distribution has a fairly wide spread.
The upper cap at 50.0 is notable and often appears as a ceiling in this dataset, which can create a visible pile-up at the high end of the histogram.

So, the house-price distribution is concentrated in the lower-to-mid 20s, somewhat right-skewed, and truncated at the top around 50.

#20

show a correlation heatmap and identify top features correlated with price

#21

Thinking

Creating correlation heatmap

I need to answer the user's request by showing a correlation heatmap and identifying the features correlated with price. I’ll start with the heatmap since the original request includes both elements. Focusing on one step, I think creating the heatmap first is best. I know the target is MEDV and the data frame is loaded. I’ll use seaborn or matplotlib. It might be helpful to annotate the heatmap and keep it concise, maybe displaying a full 14x14 chart.Finalizing the plot display

I need to make sure to end the code block with plt.show(), which is a standard way to display the plot. It’s perfectly fine to conclude like that, and I don’t need to ask for further output since the plot will show automatically. I'll focus on crafting this to be concise so that it clearly communicates the intent without unnecessary details. Keeping it straightforward is definitely the goal here!

#22

A good next step is the correlation heatmap so we can see how MEDV relates to the other variables at a glance.

#23

Plot correlation heatmap

import matplotlib.pyplot as plt
import seaborn as sns

corr = df.corr(numeric_only=True)

plt.figure(figsize=(11, 8))
sns.heatmap(corr, cmap="coolwarm", center=0, annot=True, fmt=".2f", square=True)
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()

#24

#25

Thinking

Deciding on analysis approach

I need to choose between options A and B for creating a correlation heatmap. The goal is to identify which features most correlate with price. Since we only have the heatmap figure without numeric values, option B is better for ranking. I should provide brief insights based on current outputs and include a minimal runnable code block to display correlations with MEDV sorted. The heatmap shows a mix of positive and negative relationships, but the chart alone isn't enough for ranking. I'll keep it concise.

#26

The heatmap already shows that MEDV has both strong positive and strong negative relationships with other variables, but the figure alone is not enough to reliably rank the top features.

#27

Sort MEDV correlations

medv_corr = corr["MEDV"].drop("MEDV").sort_values(key=lambda s: s.abs(), ascending=False)
medv_corr.to_frame("correlation_with_MEDV")

#28

	correlation_with_MEDV
LSTAT	-0.737663
RM	0.695360
PTRATIO	-0.507787
INDUS	-0.483725
TAX	-0.468536
NOX	-0.427321
CRIM	-0.388305
RAD	-0.381626
AGE	-0.376955
ZN	0.360445
B	0.333461
DIS	0.249929
CHAS	0.175260

#29

From the heatmap and the correlation ranking with MEDV (price), the strongest features are:

LSTAT: -0.738 — strongest relationship overall; higher lower-status population is associated with lower price
RM: 0.695 — more rooms is strongly associated with higher price
PTRATIO: -0.508 — higher pupil-teacher ratio is associated with lower price
INDUS: -0.484
TAX: -0.469

Other notable negative correlations with price:

NOX: -0.427
CRIM: -0.388
RAD: -0.382
AGE: -0.377

Moderate positive correlations:

ZN: 0.360
B: 0.333

Weak correlations:

DIS: 0.250
CHAS: 0.175

In short, the top features most correlated with price are LSTAT and RM by a clear margin. LSTAT is the strongest negative correlate, and RM is the strongest positive correlate. The heatmap supports this by showing the darkest cells around MEDV for those variables.

#30

create scatter plots of the top 3 features vs price

#31

Thinking

Generating scatter plots code

I need to create a single block of code for generating scatter plots of the top 3 features against price. The top features sorted by absolute correlation are LSTAT, RM, and PTRATIO. I'll generate one figure with 3 subplots, which should satisfy the "one chart" requirement. I'll use seaborn or matplotlib, and ensure the code is concise. The final code will end with plt.show(), and I won't include any additional expressions. Let’s make this clear and effective!

#32

Based on the correlation ranking, the top 3 features are LSTAT, RM, and PTRATIO. Here’s one figure with three scatter plots against MEDV:

#33

Plot top feature scatterplots

import matplotlib.pyplot as plt
import seaborn as sns

top_features = medv_corr.head(3).index.tolist()

fig, axes = plt.subplots(1, 3, figsize=(16, 4))

for ax, feature in zip(axes, top_features):
    sns.scatterplot(data=df, x=feature, y="MEDV", ax=ax, alpha=0.7)
    ax.set_title(f"{feature} vs MEDV")
    ax.set_xlabel(feature)
    ax.set_ylabel("MEDV")

plt.tight_layout()
plt.show()

#34

#35

The scatter plots for the top 3 features versus price are now in place, and the earlier correlation values tell us exactly why these three were chosen:

LSTAT vs MEDV: strongest relationship, with a clear negative trend. As LSTAT increases, MEDV tends to decrease.
RM vs MEDV: strongest positive trend. Higher RM generally aligns with higher MEDV.
PTRATIO vs MEDV: noticeable negative association. Higher PTRATIO tends to come with lower MEDV.

These match the top three absolute correlations already computed:

LSTAT: -0.737663
RM: 0.695360
PTRATIO: -0.507787

So the plots directly support the original goal: they visualize how the three features most correlated with price relate to MEDV, with LSTAT and RM showing the strongest patterns.

Boston Housing Prices EDA in Python

What

Who

Tools

Outcomes

Quality Score

Browse All Analysis Examples

Machine Learning

Time Series

Data Analysis

Exploratory Data Analysis

Finance

Nlp

Statistics