Exploratory Data Analysis

Boston Housing Prices EDA in Python

Explore the Boston Housing dataset with price distributions, feature correlations, and outlier detection using an AI data analyst.

What

This AI Data Analyst workflow loads the Boston Housing dataset from a CSV URL and produces basic descriptive statistics. It visualizes the distribution of the target variable (medv) and checks for skew using a histogram with KDE. It computes feature correlations, highlights the strongest correlates with price, and generates scatter plots for the top three features versus medv.

Who

This is for analysts and students who want a guided exploratory data analysis example on a small, well-known regression dataset. It helps you practice interpreting distributions, correlations, and bivariate relationships using reproducible Python code.

Tools

  • pandas
  • numpy
  • matplotlib
  • seaborn

Outcomes

  • Dataset loaded with shape (506, 14) and a summary statistics table
  • Histogram with KDE for medv showing a right-skewed distribution and median around 21k
  • Correlation heatmap with top correlators identified (rm about +0.70, lstat about -0.74)
  • Three scatter plots of the top features versus medv with regression lines

Quality Score

7/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

  • Task Completion: 2/2

    Excellent

    All requested EDA steps are present: data load + basic stats (including shape via narrative), MEDV distribution plot, correlation heatmap + top correlators, and scatter plots for the top 3 features vs MEDV.

  • Execution Correctness: 1/2

    Needs work

    Most code is runnable, but the scatter plots do not include regression lines as required; otherwise the loading, describe, histplot, corr/heatmap, and correlation ranking code are valid.

  • Output Quality: 2/3

    Good

    Outputs match key expectations: stats table for 506x14, histogram with KDE, and correlations showing RM (~0.695) and LSTAT (~-0.738) as top features. However, the scatter plots lack regression lines, so the final visualization requirement is not fully met.

  • Reasoning Quality: 1/2

    Needs work

    Interpretations of summary stats and correlations are generally correct, but the assistant also states it "can't determine the skew direction without seeing the chart" despite having the plot output, and it doesn't explicitly confirm the expected right-skew/median detail.

  • Reliability: 1/1

    Excellent

    Workflow is consistent with the notebook evidence and avoids unsupported claims about computed values (correlations are shown). Main weakness is omission of regression lines rather than hallucination or unsafe behavior.

load boston housing dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv and show basic statistics
Thinking

Loading and Inspecting Dataset

Alright, I need to load the dataset and show some basic statistics. However, since the dataset isn't loaded yet, the first step is to load and inspect it, showing the shape, columns, and the first few rows. The developer has emphasized that I shouldn't combine the inspection with plotting or modeling at this stage. So, I'll ensure to follow these directives and provide the required outputs sequentially, keeping everything simple and focused. Let's create the code!Creating Load and Check Code

I think the simplest approach for loading and inspecting the dataset is to use pandas. First, I’ll import the necessary libraries and load the CSV file from the URL. I’ll display the shape and columns without printing unnecessary details, and the final expression will be the head of the DataFrame. This keeps it clean and straightforward. I want to mention that we're first loading and checking the structure, saving the basic statistics for the next step.

Let’s start by loading the data and checking its structure first. Then we can summarize the basic statistics in the next step.

Load and inspect housing data
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column": df.columns}))
df.head()
rows columns
0 506 14
column
0 CRIM
1 ZN
2 INDUS
3 CHAS
4 NOX
5 RM
6 AGE
7 DIS
8 RAD
9 TAX
10 PTRATIO
11 B
12 LSTAT
13 MEDV
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2

You’ve already confirmed the dataset structure: it has 506 rows and 14 columns, and the preview shows the expected Boston Housing variables, including MEDV as the target.

To complete the “basic statistics” part, here is the minimal next step:

Summarize dataframe statistics
df.describe()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000

The dataset contains 506 observations and 14 variables, which matches the standard Boston Housing structure. The columns include predictors such as CRIM, RM, LSTAT, TAX, and the target-like house value column MEDV.

From the basic statistics shown:

  • There are no missing values in the displayed summary, since every column has a count of 506.
  • MEDV is the house-value variable; its average is around 22.53, with values ranging from 5.0 to 50.0.
  • RM averages about 6.28 rooms per dwelling, with most homes between roughly 5.89 and 6.62 rooms.
  • LSTAT has a median around 11.36, indicating a fairly wide spread across neighborhoods.
  • CRIM is highly right-skewed: the median is only about 0.26, while the maximum is 88.98, so a small number of areas have much higher crime rates than most.
  • ZN has a median of 0, meaning at least half the tracts have no large-lot residential zoning.
  • CHAS is binary and has a mean of about 0.069, so only about 6.9% of observations border the Charles River.
  • NOX averages about 0.555, and PTRATIO averages about 18.46, giving a sense of pollution and pupil-teacher ratios across areas.

Overall, the summary shows a complete numeric dataset with substantial variation across features, and several variables—especially CRIM, ZN, and likely TAX/RAD—appear unevenly distributed rather than symmetric.

plot the distribution of house prices (medv)
Thinking

Planning data visualization

I need to respond with one concise code block to analyze the latest outputs and plot the distribution of MEDV using seaborn or matplotlib. Since the dataset is already loaded and inspected, I can plot directly. I’ll mention that MEDV ranges from 5 to 50 with a mean of about 22.5 based on the stats. I will ensure to provide one chart and end the code block with one simple expression, avoiding display options.

MEDV ranges from about 5.0 to 50.0, so a histogram is a good first look at the price distribution.

Plot house price distribution
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
sns.histplot(df["MEDV"], bins=30, kde=True)
plt.title("Distribution of House Prices (MEDV)")
plt.xlabel("MEDV")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
Summarize MEDV statistics
medv_stats = df["MEDV"].describe()
medv_stats
count    506.000000
mean      22.532806
std        9.197104
min        5.000000
25%       17.025000
50%       21.200000
75%       25.000000
max       50.000000
Name: MEDV, dtype: float64

The distribution of MEDV is now clear from the histogram and summary stats.

MEDV is centered around the low 20s, with:

  • mean: 22.53
  • median: 21.20
  • 25th percentile: 17.03
  • 75th percentile: 25.00

A few takeaways from the plotted distribution:

  • Most house prices cluster roughly between 17 and 25.
  • The mean is slightly above the median, which suggests a mild right skew.
  • The maximum value is 50.0, while the minimum is 5.0, so the distribution has a fairly wide spread.
  • The upper cap at 50.0 is notable and often appears as a ceiling in this dataset, which can create a visible pile-up at the high end of the histogram.

So, the house-price distribution is concentrated in the lower-to-mid 20s, somewhat right-skewed, and truncated at the top around 50.

show a correlation heatmap and identify top features correlated with price
Thinking

Creating correlation heatmap

I need to answer the user's request by showing a correlation heatmap and identifying the features correlated with price. I’ll start with the heatmap since the original request includes both elements. Focusing on one step, I think creating the heatmap first is best. I know the target is MEDV and the data frame is loaded. I’ll use seaborn or matplotlib. It might be helpful to annotate the heatmap and keep it concise, maybe displaying a full 14x14 chart.Finalizing the plot display

I need to make sure to end the code block with plt.show(), which is a standard way to display the plot. It’s perfectly fine to conclude like that, and I don’t need to ask for further output since the plot will show automatically. I'll focus on crafting this to be concise so that it clearly communicates the intent without unnecessary details. Keeping it straightforward is definitely the goal here!

A good next step is the correlation heatmap so we can see how MEDV relates to the other variables at a glance.

Plot correlation heatmap
import matplotlib.pyplot as plt
import seaborn as sns

corr = df.corr(numeric_only=True)

plt.figure(figsize=(11, 8))
sns.heatmap(corr, cmap="coolwarm", center=0, annot=True, fmt=".2f", square=True)
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()
Thinking

Deciding on analysis approach

I need to choose between options A and B for creating a correlation heatmap. The goal is to identify which features most correlate with price. Since we only have the heatmap figure without numeric values, option B is better for ranking. I should provide brief insights based on current outputs and include a minimal runnable code block to display correlations with MEDV sorted. The heatmap shows a mix of positive and negative relationships, but the chart alone isn't enough for ranking. I'll keep it concise.

The heatmap already shows that MEDV has both strong positive and strong negative relationships with other variables, but the figure alone is not enough to reliably rank the top features.

Sort MEDV correlations
medv_corr = corr["MEDV"].drop("MEDV").sort_values(key=lambda s: s.abs(), ascending=False)
medv_corr.to_frame("correlation_with_MEDV")
correlation_with_MEDV
LSTAT -0.737663
RM 0.695360
PTRATIO -0.507787
INDUS -0.483725
TAX -0.468536
NOX -0.427321
CRIM -0.388305
RAD -0.381626
AGE -0.376955
ZN 0.360445
B 0.333461
DIS 0.249929
CHAS 0.175260

From the heatmap and the correlation ranking with MEDV (price), the strongest features are:

  • LSTAT: -0.738 — strongest relationship overall; higher lower-status population is associated with lower price
  • RM: 0.695 — more rooms is strongly associated with higher price
  • PTRATIO: -0.508 — higher pupil-teacher ratio is associated with lower price
  • INDUS: -0.484
  • TAX: -0.469

Other notable negative correlations with price:

  • NOX: -0.427
  • CRIM: -0.388
  • RAD: -0.382
  • AGE: -0.377

Moderate positive correlations:

  • ZN: 0.360
  • B: 0.333

Weak correlations:

  • DIS: 0.250
  • CHAS: 0.175

In short, the top features most correlated with price are LSTAT and RM by a clear margin. LSTAT is the strongest negative correlate, and RM is the strongest positive correlate. The heatmap supports this by showing the darkest cells around MEDV for those variables.

create scatter plots of the top 3 features vs price
Thinking

Generating scatter plots code

I need to create a single block of code for generating scatter plots of the top 3 features against price. The top features sorted by absolute correlation are LSTAT, RM, and PTRATIO. I'll generate one figure with 3 subplots, which should satisfy the "one chart" requirement. I'll use seaborn or matplotlib, and ensure the code is concise. The final code will end with plt.show(), and I won't include any additional expressions. Let’s make this clear and effective!

Based on the correlation ranking, the top 3 features are LSTAT, RM, and PTRATIO. Here’s one figure with three scatter plots against MEDV:

Plot top feature scatterplots
import matplotlib.pyplot as plt
import seaborn as sns

top_features = medv_corr.head(3).index.tolist()

fig, axes = plt.subplots(1, 3, figsize=(16, 4))

for ax, feature in zip(axes, top_features):
    sns.scatterplot(data=df, x=feature, y="MEDV", ax=ax, alpha=0.7)
    ax.set_title(f"{feature} vs MEDV")
    ax.set_xlabel(feature)
    ax.set_ylabel("MEDV")

plt.tight_layout()
plt.show()

The scatter plots for the top 3 features versus price are now in place, and the earlier correlation values tell us exactly why these three were chosen:

  • LSTAT vs MEDV: strongest relationship, with a clear negative trend. As LSTAT increases, MEDV tends to decrease.
  • RM vs MEDV: strongest positive trend. Higher RM generally aligns with higher MEDV.
  • PTRATIO vs MEDV: noticeable negative association. Higher PTRATIO tends to come with lower MEDV.

These match the top three absolute correlations already computed:

  • LSTAT: -0.737663
  • RM: 0.695360
  • PTRATIO: -0.507787

So the plots directly support the original goal: they visualize how the three features most correlated with price relate to MEDV, with LSTAT and RM showing the strongest patterns.

Browse All Analysis Examples