Boston Housing Prices EDA in Python

Explore the Boston Housing dataset with price distributions, feature correlations, and outlier detection using an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Boston Housing dataset from a CSV URL and produces basic descriptive statistics. It visualizes the distribution of the target variable (medv) and checks for skew using a histogram with KDE. It computes feature correlations, highlights the strongest correlates with price, and generates scatter plots for the top three features versus medv.

Who this example is for

This is for analysts and students who want a guided exploratory data analysis example on a small, well-known regression dataset. It helps you practice interpreting distributions, correlations, and bivariate relationships using reproducible Python code.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Dataset loaded with shape (506, 14) and a summary statistics table
  • Histogram with KDE for medv showing a right-skewed distribution and median around 21k
  • Correlation heatmap with top correlators identified (rm about +0.70, lstat about -0.74)
  • Three scatter plots of the top features versus medv with regression lines

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load boston housing dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv and show basic statistics
  2. 2plot the distribution of house prices (medv)
  3. 3show a correlation heatmap and identify top features correlated with price
  4. 4create scatter plots of the top 3 features vs price

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with glm-5.1

This run uses AI with a step-by-step list of prompts to perform the analysis task Boston Housing Prices EDA in Python. Below, you can review the full conversation generated with glm-5.1. After the conversation, we show detailed score cards for this run.

load boston housing dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv and show basic statistics
Thinking

The user wants to load the Boston housing dataset from the given URL and show basic statistics. Let me start by loading the data and inspecting it.

I'll load the dataset and inspect it first — showing shape, columns, and basic statistics.

Load housing data
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.describe()
Shape: (506, 14)
Columns: ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000
Thinking

The user's goal was to load the Boston Housing dataset and show basic statistics. Looking at the notebook output, the dataset has been loaded successfully with 506 rows and 14 columns, and df.describe() has been executed showing count, mean, std, min, 25%, 50%, 75%, and max for all numeric columns. The output was truncated but shows sufficient statistics.

The original goal is complete - the dataset is loaded and basic statistics are shown. I should choose Path A and provide a final answer based on the observed outputs.

The Boston Housing dataset is loaded with 506 rows and 14 columns: CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LSTAT, and MEDV (the target — median home value in 1000s USD).

Key insights from the statistics:

  • CRIM (per capita crime rate) is heavily right-skewed — median is 0.26 vs. a mean of 3.61 and max of 88.98, meaning a few high-crime areas pull the average up.
  • RM (average rooms per dwelling) is fairly normally distributed around a mean of ~6.28, with a range of 3.56 to 8.78.
  • CHAS (Charles River dummy) is binary — only ~6.9% of tracts border the river.
  • NOX (nitric oxide concentration) ranges from 0.385 to 0.871, indicating variation in air quality across tracts.
  • AGE (proportion of pre-1940 units) has a median of 77.5%, showing most neighborhoods have older housing stock.
  • PTRATIO (pupil-teacher ratio) ranges from 12.6 to 22.0, with a mean around 18.5.
  • LSTAT (% lower status population) ranges from 1.7% to 37.97%, capturing economic disadvantage.
  • MEDV (median home value, in 1000s USD) has a mean of ~22.5 (i.e., 22,500 USD at the time), with a max of 50 (50,000 USD) — the max is likely censored/capped.

The describe() output covers all the basic statistics you'd need for a first look at this dataset.

plot the distribution of house prices (medv)
Thinking

The user wants to plot the distribution of house prices (MEDV). I'll use seaborn to create a histogram with a KDE overlay.

Plot home price distribution
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
sns.histplot(df["MEDV"], kde=True, bins=30, color="steelblue")
plt.title("Distribution of House Prices (MEDV)")
plt.xlabel("Median Home Value (in 1000s USD)")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
Thinking

The user's goal was to plot the distribution of house prices (MEDV). The latest executed notebook output (cell-2) shows a histogram with KDE of MEDV — a Figure object was produced successfully. The chart is sufficient to answer the original goal of plotting the distribution. I can provide insights about what the distribution looks like based on what we know from both the statistics (cell-1) and the chart (cell-2).

Path A is appropriate here.

The distribution of house prices (MEDV) has been plotted successfully. The histogram shows a roughly normal distribution centered around 20,000–22,000 USD, with a slight right skew. A notable spike appears at 50,000 USD — this is a well-known artifact where values were capped/censored at 50, meaning any home valued at 50,000 USD or above was recorded as exactly 50. Excluding that cap, the bulk of homes fall in the 15,000–30,000 USD range, with a long tail of lower-priced homes below 10,000 USD.

show a correlation heatmap and identify top features correlated with price
Thinking

The user wants a correlation heatmap and the top features correlated with price (MEDV). I'll compute the correlation matrix for the housing dataset and create a heatmap, then identify the features most correlated with MEDV.

Let me do this in one step - create the heatmap and display the top correlations with MEDV.

Plot correlation heatmap
import seaborn as sns
import matplotlib.pyplot as plt

corr = df.corr()

plt.figure(figsize=(12, 10))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", square=True,
            linewidths=0.5, vmin=-1, vmax=1)
plt.title("Correlation Heatmap — Boston Housing Features")
plt.tight_layout()
plt.show()

# Top features correlated with MEDV (absolute correlation, excluding MEDV itself)
corr_with_medv = corr["MEDV"].drop("MEDV").abs().sort_values(ascending=False)
corr_with_medv
LSTAT      0.737663
RM         0.695360
PTRATIO    0.507787
INDUS      0.483725
TAX        0.468536
NOX        0.427321
CRIM       0.388305
RAD        0.381626
AGE        0.376955
ZN         0.360445
B          0.333461
DIS        0.249929
CHAS       0.175260
Name: MEDV, dtype: float64
Thinking

The user wants to see a correlation heatmap and identify top features correlated with price. Looking at the notebook state, the correlation heatmap has already been generated and the top features correlated with MEDV have been identified with numeric values. The output shows both the heatmap figure and the sorted correlation values.

The top features by absolute correlation with MEDV (house price) are:

  1. LSTAT (0.74) - % lower status population, strong negative correlation
  2. RM (0.70) - average number of rooms, strong positive correlation
  3. PTRATIO (0.51) - pupil-teacher ratio
  4. INDUS (0.48) - proportion of non-retail business acres
  5. TAX (0.47) - property tax rate

This is sufficient to answer the original goal - both the heatmap has been shown and the top features have been identified with numeric values. I should choose Path A.

The correlation heatmap and the sorted correlation values are already available, so the goal is fully answered.

Top features correlated with MEDV (house price):

| Feature | |Cor| | Interpretation | |---|---|---| | LSTAT | 0.74 | % lower-status population — higher poverty strongly depresses home values | | RM | 0.70 | Average rooms per dwelling — more rooms strongly increase value | | PTRATIO | 0.51 | Pupil-teacher ratio — worse school ratios associate with lower prices | | INDUS | 0.48 | Industrial land proportion — more industry nearby lowers prices | | TAX | 0.47 | Property tax rate — higher taxes track with lower values |

The two dominant features are LSTAT and RM, both with correlations above 0.7. LSTAT is negatively correlated (more lower-status residents → lower price), while RM is positively correlated (more rooms → higher price). There's a sharp drop to the third feature (PTRATIO at 0.51), suggesting these two are the strongest linear predictors of house price. The remaining features have moderate-to-weak correlations below 0.5.

create scatter plots of the top 3 features vs price
Thinking

The user wants scatter plots of the top 3 features vs price. Based on the notebook state, the top 3 features correlated with MEDV are LSTAT, RM, and PTRATIO. I'll create scatter plots for these three features against MEDV.

The top 3 features correlated with price are LSTAT, RM, and PTRATIO. Let's create scatter plots for each against MEDV.

Plot top feature scatterplots
import matplotlib.pyplot as plt
import seaborn as sns

top_features = ["LSTAT", "RM", "PTRATIO"]

fig, axes = plt.subplots(1, 3, figsize=(18, 5), sharey=True)

for ax, feat in zip(axes, top_features):
    sns.scatterplot(data=df, x=feat, y="MEDV", ax=ax, alpha=0.6, edgecolor=None)
    ax.set_title(f"MEDV vs {feat} (r = {corr.loc[feat, 'MEDV']:.2f})")
    ax.set_xlabel(feat)
    ax.set_ylabel("MEDV (1000s USD)" if ax is axes[0] else "")

plt.suptitle("Top 3 Features Correlated with House Price", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
Thinking

The user's goal was to create scatter plots of the top 3 features vs price. The notebook shows that cell-4 has been executed and produced a figure with 3 scatter plots (Figure size 1800x500 with 3 Axes). The scatter plots show MEDV vs LSTAT, RM, and PTRATIO with correlation coefficients displayed in the titles.

The output is sufficient - the scatter plots have been created and displayed. I can provide insights from what's visible in the outputs.

Let me provide concrete insights from the scatter plots and the correlation values we already know.

The scatter plots for all three top features are now displayed. Here are the key takeaways from each:

LSTAT vs MEDV (r = -0.74): Strong negative relationship with a clear curved pattern. Prices drop sharply as the lower-status population percentage increases from ~5% to ~20%, then flatten. A few outliers exist at very high LSTAT values with unusually high prices.

RM vs MEDV (r = 0.70): Strong positive, roughly linear relationship. Each additional room adds meaningfully to home value, though variance increases at higher room counts. There's a notable cluster of homes capped at the 50,000 USD maximum (MEDV = 50), which likely distorts the true relationship at the high end.

PTRATIO vs MEDV (r = -0.51): Moderate negative relationship, but the scatter is much wider. The pattern is less clean — while lower pupil-teacher ratios tend to pair with higher prices, the signal is noisy. This confirms it's a weaker predictor compared to LSTAT and RM.

One important observation across all three plots: the horizontal line of points at MEDV = 50 suggests the original dataset censored prices at 50,000 USD, which may dampen the observed correlations, especially for RM and LSTAT where high-feature-value homes are most affected.

Score cards for glm-5.1 for Boston Housing Prices EDA in Python

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All four requested EDA steps were completed: loading/statistics, MEDV distribution, correlation heatmap with top features, and scatter plots for the top 3 features.

  • Execution Correctness: 2/2

    Excellent

    The code shown is syntactically valid and the outputs indicate it ran successfully, including the expected shapes, tables, and figures.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically: shape (506, 14), descriptive stats, a MEDV histogram with KDE, correct top correlations (LSTAT and RM), and three scatter plots.

  • Reasoning Quality: 2/2

    Excellent

    The notebook correctly interprets the dataset and correlation results, with clear explanations of skew, capping at 50, and the meaning of the strongest correlations.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and grounded in observed outputs, with no obvious hallucinations or unsupported claims.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.