Linear Regression Analysis in Python

Run simple and multiple linear regression, interpret coefficients, check assumptions, and evaluate model fit using statsmodels and scikit-learn.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Boston Housing dataset from a URL and fits a simple linear regression of RM (rooms) versus MEDV (price). It then fits a multiple linear regression using all features and generates a statsmodels OLS summary for coefficient and p-value interpretation. The workflow checks key regression assumptions with residual and Q-Q plots, evaluates homoscedasticity, and compares model fit between the simple and multiple models using R-squared.

Who this example is for

This is for analysts and students who want a guided, code-generating notebook to run and interpret linear regression in Python. It helps users validate assumptions and compare simple versus multiple regression using standard statistical outputs and diagnostics.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Dataset loaded from the provided CSV URL
Simple regression scatter plot with fitted line and R-squared around 0.48
Multiple regression OLS summary table with coefficients and p-values
Residual and Q-Q diagnostic plots with a homoscedasticity check
R-squared comparison showing roughly 0.48 (simple) versus 0.74 (multiple)

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
statsmodels
scikit-learn
matplotlib
seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load boston housing data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv and run simple linear regression: rm (rooms) vs medv (price)
2run multiple linear regression with all features and show the statsmodels summary
3check regression assumptions: residual plot, Q-Q plot, homoscedasticity
4compare R-squared of simple vs multiple regression

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3-coder-next	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Linear Regression Analysis in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load boston housing data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv and run simple linear regression: rm (rooms) vs medv (price)

Thinking

Planning code for regression

I need to approach this step-by-step. The user asks to load and run a simple linear regression, but first, the developer instructs me to load and inspect the dataset. I shouldn’t combine the first-load inspection with modeling.

So, I’ll focus on loading the CSV, then showing the shape, columns, and head of the data in one code block. I think using pandas for this is a good idea. I’ll ensure only one table is displayed to stick with the guidelines. Let’s get to it!Finalizing the expression

Alright, I need to end with a single, clear expression for the user. I’m considering using DataFrames. I might display a shape DataFrame and then another for the columns. So, my code would be display(pd.DataFrame({'rows': [...], 'cols': [...]})) followed by display(pd.DataFrame({'column': df.columns})), and finally, I'll show df.head() as the last expression. This approach gives me two tables plus the final output, which fits the guidelines. Let’s do this!

Let’s start by loading the data and inspecting it first. After that, we can fit the simple linear regression using RM as the feature and MEDV as the target.

Load and inspect dataset

import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column": df.columns}))
df.head()

	rows	columns
0	506	14

	column
0	CRIM
1	ZN
2	INDUS
3	CHAS
4	NOX
5	RM
6	AGE
7	DIS
8	RAD
9	TAX
10	PTRATIO
11	B
12	LSTAT
13	MEDV

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

The data is in place and the two columns needed for the simple regression are confirmed: RM for average rooms and MEDV for median home value. The next missing piece is the actual fitted linear relationship.

Fit linear regression model

from sklearn.linear_model import LinearRegression

X = df[["RM"]]
y = df["MEDV"]

model = LinearRegression()
model.fit(X, y)

result = {
    "intercept": model.intercept_,
    "coefficient_RM": model.coef_[0],
    "r_squared": model.score(X, y),
}
result

#10

{'intercept': np.float64(-34.670620776438554),
 'coefficient_RM': np.float64(9.102108981180308),
 'r_squared': 0.48352545599133423}

#11

Using the Boston housing data with RM as the single predictor and MEDV as the target, the fitted simple linear regression is:

\hat{\text{MEDV}} = -34.67 + 9.10 × \text{RM}

From the observed model output:

Intercept: -34.67
Slope for RM: 9.10
R^2: 0.484

What this means:

For each additional room, the model predicts MEDV increases by about 9.10.
The negative intercept is just where the fitted line crosses when RM = 0; it is not practically meaningful here because homes with 0 rooms are outside the data range.
The R^2 of about 0.484 means RM alone explains roughly 48.4% of the variation in housing prices, so rooms are an important predictor, but not the whole story.

So, there is a clear positive linear relationship: houses with more rooms tend to have higher median value.

#12

run multiple linear regression with all features and show the statsmodels summary

#13

Thinking

Providing concise code solution

I need to give a clear, one-step code for fitting a statsmodels OLS model to predict MEDV, excluding the target column from the features. I also want to make sure I'm not including any identifier columns, as there are none present. Including a constant for the model sounds necessary. Since the user asked directly, I’ll provide the code in a single block and briefly mention there are no ID-like columns. Let's keep it concise!

#14

We can fit an OLS model with all 13 predictors and MEDV as the target. There are no obvious ID-like columns here, so I’m keeping all non-target features.

#15

Fit multiple linear regression

import statsmodels.api as sm

X_multi = df.drop(columns=["MEDV"])
y = df["MEDV"]

X_multi = sm.add_constant(X_multi)
ols_model = sm.OLS(y, X_multi).fit()

ols_model.summary()

#16

OLS Regression Results
Dep. Variable:	MEDV	R-squared:	0.741
Model:	OLS	Adj. R-squared:	0.734
Method:	Least Squares	F-statistic:	108.1
Date:	Fri, 10 Apr 2026	Prob (F-statistic):	6.72e-135
Time:	22:03:13	Log-Likelihood:	-1498.8
No. Observations:	506	AIC:	3026.
Df Residuals:	492	BIC:	3085.
Df Model:	13
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	36.4595	5.103	7.144	0.000	26.432	46.487
CRIM	-0.1080	0.033	-3.287	0.001	-0.173	-0.043
ZN	0.0464	0.014	3.382	0.001	0.019	0.073
INDUS	0.0206	0.061	0.334	0.738	-0.100	0.141
CHAS	2.6867	0.862	3.118	0.002	0.994	4.380
NOX	-17.7666	3.820	-4.651	0.000	-25.272	-10.262
RM	3.8099	0.418	9.116	0.000	2.989	4.631
AGE	0.0007	0.013	0.052	0.958	-0.025	0.027
DIS	-1.4756	0.199	-7.398	0.000	-1.867	-1.084
RAD	0.3060	0.066	4.613	0.000	0.176	0.436
TAX	-0.0123	0.004	-3.280	0.001	-0.020	-0.005
PTRATIO	-0.9527	0.131	-7.283	0.000	-1.210	-0.696
B	0.0093	0.003	3.467	0.001	0.004	0.015
LSTAT	-0.5248	0.051	-10.347	0.000	-0.624	-0.425

Omnibus:	178.041	Durbin-Watson:	1.078
Prob(Omnibus):	0.000	Jarque-Bera (JB):	783.126
Skew:	1.521	Prob(JB):	8.84e-171
Kurtosis:	8.281	Cond. No.	1.51e+04

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

#17

The multiple linear regression with all features has already been fit with statsmodels and the summary output confirms a strong overall model.

From the observed summary:

Dependent variable: MEDV
Observations: 506
Predictors: 13
R^2 = 0.741
Adjusted R^2 = 0.734
F-statistic: 108.1
Prob(F-statistic): 6.72e-135

What this means:

Using all available predictors improves fit substantially versus the simple regression with only RM (R^2 ≈ 0.484 vs. $0.741$).
The model explains about 74.1% of the variation in MEDV.
The adjusted R^2 staying close to R^2 suggests the added features are contributing useful signal overall rather than only inflating fit.
The extremely small F-test p-value shows the full regression is statistically significant as a whole.

This is the statsmodels summary output you requested, and it indicates that the full multiple linear regression provides a much better explanation of housing prices than the one-feature model.

#18

check regression assumptions: residual plot, Q-Q plot, homoscedasticity

#19

Thinking

Analyzing user request

I'm looking at the user's prompt, keeping in mind the current state of the notebook. It seems important to tackle this iteratively, focusing on one code block at a time. They want to check regression assumptions like the residual plot and Q-Q plot.

I think starting with the residuals vs. fitted plot makes sense, explaining briefly before I show the code. I could use the statsmodels fitted values and residuals from the existing OLS model. Let’s keep it concise!

#20

Let’s start with the most useful diagnostic: a residuals-vs-fitted plot. It helps check linearity and gives a first look at constant variance.

#21

Plot residuals vs fitted

import seaborn as sns
import matplotlib.pyplot as plt

residuals = ols_model.resid
fitted = ols_model.fittedvalues

plt.figure(figsize=(7, 5))
sns.scatterplot(x=fitted, y=residuals, alpha=0.7)
plt.axhline(0, color="red", linestyle="--", linewidth=1)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted")
plt.show()

#22

#23

Thinking

Refining regression diagnostics

I need to provide an answer based on Path A/B while focusing on checking regression assumptions like the residual plot, Q-Q plot, and homoscedasticity. Currently, I have only a residual plot, but it lacks necessary details. I realize I need to include the Q-Q plot and homoscedasticity evidence in a single code block. While the residual diagnostics were generated, the visual details are insufficient. I need to confirm the fitted values alongside the residuals and run the Q-Q and Breusch-Pagan tests together.Compiling regression diagnostics

I realize I need to display a compact numeric summary alongside the Q-Q plot and Breusch-Pagan test in one code block. So, I’ll generate the Q-Q plot, perform the homoscedasticity test, and include a summary of the fitted versus residual values. I’ll avoid overwhelming the user with too much code in one go but want to ensure I provide the missing evidence for regression assumptions. Sharing the first few rows of fitted and residual values seems essential. A small DataFrame along with one plot would help clarify things.Finalizing diagnostics code

I need to create a compact code block that generates the Q-Q plot and performs the Breusch-Pagan test. Each code block should end with just one final expression, like bp_results. I’ll make sure to display a table for better clarity, and then show the plot. I'll ensure that pandas is imported, if not already included. Once I set everything up, I can produce the final code smoothly!

#24

The residuals-vs-fitted diagnostic has been created, but the notebook output here only shows the figure object, so it isn’t enough by itself to judge the pattern. The Q-Q plot and a formal homoscedasticity check are still needed.

#25

Check residual diagnostics

from IPython.display import display
from statsmodels.graphics.gofplots import qqplot
from statsmodels.stats.diagnostic import het_breuschpagan
import pandas as pd
import matplotlib.pyplot as plt

# compact numeric view behind the residual plot
display(pd.DataFrame({
    "fitted": ols_model.fittedvalues,
    "residual": ols_model.resid
}).head(10))

# Q-Q plot for normality of residuals
qqplot(ols_model.resid, line="45", fit=True)
plt.title("Q-Q Plot of Residuals")
plt.show()

# Breusch-Pagan test for homoscedasticity
bp = het_breuschpagan(ols_model.resid, ols_model.model.exog)
bp_results = pd.Series(
    bp,
    index=["LM statistic", "LM p-value", "F statistic", "F p-value"],
    name="Breusch-Pagan test"
)
bp_results

#26

	fitted	residual
0	30.003843	-6.003843
1	25.025562	-3.425562
2	30.567597	4.132403
3	28.607036	4.792964
4	27.943524	8.256476
5	25.256284	3.443716
6	23.001808	-0.101808
7	19.535988	7.564012
8	11.523637	4.976363
9	18.920262	-0.020262

#27

#28

LM statistic    6.512179e+01
LM p-value      6.265431e-09
F statistic     5.590227e+00
F p-value       1.495776e-09
Name: Breusch-Pagan test, dtype: float64

#29

Based on the outputs you now have:

Homoscedasticity: the Breusch–Pagan test is clearly significant, with p-values around \6.27 × 10^{-9} and \1.50 × 10^{-9}. That means you should reject constant variance. In practical terms, the residual variance changes with the fitted values, so the model shows heteroscedasticity.
Q-Q plot: the plot itself is present, but not numerically summarized. Its purpose is to check whether residuals follow a normal distribution. From the available notebook evidence, this assumption was checked visually, but the strongest concrete conclusion in the current outputs is about variance, not normality.
Residual plot: you produced the residuals-vs-fitted plot, which is the right diagnostic for linearity and variance patterns. Since the notebook only shows the figure object rather than the rendered pattern here, the key confirmed conclusion comes from the formal test: the residual spread is not constant.

So, against your original goal:

Residual plot: created and appropriate for assumption checking
Q-Q plot: created and appropriate for checking residual normality
Homoscedasticity: violated, supported by the significant Breusch–Pagan result

Overall, the main regression assumption issue visible from your outputs is non-constant error variance. This means standard OLS coefficient estimates are still unbiased under usual conditions, but the usual standard errors, confidence intervals, and p-values may be unreliable unless you use a correction such as heteroscedasticity-robust standard errors.

#30

compare R-squared of simple vs multiple regression

#31

Thinking

Evaluating notebook outputs

I need to answer based on the notebook outputs, focusing on comparison without using code. It should be concise. I want to mention the simple linear R² value from cell 2, which is 0.4835, and the multiple R² value, which is 0.741. The difference I calculate is around 0.2575. I should also mention the percent improvement as approximately a 53.2% relative increase. It seems fine to avoid using headers in this case.

#32

The simple regression using only RM has:

R^2 = 0.4835

The multiple regression using all predictors has:

R^2 = 0.7410

So the multiple regression fits the data better:

Absolute increase in R^2: about $0.2575$
That means the explained variance rises from about 48.4% to 74.1%

Interpretation:

The simple model explains roughly half of the variation in MEDV
The multiple model explains about three-quarters of the variation
So adding the other predictors provides a substantial improvement over using RM alone

One caution: a higher R^2 in the multiple model is expected when adding more variables, so for a fairer comparison it can also help to look at adjusted R^2. In your OLS summary, the multiple model’s adjusted R^2 is 0.734, which is still strong.

Score cards for gpt-5.4 for Linear Regression Analysis in Python

Total score: 10/10

Task Completion: 2/2
Excellent
The notebook completes the simple regression, multiple regression, diagnostics, and R-squared comparison requested in the workflow.
Execution Correctness: 2/2
Excellent
The provided code is coherent and likely runnable, using pandas, scikit-learn, statsmodels, and diagnostic tests correctly.
Output Quality: 3/3
Excellent
Outputs match the expected outcomes: simple regression with R²≈0.484, full OLS summary, residual and Q-Q plots, and a multiple-regression R²≈0.741 versus simple R²≈0.484.
Reasoning Quality: 2/2
Excellent
The explanations correctly interpret coefficients, model fit, and the Breusch-Pagan result, and they compare simple versus multiple regression appropriately.
Reliability: 1/1
Excellent
The workflow is consistent and evidence-based, with no obvious hallucinations or unsupported claims.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Linear Regression Analysis in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with gpt-5.4

Score cards for gpt-5.4 for Linear Regression Analysis in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics