Linear Regression Analysis in Python

Run simple and multiple linear regression, interpret coefficients, check assumptions, and evaluate model fit using statsmodels and scikit-learn.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Boston Housing dataset from a URL and fits a simple linear regression of RM (rooms) versus MEDV (price). It then fits a multiple linear regression using all features and generates a statsmodels OLS summary for coefficient and p-value interpretation. The workflow checks key regression assumptions with residual and Q-Q plots, evaluates homoscedasticity, and compares model fit between the simple and multiple models using R-squared.

Who this example is for

This is for analysts and students who want a guided, code-generating notebook to run and interpret linear regression in Python. It helps users validate assumptions and compare simple versus multiple regression using standard statistical outputs and diagnostics.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Dataset loaded from the provided CSV URL
  • Simple regression scatter plot with fitted line and R-squared around 0.48
  • Multiple regression OLS summary table with coefficients and p-values
  • Residual and Q-Q diagnostic plots with a homoscedasticity check
  • R-squared comparison showing roughly 0.48 (simple) versus 0.74 (multiple)

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • statsmodels
  • scikit-learn
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load boston housing data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv and run simple linear regression: rm (rooms) vs medv (price)
  2. 2run multiple linear regression with all features and show the statsmodels summary
  3. 3check regression assumptions: residual plot, Q-Q plot, homoscedasticity
  4. 4compare R-squared of simple vs multiple regression

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Linear Regression Analysis in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load boston housing data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv and run simple linear regression: rm (rooms) vs medv (price)
Thinking

Planning code for regression

I need to approach this step-by-step. The user asks to load and run a simple linear regression, but first, the developer instructs me to load and inspect the dataset. I shouldn’t combine the first-load inspection with modeling.

So, I’ll focus on loading the CSV, then showing the shape, columns, and head of the data in one code block. I think using pandas for this is a good idea. I’ll ensure only one table is displayed to stick with the guidelines. Let’s get to it!Finalizing the expression

Alright, I need to end with a single, clear expression for the user. I’m considering using DataFrames. I might display a shape DataFrame and then another for the columns. So, my code would be display(pd.DataFrame({'rows': [...], 'cols': [...]})) followed by display(pd.DataFrame({'column': df.columns})), and finally, I'll show df.head() as the last expression. This approach gives me two tables plus the final output, which fits the guidelines. Let’s do this!

Let’s start by loading the data and inspecting it first. After that, we can fit the simple linear regression using RM as the feature and MEDV as the target.

Load and inspect dataset
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column": df.columns}))
df.head()
rows columns
0 506 14
column
0 CRIM
1 ZN
2 INDUS
3 CHAS
4 NOX
5 RM
6 AGE
7 DIS
8 RAD
9 TAX
10 PTRATIO
11 B
12 LSTAT
13 MEDV
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2

The data is in place and the two columns needed for the simple regression are confirmed: RM for average rooms and MEDV for median home value. The next missing piece is the actual fitted linear relationship.

Fit linear regression model
from sklearn.linear_model import LinearRegression

X = df[["RM"]]
y = df["MEDV"]

model = LinearRegression()
model.fit(X, y)

result = {
    "intercept": model.intercept_,
    "coefficient_RM": model.coef_[0],
    "r_squared": model.score(X, y),
}
result
{'intercept': np.float64(-34.670620776438554),
 'coefficient_RM': np.float64(9.102108981180308),
 'r_squared': 0.48352545599133423}

Using the Boston housing data with RM as the single predictor and MEDV as the target, the fitted simple linear regression is:

\hat{\text{MEDV}} = -34.67 + 9.10 × \text{RM}

From the observed model output:

  • Intercept: -34.67
  • Slope for RM: 9.10
  • R^2: 0.484

What this means:

  • For each additional room, the model predicts MEDV increases by about 9.10.
  • The negative intercept is just where the fitted line crosses when RM = 0; it is not practically meaningful here because homes with 0 rooms are outside the data range.
  • The R^2 of about 0.484 means RM alone explains roughly 48.4% of the variation in housing prices, so rooms are an important predictor, but not the whole story.

So, there is a clear positive linear relationship: houses with more rooms tend to have higher median value.

run multiple linear regression with all features and show the statsmodels summary
Thinking

Providing concise code solution

I need to give a clear, one-step code for fitting a statsmodels OLS model to predict MEDV, excluding the target column from the features. I also want to make sure I'm not including any identifier columns, as there are none present. Including a constant for the model sounds necessary. Since the user asked directly, I’ll provide the code in a single block and briefly mention there are no ID-like columns. Let's keep it concise!

We can fit an OLS model with all 13 predictors and MEDV as the target. There are no obvious ID-like columns here, so I’m keeping all non-target features.

Fit multiple linear regression
import statsmodels.api as sm

X_multi = df.drop(columns=["MEDV"])
y = df["MEDV"]

X_multi = sm.add_constant(X_multi)
ols_model = sm.OLS(y, X_multi).fit()

ols_model.summary()
OLS Regression Results
Dep. Variable: MEDV R-squared: 0.741
Model: OLS Adj. R-squared: 0.734
Method: Least Squares F-statistic: 108.1
Date: Fri, 10 Apr 2026 Prob (F-statistic): 6.72e-135
Time: 22:03:13 Log-Likelihood: -1498.8
No. Observations: 506 AIC: 3026.
Df Residuals: 492 BIC: 3085.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 36.4595 5.103 7.144 0.000 26.432 46.487
CRIM -0.1080 0.033 -3.287 0.001 -0.173 -0.043
ZN 0.0464 0.014 3.382 0.001 0.019 0.073
INDUS 0.0206 0.061 0.334 0.738 -0.100 0.141
CHAS 2.6867 0.862 3.118 0.002 0.994 4.380
NOX -17.7666 3.820 -4.651 0.000 -25.272 -10.262
RM 3.8099 0.418 9.116 0.000 2.989 4.631
AGE 0.0007 0.013 0.052 0.958 -0.025 0.027
DIS -1.4756 0.199 -7.398 0.000 -1.867 -1.084
RAD 0.3060 0.066 4.613 0.000 0.176 0.436
TAX -0.0123 0.004 -3.280 0.001 -0.020 -0.005
PTRATIO -0.9527 0.131 -7.283 0.000 -1.210 -0.696
B 0.0093 0.003 3.467 0.001 0.004 0.015
LSTAT -0.5248 0.051 -10.347 0.000 -0.624 -0.425
Omnibus: 178.041 Durbin-Watson: 1.078
Prob(Omnibus): 0.000 Jarque-Bera (JB): 783.126
Skew: 1.521 Prob(JB): 8.84e-171
Kurtosis: 8.281 Cond. No. 1.51e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The multiple linear regression with all features has already been fit with statsmodels and the summary output confirms a strong overall model.

From the observed summary:

  • Dependent variable: MEDV
  • Observations: 506
  • Predictors: 13
  • R^2 = 0.741
  • Adjusted R^2 = 0.734
  • F-statistic: 108.1
  • Prob(F-statistic): 6.72e-135

What this means:

  • Using all available predictors improves fit substantially versus the simple regression with only RM (R^2 ≈ 0.484 vs. $0.741$).
  • The model explains about 74.1% of the variation in MEDV.
  • The adjusted R^2 staying close to R^2 suggests the added features are contributing useful signal overall rather than only inflating fit.
  • The extremely small F-test p-value shows the full regression is statistically significant as a whole.

This is the statsmodels summary output you requested, and it indicates that the full multiple linear regression provides a much better explanation of housing prices than the one-feature model.

check regression assumptions: residual plot, Q-Q plot, homoscedasticity
Thinking

Analyzing user request

I'm looking at the user's prompt, keeping in mind the current state of the notebook. It seems important to tackle this iteratively, focusing on one code block at a time. They want to check regression assumptions like the residual plot and Q-Q plot.

I think starting with the residuals vs. fitted plot makes sense, explaining briefly before I show the code. I could use the statsmodels fitted values and residuals from the existing OLS model. Let’s keep it concise!

Let’s start with the most useful diagnostic: a residuals-vs-fitted plot. It helps check linearity and gives a first look at constant variance.

Plot residuals vs fitted
import seaborn as sns
import matplotlib.pyplot as plt

residuals = ols_model.resid
fitted = ols_model.fittedvalues

plt.figure(figsize=(7, 5))
sns.scatterplot(x=fitted, y=residuals, alpha=0.7)
plt.axhline(0, color="red", linestyle="--", linewidth=1)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted")
plt.show()
Thinking

Refining regression diagnostics

I need to provide an answer based on Path A/B while focusing on checking regression assumptions like the residual plot, Q-Q plot, and homoscedasticity. Currently, I have only a residual plot, but it lacks necessary details. I realize I need to include the Q-Q plot and homoscedasticity evidence in a single code block. While the residual diagnostics were generated, the visual details are insufficient. I need to confirm the fitted values alongside the residuals and run the Q-Q and Breusch-Pagan tests together.Compiling regression diagnostics

I realize I need to display a compact numeric summary alongside the Q-Q plot and Breusch-Pagan test in one code block. So, I’ll generate the Q-Q plot, perform the homoscedasticity test, and include a summary of the fitted versus residual values. I’ll avoid overwhelming the user with too much code in one go but want to ensure I provide the missing evidence for regression assumptions. Sharing the first few rows of fitted and residual values seems essential. A small DataFrame along with one plot would help clarify things.Finalizing diagnostics code

I need to create a compact code block that generates the Q-Q plot and performs the Breusch-Pagan test. Each code block should end with just one final expression, like bp_results. I’ll make sure to display a table for better clarity, and then show the plot. I'll ensure that pandas is imported, if not already included. Once I set everything up, I can produce the final code smoothly!

The residuals-vs-fitted diagnostic has been created, but the notebook output here only shows the figure object, so it isn’t enough by itself to judge the pattern. The Q-Q plot and a formal homoscedasticity check are still needed.

Check residual diagnostics
from IPython.display import display
from statsmodels.graphics.gofplots import qqplot
from statsmodels.stats.diagnostic import het_breuschpagan
import pandas as pd
import matplotlib.pyplot as plt

# compact numeric view behind the residual plot
display(pd.DataFrame({
    "fitted": ols_model.fittedvalues,
    "residual": ols_model.resid
}).head(10))

# Q-Q plot for normality of residuals
qqplot(ols_model.resid, line="45", fit=True)
plt.title("Q-Q Plot of Residuals")
plt.show()

# Breusch-Pagan test for homoscedasticity
bp = het_breuschpagan(ols_model.resid, ols_model.model.exog)
bp_results = pd.Series(
    bp,
    index=["LM statistic", "LM p-value", "F statistic", "F p-value"],
    name="Breusch-Pagan test"
)
bp_results
fitted residual
0 30.003843 -6.003843
1 25.025562 -3.425562
2 30.567597 4.132403
3 28.607036 4.792964
4 27.943524 8.256476
5 25.256284 3.443716
6 23.001808 -0.101808
7 19.535988 7.564012
8 11.523637 4.976363
9 18.920262 -0.020262
LM statistic    6.512179e+01
LM p-value      6.265431e-09
F statistic     5.590227e+00
F p-value       1.495776e-09
Name: Breusch-Pagan test, dtype: float64

Based on the outputs you now have:

  • Homoscedasticity: the Breusch–Pagan test is clearly significant, with p-values around \6.27 × 10^{-9} and \1.50 × 10^{-9}. That means you should reject constant variance. In practical terms, the residual variance changes with the fitted values, so the model shows heteroscedasticity.

  • Q-Q plot: the plot itself is present, but not numerically summarized. Its purpose is to check whether residuals follow a normal distribution. From the available notebook evidence, this assumption was checked visually, but the strongest concrete conclusion in the current outputs is about variance, not normality.

  • Residual plot: you produced the residuals-vs-fitted plot, which is the right diagnostic for linearity and variance patterns. Since the notebook only shows the figure object rather than the rendered pattern here, the key confirmed conclusion comes from the formal test: the residual spread is not constant.

So, against your original goal:

  • Residual plot: created and appropriate for assumption checking
  • Q-Q plot: created and appropriate for checking residual normality
  • Homoscedasticity: violated, supported by the significant Breusch–Pagan result

Overall, the main regression assumption issue visible from your outputs is non-constant error variance. This means standard OLS coefficient estimates are still unbiased under usual conditions, but the usual standard errors, confidence intervals, and p-values may be unreliable unless you use a correction such as heteroscedasticity-robust standard errors.

compare R-squared of simple vs multiple regression
Thinking

Evaluating notebook outputs

I need to answer based on the notebook outputs, focusing on comparison without using code. It should be concise. I want to mention the simple linear R² value from cell 2, which is 0.4835, and the multiple R² value, which is 0.741. The difference I calculate is around 0.2575. I should also mention the percent improvement as approximately a 53.2% relative increase. It seems fine to avoid using headers in this case.

The simple regression using only RM has:

  • R^2 = 0.4835

The multiple regression using all predictors has:

  • R^2 = 0.7410

So the multiple regression fits the data better:

  • Absolute increase in R^2: about $0.2575$
  • That means the explained variance rises from about 48.4% to 74.1%

Interpretation:

  • The simple model explains roughly half of the variation in MEDV
  • The multiple model explains about three-quarters of the variation
  • So adding the other predictors provides a substantial improvement over using RM alone

One caution: a higher R^2 in the multiple model is expected when adding more variables, so for a fairer comparison it can also help to look at adjusted R^2. In your OLS summary, the multiple model’s adjusted R^2 is 0.734, which is still strong.

Score cards for gpt-5.4 for Linear Regression Analysis in Python

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    The notebook completes the simple regression, multiple regression, diagnostics, and R-squared comparison requested in the workflow.

  • Execution Correctness: 2/2

    Excellent

    The provided code is coherent and likely runnable, using pandas, scikit-learn, statsmodels, and diagnostic tests correctly.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes: simple regression with R²≈0.484, full OLS summary, residual and Q-Q plots, and a multiple-regression R²≈0.741 versus simple R²≈0.484.

  • Reasoning Quality: 2/2

    Excellent

    The explanations correctly interpret coefficients, model fit, and the Breusch-Pagan result, and they compare simple versus multiple regression appropriately.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and evidence-based, with no obvious hallucinations or unsupported claims.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.