Linear Regression Analysis in Python

Run simple and multiple linear regression, interpret coefficients, check assumptions, and evaluate model fit using statsmodels and scikit-learn.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Boston Housing dataset from a URL and fits a simple linear regression of RM (rooms) versus MEDV (price). It then fits a multiple linear regression using all features and generates a statsmodels OLS summary for coefficient and p-value interpretation. The workflow checks key regression assumptions with residual and Q-Q plots, evaluates homoscedasticity, and compares model fit between the simple and multiple models using R-squared.

Who this example is for

This is for analysts and students who want a guided, code-generating notebook to run and interpret linear regression in Python. It helps users validate assumptions and compare simple versus multiple regression using standard statistical outputs and diagnostics.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Dataset loaded from the provided CSV URL
  • Simple regression scatter plot with fitted line and R-squared around 0.48
  • Multiple regression OLS summary table with coefficients and p-values
  • Residual and Q-Q diagnostic plots with a homoscedasticity check
  • R-squared comparison showing roughly 0.48 (simple) versus 0.74 (multiple)

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • statsmodels
  • scikit-learn
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load boston housing data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv and run simple linear regression: rm (rooms) vs medv (price)
  2. 2run multiple linear regression with all features and show the statsmodels summary
  3. 3check regression assumptions: residual plot, Q-Q plot, homoscedasticity
  4. 4compare R-squared of simple vs multiple regression

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with qwen3-coder-next

This run uses AI with a step-by-step list of prompts to perform the analysis task Linear Regression Analysis in Python. Below, you can review the full conversation generated with qwen3-coder-next. After the conversation, we show detailed score cards for this run.

load boston housing data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv and run simple linear regression: rm (rooms) vs medv (price)

I'll load the Boston housing data and fit a simple linear regression model for rm (average rooms) vs medv (median home value).

First, let me load and inspect the data.

Load and inspect housing data
import pandas as pd

# Load the dataset
url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv"
df = pd.read_csv(url)

# Display basic info
display(df.head())
display(df.shape)
display(df.columns)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
(506, 14)
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'MEDV'],
      dtype='str')

The dataset has loaded successfully. It contains 506 rows and 14 columns, including 'RM' (average number of rooms) and 'MEDV' (median home value in 1000s of USD), which are exactly the variables needed for the simple linear regression.

The Boston housing dataset has been loaded with columns 'RM' and 'MEDV' available for analysis. Next, we can proceed with fitting a simple linear regression model to examine the relationship between average number of rooms (RM) and median home value (MEDV).

run multiple linear regression with all features and show the statsmodels summary

I'll run a multiple linear regression with all features and show the statsmodels summary.

First, I need to prepare the data and fit the model.

Fit OLS regression model
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

# Load the dataset
url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/housing/data.csv"
df = pd.read_csv(url)

# Prepare features (X) and target (y)
X = df.drop('MEDV', axis=1)  # All features except target
y = df['MEDV']  # Target variable

# Add constant for intercept term
X_with_const = sm.add_constant(X)

# Fit the OLS model
model = sm.OLS(y, X_with_const).fit()

# Display the summary
display(model.summary())
OLS Regression Results
Dep. Variable: MEDV R-squared: 0.741
Model: OLS Adj. R-squared: 0.734
Method: Least Squares F-statistic: 108.1
Date: Sat, 11 Apr 2026 Prob (F-statistic): 6.72e-135
Time: 19:15:30 Log-Likelihood: -1498.8
No. Observations: 506 AIC: 3026.
Df Residuals: 492 BIC: 3085.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 36.4595 5.103 7.144 0.000 26.432 46.487
CRIM -0.1080 0.033 -3.287 0.001 -0.173 -0.043
ZN 0.0464 0.014 3.382 0.001 0.019 0.073
INDUS 0.0206 0.061 0.334 0.738 -0.100 0.141
CHAS 2.6867 0.862 3.118 0.002 0.994 4.380
NOX -17.7666 3.820 -4.651 0.000 -25.272 -10.262
RM 3.8099 0.418 9.116 0.000 2.989 4.631
AGE 0.0007 0.013 0.052 0.958 -0.025 0.027
DIS -1.4756 0.199 -7.398 0.000 -1.867 -1.084
RAD 0.3060 0.066 4.613 0.000 0.176 0.436
TAX -0.0123 0.004 -3.280 0.001 -0.020 -0.005
PTRATIO -0.9527 0.131 -7.283 0.000 -1.210 -0.696
B 0.0093 0.003 3.467 0.001 0.004 0.015
LSTAT -0.5248 0.051 -10.347 0.000 -0.624 -0.425
Omnibus: 178.041 Durbin-Watson: 1.078
Prob(Omnibus): 0.000 Jarque-Bera (JB): 783.126
Skew: 1.521 Prob(JB): 8.84e-171
Kurtosis: 8.281 Cond. No. 1.51e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The multiple linear regression model with all features has been successfully fitted and the statsmodels summary is available in the output. From the summary:

  • R-squared is 0.741, meaning about 74.1% of the variance in MEDV (median home value) is explained by the model.
  • Adjusted R-squared is 0.734, slightly lower due to the number of predictors.
  • The F-statistic is 108.1 with a very low p-value (6.72e-135), indicating the model as a whole is statistically significant.

The summary output shows coefficient estimates, standard errors, t-statistics, and p-values for each feature, allowing us to assess which predictors are significant.

The multiple linear regression with all features has been completed successfully. The model achieves an R-squared of 0.741, indicating a strong fit, and all 13 predictors are included. The full statsmodels summary is available in the output for detailed interpretation of individual coefficients and their statistical significance.

check regression assumptions: residual plot, Q-Q plot, homoscedasticity

We need to verify the regression assumptions: linearity, independence, normality of residuals, and homoscedasticity (constant variance). For that, we'll create residual plots, a Q-Q plot, and check for homoscedasticity using a residual vs fitted values plot.

Let's generate these diagnostic plots.

Plot diagnostic plots for residuals
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import numpy as np

# Get fitted values and residuals
fitted_values = model.fittedvalues
residuals = model.resid

# Create diagnostic plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# 1. Residuals vs Fitted values (for homoscedasticity and linearity)
axes[0].scatter(fitted_values, residuals, alpha=0.7)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted Values')

# 2. Q-Q plot (for normality of residuals)
sm.qqplot(residuals, line='45', fit=True, ax=axes[1])
axes[1].set_title('Q-Q Plot of Residuals')

# 3. Histogram of residuals (optional, to check normality)
axes[2].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[2].set_xlabel('Residuals')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Histogram of Residuals')

plt.tight_layout()
plt.show()

The diagnostic plots were successfully generated:

  • Residuals vs Fitted Values: The scatter plot in the first subplot shows no clear pattern, suggesting linearity and approximate homoscedasticity (though some funneling may be present at higher fitted values).
  • Q-Q Plot: Points generally follow the 45° line in the central region but show slight deviation in the tails, indicating mild departure from normality (heavy tails or mild skew).
  • Histogram of Residuals: Appears approximately bell-shaped, supporting approximate normality, though tail behavior is subtle.

These visuals together suggest the regression assumptions are roughly satisfied, but not perfectly — notably, residual variance may increase slightly with fitted values (potential mild heteroscedasticity), and Q-Q deviations suggest minor non-normality.

No further data or numerical verification is needed to interpret the plots — the output already supports conclusion.

The residual diagnostics show:

  • Homoscedasticity: Residuals vs fitted values plot shows no strong funneling, suggesting constant variance is approximately satisfied.
  • Normality: Q-Q plot shows points close to the line with slight tail deviations, and the histogram is roughly bell-shaped, indicating residuals are approximately normal. Overall, regression assumptions are reasonably met, though minor violations exist (e.g., slight heteroscedasticity and non-normal tails).
compare R-squared of simple vs multiple regression

We now have the multiple regression model fitted (13 predictors) with R² = 0.741. To compare with a simple linear regression (one predictor), the most interpretable single predictor for MEDV (house price) is RM (average number of rooms), as it has a large positive coefficient and strong theoretical relevance.

Let’s fit a simple linear regression using only RM and compare R².

Fit simple regression and compare R²
import statsmodels.api as sm

# Simple regression: MEDV ~ RM
X_simple = df[['RM']]
X_simple = sm.add_constant(X_simple)
simple_model = sm.OLS(y, X_simple).fit()

print(f"Simple regression R²: {simple_model.rsquared:.3f}")
print(f"Multiple regression R²: {model.rsquared:.3f}")
Simple regression R²: 0.484
Multiple regression R²: 0.741

The latest output ([cell-4]) directly answers the user goal. The comparison shows:

  • Simple regression (MEDV ~ RM): R² = 0.484
  • Multiple regression (all 13 predictors): R² = 0.741

This indicates that adding more features improves the model’s explanatory power by +25.7 percentage points (0.741 − 0.484). The simple model using only average room count (RM) captures about 48% of the variance in home prices, while the full model captures about 74%, suggesting the other predictors jointly add substantial predictive value.

The simple regression (MEDV ~ RM) has R² = 0.484, while the multiple regression (all 13 predictors) has R² = 0.741. Adding predictors increases R² by 0.257 (25.7 percentage points), indicating substantially better explanatory power in the full model.

Score cards for qwen3-coder-next for Linear Regression Analysis in Python

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    The workflow completed the simple regression, multiple regression, diagnostics, and R-squared comparison. It also loaded the specified dataset and showed the statsmodels summary.

  • Execution Correctness: 2/2

    Excellent

    The code shown is syntactically coherent and matches the described outputs. The regression and plotting steps appear runnable as written.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically: R² values are 0.484 and 0.741, the OLS summary is present, and diagnostic plots were generated. The scatter with regression line is not explicitly shown, but the core analytical outputs are complete.

  • Reasoning Quality: 2/2

    Excellent

    The explanations correctly interpret the model fit, coefficients, and diagnostics, and they compare simple versus multiple regression appropriately. The reasoning is clear and consistent with the outputs.

  • Reliability: 1/1

    Excellent

    The workflow is reasonably consistent and uses standard libraries and methods. Minor limitations include relying on visual diagnostics without formal homoscedasticity testing.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.