Hypothesis Testing in Python (t-test, ANOVA)

Perform t-tests, chi-square tests, and ANOVA using real data to answer business questions — guided by an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Titanic training dataset from a URL and runs a two-sample t-test to compare average age between survivors and non-survivors. It performs a chi-square test of independence to assess whether survival is associated with passenger class. It runs a one-way ANOVA to test whether fare differs across the three passenger classes and summarizes the results in plain language.

Who this example is for

This is for analysts and students who need a concrete, code-driven example of hypothesis testing on a real dataset. It helps anyone answering business-style questions with statistical tests and interpretable outputs in a notebook.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

t-test result comparing survivor vs non-survivor age with p-value and interpretation
chi-square test result for survival vs passenger class independence with p-value and interpretation
one-way ANOVA result for fare differences across classes with p-value and interpretation
plain-language summary table consolidating all three hypothesis tests

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
scipy
statsmodels
matplotlib
seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test
2test whether survival rate is independent of passenger class using a chi-square test
3test whether fare differs significantly across the three passenger classes using ANOVA
4summarise all three test results and their business interpretation

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3-coder-next	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with glm-5.1

This run uses AI with a step-by-step list of prompts to perform the analysis task Hypothesis Testing in Python (t-test, ANOVA). Below, you can review the full conversation generated with glm-5.1. After the conversation, we show detailed score cards for this run.

load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test

Thinking

The user wants to load the Titanic dataset and perform a t-test to compare average age between survivors and non-survivors. Let me start by loading and inspecting the data first.

Let's start by loading the data and inspecting it briefly before running the t-test.

Load Titanic dataset

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv")
df.shape

(891, 12)

Preview first rows

df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Describe survival and age

df[["Survived", "Age"]].describe()

	Survived	Age
count	891.000000	714.000000
mean	0.383838	29.699118
std	0.486592	14.526497
min	0.000000	0.420000
25%	0.000000	20.125000
50%	0.000000	28.000000
75%	1.000000	38.000000
max	1.000000	80.000000

#10

Let me also check how many missing ages there are per group, since that matters for the t-test.

#11

Calculate age stats by survival

df.groupby("Survived")["Age"].agg(["count", "mean", "std", lambda x: x.isna().sum()])

#12

	count	mean	std	<lambda_0>
Survived
0	424	30.626179	14.172110	125
1	290	28.343690	14.950952	52

#13

Now let's run the independent samples t-test comparing age between survivors and non-survivors.

#14

Compare age by survival

from scipy import stats

# Drop rows with missing Age
ages_survived = df.loc[df["Survived"] == 1, "Age"].dropna()
ages_not_survived = df.loc[df["Survived"] == 0, "Age"].dropna()

t_stat, p_value = stats.ttest_ind(ages_survived, ages_not_survived, equal_var=False)

print(f"Mean age (survived):   {ages_survived.mean():.2f}  (n={len(ages_survived)})")
print(f"Mean age (not survived): {ages_not_survived.mean():.2f}  (n={len(ages_not_survived)})")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

#15

Mean age (survived):   28.34  (n=290)
Mean age (not survived): 30.63  (n=424)
t-statistic: -2.0460
p-value:     0.0412

#16

Results & Interpretation:

Survivors had a mean age of ~28.3 years, while non-survivors had a mean age of ~30.1 years.
The t-test yields a p-value around 0.04, which is below the conventional 0.05 threshold.
This suggests there is a statistically significant difference in average age between the two groups — survivors tended to be slightly younger on average. However, the practical difference (~1.8 years) is modest.

A couple of notes:

177 rows had missing Age values and were excluded from the test.
I used Welch's t-test (equal_var=False) since the two groups may have unequal variances.

#17

Thinking

The outputs already fully answer the original goal. The t-test has been run and the results are clear:

Survivors mean age: 28.34 (n=290)
Non-survivors mean age: 30.63 (n=424)
t-statistic: -2.0460
p-value: 0.0412

This is sufficient to answer the question. I'll provide Path A with a natural final answer.

#18

The t-test results are in and they answer the question directly:

Survivors had a mean age of 28.34 years (n=290)
Non-survivors had a mean age of 30.63 years (n=424)
t-statistic: -2.0460
p-value: 0.0412

Since the p-value (0.0412) is below the conventional 0.05 threshold, we reject the null hypothesis of equal mean ages. There is a statistically significant difference in average age between survivors and non-survivors — survivors tended to be about 2.3 years younger on average. That said, the practical magnitude of this difference is modest, and 177 passengers (125 non-survivors, 52 survivors) had missing age values, which were excluded from the test.

#19

test whether survival rate is independent of passenger class using a chi-square test

#20

Thinking

The user wants to test whether survival rate is independent of passenger class using a chi-square test. I need to create a contingency table of Survived vs Pclass and then run scipy's chi2_contingency test.

#21

Let's run a chi-square test of independence between Survived and Pclass. First, I'll show the contingency table, then the test result.

#22

Compute contingency table

from scipy.stats import chi2_contingency

# Contingency table
contingency = pd.crosstab(df["Pclass"], df["Survived"], margins=True)
contingency

#23

Survived	0	1	All
Pclass
1	80	136	216
2	97	87	184
3	372	119	491
All	549	342	891

#24

Compute chi-square test

# Chi-square test (on the inner table without margins)
ct = pd.crosstab(df["Pclass"], df["Survived"])
chi2, p_value, dof, expected = chi2_contingency(ct)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom:    {dof}")
print(f"p-value:               {p_value:.6f}")

#25

Chi-square statistic: 102.8890
Degrees of freedom:    2
p-value:               0.000000

#26

This will show whether survival is independent of passenger class. Given what we know about the Titanic, we'd expect a strong association — lower-class passengers had much lower survival rates. The contingency table will make the pattern visible, and the chi-square test will confirm whether the association is statistically significant.

#27

Thinking

The chi-square test results are already available in the notebook output. Let me analyze them and provide a final answer.

Results:

Chi-square statistic: 102.8890
Degrees of freedom: 2
p-value: 0.000000 (essentially 0)

The contingency table shows clear patterns:

1st class: 80 died, 136 survived → 62.96% survival rate
2nd class: 97 died, 87 survived → 47.28% survival rate
3rd class: 372 died, 119 survived → 24.24% survival rate

The extremely low p-value (essentially 0) means we reject the null hypothesis of independence. Survival rate is strongly associated with passenger class.

This is sufficient to answer the goal. Path A.

#28

The chi-square test results are clear and conclusive:

Chi-square statistic: 102.89
Degrees of freedom: 2
p-value: ≈ 0.000000 (essentially zero)

With a p-value far below any conventional significance threshold, we reject the null hypothesis of independence. Survival rate is not independent of passenger class — there is a highly significant association.

The contingency table makes the pattern stark:

Pclass	Died	Survived	Survival Rate
1st	80	136	62.96%
2nd	97	87	47.28%
3rd	372	119	24.24%

First-class passengers survived at nearly 3× the rate of third-class passengers. The chi-square statistic of 102.89 (with df = 2) confirms this is not due to random chance — passenger class was a powerful predictor of survival on the Titanic.

#29

test whether fare differs significantly across the three passenger classes using ANOVA

#30

Thinking

The user wants to test whether fare differs significantly across the three passenger classes using ANOVA. The Titanic dataset df is already loaded. I'll run a one-way ANOVA test comparing fare across Pclass groups.

#31

Let's run a one-way ANOVA test to compare Fare across the three passenger classes.

#32

Compare mean fares by class

from scipy import stats

# Extract Fare arrays for each class
fare_1 = df.loc[df["Pclass"] == 1, "Fare"].dropna()
fare_2 = df.loc[df["Pclass"] == 2, "Fare"].dropna()
fare_3 = df.loc[df["Pclass"] == 3, "Fare"].dropna()

print(f"1st class — mean: {fare_1.mean():.2f}, n: {len(fare_1)}")
print(f"2nd class — mean: {fare_2.mean():.2f}, n: {len(fare_2)}")
print(f"3rd class — mean: {fare_3.mean():.2f}, n: {len(fare_3)}")

f_stat, p_value = stats.f_oneway(fare_1, fare_2, fare_3)

print(f"\nF-statistic: {f_stat:.4f}")
print(f"p-value:     {p_value:.10f}")

#33

1st class — mean: 84.15, n: 216
2nd class — mean: 20.66, n: 184
3rd class — mean: 13.68, n: 491

F-statistic: 242.3442
p-value:     0.0000000000

#34

This will show the mean fare for each class and test whether those means differ significantly. Given the fare structure of the Titanic, we'd expect a strong class effect — but the ANOVA will quantify it formally.

#35

Thinking

The ANOVA results are clear and directly answer the user's goal. Let me summarize the findings from the observed outputs.

#36

The ANOVA results are definitive:

1st class mean fare: 84.15 USD (n=216)
2nd class mean fare: 20.66 USD (n=184)
3rd class mean fare: 13.68 USD (n=491)
F-statistic: 242.34 (df between = 2, df within = 888)
p-value: ≈ 0 (effectively zero)

We reject the null hypothesis — fare differs significantly across passenger classes. The differences are massive: 1st-class passengers paid roughly 4× more than 2nd-class and 6× more than 3rd-class passengers. An F-statistic of 242.34 with a near-zero p-value leaves no doubt that these are not random fluctuations.

One caveat: ANOVA assumes normally distributed residuals and homogeneity of variances. Given the likely skewness of fare data and unequal group sizes, the result is so overwhelmingly significant that even a non-parametric Kruskal-Wallis test would almost certainly reach the same conclusion. But if you'd like, we can verify with that robustness check.

#37

summarise all three test results and their business interpretation

#38

Thinking

The user wants a summary of all three hypothesis tests performed in the notebook and their business interpretation. Let me review the three tests:

Two-sample t-test (cell-5): Comparing Age between survived and not survived passengers
- Mean age (survived): 28.34, n=290
- Mean age (not survived): 30.63, n=424
- t-statistic: -2.0460, p-value: 0.0412
- Statistically significant at α=0.05
Chi-square test (cells 6-7): Testing association between Pclass and Survived
- Chi-square statistic: 102.8890, df=2, p-value ≈ 0
- Strong association
One-way ANOVA (cell-8): Testing whether Fare differs across Pclass
- 1st class mean: 84.15, 2nd class mean: 20.66, 3rd class mean: 13.68
- F-statistic: 242.3442, p-value ≈ 0
- Highly significant

Let me write a clear, concise summary with business interpretation.

#39

Here's a consolidated summary of all three tests:

1. Age vs Survival (Two-Sample t-Test)

Group	Mean Age	n
Survived	28.34	290
Not survived	30.63	424

t-statistic: -2.046, p-value: 0.0412
Result: Statistically significant (barely, at α = 0.05). Survivors were slightly younger on average (≈2.3 years).

Business interpretation: Age has a modest but real effect on survival. Younger passengers had a slight survival advantage — likely because children were prioritized during evacuation. However, the small mean difference and marginal p-value suggest age alone is a weak predictor; other factors dominate.

2. Passenger Class vs Survival (Chi-Square Test)

Pclass	Died	Survived
1st	80 (37%)	136 (63%)
2nd	97 (53%)	87 (47%)
3rd	372 (76%)	119 (24%)

Chi-square: 102.89, df: 2, p-value: ≈ 0
Result: Highly significant. Survival is strongly dependent on passenger class.

Business interpretation: Class is the dominant survival factor. First-class passengers survived at nearly triple the rate of third-class. This reflects both physical access (cabins closer to lifeboats) and social priority during evacuation. From a risk-management perspective, class was effectively a proxy for survival likelihood.

3. Fare vs Passenger Class (One-Way ANOVA)

Pclass	Mean Fare	n
1st	84.15 USD	216
2nd	20.66 USD	184
3rd	13.68 USD	491

F-statistic: 242.34, p-value: ≈ 0
Result: Highly significant. Fare differs drastically across classes.

Business interpretation: No surprise — fare structures were designed to stratify by class. The 1st-class fare was ~4× that of 2nd-class and ~6× that of 3rd-class. This confirms the pricing tiers were highly differentiated, and fare essentially encodes the same information as class. They should not be treated as independent features in a predictive model.

Big Picture

The three tests tell a coherent story: class was the single most powerful axis of inequality on the Titanic — it dictated how much you paid, where you slept, and most critically, whether you survived. Age mattered only marginally, but class was decisive.

Score cards for glm-5.1 for Hypothesis Testing in Python (t-test, ANOVA)

Total score: 10/10

Task Completion: 2/2
Excellent
All three requested hypothesis tests were performed and a consolidated summary was attempted. The notebook also includes the expected business interpretations for each test.
Execution Correctness: 2/2
Excellent
The code shown is syntactically valid and the reported outputs are consistent with the computations. The workflow appears runnable as written.
Output Quality: 3/3
Excellent
The outputs match the expected outcomes: significant t-test, chi-square, and ANOVA results with clear statistics. The notebook also provides a plain-language summary of the individual tests.
Reasoning Quality: 2/2
Excellent
The interpretations are correct and aligned with the statistical results, including significance and practical context. The summary is concise and business-oriented.
Reliability: 1/1
Excellent
The workflow is consistent and uses standard statistical tests appropriately, with a reasonable note about ANOVA assumptions. No major hallucinations or unsafe steps are evident.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Hypothesis Testing in Python (t-test, ANOVA)

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with glm-5.1

1. Age vs Survival (Two-Sample t-Test)

2. Passenger Class vs Survival (Chi-Square Test)

3. Fare vs Passenger Class (One-Way ANOVA)

Big Picture

Score cards for glm-5.1 for Hypothesis Testing in Python (t-test, ANOVA)

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics