Hypothesis Testing in Python (t-test, ANOVA)

Perform t-tests, chi-square tests, and ANOVA using real data to answer business questions — guided by an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Titanic training dataset from a URL and runs a two-sample t-test to compare average age between survivors and non-survivors. It performs a chi-square test of independence to assess whether survival is associated with passenger class. It runs a one-way ANOVA to test whether fare differs across the three passenger classes and summarizes the results in plain language.

Who this example is for

This is for analysts and students who need a concrete, code-driven example of hypothesis testing on a real dataset. It helps anyone answering business-style questions with statistical tests and interpretable outputs in a notebook.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

t-test result comparing survivor vs non-survivor age with p-value and interpretation
chi-square test result for survival vs passenger class independence with p-value and interpretation
one-way ANOVA result for fare differences across classes with p-value and interpretation
plain-language summary table consolidating all three hypothesis tests

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
scipy
statsmodels
matplotlib
seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test
2test whether survival rate is independent of passenger class using a chi-square test
3test whether fare differs significantly across the three passenger classes using ANOVA
4summarise all three test results and their business interpretation

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3-coder-next	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with qwen3-coder-next

This run uses AI with a step-by-step list of prompts to perform the analysis task Hypothesis Testing in Python (t-test, ANOVA). Below, you can review the full conversation generated with qwen3-coder-next. After the conversation, we show detailed score cards for this run.

load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test

Load and preview Titanic dataset

import pandas as pd

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv"
df = pd.read_csv(url)

# Display basic info about the dataset
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

The dataset has been loaded successfully and shows the expected columns for the Titanic dataset, including Survived and Age. To test whether average age differs significantly between survivors and non-survivors, a t-test is needed. First, I'll check the distribution of Age values to ensure it's suitable for a t-test (i.e., enough valid data and not heavily skewed). A quick summary of age statistics by survival status is required.

Describe age distribution by survival status

# Check age distribution by survival status
df.groupby('Survived')['Age'].describe()

	count	mean	std	min	25%	50%	75%	max
Survived
0	424.0	30.626179	14.172110	1.00	21.0	28.0	39.0	74.0
1	290.0	28.343690	14.950952	0.42	19.0	28.0	36.0	80.0

The output shows the descriptive statistics for Age grouped by survival status: non-survivors (0) have a mean age of 30.63, while survivors (1) have a mean age of 28.34. This suggests a small difference in average age, but it does not confirm statistical significance. To test whether this difference is statistically significant, a two-sample t-test is needed.

Compute t-test for age by survival status

from scipy import stats

# Extract age values for survivors and non-survivors, ignoring NaNs
age_non_survivors = df[df['Survived'] == 0]['Age'].dropna()
age_survivors = df[df['Survived'] == 1]['Age'].dropna()

# Perform two-sample t-test assuming equal variances
t_stat, p_val = stats.ttest_ind(age_non_survivors, age_survivors, equal_var=False)

# Return t-statistic and p-value
t_stat, p_val

(np.float64(2.0460301043939704), np.float64(0.04118965162586641))

#10

The two-sample t-test results show a t-statistic of 2.046 and a p-value of 0.0412. Since the p-value is less than the conventional threshold of 0.05, the difference in average age between non-survivors (30.63 years) and survivors (28.34 years) is statistically significant. This supports the conclusion that age differs significantly between survivors and non-survivors.

#11

test whether survival rate is independent of passenger class using a chi-square test

#12

To test whether survival rate is independent of passenger class, we’ll use a chi-square test of independence on the contingency table of Survived vs Pclass.

Here’s the next step:

#13

Test survival by class with chi-square

import pandas as pd
from scipy import stats

# Create a contingency table of Survived vs Pclass
contingency_table = pd.crosstab(df['Survived'], df['Pclass'])

# Perform chi-square test
chi2, p_val, dof, expected = stats.chi2_contingency(contingency_table)

# Display contingency table and test results
contingency_table, (chi2, p_val, dof)

#14

(Pclass      1   2    3
 Survived              
 0          80  97  372
 1         136  87  119,
 (np.float64(102.88898875696056), np.float64(4.549251711298793e-23), 2))

#15

The chi-square test output shows:

Contingency table:

Pclass      1   2    3
Survived              
0          80  97  372
1         136  87  119

Test statistic: chi2 = 102.89
Degrees of freedom: dof = 2
p-value: 4.55e-23

Since the p-value is far below any conventional significance level (e.g., 0.05), we reject the null hypothesis that survival is independent of passenger class. This indicates a very strong association between class and survival.

#16

test whether fare differs significantly across the three passenger classes using ANOVA

#17

To test whether fare differs significantly across passenger classes, we’ll perform a one-way ANOVA using scipy.stats.f_oneway, grouping by Pclass.

#18

Perform one-way ANOVA on fares by passenger class

from scipy import stats

# Group fares by passenger class
fare_class_1 = df[df['Pclass'] == 1]['Fare']
fare_class_2 = df[df['Pclass'] == 2]['Fare']
fare_class_3 = df[df['Pclass'] == 3]['Fare']

# Perform one-way ANOVA
f_stat, p_val = stats.f_oneway(fare_class_1, fare_class_2, fare_class_3)

(f_stat, p_val)

#19

(np.float64(242.344156517448), np.float64(1.0313763209140291e-84))

#20

The latest ANOVA output shows:

F-statistic = 242.34
p-value = 1.03e-84

This extremely small p-value (< 0.001) indicates strong evidence against the null hypothesis that mean fares are equal across all three passenger classes. Therefore, fare differs significantly across passenger classes.

#21

summarise all three test results and their business interpretation

#22

Here's a concise summary of all three statistical tests and their business interpretation:

Age & Survival (Two-sample t-test)
- Result: t = 2.05, p = 0.041
- Interpretation: Statistically significant difference in mean age between survivors and non-survivors. Younger passengers were slightly more likely to survive.
Survival vs Passenger Class (Chi-square test)
- Result: χ² = 102.89, p ≈ 4.55×10⁻²³
- Interpretation: Strong association between passenger class and survival. First-class passengers had much higher survival rates than third-class.
Fare & Passenger Class (One-way ANOVA)
- Result: F = 242.34, p ≈ 1.03×10⁻⁸⁴
- Interpretation: Extremely significant difference in mean fare across passenger classes. This confirms that class is a strong determinant of fare, reinforcing the link between socioeconomic status and survival.

Overall, all three tests reveal that socioeconomic factors (class and fare) were major correlates of survival — a key insight for understanding historical risk or inequity patterns in maritime disasters.

Score cards for qwen3-coder-next for Hypothesis Testing in Python (t-test, ANOVA)

Total score: 10/10

Task Completion: 2/2
Excellent
All three required hypothesis tests were performed and a summary interpretation was provided.
Execution Correctness: 2/2
Excellent
The code is syntactically valid and the reported outputs are consistent with the shown computations.
Output Quality: 3/3
Excellent
The outputs match the expected outcomes semantically: significant t-test, chi-square, and ANOVA results, plus a plain-language summary table-like recap.
Reasoning Quality: 2/2
Excellent
The interpretations correctly connect p-values to the hypotheses and explain the business meaning of each test.
Reliability: 1/1
Excellent
The workflow is consistent and uses standard statistical tests appropriately on the Titanic dataset.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Hypothesis Testing in Python (t-test, ANOVA)

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with qwen3-coder-next

Score cards for qwen3-coder-next for Hypothesis Testing in Python (t-test, ANOVA)

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics