Hypothesis Testing in Python (t-test, ANOVA)

Perform t-tests, chi-square tests, and ANOVA using real data to answer business questions — guided by an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Titanic training dataset from a URL and runs a two-sample t-test to compare average age between survivors and non-survivors. It performs a chi-square test of independence to assess whether survival is associated with passenger class. It runs a one-way ANOVA to test whether fare differs across the three passenger classes and summarizes the results in plain language.

Who this example is for

This is for analysts and students who need a concrete, code-driven example of hypothesis testing on a real dataset. It helps anyone answering business-style questions with statistical tests and interpretable outputs in a notebook.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

t-test result comparing survivor vs non-survivor age with p-value and interpretation
chi-square test result for survival vs passenger class independence with p-value and interpretation
one-way ANOVA result for fare differences across classes with p-value and interpretation
plain-language summary table consolidating all three hypothesis tests

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
scipy
statsmodels
matplotlib
seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test
2test whether survival rate is independent of passenger class using a chi-square test
3test whether fare differs significantly across the three passenger classes using ANOVA
4summarise all three test results and their business interpretation

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3-coder-next	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gemma4:31b

This run uses AI with a step-by-step list of prompts to perform the analysis task Hypothesis Testing in Python (t-test, ANOVA). Below, you can review the full conversation generated with gemma4:31b. After the conversation, we show detailed score cards for this run.

load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test

Thinking

The user wants to analyze the Titanic dataset. Specifically, they want to:

Load the data from a provided URL.
Test if the average age differs significantly between survivors and non-survivors using a t-test.

Following the "AI Data Analyst" persona rules:

I must work iteratively.
I cannot provide the full pipeline at once.
The first step for any dataset that isn't loaded is to load and inspect: shape, columns, and head().
I should only provide one Python code block per response.
I must use the specified currency formatting (though not applicable here).
I must follow the visualization/ML rules (though not applicable yet).

Plan: Step 1: Load the data from the URL and inspect it. Step 2: (In subsequent turns) Handle missing values for the 'Age' column and 'Survived' column. Step 3: (In subsequent turns) Perform the t-test.

Current step: Load and inspect.

I'll start by loading the Titanic dataset from the provided URL and inspecting its structure to see how the age and survival data are stored.

Load and inspect data

import pandas as pd

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv"
df = pd.read_csv(url)

# Inspect the dataset
display(df.head())
display(df.shape)
display(df.columns)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

(891, 12)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='str')

Thinking

The user wants to test whether the average age differs significantly between survivors and non-survivors using a t-test. The dataset is loaded. The columns Survived and Age exist. To perform a t-test, I need to separate the Age values into two groups based on Survived (0 or 1). I must handle missing values in the Age column, as t-tests cannot be performed on NaN values.

Plan:

Extract Age for Survived == 1 and Survived == 0.
Remove NaN values from these groups.
Run scipy.stats.ttest_ind.
Display the results.

The latest output confirms the presence of Survived and Age columns. Now I need to perform the actual statistical test.

The dataset contains the necessary columns, Survived and Age, but since the Age column likely contains missing values, I need to handle those before performing the t-test to ensure accurate results.

#10

Perform age t-test

from scipy import stats

# Split age into two groups based on survival status, removing NaNs
survived_age = df[df['Survived'] == 1]['Age'].dropna()
not_survived_age = df[df['Survived'] == 0]['Age'].dropna()

# Perform independent t-test
t_stat, p_value = stats.ttest_ind(survived_age, not_survived_age)

# Display means and test results
display(pd.DataFrame({
    'Group': ['Survived', 'Not Survived'],
    'Mean Age': [survived_age.mean(), not_survived_age.mean()],
    'Count': [len(survived_age), len(not_survived_age)]
}))

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

#11

	Group	Mean Age	Count
0	Survived	28.343690	290
1	Not Survived	30.626179	424

#12

T-statistic: -2.0667
P-value: 0.0391

#13

Thinking

The user wants to know if the average age differs significantly between survivors and non-survivors using a t-test.

Looking at the notebook state:

Data loaded: df.
Groups created: survived_age (mean ~28.34, count 290) and not_survived_age (mean ~30.63, count 424).
T-test performed: $T$-statistic = -2.0667, $P$-value = 0.0391.

Interpretation:

The $p$-value (0.0391) is less than the standard significance level of 0.05.
This indicates that the difference in average age between survivors and non-survivors is statistically significant.
On average, survivors were younger (28.34 years) than non-survivors (30.63 years).

The original goal was: "load titanic data ... and test whether average age differs significantly between survivors and non-survivors using a t-test".

The results are fully available in output-2. Path A is appropriate.

Final answer construction:

State the observed means.
State the p-value.
Conclude whether the difference is significant.

#14

The t-test results show a p-value of 0.0391, which is below the standard significance level of 0.05. This indicates that the difference in average age between survivors and non-survivors is statistically significant. On average, survivors were younger (28.34 years) compared to those who did not survive (30.63 years).

#15

test whether survival rate is independent of passenger class using a chi-square test

#16

Thinking

The user wants to test if survival rate is independent of passenger class using a chi-square test.

- `df` is already loaded and contains `Survived` and `Pclass`.
- Survival is binary (0 or 1).
- Pclass is categorical (1, 2, 3).

- A chi-square test of independence requires a contingency table (cross-tabulation) of the two variables.
- I can use `pd.crosstab` to create the table.
- I can use `scipy.stats.chi2_contingency` to perform the test.

- Import `stats` from `scipy` (already done in previous cell, but good to keep it in mind or re-import if necessary, though in a notebook it's usually persistent).
- Create a contingency table of `Survived` vs `Pclass`.
- Run the chi-square test.
- Display the table and the p-value.

#17

To test whether survival is independent of passenger class, I will create a contingency table (cross-tabulation) of Survived and Pclass and then perform the Chi-square test of independence.

#18

Run Chi-square test

from scipy import stats

# Create a contingency table of Survived and Pclass
contingency_table = pd.crosstab(df['Survived'], df['Pclass'])

# Perform the Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# Display the table and the results
display(contingency_table)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")

#19

Pclass	1	2	3
Survived
0	80	97	372
1	136	87	119

#20

Chi-square statistic: 102.8890
P-value: 0.0000

#21

Thinking

The user wants to test whether survival rate is independent of passenger class using a chi-square test.

Looking at the notebook state:

cell-3 performed the chi-square test of independence between Survived and Pclass.
The contingency_table shows the counts of survivors (1) and non-survivors (0) across the three passenger classes (1, 2, 3).
The output shows:
- Chi-square statistic: 102.8890
- P-value: 0.0000 (which is $< 0.05$)
The contingency table shows a clear difference in survival proportions (e.g., class 1 has more survivors than non-survivors, whereas class 3 has significantly more non-survivors).

Conclusion:

The p-value is extremely low ($p < 0.0001$), which means we reject the null hypothesis of independence.
Therefore, survival rate is not independent of passenger class; there is a statistically significant association between the two.

Decision: Path A. The outputs are sufficient to answer the goal.

#22

The chi-square test results show a p-value of 0.0000, which is well below the standard significance level of 0.05. This indicates that survival rate is not independent of passenger class; there is a statistically significant association between the two. Looking at the contingency table, passengers in the first class had a much higher likelihood of survival compared to those in the third class.

#23

test whether fare differs significantly across the three passenger classes using ANOVA

#24

Thinking

The user wants to test if the Fare differs significantly across the three passenger classes (Pclass) using ANOVA (Analysis of Variance).

*   `df`: Titanic dataset (DataFrame) containing `Fare` and `Pclass`.
*   Columns of interest: `Fare` (numeric) and `Pclass` (categorical/ordinal).

*   Objective: ANOVA test to compare means of `Fare` across `Pclass`.
*   Step 1: Group `Fare` by `Pclass`.
*   Step 2: Use `scipy.stats.f_oneway` to perform the ANOVA.
*   Step 3: Provide descriptive statistics (means per class) for context.

*   The `Fare` column might have NaNs? Checking the first load output... `Fare` seems okay, but it's good practice to handle missing values. In the Titanic dataset, `Fare` usually doesn't have NaNs in the train set, but I should ensure the arrays passed to `f_oneway` are clean.
*   `Pclass` has values 1, 2, 3.

*   Group `Fare` values based on `Pclass`.
*   Run `stats.f_oneway`.
*   Print results.

#25

To test whether the fare differs significantly across the three passenger classes, I will group the fares by class and perform a one-way ANOVA test.

#26

Perform ANOVA on fares

from scipy import stats

# Group fares by Pclass
fares_by_class = [df[df['Pclass'] == i]['Fare'] for i in [1, 2, 3]]

# Perform one-way ANOVA
f_stat, p_value = stats.f_oneway(*fares_by_class)

# Calculate means for each class to provide context
means = df.groupby('Pclass')['Fare'].mean()

# Display means and test results
display(means.to_frame(name='Mean Fare'))
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.4f}")

#27

	Mean Fare
Pclass
1	84.154687
2	20.662183
3	13.675550

#28

F-statistic: 242.3442
P-value: 0.0000

#29

Thinking

The user wants to know if the fare differs significantly across the three passenger classes using ANOVA.

Looking at the notebook state:

Cell 4 executed stats.f_oneway(*fares_by_class).
The output shows:
- Mean Fare for Pclass 1: 84.154687
- Mean Fare for Pclass 2: 20.662183
- Mean Fare for Pclass 3: 13.675550
- F-statistic: 242.3442
- P-value: 0.0000

The p-value is essentially 0, which is much lower than the standard significance level of 0.05. This means we reject the null hypothesis that the means are equal. There is a statistically significant difference in fares across the three passenger classes. The means clearly show that first-class passengers paid significantly more (84.15 USD) than second-class (20.66 USD) and third-class (13.68 USD) passengers.

Current outputs are sufficient to answer the original goal. I will choose Path A.

Currency rule: Use USD instead of $.

#30

The ANOVA test results show a p-value of 0.0000, which is significantly lower than the 0.05 threshold. This indicates that the fare differs significantly across the three passenger classes. The mean fares confirm this stark difference, with first class averaging 84.15 USD, second class 20.66 USD, and third class 13.68 USD.

#31

summarise all three test results and their business interpretation

#32

Thinking

The user wants a summary of three statistical tests performed in the notebook and their business (or practical) interpretation.
- The notebook contains:
  1. T-test: Age vs. Survived.
  2. Chi-square test: Pclass vs. Survived.
  3. ANOVA: Fare vs. Pclass.
- Test 1 (T-test): Age vs. Survival.
  - Mean Age (Survived): 28.34
  - Mean Age (Not Survived): 30.63
  - P-value: 0.0391 (Significant at 0.05).
  - Interpretation: There is a statistically significant difference in age between those who survived and those who didn't. Younger passengers were slightly more likely to survive.
- Test 2 (Chi-square): Pclass vs. Survival.
  - P-value: 0.0000 (Highly significant).
  - Crosstab: Pclass 1 has a higher survival rate relative to its count compared to Pclass 3.
  - Interpretation: Passenger class is strongly associated with survival. Being in a higher class significantly increased the chances of survival.
- Test 3 (ANOVA): Fare vs. Pclass.
  - Mean Fares: Pclass 1 (84.15 USD), Pclass 2 (20.66 USD), Pclass 3 (13.68 USD).
  - P-value: 0.0000 (Highly significant).
  - Interpretation: There is a significant difference in ticket prices across the three classes, as expected. This confirms that the class system was clearly delineated by cost.
- Ensure currency is formatted as USD per the system prompt.
- Avoid using $ symbols for money.
- Keep it professional and data-driven.

#33

Here is a summary of the three statistical tests and their interpretations:

1. Age and Survival (Independent T-test)

Result: P-value = 0.0391 (Significant).
Interpretation: There is a statistically significant difference in the average age of passengers who survived (28.34 years) versus those who did not (30.63 years). This suggests that younger passengers had a slightly higher probability of survival.

2. Passenger Class and Survival (Chi-square Test)

Result: P-value = 0.0000 (Highly Significant).
Interpretation: There is a strong association between passenger class and survival. The data indicates that survival rates were not distributed equally across classes; passengers in 1st class had a significantly higher chance of survival compared to those in 3rd class.

3. Fare and Passenger Class (One-way ANOVA)

Result: P-value = 0.0000 (Highly Significant).
Interpretation: There is a significant difference in the fares paid across the three classes. With mean fares of 84.15 USD for 1st class, 20.66 USD for 2nd class, and 13.68 USD for 3rd class, the pricing structure clearly separated the passenger tiers.

Score cards for gemma4:31b for Hypothesis Testing in Python (t-test, ANOVA)

Total score: 10/10

Task Completion: 2/2
Excellent
All three requested hypothesis tests were performed and a summary interpretation was provided. The notebook also loaded the Titanic data as required.
Execution Correctness: 2/2
Excellent
The code shown is syntactically valid and uses appropriate statistical tests for each question. It appears runnable based on the provided outputs.
Output Quality: 3/3
Excellent
The outputs match the expected outcomes semantically: t-test p < 0.05 with younger survivors, chi-square p < 0.001, and ANOVA p < 0.001 with fare differences by class. The notebook also includes descriptive context for each test.
Reasoning Quality: 2/2
Excellent
The interpretations correctly connect each p-value to the corresponding business/statistical conclusion. The reasoning is clear and consistent with the reported results.
Reliability: 1/1
Excellent
The workflow is consistent and uses standard statistical methods without obvious hallucinations or unsupported claims. It is reasonably robust for the stated task.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Hypothesis Testing in Python (t-test, ANOVA)

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with gemma4:31b

Score cards for gemma4:31b for Hypothesis Testing in Python (t-test, ANOVA)

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics