Hypothesis Testing in Python (t-test, ANOVA)

Perform t-tests, chi-square tests, and ANOVA using real data to answer business questions — guided by an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Titanic training dataset from a URL and runs a two-sample t-test to compare average age between survivors and non-survivors. It performs a chi-square test of independence to assess whether survival is associated with passenger class. It runs a one-way ANOVA to test whether fare differs across the three passenger classes and summarizes the results in plain language.

Who this example is for

This is for analysts and students who need a concrete, code-driven example of hypothesis testing on a real dataset. It helps anyone answering business-style questions with statistical tests and interpretable outputs in a notebook.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

t-test result comparing survivor vs non-survivor age with p-value and interpretation
chi-square test result for survival vs passenger class independence with p-value and interpretation
one-way ANOVA result for fare differences across classes with p-value and interpretation
plain-language summary table consolidating all three hypothesis tests

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
scipy
statsmodels
matplotlib
seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test
2test whether survival rate is independent of passenger class using a chi-square test
3test whether fare differs significantly across the three passenger classes using ANOVA
4summarise all three test results and their business interpretation

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3-coder-next	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Hypothesis Testing in Python (t-test, ANOVA). Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test

Thinking

Considering data display

I’m thinking about how to display the shape of data. Showing it as a pd.DataFrame could work. While a shape tuple is fine, the special load inspection might require details like shape, columns, and head. It seems that I need to keep the call for the data variable short, so I’ll go with df. I need to ensure that I follow the guideline of either a single chart or table per step.

Let’s start by loading the Titanic training data and checking the structure before we run the t-test.

Load Titanic data

import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column": df.columns}))
df.head()

	rows	columns
0	891	12

	column
0	PassengerId
1	Survived
2	Pclass
3	Name
4	Sex
5	Age
6	SibSp
7	Parch
8	Ticket
9	Fare
10	Cabin
11	Embarked

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

The data is loaded and the key columns for your test are present: Survived and Age. To answer whether mean age differs significantly between survivors and non-survivors, we still need the actual group means and the t-test result.

Compare age by survival

from scipy.stats import ttest_ind
import pandas as pd

age_survived = df.loc[df["Survived"] == 1, "Age"].dropna()
age_not_survived = df.loc[df["Survived"] == 0, "Age"].dropna()

result = pd.DataFrame({
    "group": ["Survived", "Did not survive"],
    "n": [age_survived.shape[0], age_not_survived.shape[0]],
    "mean_age": [age_survived.mean(), age_not_survived.mean()],
    "std_age": [age_survived.std(), age_not_survived.std()]
})

test = ttest_ind(age_survived, age_not_survived, equal_var=False)

display(result)
pd.DataFrame({
    "t_statistic": [test.statistic],
    "p_value": [test.pvalue]
})

#10

	group	n	mean_age	std_age
0	Survived	290	28.343690	14.950952
1	Did not survive	424	30.626179	14.172110

#11

	t_statistic	p_value
0	-2.04603	0.04119

#12

Yes. Based on the observed outputs, the average age differs significantly between survivors and non-survivors at the 5% significance level.

From the data with non-missing ages:

Survivors: n = 290, mean age 28.34
Non-survivors: n = 424, mean age 30.63

Using Welch’s two-sample t-test:

t = -2.046
p = 0.04119

Because the p-value is below 0.05, we reject the null hypothesis of equal mean ages. In this sample, survivors were on average about 2.28 years younger than non-survivors.

#13

test whether survival rate is independent of passenger class using a chi-square test

#14

Thinking

Planning chi-square test

I need to craft an answer based on the user's request to test whether survival rate is independent of passenger class using a chi-square test. Since we have the DataFrame ready, I can compute the contingency table and chi-square in one step. I'll propose the next step with a single code block, making sure to include a display of the contingency table along with the chi-square test result.

I'll keep it concise, using markdown and avoiding unnecessary symbols.Defining chi-square test output

I need to provide an answer to the user's question about testing independence between survival rate and passenger class. Since we're missing output, I can combine elements, but I should stick to displaying one table at a time per the rules.

It might make sense to compute the chi-square test and return a summary table with values like chi2, degrees of freedom, and p-value. Including the contingency table could be helpful, but that leads to multiple tables, which isn't preferred.

I could also include a brief interpretation after the code, focusing on conveying the main outcomes.Crafting summary output

I see that if I can't display results directly, I can still hold the crosstab in a variable and create a summary as the final output. The user's goal seems to be about testing independence, so I can compute the contingency table and the chi-square test, then return a summary.

I'll create a summary DataFrame that includes observed counts but keep it simple and concise. I'll remember that the variable for passenger class is Pclass and ensure to craft clear and straightforward text around this.

#15

Let’s run the chi-square test using Pclass and Survived.

#16

Run chi square test

from scipy.stats import chi2_contingency
import pandas as pd

ct = pd.crosstab(df["Pclass"], df["Survived"])
chi2, p_value, dof, expected = chi2_contingency(ct)

pd.DataFrame({
    "chi2_statistic": [chi2],
    "degrees_of_freedom": [dof],
    "p_value": [p_value]
})

#17

	chi2_statistic	degrees_of_freedom	p_value
0	102.888989	2	4.549252e-23

#18

The chi-square test shows that survival rate is not independent of passenger class.

Observed result:

\chi^2 = 102.89
degrees of freedom = 2
p ≈ 4.55 × 10^{-23}

Because the p-value is far below $0.05$, we reject the null hypothesis of independence. This means there is a strong association between Pclass and Survived in this Titanic dataset.

In plain terms, passenger class and survival were related: survival rates differed substantially across classes.

#19

test whether fare differs significantly across the three passenger classes using ANOVA

#20

We can test that with a one-way ANOVA using Fare as the numeric outcome and Pclass as the 3 groups.

#21

Compare fares across classes

from scipy.stats import f_oneway
import pandas as pd

fare_1 = df.loc[df["Pclass"] == 1, "Fare"].dropna()
fare_2 = df.loc[df["Pclass"] == 2, "Fare"].dropna()
fare_3 = df.loc[df["Pclass"] == 3, "Fare"].dropna()

f_stat, p_value = f_oneway(fare_1, fare_2, fare_3)

pd.DataFrame({
    "F_statistic": [f_stat],
    "p_value": [p_value],
    "n_class_1": [fare_1.shape[0]],
    "n_class_2": [fare_2.shape[0]],
    "n_class_3": [fare_3.shape[0]]
})

#22

	F_statistic	p_value	n_class_1	n_class_2	n_class_3
0	242.344157	1.031376e-84	216	184	491

#23

Yes, fare differs significantly across the three passenger classes.

From the ANOVA output:

F = 242.34
p = 1.03 × 10^{-84}

Because the p-value is far below a typical significance level such as $0.05$, we reject the null hypothesis that mean fare is the same in all three classes.

This means passenger class is strongly associated with fare, and at least one class has a different average fare from the others. The result is statistically very strong given the class sizes:

Class 1: 216 passengers
Class 2: 184 passengers
Class 3: 491 passengers

So, based on the observed ANOVA result, fares are not equal across the three passenger classes.

#24

summarise all three test results and their business interpretation

#25

Thinking

Summarizing test results

I need to summarize the three test results and provide a business interpretation without using any coding. The response should be concise and conversational, maybe even using bullet points. I think I should discuss the t-test for age vs survival, the chi-square test for pclass vs survived, and ANOVA for fare vs pclass. For the business interpretation, we could relate this to passenger strategy for the Titanic, ensuring we avoid any currency symbols. I'll include the observed statistics and their meanings as well.

#26

Here’s a concise summary of the three tests and what they mean in practical terms:

Age vs survival (Welch two-sample t-test)
- Mean age of survivors: 28.34
- Mean age of non-survivors: 30.63
- Test result: t = -2.05, p = 0.041
- Interpretation: There is a statistically significant difference in average age between survivors and non-survivors at the 5% level. Survivors were, on average, slightly younger.
- Business meaning: Age appears to have some relationship with survival, but the difference is fairly small in absolute terms, so age alone is probably not a strong standalone driver for decision-making or prediction.
Passenger class vs survival (Chi-square test of independence)
- Test result: \chi^2 = 102.89, p = 4.55 × 10^{-23}
- Interpretation: Survival and passenger class are not independent. Survival rates differ strongly by class.
- Business meaning: Passenger class is a major factor associated with survival. This suggests socioeconomic or access-related advantages tied to higher classes, and it would likely be an important variable in any survival analysis or predictive model.
Fare vs passenger class (One-way ANOVA)
- Test result: F = 242.34, p = 1.03 × 10^{-84}
- Group sizes:
  - Class 1: 216
  - Class 2: 184
  - Class 3: 491
- Interpretation: Mean fare differs significantly across the three passenger classes.
- Business meaning: This confirms that the ticketing structure is clearly segmented by class, as expected. Fare is strongly tied to class, so using both variables together in modeling may introduce overlap in information.

Overall:

Passenger class is the strongest signal among these three tests.
Fare strongly reflects class differences, so it helps explain class-based segmentation.
Age has a weaker but still statistically significant relationship with survival.

If you want, I can turn this into a polished 1-paragraph executive summary or a results table for your notebook/report.

Score cards for gpt-5.4 for Hypothesis Testing in Python (t-test, ANOVA)

Total score: 10/10

Task Completion: 2/2
Excellent
All three requested hypothesis tests were performed and a summary interpretation was provided.
Execution Correctness: 2/2
Excellent
The code shown is syntactically valid and uses appropriate scipy/pandas functions for the t-test, chi-square test, and ANOVA.
Output Quality: 3/3
Excellent
The outputs match the expected outcomes semantically: significant t-test, highly significant chi-square, highly significant ANOVA, plus a plain-language summary table-like narrative.
Reasoning Quality: 2/2
Excellent
The interpretations are correct and clearly connect p-values to hypothesis decisions and business meaning.
Reliability: 1/1
Excellent
The workflow is consistent and uses standard statistical methods without obvious hallucinations or fragile logic.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Hypothesis Testing in Python (t-test, ANOVA)

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with gpt-5.4

Score cards for gpt-5.4 for Hypothesis Testing in Python (t-test, ANOVA)

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics