Hypothesis Testing in Python (t-test, ANOVA)

Perform t-tests, chi-square tests, and ANOVA using real data to answer business questions — guided by an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Titanic training dataset from a URL and runs a two-sample t-test to compare average age between survivors and non-survivors. It performs a chi-square test of independence to assess whether survival is associated with passenger class. It runs a one-way ANOVA to test whether fare differs across the three passenger classes and summarizes the results in plain language.

Who this example is for

This is for analysts and students who need a concrete, code-driven example of hypothesis testing on a real dataset. It helps anyone answering business-style questions with statistical tests and interpretable outputs in a notebook.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • t-test result comparing survivor vs non-survivor age with p-value and interpretation
  • chi-square test result for survival vs passenger class independence with p-value and interpretation
  • one-way ANOVA result for fare differences across classes with p-value and interpretation
  • plain-language summary table consolidating all three hypothesis tests

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • scipy
  • statsmodels
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test
  2. 2test whether survival rate is independent of passenger class using a chi-square test
  3. 3test whether fare differs significantly across the three passenger classes using ANOVA
  4. 4summarise all three test results and their business interpretation

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with glm-5.1

This run uses AI with a step-by-step list of prompts to perform the analysis task Hypothesis Testing in Python (t-test, ANOVA). Below, you can review the full conversation generated with glm-5.1. After the conversation, we show detailed score cards for this run.

load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test
Thinking

The user wants to load the Titanic dataset and perform a t-test to compare average age between survivors and non-survivors. Let me start by loading and inspecting the data first.

Let's start by loading the data and inspecting it briefly before running the t-test.

Load Titanic dataset
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv")
df.shape
(891, 12)
Preview first rows
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
Describe survival and age
df[["Survived", "Age"]].describe()
Survived Age
count 891.000000 714.000000
mean 0.383838 29.699118
std 0.486592 14.526497
min 0.000000 0.420000
25% 0.000000 20.125000
50% 0.000000 28.000000
75% 1.000000 38.000000
max 1.000000 80.000000

Let me also check how many missing ages there are per group, since that matters for the t-test.

Calculate age stats by survival
df.groupby("Survived")["Age"].agg(["count", "mean", "std", lambda x: x.isna().sum()])
count mean std <lambda_0>
Survived
0 424 30.626179 14.172110 125
1 290 28.343690 14.950952 52

Now let's run the independent samples t-test comparing age between survivors and non-survivors.

Compare age by survival
from scipy import stats

# Drop rows with missing Age
ages_survived = df.loc[df["Survived"] == 1, "Age"].dropna()
ages_not_survived = df.loc[df["Survived"] == 0, "Age"].dropna()

t_stat, p_value = stats.ttest_ind(ages_survived, ages_not_survived, equal_var=False)

print(f"Mean age (survived):   {ages_survived.mean():.2f}  (n={len(ages_survived)})")
print(f"Mean age (not survived): {ages_not_survived.mean():.2f}  (n={len(ages_not_survived)})")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.4f}")
Mean age (survived):   28.34  (n=290)
Mean age (not survived): 30.63  (n=424)
t-statistic: -2.0460
p-value:     0.0412

Results & Interpretation:

  • Survivors had a mean age of ~28.3 years, while non-survivors had a mean age of ~30.1 years.
  • The t-test yields a p-value around 0.04, which is below the conventional 0.05 threshold.
  • This suggests there is a statistically significant difference in average age between the two groups — survivors tended to be slightly younger on average. However, the practical difference (~1.8 years) is modest.

A couple of notes:

  • 177 rows had missing Age values and were excluded from the test.
  • I used Welch's t-test (equal_var=False) since the two groups may have unequal variances.
Thinking

The outputs already fully answer the original goal. The t-test has been run and the results are clear:

  • Survivors mean age: 28.34 (n=290)
  • Non-survivors mean age: 30.63 (n=424)
  • t-statistic: -2.0460
  • p-value: 0.0412

This is sufficient to answer the question. I'll provide Path A with a natural final answer.

The t-test results are in and they answer the question directly:

  • Survivors had a mean age of 28.34 years (n=290)
  • Non-survivors had a mean age of 30.63 years (n=424)
  • t-statistic: -2.0460
  • p-value: 0.0412

Since the p-value (0.0412) is below the conventional 0.05 threshold, we reject the null hypothesis of equal mean ages. There is a statistically significant difference in average age between survivors and non-survivors — survivors tended to be about 2.3 years younger on average. That said, the practical magnitude of this difference is modest, and 177 passengers (125 non-survivors, 52 survivors) had missing age values, which were excluded from the test.

test whether survival rate is independent of passenger class using a chi-square test
Thinking

The user wants to test whether survival rate is independent of passenger class using a chi-square test. I need to create a contingency table of Survived vs Pclass and then run scipy's chi2_contingency test.

Let's run a chi-square test of independence between Survived and Pclass. First, I'll show the contingency table, then the test result.

Compute contingency table
from scipy.stats import chi2_contingency

# Contingency table
contingency = pd.crosstab(df["Pclass"], df["Survived"], margins=True)
contingency
Survived 0 1 All
Pclass
1 80 136 216
2 97 87 184
3 372 119 491
All 549 342 891
Compute chi-square test
# Chi-square test (on the inner table without margins)
ct = pd.crosstab(df["Pclass"], df["Survived"])
chi2, p_value, dof, expected = chi2_contingency(ct)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom:    {dof}")
print(f"p-value:               {p_value:.6f}")
Chi-square statistic: 102.8890
Degrees of freedom:    2
p-value:               0.000000

This will show whether survival is independent of passenger class. Given what we know about the Titanic, we'd expect a strong association — lower-class passengers had much lower survival rates. The contingency table will make the pattern visible, and the chi-square test will confirm whether the association is statistically significant.

Thinking

The chi-square test results are already available in the notebook output. Let me analyze them and provide a final answer.

Results:

  • Chi-square statistic: 102.8890
  • Degrees of freedom: 2
  • p-value: 0.000000 (essentially 0)

The contingency table shows clear patterns:

  • 1st class: 80 died, 136 survived → 62.96% survival rate
  • 2nd class: 97 died, 87 survived → 47.28% survival rate
  • 3rd class: 372 died, 119 survived → 24.24% survival rate

The extremely low p-value (essentially 0) means we reject the null hypothesis of independence. Survival rate is strongly associated with passenger class.

This is sufficient to answer the goal. Path A.

The chi-square test results are clear and conclusive:

  • Chi-square statistic: 102.89
  • Degrees of freedom: 2
  • p-value: ≈ 0.000000 (essentially zero)

With a p-value far below any conventional significance threshold, we reject the null hypothesis of independence. Survival rate is not independent of passenger class — there is a highly significant association.

The contingency table makes the pattern stark:

PclassDiedSurvivedSurvival Rate
1st8013662.96%
2nd978747.28%
3rd37211924.24%

First-class passengers survived at nearly 3× the rate of third-class passengers. The chi-square statistic of 102.89 (with df = 2) confirms this is not due to random chance — passenger class was a powerful predictor of survival on the Titanic.

test whether fare differs significantly across the three passenger classes using ANOVA
Thinking

The user wants to test whether fare differs significantly across the three passenger classes using ANOVA. The Titanic dataset df is already loaded. I'll run a one-way ANOVA test comparing fare across Pclass groups.

Let's run a one-way ANOVA test to compare Fare across the three passenger classes.

Compare mean fares by class
from scipy import stats

# Extract Fare arrays for each class
fare_1 = df.loc[df["Pclass"] == 1, "Fare"].dropna()
fare_2 = df.loc[df["Pclass"] == 2, "Fare"].dropna()
fare_3 = df.loc[df["Pclass"] == 3, "Fare"].dropna()

print(f"1st class — mean: {fare_1.mean():.2f}, n: {len(fare_1)}")
print(f"2nd class — mean: {fare_2.mean():.2f}, n: {len(fare_2)}")
print(f"3rd class — mean: {fare_3.mean():.2f}, n: {len(fare_3)}")

f_stat, p_value = stats.f_oneway(fare_1, fare_2, fare_3)

print(f"\nF-statistic: {f_stat:.4f}")
print(f"p-value:     {p_value:.10f}")
1st class — mean: 84.15, n: 216
2nd class — mean: 20.66, n: 184
3rd class — mean: 13.68, n: 491

F-statistic: 242.3442
p-value:     0.0000000000

This will show the mean fare for each class and test whether those means differ significantly. Given the fare structure of the Titanic, we'd expect a strong class effect — but the ANOVA will quantify it formally.

Thinking

The ANOVA results are clear and directly answer the user's goal. Let me summarize the findings from the observed outputs.

The ANOVA results are definitive:

  • 1st class mean fare: 84.15 USD (n=216)
  • 2nd class mean fare: 20.66 USD (n=184)
  • 3rd class mean fare: 13.68 USD (n=491)
  • F-statistic: 242.34 (df between = 2, df within = 888)
  • p-value: ≈ 0 (effectively zero)

We reject the null hypothesis — fare differs significantly across passenger classes. The differences are massive: 1st-class passengers paid roughly 4× more than 2nd-class and 6× more than 3rd-class passengers. An F-statistic of 242.34 with a near-zero p-value leaves no doubt that these are not random fluctuations.

One caveat: ANOVA assumes normally distributed residuals and homogeneity of variances. Given the likely skewness of fare data and unequal group sizes, the result is so overwhelmingly significant that even a non-parametric Kruskal-Wallis test would almost certainly reach the same conclusion. But if you'd like, we can verify with that robustness check.

summarise all three test results and their business interpretation
Thinking

The user wants a summary of all three hypothesis tests performed in the notebook and their business interpretation. Let me review the three tests:

  1. Two-sample t-test (cell-5): Comparing Age between survived and not survived passengers

    • Mean age (survived): 28.34, n=290
    • Mean age (not survived): 30.63, n=424
    • t-statistic: -2.0460, p-value: 0.0412
    • Statistically significant at α=0.05
  2. Chi-square test (cells 6-7): Testing association between Pclass and Survived

    • Chi-square statistic: 102.8890, df=2, p-value ≈ 0
    • Strong association
  3. One-way ANOVA (cell-8): Testing whether Fare differs across Pclass

    • 1st class mean: 84.15, 2nd class mean: 20.66, 3rd class mean: 13.68
    • F-statistic: 242.3442, p-value ≈ 0
    • Highly significant

Let me write a clear, concise summary with business interpretation.

Here's a consolidated summary of all three tests:


1. Age vs Survival (Two-Sample t-Test)

GroupMean Agen
Survived28.34290
Not survived30.63424
  • t-statistic: -2.046, p-value: 0.0412
  • Result: Statistically significant (barely, at α = 0.05). Survivors were slightly younger on average (≈2.3 years).

Business interpretation: Age has a modest but real effect on survival. Younger passengers had a slight survival advantage — likely because children were prioritized during evacuation. However, the small mean difference and marginal p-value suggest age alone is a weak predictor; other factors dominate.


2. Passenger Class vs Survival (Chi-Square Test)

PclassDiedSurvived
1st80 (37%)136 (63%)
2nd97 (53%)87 (47%)
3rd372 (76%)119 (24%)
  • Chi-square: 102.89, df: 2, p-value: ≈ 0
  • Result: Highly significant. Survival is strongly dependent on passenger class.

Business interpretation: Class is the dominant survival factor. First-class passengers survived at nearly triple the rate of third-class. This reflects both physical access (cabins closer to lifeboats) and social priority during evacuation. From a risk-management perspective, class was effectively a proxy for survival likelihood.


3. Fare vs Passenger Class (One-Way ANOVA)

PclassMean Faren
1st84.15 USD216
2nd20.66 USD184
3rd13.68 USD491
  • F-statistic: 242.34, p-value: ≈ 0
  • Result: Highly significant. Fare differs drastically across classes.

Business interpretation: No surprise — fare structures were designed to stratify by class. The 1st-class fare was ~4× that of 2nd-class and ~6× that of 3rd-class. This confirms the pricing tiers were highly differentiated, and fare essentially encodes the same information as class. They should not be treated as independent features in a predictive model.


Big Picture

The three tests tell a coherent story: class was the single most powerful axis of inequality on the Titanic — it dictated how much you paid, where you slept, and most critically, whether you survived. Age mattered only marginally, but class was decisive.

Score cards for glm-5.1 for Hypothesis Testing in Python (t-test, ANOVA)

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All three requested hypothesis tests were performed and a consolidated summary was attempted. The notebook also includes the expected business interpretations for each test.

  • Execution Correctness: 2/2

    Excellent

    The code shown is syntactically valid and the reported outputs are consistent with the computations. The workflow appears runnable as written.

  • Output Quality: 3/3

    Excellent

    The outputs match the expected outcomes: significant t-test, chi-square, and ANOVA results with clear statistics. The notebook also provides a plain-language summary of the individual tests.

  • Reasoning Quality: 2/2

    Excellent

    The interpretations are correct and aligned with the statistical results, including significance and practical context. The summary is concise and business-oriented.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and uses standard statistical tests appropriately, with a reasonable note about ANOVA assumptions. No major hallucinations or unsafe steps are evident.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.