Hypothesis Testing in Python (t-test, ANOVA)

Perform t-tests, chi-square tests, and ANOVA using real data to answer business questions — guided by an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Titanic training dataset from a URL and runs a two-sample t-test to compare average age between survivors and non-survivors. It performs a chi-square test of independence to assess whether survival is associated with passenger class. It runs a one-way ANOVA to test whether fare differs across the three passenger classes and summarizes the results in plain language.

Who this example is for

This is for analysts and students who need a concrete, code-driven example of hypothesis testing on a real dataset. It helps anyone answering business-style questions with statistical tests and interpretable outputs in a notebook.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • t-test result comparing survivor vs non-survivor age with p-value and interpretation
  • chi-square test result for survival vs passenger class independence with p-value and interpretation
  • one-way ANOVA result for fare differences across classes with p-value and interpretation
  • plain-language summary table consolidating all three hypothesis tests

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • scipy
  • statsmodels
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test
  2. 2test whether survival rate is independent of passenger class using a chi-square test
  3. 3test whether fare differs significantly across the three passenger classes using ANOVA
  4. 4summarise all three test results and their business interpretation

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with qwen3-coder-next

This run uses AI with a step-by-step list of prompts to perform the analysis task Hypothesis Testing in Python (t-test, ANOVA). Below, you can review the full conversation generated with qwen3-coder-next. After the conversation, we show detailed score cards for this run.

load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test
Load and preview Titanic dataset
import pandas as pd

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv"
df = pd.read_csv(url)

# Display basic info about the dataset
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

The dataset has been loaded successfully and shows the expected columns for the Titanic dataset, including Survived and Age. To test whether average age differs significantly between survivors and non-survivors, a t-test is needed. First, I'll check the distribution of Age values to ensure it's suitable for a t-test (i.e., enough valid data and not heavily skewed). A quick summary of age statistics by survival status is required.

Describe age distribution by survival status
# Check age distribution by survival status
df.groupby('Survived')['Age'].describe()
count mean std min 25% 50% 75% max
Survived
0 424.0 30.626179 14.172110 1.00 21.0 28.0 39.0 74.0
1 290.0 28.343690 14.950952 0.42 19.0 28.0 36.0 80.0

The output shows the descriptive statistics for Age grouped by survival status: non-survivors (0) have a mean age of 30.63, while survivors (1) have a mean age of 28.34. This suggests a small difference in average age, but it does not confirm statistical significance. To test whether this difference is statistically significant, a two-sample t-test is needed.

Compute t-test for age by survival status
from scipy import stats

# Extract age values for survivors and non-survivors, ignoring NaNs
age_non_survivors = df[df['Survived'] == 0]['Age'].dropna()
age_survivors = df[df['Survived'] == 1]['Age'].dropna()

# Perform two-sample t-test assuming equal variances
t_stat, p_val = stats.ttest_ind(age_non_survivors, age_survivors, equal_var=False)

# Return t-statistic and p-value
t_stat, p_val
(np.float64(2.0460301043939704), np.float64(0.04118965162586641))

The two-sample t-test results show a t-statistic of 2.046 and a p-value of 0.0412. Since the p-value is less than the conventional threshold of 0.05, the difference in average age between non-survivors (30.63 years) and survivors (28.34 years) is statistically significant. This supports the conclusion that age differs significantly between survivors and non-survivors.

test whether survival rate is independent of passenger class using a chi-square test

To test whether survival rate is independent of passenger class, we’ll use a chi-square test of independence on the contingency table of Survived vs Pclass.

Here’s the next step:

Test survival by class with chi-square
import pandas as pd
from scipy import stats

# Create a contingency table of Survived vs Pclass
contingency_table = pd.crosstab(df['Survived'], df['Pclass'])

# Perform chi-square test
chi2, p_val, dof, expected = stats.chi2_contingency(contingency_table)

# Display contingency table and test results
contingency_table, (chi2, p_val, dof)
(Pclass      1   2    3
 Survived              
 0          80  97  372
 1         136  87  119,
 (np.float64(102.88898875696056), np.float64(4.549251711298793e-23), 2))

The chi-square test output shows:

  • Contingency table:
    Pclass      1   2    3
    Survived              
    0          80  97  372
    1         136  87  119
    
  • Test statistic: chi2 = 102.89
  • Degrees of freedom: dof = 2
  • p-value: 4.55e-23

Since the p-value is far below any conventional significance level (e.g., 0.05), we reject the null hypothesis that survival is independent of passenger class. This indicates a very strong association between class and survival.

test whether fare differs significantly across the three passenger classes using ANOVA

To test whether fare differs significantly across passenger classes, we’ll perform a one-way ANOVA using scipy.stats.f_oneway, grouping by Pclass.

Perform one-way ANOVA on fares by passenger class
from scipy import stats

# Group fares by passenger class
fare_class_1 = df[df['Pclass'] == 1]['Fare']
fare_class_2 = df[df['Pclass'] == 2]['Fare']
fare_class_3 = df[df['Pclass'] == 3]['Fare']

# Perform one-way ANOVA
f_stat, p_val = stats.f_oneway(fare_class_1, fare_class_2, fare_class_3)

(f_stat, p_val)
(np.float64(242.344156517448), np.float64(1.0313763209140291e-84))

The latest ANOVA output shows:

  • F-statistic = 242.34
  • p-value = 1.03e-84

This extremely small p-value (< 0.001) indicates strong evidence against the null hypothesis that mean fares are equal across all three passenger classes. Therefore, fare differs significantly across passenger classes.

summarise all three test results and their business interpretation

Here's a concise summary of all three statistical tests and their business interpretation:

  1. Age & Survival (Two-sample t-test)

    • Result: t = 2.05, p = 0.041
    • Interpretation: Statistically significant difference in mean age between survivors and non-survivors. Younger passengers were slightly more likely to survive.
  2. Survival vs Passenger Class (Chi-square test)

    • Result: χ² = 102.89, p ≈ 4.55×10⁻²³
    • Interpretation: Strong association between passenger class and survival. First-class passengers had much higher survival rates than third-class.
  3. Fare & Passenger Class (One-way ANOVA)

    • Result: F = 242.34, p ≈ 1.03×10⁻⁸⁴
    • Interpretation: Extremely significant difference in mean fare across passenger classes. This confirms that class is a strong determinant of fare, reinforcing the link between socioeconomic status and survival.

Overall, all three tests reveal that socioeconomic factors (class and fare) were major correlates of survival — a key insight for understanding historical risk or inequity patterns in maritime disasters.

Score cards for qwen3-coder-next for Hypothesis Testing in Python (t-test, ANOVA)

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All three required hypothesis tests were performed and a summary interpretation was provided.

  • Execution Correctness: 2/2

    Excellent

    The code is syntactically valid and the reported outputs are consistent with the shown computations.

  • Output Quality: 3/3

    Excellent

    The outputs match the expected outcomes semantically: significant t-test, chi-square, and ANOVA results, plus a plain-language summary table-like recap.

  • Reasoning Quality: 2/2

    Excellent

    The interpretations correctly connect p-values to the hypotheses and explain the business meaning of each test.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and uses standard statistical tests appropriately on the Titanic dataset.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.