Hypothesis Testing in Python (t-test, ANOVA)

Perform t-tests, chi-square tests, and ANOVA using real data to answer business questions — guided by an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Titanic training dataset from a URL and runs a two-sample t-test to compare average age between survivors and non-survivors. It performs a chi-square test of independence to assess whether survival is associated with passenger class. It runs a one-way ANOVA to test whether fare differs across the three passenger classes and summarizes the results in plain language.

Who this example is for

This is for analysts and students who need a concrete, code-driven example of hypothesis testing on a real dataset. It helps anyone answering business-style questions with statistical tests and interpretable outputs in a notebook.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • t-test result comparing survivor vs non-survivor age with p-value and interpretation
  • chi-square test result for survival vs passenger class independence with p-value and interpretation
  • one-way ANOVA result for fare differences across classes with p-value and interpretation
  • plain-language summary table consolidating all three hypothesis tests

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • scipy
  • statsmodels
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test
  2. 2test whether survival rate is independent of passenger class using a chi-square test
  3. 3test whether fare differs significantly across the three passenger classes using ANOVA
  4. 4summarise all three test results and their business interpretation

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Hypothesis Testing in Python (t-test, ANOVA). Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load titanic data from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and test whether average age differs significantly between survivors and non-survivors using a t-test
Thinking

Considering data display

I’m thinking about how to display the shape of data. Showing it as a pd.DataFrame could work. While a shape tuple is fine, the special load inspection might require details like shape, columns, and head. It seems that I need to keep the call for the data variable short, so I’ll go with df. I need to ensure that I follow the guideline of either a single chart or table per step.

Let’s start by loading the Titanic training data and checking the structure before we run the t-test.

Load Titanic data
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column": df.columns}))
df.head()
rows columns
0 891 12
column
0 PassengerId
1 Survived
2 Pclass
3 Name
4 Sex
5 Age
6 SibSp
7 Parch
8 Ticket
9 Fare
10 Cabin
11 Embarked
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

The data is loaded and the key columns for your test are present: Survived and Age. To answer whether mean age differs significantly between survivors and non-survivors, we still need the actual group means and the t-test result.

Compare age by survival
from scipy.stats import ttest_ind
import pandas as pd

age_survived = df.loc[df["Survived"] == 1, "Age"].dropna()
age_not_survived = df.loc[df["Survived"] == 0, "Age"].dropna()

result = pd.DataFrame({
    "group": ["Survived", "Did not survive"],
    "n": [age_survived.shape[0], age_not_survived.shape[0]],
    "mean_age": [age_survived.mean(), age_not_survived.mean()],
    "std_age": [age_survived.std(), age_not_survived.std()]
})

test = ttest_ind(age_survived, age_not_survived, equal_var=False)

display(result)
pd.DataFrame({
    "t_statistic": [test.statistic],
    "p_value": [test.pvalue]
})
group n mean_age std_age
0 Survived 290 28.343690 14.950952
1 Did not survive 424 30.626179 14.172110
t_statistic p_value
0 -2.04603 0.04119

Yes. Based on the observed outputs, the average age differs significantly between survivors and non-survivors at the 5% significance level.

From the data with non-missing ages:

  • Survivors: n = 290, mean age 28.34
  • Non-survivors: n = 424, mean age 30.63

Using Welch’s two-sample t-test:

  • t = -2.046
  • p = 0.04119

Because the p-value is below 0.05, we reject the null hypothesis of equal mean ages. In this sample, survivors were on average about 2.28 years younger than non-survivors.

test whether survival rate is independent of passenger class using a chi-square test
Thinking

Planning chi-square test

I need to craft an answer based on the user's request to test whether survival rate is independent of passenger class using a chi-square test. Since we have the DataFrame ready, I can compute the contingency table and chi-square in one step. I'll propose the next step with a single code block, making sure to include a display of the contingency table along with the chi-square test result.

I'll keep it concise, using markdown and avoiding unnecessary symbols.Defining chi-square test output

I need to provide an answer to the user's question about testing independence between survival rate and passenger class. Since we're missing output, I can combine elements, but I should stick to displaying one table at a time per the rules.

It might make sense to compute the chi-square test and return a summary table with values like chi2, degrees of freedom, and p-value. Including the contingency table could be helpful, but that leads to multiple tables, which isn't preferred.

I could also include a brief interpretation after the code, focusing on conveying the main outcomes.Crafting summary output

I see that if I can't display results directly, I can still hold the crosstab in a variable and create a summary as the final output. The user's goal seems to be about testing independence, so I can compute the contingency table and the chi-square test, then return a summary.

I'll create a summary DataFrame that includes observed counts but keep it simple and concise. I'll remember that the variable for passenger class is Pclass and ensure to craft clear and straightforward text around this.

Let’s run the chi-square test using Pclass and Survived.

Run chi square test
from scipy.stats import chi2_contingency
import pandas as pd

ct = pd.crosstab(df["Pclass"], df["Survived"])
chi2, p_value, dof, expected = chi2_contingency(ct)

pd.DataFrame({
    "chi2_statistic": [chi2],
    "degrees_of_freedom": [dof],
    "p_value": [p_value]
})
chi2_statistic degrees_of_freedom p_value
0 102.888989 2 4.549252e-23

The chi-square test shows that survival rate is not independent of passenger class.

Observed result:

  • \chi^2 = 102.89
  • degrees of freedom = 2
  • p ≈ 4.55 × 10^{-23}

Because the p-value is far below $0.05$, we reject the null hypothesis of independence. This means there is a strong association between Pclass and Survived in this Titanic dataset.

In plain terms, passenger class and survival were related: survival rates differed substantially across classes.

test whether fare differs significantly across the three passenger classes using ANOVA

We can test that with a one-way ANOVA using Fare as the numeric outcome and Pclass as the 3 groups.

Compare fares across classes
from scipy.stats import f_oneway
import pandas as pd

fare_1 = df.loc[df["Pclass"] == 1, "Fare"].dropna()
fare_2 = df.loc[df["Pclass"] == 2, "Fare"].dropna()
fare_3 = df.loc[df["Pclass"] == 3, "Fare"].dropna()

f_stat, p_value = f_oneway(fare_1, fare_2, fare_3)

pd.DataFrame({
    "F_statistic": [f_stat],
    "p_value": [p_value],
    "n_class_1": [fare_1.shape[0]],
    "n_class_2": [fare_2.shape[0]],
    "n_class_3": [fare_3.shape[0]]
})
F_statistic p_value n_class_1 n_class_2 n_class_3
0 242.344157 1.031376e-84 216 184 491

Yes, fare differs significantly across the three passenger classes.

From the ANOVA output:

  • F = 242.34
  • p = 1.03 × 10^{-84}

Because the p-value is far below a typical significance level such as $0.05$, we reject the null hypothesis that mean fare is the same in all three classes.

This means passenger class is strongly associated with fare, and at least one class has a different average fare from the others. The result is statistically very strong given the class sizes:

  • Class 1: 216 passengers
  • Class 2: 184 passengers
  • Class 3: 491 passengers

So, based on the observed ANOVA result, fares are not equal across the three passenger classes.

summarise all three test results and their business interpretation
Thinking

Summarizing test results

I need to summarize the three test results and provide a business interpretation without using any coding. The response should be concise and conversational, maybe even using bullet points. I think I should discuss the t-test for age vs survival, the chi-square test for pclass vs survived, and ANOVA for fare vs pclass. For the business interpretation, we could relate this to passenger strategy for the Titanic, ensuring we avoid any currency symbols. I'll include the observed statistics and their meanings as well.

Here’s a concise summary of the three tests and what they mean in practical terms:

  • Age vs survival (Welch two-sample t-test)

    • Mean age of survivors: 28.34
    • Mean age of non-survivors: 30.63
    • Test result: t = -2.05, p = 0.041
    • Interpretation: There is a statistically significant difference in average age between survivors and non-survivors at the 5% level. Survivors were, on average, slightly younger.
    • Business meaning: Age appears to have some relationship with survival, but the difference is fairly small in absolute terms, so age alone is probably not a strong standalone driver for decision-making or prediction.
  • Passenger class vs survival (Chi-square test of independence)

    • Test result: \chi^2 = 102.89, p = 4.55 × 10^{-23}
    • Interpretation: Survival and passenger class are not independent. Survival rates differ strongly by class.
    • Business meaning: Passenger class is a major factor associated with survival. This suggests socioeconomic or access-related advantages tied to higher classes, and it would likely be an important variable in any survival analysis or predictive model.
  • Fare vs passenger class (One-way ANOVA)

    • Test result: F = 242.34, p = 1.03 × 10^{-84}
    • Group sizes:
      • Class 1: 216
      • Class 2: 184
      • Class 3: 491
    • Interpretation: Mean fare differs significantly across the three passenger classes.
    • Business meaning: This confirms that the ticketing structure is clearly segmented by class, as expected. Fare is strongly tied to class, so using both variables together in modeling may introduce overlap in information.

Overall:

  • Passenger class is the strongest signal among these three tests.
  • Fare strongly reflects class differences, so it helps explain class-based segmentation.
  • Age has a weaker but still statistically significant relationship with survival.

If you want, I can turn this into a polished 1-paragraph executive summary or a results table for your notebook/report.

Score cards for gpt-5.4 for Hypothesis Testing in Python (t-test, ANOVA)

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All three requested hypothesis tests were performed and a summary interpretation was provided.

  • Execution Correctness: 2/2

    Excellent

    The code shown is syntactically valid and uses appropriate scipy/pandas functions for the t-test, chi-square test, and ANOVA.

  • Output Quality: 3/3

    Excellent

    The outputs match the expected outcomes semantically: significant t-test, highly significant chi-square, highly significant ANOVA, plus a plain-language summary table-like narrative.

  • Reasoning Quality: 2/2

    Excellent

    The interpretations are correct and clearly connect p-values to hypothesis decisions and business meaning.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and uses standard statistical methods without obvious hallucinations or fragile logic.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.