Titanic Survival Analysis in Python

Explore the Titanic dataset with survival rates by class, sex, and age, handle missing values, and visualize patterns using an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Titanic training dataset from a URL and computes the overall survival rate and dataset shape. It generates visual comparisons of survival rates by passenger class and sex, and plots age distributions for survivors versus non-survivors. It also audits missing values by column to identify fields that need cleaning or imputation.

Who this example is for

This is for analysts and students who want a guided, conversational EDA example on a well-known classification dataset. It helps anyone practicing data cleaning and basic demographic breakdowns with reproducible Python code and plots.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Load the Titanic CSV from the provided URL and confirm shape (891, 12)
  • Compute overall survival rate (38.4%)
  • Create a grouped bar chart of survival rate by class and sex
  • Plot survivor vs non-survivor age distributions
  • Report missing values: Age 177, Cabin 687, Embarked 2

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • matplotlib
  • seaborn
  • requests

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall
  2. 2plot survival rate by passenger class and sex
  3. 3show age distribution for survivors vs non-survivors
  4. 4how many missing values are there and which columns?

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Titanic Survival Analysis in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall
Thinking

Calculating survival rate

First, I need to load and inspect the data before calculating the survival rate. I want to ensure I provide exactly one runnable code block. The code will read a CSV file and display its shape, columns, and the first few rows in sequence. I’ll avoid using any comma-separated expressions and ensure to end the code block with only one final bare expression. The plan involves using pandas to achieve this effectively.

Let’s start by loading the data and doing the required first inspection.

Load Titanic data
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns_count": [df.shape[1]]}))
display(pd.DataFrame({"column": df.columns}))

df.head()
rows columns_count
0 891 12
column
0 PassengerId
1 Survived
2 Pclass
3 Name
4 Sex
5 Age
6 SibSp
7 Parch
8 Ticket
9 Fare
10 Cabin
11 Embarked
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

The data is already loaded, and the target column Survived is present, so the only missing piece is the overall survival rate.

Calculate survival rate
survival_rate = df["Survived"].mean()

pd.DataFrame({
    "survival_rate": [survival_rate],
    "survival_rate_percent": [survival_rate * 100]
})
survival_rate survival_rate_percent
0 0.383838 38.383838

The dataset shown in the notebook has 891 rows and includes the Survived column, so it matches the Titanic training data you asked for.

From the computed result, the overall survival rate is 0.3838, which means 38.38% of passengers survived.

plot survival rate by passenger class and sex

Overall survival is 38.38%. Next, this bar chart shows survival rate by Pclass and Sex where the mean of Survived is the survival rate.

Plot survival rates by class
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
ax = sns.barplot(data=df, x="Pclass", y="Survived", hue="Sex", errorbar=None)

ax.set_title("Survival Rate by Passenger Class and Sex")
ax.set_xlabel("Passenger Class")
ax.set_ylabel("Survival Rate")
ax.set_ylim(0, 1)

plt.legend(title="Sex")
plt.tight_layout()
plt.show()

The chart is in place, but the latest output is visual-only, so the missing evidence is the underlying survival rates by class and sex.

Compute survival rates
rate_by_class_sex = (
    df.groupby(["Pclass", "Sex"])["Survived"]
      .mean()
      .reset_index(name="survival_rate")
)

rate_by_class_sex
Pclass Sex survival_rate
0 1 female 0.968085
1 1 male 0.368852
2 2 female 0.921053
3 2 male 0.157407
4 3 female 0.500000
5 3 male 0.135447

The plot is supported by the computed survival rates, and the pattern is very clear:

  • Women had much higher survival rates than men in every passenger class.
  • First-class women had the highest survival rate at about 96.8%.
  • Second-class women were also very likely to survive at about 92.1%.
  • Third-class women had a lower survival rate than women in higher classes, but still much higher than men, at 50.0%.

For men:

  • First-class men survived at about 36.9%.
  • Second-class men survived at about 15.7%.
  • Third-class men had the lowest survival rate at about 13.5%.

So the chart answers the question directly: survival depended strongly on sex and also on passenger class, with the best outcomes for female passengers in higher classes and the worst outcomes for male passengers in lower classes.

show age distribution for survivors vs non-survivors
Thinking

Preparing age distribution plot

I need to create a code block to show the age distribution for survivors versus non-survivors. First, I'll ensure the data is in the notebook state since it's already inspected. I want to create one chart only using seaborn. I'll likely drop any missing ages to handle that. I'll include concise explanations, relevant titles, and labels. Finally, I'll make sure it ends with plt.show(). Let's craft that brief text alongside the code.

Let’s compare the age distribution for survivors vs non-survivors with an overlaid histogram.

Plot age distribution by survival
age_df = df.dropna(subset=["Age"]).copy()

plt.figure(figsize=(8, 5))
sns.histplot(
    data=age_df,
    x="Age",
    hue="Survived",
    bins=30,
    stat="density",
    common_norm=False,
    element="step",
    fill=True,
    alpha=0.35
)

plt.title("Age Distribution: Survivors vs Non-Survivors")
plt.xlabel("Age")
plt.ylabel("Density")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.tight_layout()
plt.show()

The histogram was created successfully, but the current output is only the figure object, so there isn’t enough visible evidence yet to describe the age distributions confidently. A compact summary of age by survival status will make the comparison concrete.

Summarize age by survival
age_summary = (
    age_df.groupby("Survived")["Age"]
    .describe()[["count", "mean", "std", "min", "25%", "50%", "75%", "max"]]
    .rename(index={0: "Non-survivors", 1: "Survivors"})
)

age_summary
count mean std min 25% 50% 75% max
Survived
Non-survivors 424.0 30.626179 14.172110 1.00 21.0 28.0 39.0 74.0
Survivors 290.0 28.343690 14.950952 0.42 19.0 28.0 36.0 80.0

The age distributions for survivors and non-survivors are fairly similar, but survivors skew a bit younger.

From the observed summary:

  • Non-survivors: mean age 30.6, median 28, middle 50% roughly 21 to 39
  • Survivors: mean age 28.3, median 28, middle 50% roughly 19 to 36

What this suggests:

  • The median age is the same for both groups at 28, so the typical passenger age was similar.
  • Survivors have a slightly lower average age, which indicates a modest shift toward younger ages among those who survived.
  • The central spread is also a bit lower for survivors, with their 25th and 75th percentiles both below those of non-survivors.
  • Both groups cover a wide age range, from infants to elderly passengers, so age alone did not fully separate survival outcomes.

So, based on the plotted distribution and the numeric summary, survivors were somewhat younger on average, but the two age distributions still overlap heavily.

how many missing values are there and which columns?
Thinking

Analyzing missing values in dataset

I need to respond to the user's prompt by using the current state of the notebook. I should take it step by step. The DataFrame is loaded, but I need to analyze the latest outputs to assess the original goal. The user specifically asked, "how many missing values are there and which columns?" I realize that missing values haven’t been computed yet. I’ll propose a single code block to check for these values, focusing on those greater than zero.

Let’s check missing values by column and keep only columns that actually have any missing data.

Summarize missing values
missing_summary = (
    df.isna()
      .sum()
      .rename("missing_count")
      .to_frame()
      .query("missing_count > 0")
      .sort_values("missing_count", ascending=False)
)

missing_summary
missing_count
Cabin 687
Age 177
Embarked 2

There are missing values in 3 columns:

  • Cabin: 687 missing
  • Age: 177 missing
  • Embarked: 2 missing

So most missing data is concentrated in Cabin, with a moderate amount in Age, and very little in Embarked.

Score cards for gpt-5.4 for Titanic Survival Analysis in Python

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All four requested analyses were completed: overall survival rate, survival by class and sex, age distribution by survival status, and missing values by column.

  • Execution Correctness: 2/2

    Excellent

    The code is coherent and likely runnable as shown, with valid pandas/seaborn/matplotlib usage and outputs consistent with the Titanic dataset.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically: 38.38% survival, 891x12 shape, correct missing-value counts, and the expected survival pattern by class and sex. The age comparison also correctly indicates survivors skew younger.

  • Reasoning Quality: 2/2

    Excellent

    The explanations are clear and grounded in the computed results, correctly interpreting the survival patterns and age distribution without obvious errors.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and uses direct computations from the dataset, with no hallucinated methods or unsupported claims.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.