Titanic Survival Analysis in Python

Explore the Titanic dataset with survival rates by class, sex, and age, handle missing values, and visualize patterns using an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Titanic training dataset from a URL and computes the overall survival rate and dataset shape. It generates visual comparisons of survival rates by passenger class and sex, and plots age distributions for survivors versus non-survivors. It also audits missing values by column to identify fields that need cleaning or imputation.

Who this example is for

This is for analysts and students who want a guided, conversational EDA example on a well-known classification dataset. It helps anyone practicing data cleaning and basic demographic breakdowns with reproducible Python code and plots.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Load the Titanic CSV from the provided URL and confirm shape (891, 12)
Compute overall survival rate (38.4%)
Create a grouped bar chart of survival rate by class and sex
Plot survivor vs non-survivor age distributions
Report missing values: Age 177, Cabin 687, Embarked 2

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
matplotlib
seaborn
requests

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall
2plot survival rate by passenger class and sex
3show age distribution for survivors vs non-survivors
4how many missing values are there and which columns?

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3-coder-next	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Titanic Survival Analysis in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall

Thinking

Calculating survival rate

First, I need to load and inspect the data before calculating the survival rate. I want to ensure I provide exactly one runnable code block. The code will read a CSV file and display its shape, columns, and the first few rows in sequence. I’ll avoid using any comma-separated expressions and ensure to end the code block with only one final bare expression. The plan involves using pandas to achieve this effectively.

Let’s start by loading the data and doing the required first inspection.

Load Titanic data

import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns_count": [df.shape[1]]}))
display(pd.DataFrame({"column": df.columns}))

df.head()

	rows	columns_count
0	891	12

	column
0	PassengerId
1	Survived
2	Pclass
3	Name
4	Sex
5	Age
6	SibSp
7	Parch
8	Ticket
9	Fare
10	Cabin
11	Embarked

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

The data is already loaded, and the target column Survived is present, so the only missing piece is the overall survival rate.

Calculate survival rate

survival_rate = df["Survived"].mean()

pd.DataFrame({
    "survival_rate": [survival_rate],
    "survival_rate_percent": [survival_rate * 100]
})

#10

	survival_rate	survival_rate_percent
0	0.383838	38.383838

#11

The dataset shown in the notebook has 891 rows and includes the Survived column, so it matches the Titanic training data you asked for.

From the computed result, the overall survival rate is 0.3838, which means 38.38% of passengers survived.

#12

plot survival rate by passenger class and sex

#13

Overall survival is 38.38%. Next, this bar chart shows survival rate by Pclass and Sex where the mean of Survived is the survival rate.

#14

Plot survival rates by class

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
ax = sns.barplot(data=df, x="Pclass", y="Survived", hue="Sex", errorbar=None)

ax.set_title("Survival Rate by Passenger Class and Sex")
ax.set_xlabel("Passenger Class")
ax.set_ylabel("Survival Rate")
ax.set_ylim(0, 1)

plt.legend(title="Sex")
plt.tight_layout()
plt.show()

#15

#16

The chart is in place, but the latest output is visual-only, so the missing evidence is the underlying survival rates by class and sex.

#17

Compute survival rates

rate_by_class_sex = (
    df.groupby(["Pclass", "Sex"])["Survived"]
      .mean()
      .reset_index(name="survival_rate")
)

rate_by_class_sex

#18

	Pclass	Sex	survival_rate
0	1	female	0.968085
1	1	male	0.368852
2	2	female	0.921053
3	2	male	0.157407
4	3	female	0.500000
5	3	male	0.135447

#19

The plot is supported by the computed survival rates, and the pattern is very clear:

Women had much higher survival rates than men in every passenger class.
First-class women had the highest survival rate at about 96.8%.
Second-class women were also very likely to survive at about 92.1%.
Third-class women had a lower survival rate than women in higher classes, but still much higher than men, at 50.0%.

For men:

First-class men survived at about 36.9%.
Second-class men survived at about 15.7%.
Third-class men had the lowest survival rate at about 13.5%.

So the chart answers the question directly: survival depended strongly on sex and also on passenger class, with the best outcomes for female passengers in higher classes and the worst outcomes for male passengers in lower classes.

#20

show age distribution for survivors vs non-survivors

#21

Thinking

Preparing age distribution plot

I need to create a code block to show the age distribution for survivors versus non-survivors. First, I'll ensure the data is in the notebook state since it's already inspected. I want to create one chart only using seaborn. I'll likely drop any missing ages to handle that. I'll include concise explanations, relevant titles, and labels. Finally, I'll make sure it ends with plt.show(). Let's craft that brief text alongside the code.

#22

Let’s compare the age distribution for survivors vs non-survivors with an overlaid histogram.

#23

Plot age distribution by survival

age_df = df.dropna(subset=["Age"]).copy()

plt.figure(figsize=(8, 5))
sns.histplot(
    data=age_df,
    x="Age",
    hue="Survived",
    bins=30,
    stat="density",
    common_norm=False,
    element="step",
    fill=True,
    alpha=0.35
)

plt.title("Age Distribution: Survivors vs Non-Survivors")
plt.xlabel("Age")
plt.ylabel("Density")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.tight_layout()
plt.show()

#24

#25

The histogram was created successfully, but the current output is only the figure object, so there isn’t enough visible evidence yet to describe the age distributions confidently. A compact summary of age by survival status will make the comparison concrete.

#26

Summarize age by survival

age_summary = (
    age_df.groupby("Survived")["Age"]
    .describe()[["count", "mean", "std", "min", "25%", "50%", "75%", "max"]]
    .rename(index={0: "Non-survivors", 1: "Survivors"})
)

age_summary

#27

	count	mean	std	min	25%	50%	75%	max
Survived
Non-survivors	424.0	30.626179	14.172110	1.00	21.0	28.0	39.0	74.0
Survivors	290.0	28.343690	14.950952	0.42	19.0	28.0	36.0	80.0

#28

The age distributions for survivors and non-survivors are fairly similar, but survivors skew a bit younger.

From the observed summary:

Non-survivors: mean age 30.6, median 28, middle 50% roughly 21 to 39
Survivors: mean age 28.3, median 28, middle 50% roughly 19 to 36

What this suggests:

The median age is the same for both groups at 28, so the typical passenger age was similar.
Survivors have a slightly lower average age, which indicates a modest shift toward younger ages among those who survived.
The central spread is also a bit lower for survivors, with their 25th and 75th percentiles both below those of non-survivors.
Both groups cover a wide age range, from infants to elderly passengers, so age alone did not fully separate survival outcomes.

So, based on the plotted distribution and the numeric summary, survivors were somewhat younger on average, but the two age distributions still overlap heavily.

#29

how many missing values are there and which columns?

#30

Thinking

Analyzing missing values in dataset

I need to respond to the user's prompt by using the current state of the notebook. I should take it step by step. The DataFrame is loaded, but I need to analyze the latest outputs to assess the original goal. The user specifically asked, "how many missing values are there and which columns?" I realize that missing values haven’t been computed yet. I’ll propose a single code block to check for these values, focusing on those greater than zero.

#31

Let’s check missing values by column and keep only columns that actually have any missing data.

#32

Summarize missing values

missing_summary = (
    df.isna()
      .sum()
      .rename("missing_count")
      .to_frame()
      .query("missing_count > 0")
      .sort_values("missing_count", ascending=False)
)

missing_summary

#33

	missing_count
Cabin	687
Age	177
Embarked	2

#34

There are missing values in 3 columns:

Cabin: 687 missing
Age: 177 missing
Embarked: 2 missing

So most missing data is concentrated in Cabin, with a moderate amount in Age, and very little in Embarked.

Score cards for gpt-5.4 for Titanic Survival Analysis in Python

Total score: 10/10

Task Completion: 2/2
Excellent
All four requested analyses were completed: overall survival rate, survival by class and sex, age distribution by survival status, and missing values by column.
Execution Correctness: 2/2
Excellent
The code is coherent and likely runnable as shown, with valid pandas/seaborn/matplotlib usage and outputs consistent with the Titanic dataset.
Output Quality: 3/3
Excellent
Outputs match the expected outcomes semantically: 38.38% survival, 891x12 shape, correct missing-value counts, and the expected survival pattern by class and sex. The age comparison also correctly indicates survivors skew younger.
Reasoning Quality: 2/2
Excellent
The explanations are clear and grounded in the computed results, correctly interpreting the survival patterns and age distribution without obvious errors.
Reliability: 1/1
Excellent
The workflow is consistent and uses direct computations from the dataset, with no hallucinated methods or unsupported claims.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Titanic Survival Analysis in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with gpt-5.4

Score cards for gpt-5.4 for Titanic Survival Analysis in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics