Titanic Survival Analysis in Python

Explore the Titanic dataset with survival rates by class, sex, and age, handle missing values, and visualize patterns using an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Titanic training dataset from a URL and computes the overall survival rate and dataset shape. It generates visual comparisons of survival rates by passenger class and sex, and plots age distributions for survivors versus non-survivors. It also audits missing values by column to identify fields that need cleaning or imputation.

Who this example is for

This is for analysts and students who want a guided, conversational EDA example on a well-known classification dataset. It helps anyone practicing data cleaning and basic demographic breakdowns with reproducible Python code and plots.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Load the Titanic CSV from the provided URL and confirm shape (891, 12)
Compute overall survival rate (38.4%)
Create a grouped bar chart of survival rate by class and sex
Plot survivor vs non-survivor age distributions
Report missing values: Age 177, Cabin 687, Embarked 2

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
matplotlib
seaborn
requests

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall
2plot survival rate by passenger class and sex
3show age distribution for survivors vs non-survivors
4how many missing values are there and which columns?

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3-coder-next	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with qwen3-coder-next

This run uses AI with a step-by-step list of prompts to perform the analysis task Titanic Survival Analysis in Python. Below, you can review the full conversation generated with qwen3-coder-next. After the conversation, we show detailed score cards for this run.

load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall

Load Titanic dataset and display basic info

import pandas as pd

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv"
df = pd.read_csv(url)

# Display basic info and survival rate
print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
df.head()

Dataset shape: (891, 12)

First 5 rows:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

The dataset is loaded successfully with 891 rows and 12 columns, including the Survived column (0 = did not survive, 1 = survived). The first few rows confirm the structure: PassengerId, Survived, Pclass, Name, Sex, Age, etc.

To calculate and show the overall survival rate (proportion of passengers who survived), I need to compute the mean of the Survived column.

Calculate survival rate

survival_rate = df['Survived'].mean()
survival_rate

np.float64(0.3838383838383838)

The overall survival rate is 38.38%, meaning about 38 out of every 100 passengers survived the Titanic disaster. This is calculated as the mean of the Survived column (0 = died, 1 = survived). The dataset contains 891 records, and the survival rate has been correctly computed in the last step.

The overall survival rate in the Titanic training dataset is 38.38%.

plot survival rate by passenger class and sex

#10

Let's visualize the survival rate by passenger class (Pclass) and sex (Sex). I'll create a grouped bar chart using seaborn, where the x-axis represents passenger class, the bars are colored by sex, and the height represents the survival rate.

First, I'll compute the survival rate for each group:

#11

Compute survival rate by Pclass and Sex

import pandas as pd

# Compute survival rate by Pclass and Sex
grouped = df.groupby(['Pclass', 'Sex'])['Survived'].mean().unstack()

# Display the computed rates as a table for reference
grouped

#12

Sex	female	male
Pclass
1	0.968085	0.368852
2	0.921053	0.157407
3	0.500000	0.135447

#13

From the computed survival rates by Pclass and Sex:

First class: 96.8% of females and 36.9% of males survived.
Second class: 92.1% of females and 15.7% of males survived.
Third class: 50.0% of females and 13.5% of males survived.

This clearly shows both a class and gender effect: women had much higher survival rates than men in every class, and higher class correlated with higher survival, especially for women.

The goal is to plot survival rate by passenger class and sex. The underlying values are now available, so I’ll generate a clear grouped bar chart using seaborn.

#14

Plot survival rates by class and sex

import seaborn as sns
import matplotlib.pyplot as plt

# Prepare data for plotting (long format)
plot_data = grouped.reset_index().melt(id_vars='Pclass', var_name='Sex', value_name='SurvivalRate')

# Create grouped bar chart
plt.figure(figsize=(8, 5))
sns.barplot(data=plot_data, x='Pclass', y='SurvivalRate', hue='Sex')
plt.title('Survival Rate by Passenger Class and Sex')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
plt.ylim(0, 1)
plt.show()

#15

#16

The bar chart successfully visualizes survival rate by passenger class and sex. The chart shows clear patterns: survival rates are highest for first-class females (≈96.8%) and lowest for third-class males (≈13.5%). The gap between males and females is consistent across all classes, and survival declines with lower class for both genders.

#17

show age distribution for survivors vs non-survivors

#18

The previous cell successfully plotted survival rate by passenger class and sex. Now, per the user’s new request, we will plot the age distribution for survivors vs. non-survivors.

I’ll use a kernel density estimate (KDE) overlay plot — it’s clear, smooth, and ideal for comparing distributions.

First, I’ll check if the Age column has missing values and, if needed, handle them (e.g., drop or fill). Then I’ll plot KDEs for survivors (Survived == 1) and non-survivors (Survived == 0) using seaborn.

Let’s do this in one step: filter and plot.

#19

Plot age distribution by survival status

import seaborn as sns
import matplotlib.pyplot as plt

# Drop rows with missing Age to avoid plot warnings
df_age = df.dropna(subset=['Age'])

plt.figure(figsize=(8, 5))
sns.kdeplot(data=df_age, x='Age', hue='Survived', fill=True, common_norm=False, alpha=0.6)
plt.title('Age Distribution by Survival Status')
plt.xlabel('Age')
plt.ylabel('Density')
plt.show()

#20

#21

The KDE plot successfully visualizes the age distribution for survivors (Survived = 1) and non-survivors (Survived = 0). The chart shows that younger passengers (especially children under ~5 years old) had a higher survival probability, while the distribution for non-survivors is slightly broader and shifted toward older ages.

Since the chart clearly shows the distinction between survivors and non-survivors by age, and no further numeric values or verification is needed to confirm the comparison, the goal is fully satisfied.

#22

how many missing values are there and which columns?

#23

Let’s check how many missing values exist in the dataset and which columns contain them.

#24

Count missing values per column

df.isnull().sum()

#25

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

#26

From the latest executed output ([cell-6]), we see the count of missing values per column:

Age: 177 missing values
Cabin: 687 missing values
Embarked: 2 missing values
All other columns have 0 missing values.

This fully answers the original goal.

There are 866 total missing values across 3 columns:

Cabin: 687
Age: 177
Embarked: 2

Score cards for qwen3-coder-next for Titanic Survival Analysis in Python

Total score: 10/10

Task Completion: 2/2
Excellent
All four requested analyses were completed: dataset loading, overall survival rate, survival by class and sex, age distribution by survival status, and missing-value counts.
Execution Correctness: 2/2
Excellent
The code shown is syntactically valid and the outputs are consistent with the Titanic dataset, including the correct shape and missing-value counts.
Output Quality: 3/3
Excellent
Outputs match the expected outcomes semantically: 38.38% survival rate, correct grouped survival rates, an age comparison plot, and the exact missing-value counts for Age, Cabin, and Embarked.
Reasoning Quality: 2/2
Excellent
The explanations are clear and correctly interpret the results, including the class/sex survival pattern and the younger-skewing survivor age distribution.
Reliability: 1/1
Excellent
The workflow is consistent and uses standard pandas/seaborn operations without obvious hallucinations or fragile logic.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Titanic Survival Analysis in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with qwen3-coder-next

Score cards for qwen3-coder-next for Titanic Survival Analysis in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics