Titanic Survival Analysis in Python

Explore the Titanic dataset with survival rates by class, sex, and age, handle missing values, and visualize patterns using an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Titanic training dataset from a URL and computes the overall survival rate and dataset shape. It generates visual comparisons of survival rates by passenger class and sex, and plots age distributions for survivors versus non-survivors. It also audits missing values by column to identify fields that need cleaning or imputation.

Who this example is for

This is for analysts and students who want a guided, conversational EDA example on a well-known classification dataset. It helps anyone practicing data cleaning and basic demographic breakdowns with reproducible Python code and plots.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Load the Titanic CSV from the provided URL and confirm shape (891, 12)
  • Compute overall survival rate (38.4%)
  • Create a grouped bar chart of survival rate by class and sex
  • Plot survivor vs non-survivor age distributions
  • Report missing values: Age 177, Cabin 687, Embarked 2

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • matplotlib
  • seaborn
  • requests

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall
  2. 2plot survival rate by passenger class and sex
  3. 3show age distribution for survivors vs non-survivors
  4. 4how many missing values are there and which columns?

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with qwen3-coder-next

This run uses AI with a step-by-step list of prompts to perform the analysis task Titanic Survival Analysis in Python. Below, you can review the full conversation generated with qwen3-coder-next. After the conversation, we show detailed score cards for this run.

load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall
Load Titanic dataset and display basic info
import pandas as pd

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv"
df = pd.read_csv(url)

# Display basic info and survival rate
print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
df.head()
Dataset shape: (891, 12)

First 5 rows:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

The dataset is loaded successfully with 891 rows and 12 columns, including the Survived column (0 = did not survive, 1 = survived). The first few rows confirm the structure: PassengerId, Survived, Pclass, Name, Sex, Age, etc.

To calculate and show the overall survival rate (proportion of passengers who survived), I need to compute the mean of the Survived column.

Calculate survival rate
survival_rate = df['Survived'].mean()
survival_rate
np.float64(0.3838383838383838)

The overall survival rate is 38.38%, meaning about 38 out of every 100 passengers survived the Titanic disaster. This is calculated as the mean of the Survived column (0 = died, 1 = survived). The dataset contains 891 records, and the survival rate has been correctly computed in the last step.

The overall survival rate in the Titanic training dataset is 38.38%.

plot survival rate by passenger class and sex

Let's visualize the survival rate by passenger class (Pclass) and sex (Sex). I'll create a grouped bar chart using seaborn, where the x-axis represents passenger class, the bars are colored by sex, and the height represents the survival rate.

First, I'll compute the survival rate for each group:

Compute survival rate by Pclass and Sex
import pandas as pd

# Compute survival rate by Pclass and Sex
grouped = df.groupby(['Pclass', 'Sex'])['Survived'].mean().unstack()

# Display the computed rates as a table for reference
grouped
Sex female male
Pclass
1 0.968085 0.368852
2 0.921053 0.157407
3 0.500000 0.135447

From the computed survival rates by Pclass and Sex:

  • First class: 96.8% of females and 36.9% of males survived.
  • Second class: 92.1% of females and 15.7% of males survived.
  • Third class: 50.0% of females and 13.5% of males survived.

This clearly shows both a class and gender effect: women had much higher survival rates than men in every class, and higher class correlated with higher survival, especially for women.

The goal is to plot survival rate by passenger class and sex. The underlying values are now available, so I’ll generate a clear grouped bar chart using seaborn.

Plot survival rates by class and sex
import seaborn as sns
import matplotlib.pyplot as plt

# Prepare data for plotting (long format)
plot_data = grouped.reset_index().melt(id_vars='Pclass', var_name='Sex', value_name='SurvivalRate')

# Create grouped bar chart
plt.figure(figsize=(8, 5))
sns.barplot(data=plot_data, x='Pclass', y='SurvivalRate', hue='Sex')
plt.title('Survival Rate by Passenger Class and Sex')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
plt.ylim(0, 1)
plt.show()

The bar chart successfully visualizes survival rate by passenger class and sex. The chart shows clear patterns: survival rates are highest for first-class females (≈96.8%) and lowest for third-class males (≈13.5%). The gap between males and females is consistent across all classes, and survival declines with lower class for both genders.

show age distribution for survivors vs non-survivors

The previous cell successfully plotted survival rate by passenger class and sex. Now, per the user’s new request, we will plot the age distribution for survivors vs. non-survivors.

I’ll use a kernel density estimate (KDE) overlay plot — it’s clear, smooth, and ideal for comparing distributions.

First, I’ll check if the Age column has missing values and, if needed, handle them (e.g., drop or fill). Then I’ll plot KDEs for survivors (Survived == 1) and non-survivors (Survived == 0) using seaborn.

Let’s do this in one step: filter and plot.

Plot age distribution by survival status
import seaborn as sns
import matplotlib.pyplot as plt

# Drop rows with missing Age to avoid plot warnings
df_age = df.dropna(subset=['Age'])

plt.figure(figsize=(8, 5))
sns.kdeplot(data=df_age, x='Age', hue='Survived', fill=True, common_norm=False, alpha=0.6)
plt.title('Age Distribution by Survival Status')
plt.xlabel('Age')
plt.ylabel('Density')
plt.show()

The KDE plot successfully visualizes the age distribution for survivors (Survived = 1) and non-survivors (Survived = 0). The chart shows that younger passengers (especially children under ~5 years old) had a higher survival probability, while the distribution for non-survivors is slightly broader and shifted toward older ages.

Since the chart clearly shows the distinction between survivors and non-survivors by age, and no further numeric values or verification is needed to confirm the comparison, the goal is fully satisfied.

how many missing values are there and which columns?

Let’s check how many missing values exist in the dataset and which columns contain them.

Count missing values per column
df.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

From the latest executed output ([cell-6]), we see the count of missing values per column:

  • Age: 177 missing values
  • Cabin: 687 missing values
  • Embarked: 2 missing values
  • All other columns have 0 missing values.

This fully answers the original goal.

There are 866 total missing values across 3 columns:

  • Cabin: 687
  • Age: 177
  • Embarked: 2

Score cards for qwen3-coder-next for Titanic Survival Analysis in Python

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All four requested analyses were completed: dataset loading, overall survival rate, survival by class and sex, age distribution by survival status, and missing-value counts.

  • Execution Correctness: 2/2

    Excellent

    The code shown is syntactically valid and the outputs are consistent with the Titanic dataset, including the correct shape and missing-value counts.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically: 38.38% survival rate, correct grouped survival rates, an age comparison plot, and the exact missing-value counts for Age, Cabin, and Embarked.

  • Reasoning Quality: 2/2

    Excellent

    The explanations are clear and correctly interpret the results, including the class/sex survival pattern and the younger-skewing survivor age distribution.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and uses standard pandas/seaborn operations without obvious hallucinations or fragile logic.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.