Exploratory Data Analysis

Titanic Survival Analysis in Python

Explore the Titanic dataset with survival rates by class, sex, and age, handle missing values, and visualize patterns using an AI data analyst.

What

This AI Data Analyst workflow loads the Titanic training dataset from a URL and computes the overall survival rate and dataset shape. It generates visual comparisons of survival rates by passenger class and sex, and plots age distributions for survivors versus non-survivors. It also audits missing values by column to identify fields that need cleaning or imputation.

Who

This is for analysts and students who want a guided, conversational EDA example on a well-known classification dataset. It helps anyone practicing data cleaning and basic demographic breakdowns with reproducible Python code and plots.

Tools

pandas
numpy
matplotlib
seaborn
requests

Outcomes

Load the Titanic CSV from the provided URL and confirm shape (891, 12)
Compute overall survival rate (38.4%)
Create a grouped bar chart of survival rate by class and sex
Plot survivor vs non-survivor age distributions
Report missing values: Age 177, Cabin 687, Embarked 2

Quality Score

7/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

Task Completion: 1/2
Needs work
Loaded the dataset, computed overall survival rate, produced the requested plots, and computed missing values. However, it did not report the dataset shape (891, 12) and did not explicitly state the key plot findings expected (e.g., 1st class + female highest; survivors skew younger).
Execution Correctness: 2/2
Excellent
All provided code blocks are syntactically correct and logically consistent (read_csv, groupby mean, seaborn plots, missingness summary). The workflow is likely runnable in a standard notebook environment.
Output Quality: 2/3
Good
Outputs match several expected outcomes: survival rate ~38.38% and missing values (Age 177, Cabin 687, Embarked 2) with correct percentages, and plots were generated. But the analysis text avoids/omits the expected semantic conclusions from the plots and does not show the dataset shape.
Reasoning Quality: 1/2
Needs work
Reasoning is generally coherent about what the code/plots represent, but it is overly cautious and fails to extract the main insights the task asked for (class/sex survival ranking and age skew).
Reliability: 1/1
Excellent
Approach is standard EDA and reasonably robust (uses dropna for age, computes missingness systematically). Minor fragility: uses display() without importing it, though this typically works in notebooks.

load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall

Thinking

Considering data inspection

I need to focus on loading and inspecting the data without combining it with plotting or modeling. It looks like I'm expected to show the shape of the dataset, the columns, and the first few rows with the head() function. I should keep my output concise and ensure there's exactly one code block. It seems best to follow the guidelines closely to avoid mixing different tasks. Let's make sure to do that!

Let’s start by loading the Titanic training data and inspecting its structure so we can confirm the survival column is present and named as expected.

Load and inspect Titanic data

import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"columns": df.columns}))
df.head()

	rows	columns
0	891	12

	columns
0	PassengerId
1	Survived
2	Pclass
3	Name
4	Sex
5	Age
6	SibSp
7	Parch
8	Ticket
9	Fare
10	Cabin
11	Embarked

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

The data is in place and the Survived column is confirmed, so the only missing piece is the actual overall survival rate.

Calculate survival rate

survival_rate = df["Survived"].mean()

pd.DataFrame({
    "survival_rate": [survival_rate],
    "survival_rate_percent": [survival_rate * 100]
})

#10

	survival_rate	survival_rate_percent
0	0.383838	38.383838

#11

The dataset contains 891 passengers and includes the Survived column needed for this calculation.

The overall survival rate is 0.3838, which means 38.38% of passengers survived.

#12

plot survival rate by passenger class and sex

#13

A grouped bar chart is the simplest way to compare survival rate across passenger class and sex.

#14

Plot survival rates by class

import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

plt.figure(figsize=(8, 5))
ax = sns.barplot(data=df, x="Pclass", y="Survived", hue="Sex", estimator="mean", errorbar=None)

ax.set_title("Survival Rate by Passenger Class and Sex")
ax.set_xlabel("Passenger Class")
ax.set_ylabel("Survival Rate")
ax.yaxis.set_major_formatter(PercentFormatter(1))
ax.legend(title="Sex")

plt.tight_layout()
ax

#15

<Axes: title={'center': 'Survival Rate by Passenger Class and Sex'}, xlabel='Passenger Class', ylabel='Survival Rate'>

#16

#17

The current chart is the right one for the question: each bar represents the mean of Survived, so it is showing survival rate by passenger class and sex. What’s missing is the exact values behind those bars.

#18

Calculate survival rates

rates = (
    df.groupby(["Pclass", "Sex"])["Survived"]
    .mean()
    .reset_index(name="survival_rate")
)

rates["survival_rate_percent"] = (rates["survival_rate"] * 100).round(2)
rates

#19

	Pclass	Sex	survival_rate	survival_rate_percent
0	1	female	0.968085	96.81
1	1	male	0.368852	36.89
2	2	female	0.921053	92.11
3	2	male	0.157407	15.74
4	3	female	0.500000	50.00
5	3	male	0.135447	13.54

#20

The plot and the grouped survival-rate table together answer the question clearly.

Women had much higher survival rates than men in every passenger class:

1st class: female 96.81%, male 36.89%
2nd class: female 92.11%, male 15.74%
3rd class: female 50.00%, male 13.54%

What this means from the chart:

Sex was a very strong factor: female passengers survived at much higher rates than male passengers across all classes.
Passenger class also mattered: survival was highest in 1st class, then 2nd, then 3rd, for both sexes overall.
1st-class women had the highest survival rate at nearly 97%.
3rd-class men had the lowest survival rate at about 13.5%.
The biggest gap is especially visible among men: survival drops sharply from 1st class (36.89%) to 2nd (15.74%) and 3rd (13.54%).

So the visualization shows a clear combined effect of class and sex on Titanic survival: women, especially in higher classes, were far more likely to survive.

#21

show age distribution for survivors vs non-survivors

#22

Thinking

Creating age distribution chart

I'm focusing on producing a clear answer based on the notebook state. I’ll follow an iterative approach, step-by-step, to analyze recent outputs and aim for a chart showing age distribution for survivors versus non-survivors. Since the dataset is loaded, I can plot it directly. I'll likely use a Seaborn histogram or KDE. I’ll ensure to handle missing age data with dropna.

Now, I’ll provide one code block for the histogram visualization.

#23

A clear next view is an overlaid age histogram split by survival status.

#24

Plot age distribution by survival

plot_df = df[["Age", "Survived"]].dropna().copy()
plot_df["Survived"] = plot_df["Survived"].map({0: "Non-survivor", 1: "Survivor"})

plt.figure(figsize=(9, 5))
ax = sns.histplot(
    data=plot_df,
    x="Age",
    hue="Survived",
    bins=30,
    stat="density",
    common_norm=False,
    element="step",
    fill=False
)

ax.set_title("Age Distribution: Survivors vs Non-survivors")
ax.set_xlabel("Age")
ax.set_ylabel("Density")

plt.tight_layout()
ax

#25

<Axes: title={'center': 'Age Distribution: Survivors vs Non-survivors'}, xlabel='Age', ylabel='Density'>

#26

#27

Thinking

Crafting insights on age distribution

I need to provide insights based on the age distribution for survivors versus non-survivors, using our chart object since we don’t have a numeric table. The chart visualizes the data, but it doesn’t offer exact values, so a summary of statistics by survivor status is important. I'll include a minimal runnable code block focusing on grouped descriptive stats for Age by Survived. Let’s ensure I keep the expression concise. Here’s the code block to gather that missing evidence.

# Grouped descriptive statistics for Age by Survived status
age_stats = df.groupby('Survived')['Age'].describe()
display(age_stats)
```**Interpreting the age distribution**

I have the histogram ready, which allows me to visually compare the age distributions for survivors and non-survivors. However, to give it more context with concrete values, I need to create a compact summary of age by status. This summary will include statistics like count, mean, median, and quartiles, which will help interpret the results effectively. Let’s make sure to present these insights clearly for the user!

#28

The age distribution chart is already there, so the comparison is visually set up. What’s still missing is the underlying age summary by survival status, which will let us describe the difference concretely rather than only by eye.

#29

Summarize age by survival

age_summary = (
    df.groupby("Survived")["Age"]
    .agg(
        count="count",
        mean="mean",
        median="median",
        min="min",
        q1=lambda s: s.quantile(0.25),
        q3=lambda s: s.quantile(0.75),
        max="max",
    )
    .round(2)
)

age_summary.index = age_summary.index.map({0: "Non-survivor", 1: "Survivor"})
age_summary

#30

	count	mean	median	min	q1	q3	max
Survived
Non-survivor	424	30.63	28.0	1.00	21.0	39.0	74.0
Survivor	290	28.34	28.0	0.42	19.0	36.0	80.0

#31

The age distribution differs modestly between survivors and non-survivors.

From the plotted distributions and the summary table:

Non-survivors skew a bit older overall: mean age is 30.63 versus 28.34 for survivors.
The median age is the same for both groups at 28, so the biggest difference is in the overall spread rather than the center alone.
Survivors are slightly younger across the middle of the distribution:
- 25th percentile: 19 for survivors vs 21 for non-survivors
- 75th percentile: 36 for survivors vs 39 for non-survivors
The observed age ranges are broad in both groups:
- Non-survivors: 1.00 to 74.0
- Survivors: 0.42 to 80.0

So, the age plot supports this takeaway: survivors tend to be somewhat younger on average, but the two age distributions overlap heavily.

#32

how many missing values are there and which columns?

#33

Thinking

Assessing missing values

I need to answer based on the notebook state since the user asked about missing values. They want to know how many there are and which columns are affected. It sounds like I should provide a concise code block in markdown format for this.

I think the code should count missing values per column and filter for those greater than zero. I could create a missing_df to summarize this, and then I’ll keep it simple with just one code block. Let's get right to crafting that!

#34

Let’s check missing values by column first.

#35

Summarize missing values

missing_df = (
    df.isna()
    .sum()
    .reset_index()
    .rename(columns={"index": "column", 0: "missing_values"})
)

missing_df = missing_df[missing_df["missing_values"] > 0].sort_values("missing_values", ascending=False).reset_index(drop=True)
missing_df

#36

	column	missing_values
0	Cabin	687
1	Age	177
2	Embarked	2

#37

There are 866 missing values in total across 3 columns.

The columns with missing values are:

Cabin: 687 missing
Age: 177 missing
Embarked: 2 missing

This means most of the missing data is concentrated in Cabin, with a moderate amount in Age, and almost none in Embarked.

Titanic Survival Analysis in Python

What

Who

Tools

Outcomes

Quality Score

Browse All Analysis Examples

Machine Learning

Time Series

Data Analysis

Exploratory Data Analysis

Finance

Nlp

Statistics