Exploratory Data Analysis

Titanic Survival Analysis in Python

Explore the Titanic dataset with survival rates by class, sex, and age, handle missing values, and visualize patterns using an AI data analyst.

What

This AI Data Analyst workflow loads the Titanic training dataset from a URL and computes the overall survival rate and dataset shape. It generates visual comparisons of survival rates by passenger class and sex, and plots age distributions for survivors versus non-survivors. It also audits missing values by column to identify fields that need cleaning or imputation.

Who

This is for analysts and students who want a guided, conversational EDA example on a well-known classification dataset. It helps anyone practicing data cleaning and basic demographic breakdowns with reproducible Python code and plots.

Tools

  • pandas
  • numpy
  • matplotlib
  • seaborn
  • requests

Outcomes

  • Load the Titanic CSV from the provided URL and confirm shape (891, 12)
  • Compute overall survival rate (38.4%)
  • Create a grouped bar chart of survival rate by class and sex
  • Plot survivor vs non-survivor age distributions
  • Report missing values: Age 177, Cabin 687, Embarked 2

Quality Score

7/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

  • Task Completion: 1/2

    Needs work

    Loaded the dataset, computed overall survival rate, produced the requested plots, and computed missing values. However, it did not report the dataset shape (891, 12) and did not explicitly state the key plot findings expected (e.g., 1st class + female highest; survivors skew younger).

  • Execution Correctness: 2/2

    Excellent

    All provided code blocks are syntactically correct and logically consistent (read_csv, groupby mean, seaborn plots, missingness summary). The workflow is likely runnable in a standard notebook environment.

  • Output Quality: 2/3

    Good

    Outputs match several expected outcomes: survival rate ~38.38% and missing values (Age 177, Cabin 687, Embarked 2) with correct percentages, and plots were generated. But the analysis text avoids/omits the expected semantic conclusions from the plots and does not show the dataset shape.

  • Reasoning Quality: 1/2

    Needs work

    Reasoning is generally coherent about what the code/plots represent, but it is overly cautious and fails to extract the main insights the task asked for (class/sex survival ranking and age skew).

  • Reliability: 1/1

    Excellent

    Approach is standard EDA and reasonably robust (uses dropna for age, computes missingness systematically). Minor fragility: uses display() without importing it, though this typically works in notebooks.

load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall
Thinking

Considering data inspection

I need to focus on loading and inspecting the data without combining it with plotting or modeling. It looks like I'm expected to show the shape of the dataset, the columns, and the first few rows with the head() function. I should keep my output concise and ensure there's exactly one code block. It seems best to follow the guidelines closely to avoid mixing different tasks. Let's make sure to do that!

Let’s start by loading the Titanic training data and inspecting its structure so we can confirm the survival column is present and named as expected.

Load and inspect Titanic data
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"columns": df.columns}))
df.head()
rows columns
0 891 12
columns
0 PassengerId
1 Survived
2 Pclass
3 Name
4 Sex
5 Age
6 SibSp
7 Parch
8 Ticket
9 Fare
10 Cabin
11 Embarked
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

The data is in place and the Survived column is confirmed, so the only missing piece is the actual overall survival rate.

Calculate survival rate
survival_rate = df["Survived"].mean()

pd.DataFrame({
    "survival_rate": [survival_rate],
    "survival_rate_percent": [survival_rate * 100]
})
survival_rate survival_rate_percent
0 0.383838 38.383838

The dataset contains 891 passengers and includes the Survived column needed for this calculation.

The overall survival rate is 0.3838, which means 38.38% of passengers survived.

plot survival rate by passenger class and sex

A grouped bar chart is the simplest way to compare survival rate across passenger class and sex.

Plot survival rates by class
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

plt.figure(figsize=(8, 5))
ax = sns.barplot(data=df, x="Pclass", y="Survived", hue="Sex", estimator="mean", errorbar=None)

ax.set_title("Survival Rate by Passenger Class and Sex")
ax.set_xlabel("Passenger Class")
ax.set_ylabel("Survival Rate")
ax.yaxis.set_major_formatter(PercentFormatter(1))
ax.legend(title="Sex")

plt.tight_layout()
ax
<Axes: title={'center': 'Survival Rate by Passenger Class and Sex'}, xlabel='Passenger Class', ylabel='Survival Rate'>

The current chart is the right one for the question: each bar represents the mean of Survived, so it is showing survival rate by passenger class and sex. What’s missing is the exact values behind those bars.

Calculate survival rates
rates = (
    df.groupby(["Pclass", "Sex"])["Survived"]
    .mean()
    .reset_index(name="survival_rate")
)

rates["survival_rate_percent"] = (rates["survival_rate"] * 100).round(2)
rates
Pclass Sex survival_rate survival_rate_percent
0 1 female 0.968085 96.81
1 1 male 0.368852 36.89
2 2 female 0.921053 92.11
3 2 male 0.157407 15.74
4 3 female 0.500000 50.00
5 3 male 0.135447 13.54

The plot and the grouped survival-rate table together answer the question clearly.

Women had much higher survival rates than men in every passenger class:

  • 1st class: female 96.81%, male 36.89%
  • 2nd class: female 92.11%, male 15.74%
  • 3rd class: female 50.00%, male 13.54%

What this means from the chart:

  • Sex was a very strong factor: female passengers survived at much higher rates than male passengers across all classes.
  • Passenger class also mattered: survival was highest in 1st class, then 2nd, then 3rd, for both sexes overall.
  • 1st-class women had the highest survival rate at nearly 97%.
  • 3rd-class men had the lowest survival rate at about 13.5%.
  • The biggest gap is especially visible among men: survival drops sharply from 1st class (36.89%) to 2nd (15.74%) and 3rd (13.54%).

So the visualization shows a clear combined effect of class and sex on Titanic survival: women, especially in higher classes, were far more likely to survive.

show age distribution for survivors vs non-survivors
Thinking

Creating age distribution chart

I'm focusing on producing a clear answer based on the notebook state. I’ll follow an iterative approach, step-by-step, to analyze recent outputs and aim for a chart showing age distribution for survivors versus non-survivors. Since the dataset is loaded, I can plot it directly. I'll likely use a Seaborn histogram or KDE. I’ll ensure to handle missing age data with dropna.

Now, I’ll provide one code block for the histogram visualization.

A clear next view is an overlaid age histogram split by survival status.

Plot age distribution by survival
plot_df = df[["Age", "Survived"]].dropna().copy()
plot_df["Survived"] = plot_df["Survived"].map({0: "Non-survivor", 1: "Survivor"})

plt.figure(figsize=(9, 5))
ax = sns.histplot(
    data=plot_df,
    x="Age",
    hue="Survived",
    bins=30,
    stat="density",
    common_norm=False,
    element="step",
    fill=False
)

ax.set_title("Age Distribution: Survivors vs Non-survivors")
ax.set_xlabel("Age")
ax.set_ylabel("Density")

plt.tight_layout()
ax
<Axes: title={'center': 'Age Distribution: Survivors vs Non-survivors'}, xlabel='Age', ylabel='Density'>
Thinking

Crafting insights on age distribution

I need to provide insights based on the age distribution for survivors versus non-survivors, using our chart object since we don’t have a numeric table. The chart visualizes the data, but it doesn’t offer exact values, so a summary of statistics by survivor status is important. I'll include a minimal runnable code block focusing on grouped descriptive stats for Age by Survived. Let’s ensure I keep the expression concise. Here’s the code block to gather that missing evidence.

# Grouped descriptive statistics for Age by Survived status
age_stats = df.groupby('Survived')['Age'].describe()
display(age_stats)
```**Interpreting the age distribution**

I have the histogram ready, which allows me to visually compare the age distributions for survivors and non-survivors. However, to give it more context with concrete values, I need to create a compact summary of age by status. This summary will include statistics like count, mean, median, and quartiles, which will help interpret the results effectively. Let’s make sure to present these insights clearly for the user!

The age distribution chart is already there, so the comparison is visually set up. What’s still missing is the underlying age summary by survival status, which will let us describe the difference concretely rather than only by eye.

Summarize age by survival
age_summary = (
    df.groupby("Survived")["Age"]
    .agg(
        count="count",
        mean="mean",
        median="median",
        min="min",
        q1=lambda s: s.quantile(0.25),
        q3=lambda s: s.quantile(0.75),
        max="max",
    )
    .round(2)
)

age_summary.index = age_summary.index.map({0: "Non-survivor", 1: "Survivor"})
age_summary
count mean median min q1 q3 max
Survived
Non-survivor 424 30.63 28.0 1.00 21.0 39.0 74.0
Survivor 290 28.34 28.0 0.42 19.0 36.0 80.0

The age distribution differs modestly between survivors and non-survivors.

From the plotted distributions and the summary table:

  • Non-survivors skew a bit older overall: mean age is 30.63 versus 28.34 for survivors.
  • The median age is the same for both groups at 28, so the biggest difference is in the overall spread rather than the center alone.
  • Survivors are slightly younger across the middle of the distribution:
    • 25th percentile: 19 for survivors vs 21 for non-survivors
    • 75th percentile: 36 for survivors vs 39 for non-survivors
  • The observed age ranges are broad in both groups:
    • Non-survivors: 1.00 to 74.0
    • Survivors: 0.42 to 80.0

So, the age plot supports this takeaway: survivors tend to be somewhat younger on average, but the two age distributions overlap heavily.

how many missing values are there and which columns?
Thinking

Assessing missing values

I need to answer based on the notebook state since the user asked about missing values. They want to know how many there are and which columns are affected. It sounds like I should provide a concise code block in markdown format for this.

I think the code should count missing values per column and filter for those greater than zero. I could create a missing_df to summarize this, and then I’ll keep it simple with just one code block. Let's get right to crafting that!

Let’s check missing values by column first.

Summarize missing values
missing_df = (
    df.isna()
    .sum()
    .reset_index()
    .rename(columns={"index": "column", 0: "missing_values"})
)

missing_df = missing_df[missing_df["missing_values"] > 0].sort_values("missing_values", ascending=False).reset_index(drop=True)
missing_df
column missing_values
0 Cabin 687
1 Age 177
2 Embarked 2

There are 866 missing values in total across 3 columns.

The columns with missing values are:

  • Cabin: 687 missing
  • Age: 177 missing
  • Embarked: 2 missing

This means most of the missing data is concentrated in Cabin, with a moderate amount in Age, and almost none in Embarked.

Browse All Analysis Examples