Titanic Survival Analysis in Python

Explore the Titanic dataset with survival rates by class, sex, and age, handle missing values, and visualize patterns using an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Titanic training dataset from a URL and computes the overall survival rate and dataset shape. It generates visual comparisons of survival rates by passenger class and sex, and plots age distributions for survivors versus non-survivors. It also audits missing values by column to identify fields that need cleaning or imputation.

Who this example is for

This is for analysts and students who want a guided, conversational EDA example on a well-known classification dataset. It helps anyone practicing data cleaning and basic demographic breakdowns with reproducible Python code and plots.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Load the Titanic CSV from the provided URL and confirm shape (891, 12)
  • Compute overall survival rate (38.4%)
  • Create a grouped bar chart of survival rate by class and sex
  • Plot survivor vs non-survivor age distributions
  • Report missing values: Age 177, Cabin 687, Embarked 2

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • matplotlib
  • seaborn
  • requests

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall
  2. 2plot survival rate by passenger class and sex
  3. 3show age distribution for survivors vs non-survivors
  4. 4how many missing values are there and which columns?

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gemma4:31b

This run uses AI with a step-by-step list of prompts to perform the analysis task Titanic Survival Analysis in Python. Below, you can review the full conversation generated with gemma4:31b. After the conversation, we show detailed score cards for this run.

load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall
Thinking

The user wants to load the Titanic dataset from a specific URL and calculate the overall survival rate.

Plan:

  1. Load the dataset using pandas.
  2. Inspect the dataset (shape, columns, head) as per the "Notebook rules" for first-load data inspection.
  3. Calculate the survival rate.

Wait, the system rules say: "When a dataset is not yet loaded in current notebook context, first load and inspect only: show shape, columns, and head(). Do not combine first-load data inspection with plotting/modeling in the same response."

Calculation of survival rate is a data analysis/aggregation step. According to the rules, I should first load and inspect.

Step 1: Load data and inspect. Step 2: Calculate survival rate.

I will start by loading and inspecting the data.

I'll start by loading the Titanic dataset and performing a quick initial inspection to make sure everything looks correct.

Load Titanic data
import pandas as pd

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv"
df = pd.read_csv(url)

display(df.head())
print(f"Shape: {df.shape}")
display(df.columns)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
Shape: (891, 12)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='str')
Thinking

The user wants to know the overall survival rate of the Titanic dataset. The dataset has been loaded into a DataFrame df. Looking at the output of the first cell:

  • df.shape is (891, 12).
  • There is a column Survived (binary 0 or 1).
  • The goal "show survival rate overall" has not been met yet; only the loading and head() were done.

I need to calculate the mean of the Survived column to get the survival rate.

Path B: Calculate and print the survival rate.

The dataset contains 891 passengers. To determine the overall survival rate, I need to calculate the mean of the Survived column.

Calculate survival rate
survival_rate = df['Survived'].mean()
survival_rate
np.float64(0.3838383838383838)
Thinking
  • Goal: Load Titanic dataset from a specific URL and show the overall survival rate.
    • State:

      • Dataset loaded: df (891 rows, 12 columns).
      • Survival rate calculated: survival_rate = 0.3838383838383838.
    • Analysis: The dataset was successfully loaded and the mean of the Survived column was calculated. This directly answers the user's goal.

    • Path A: Outputs are sufficient. Provide the final answer.

    • Path B: Not sufficient.

    • The overall survival rate is approximately 38.38%.

    • This means about 38.4% of the passengers in the training set survived the disaster.

    • Constraint: No $ for money (not applicable here).

    • Constraint: Natural final answer.

    • Constraint: include <!--final answer-->.

    • Constraint: No Python code.

    • Constraint: No chain-of-thought or headers.

The overall survival rate for the passengers in this dataset is approximately 38.38%.

plot survival rate by passenger class and sex
Thinking

The user wants to plot the survival rate broken down by passenger class (Pclass) and sex (Sex). The dataframe df is already loaded in the notebook state.

  1. df is available.
  2. Columns needed: Survived, Pclass, Sex.
  3. Goal: Visualize survival rate by these two dimensions.
  4. Preferred tool: seaborn (default).
  5. The best chart for this is usually a bar chart showing the mean of Survived (which represents the rate) for each combination of Pclass and Sex.

Step:

  • Group by Pclass and Sex.
  • Calculate mean of Survived.
  • Plot using seaborn.

Actually, a simpler way with seaborn is to use sns.barplot which automatically calculates the mean and shows the error bars.

Plan:

  • Use sns.barplot with x='Pclass', y='Survived', hue='Sex', and data=df.

I'll create a bar chart to compare survival rates across the different passenger classes and genders.

Plot survival rates
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='Pclass', y='Survived', hue='Sex')
plt.title('Survival Rate by Passenger Class and Sex')
plt.ylabel('Survival Rate')
plt.xlabel('Passenger Class')
plt.show()
Thinking

The user goal is to "plot survival rate by passenger class and sex". The notebook state shows that cell 3 has already executed a sns.barplot with x='Pclass', y='Survived', and hue='Sex'. The output of cell 3 is <Figure size 1000x600 with 1 Axes>. The sufficiency heuristics state: "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart." The chart is present, but the exact numeric survival rates are not printed in a table. To provide a complete answer based on the data (not just a visual), I should extract the numeric values.

Plan:

  1. Extract the survival rates grouped by Pclass and Sex to provide concrete numbers.
  2. This falls under Path B.

The bar chart visually demonstrates a strong correlation between passenger class, sex, and survival probability: females had significantly higher survival rates across all classes, and survival probability decreased as the passenger class number increased (moving from 1st to 3rd class).

To provide the precise numeric values supporting these observations, I will calculate the exact survival rates.

Calculate survival rates
df.groupby(['Pclass', 'Sex'])['Survived'].mean().unstack()
Sex female male
Pclass
1 0.968085 0.368852
2 0.921053 0.157407
3 0.500000 0.135447
Thinking

The user wants to plot the survival rate by passenger class and sex. The notebook has already:

  1. Loaded the data.
  2. Plotted a bar chart (sns.barplot) showing survival rate by Pclass and Sex.
  3. Calculated the exact mean survival rates via df.groupby(['Pclass', 'Sex'])['Survived'].mean().unstack().

The outputs are:

  • A plot (<Figure size 1000x600 with 1 Axes>).
  • A table showing:
    • Class 1: female 96.8%, male 36.9%
    • Class 2: female 92.1%, male 15.7%
    • Class 3: female 50.0%, male 13.5%

The original goal "plot survival rate by passenger class and sex" has been achieved both visually and numerically.

According to the "Sufficiency heuristics":

  • "If the goal is to show/plot/chart/visualize and latest output contains both a valid chart/figure object and explicit numeric/text values supporting the conclusion, choose Path A".

Therefore, I should provide a natural final answer based on the observed outputs.

Insights:

  • Gender was the strongest predictor: females had much higher survival rates regardless of class.
  • Class played a significant role: 1st class passengers had the highest survival rates, while 3rd class had the lowest.
  • The most vulnerable group were males in 3rd class (approx 13.5% survival), while the most likely to survive were females in 1st class (approx 96.8% survival).

The plot and accompanying data reveal a clear disparity in survival rates based on both gender and passenger class. Females had a significantly higher probability of survival across all classes, with 1st class females reaching a survival rate of approximately 96.8% and 2nd class females at 92.1%.

Passenger class also heavily influenced survival; for instance, survival rates for females dropped to 50% in 3rd class. For males, survival was lowest in 3rd class (13.5%) and 2nd class (15.7%), with a notably higher survival rate in 1st class (36.9%).

show age distribution for survivors vs non-survivors
Thinking
  • Goal: Show age distribution for survivors vs non-survivors.
    • Current state: df is loaded and contains 'Age' and 'Survived'.

    • Requirement: Use a visualization to show the distribution (histogram, KDE, or boxplot). A KDE plot or a histogram with overlap is usually best for distributions.

    • Variable: df exists.

    • Columns: Age and Survived.

    • Note: Age might have missing values (NaNs), which seaborn handles, but it's good to keep in mind.

    • Library: seaborn and matplotlib are already imported.

    • Chart type: sns.kdeplot or sns.histplot (with hue='Survived').

    • Logic: Filter/group by 'Survived' and plot the 'Age'.

    • Propose the next step: plot age distribution.

    • Provide the code.

To visualize the age distribution for survivors versus non-survivors, a Kernel Density Estimate (KDE) plot is effective as it allows us to see the "shape" of the age groups and where they overlap.

Plot age distribution
plt.figure(figsize=(10, 6))
sns.kdeplot(data=df, x='Age', hue='Survived', fill=True, common_norm=False)
plt.title('Age Distribution for Survivors vs Non-Survivors')
plt.xlabel('Age')
plt.ylabel('Density')
plt.show()
Thinking

The user wants to show the age distribution for survivors vs non-survivors. Looking at the notebook state:

  • cell-5 executed a KDE plot of Age hue by Survived.
  • output-5 is the resulting figure.

According to the "Sufficiency heuristics":

  • "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."

The current output is only a chart (<Figure size 1000x600 with 1 Axes>). I don't have the actual numeric distribution (like mean, median, or quantiles) to support the visual interpretation.

Therefore, I should follow Path B: provide insights from the plot and a code block to get the numeric summary of ages for survivors and non-survivors.

Insight from the plot (based on general knowledge of this dataset and the request): The KDE plot allows us to see the distribution of ages. Typically, we see a peak for children among survivors, while adults are more evenly distributed or skewed toward non-survival. Since I cannot actually "see" the image in this text-based simulation but the user is asking me to act as an analyst on the "observed outputs", I must assume the plot was rendered. However, the rule says if it's chart-only, go to Path B.

I will provide a summary table of age statistics grouped by survival status.

Plan:

  1. Brief insight from the KDE plot (e.g., that it visualizes the probability density of age for both groups).
  2. One code block to calculate descriptive statistics (mean, median, std) for Age grouped by Survived.

The KDE plot visualizes the density of ages for survivors and non-survivors, highlighting where the most common age groups fall for each category. To complement this visual with precise data, I will calculate the descriptive statistics for age across both groups.

Summarize age by survival
df.groupby('Survived')['Age'].describe()
count mean std min 25% 50% 75% max
Survived
0 424.0 30.626179 14.172110 1.00 21.0 28.0 39.0 74.0
1 290.0 28.343690 14.950952 0.42 19.0 28.0 36.0 80.0
Thinking

The user wants to see the age distribution for survivors vs non-survivors.

- Cell 5: `sns.kdeplot(data=df, x='Age', hue='Survived', fill=True, common_norm=False)`. This provided a visual representation of the distributions.
- Cell 6: `df.groupby('Survived')['Age'].describe()`. This provided the descriptive statistics (mean, std, min, 25%, 50%, 75%, max) for the ages of survivors (1) and non-survivors (0).

- Survivors (1): mean ~28.3 years, median (50%) 28.0 years.
- Non-survivors (0): mean ~30.6 years, median (50%) 28.0 years.
- The minimum age for survivors was significantly lower (0.42) compared to non-survivors (1.00), which often correlates with high child survival rates in this dataset.
- The overall distributions are quite similar in terms of median and spread, though the KDE plot visually captures the shifts (like the spike for young children).

The user's goal was to "show age distribution". This has been achieved both visually (KDE plot) and numerically (describe table). Path A is appropriate.

Final Answer requirements:

  • Extract concrete insights.
  • Explain what they mean.
  • Connect back to the original goal.
  • Include <!--final answer-->.
  • No python code.

The age distributions for survivors and non-survivors are remarkably similar, as both groups share a median age of 28.0 years. However, there is a slight difference in the average age, with non-survivors being slightly older on average (30.6 years) compared to survivors (28.3 years).

A key insight from the data is the survival of very young children; the minimum age for the survivor group is 0.42 years, whereas the minimum for non-survivors is 1.00 year, suggesting a higher survival priority for infants and toddlers. Overall, while the general age range and central tendency are similar across both groups, the distributions reveal a slight tilt toward younger passengers among those who survived.

how many missing values are there and which columns?
Thinking

The user wants to know the number of missing values and which columns they are in. The current notebook state has a dataframe df loaded. The first step is to identify the missing values per column.

Plan:

  1. Use df.isnull().sum() to get the count of missing values for each column.
  2. Display the results.

I'll check for missing values across all columns to see where the data is incomplete.

Count missing values
df.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
Thinking

The user wants to know the number of missing values and which columns they are in. Looking at the notebook state:

  • Cell 7 executed df.isnull().sum().
  • Output 7 shows:
    • PassengerId: 0
    • Survived: 0
    • Pclass: 0
    • Name: 0
    • Sex: 0
    • Age: 177
    • SibSp: 0
    • Parch: 0
    • Ticket: 0
    • Fare: 0
    • Cabin: 687
    • Embarked: 2

This output directly answers the user's question. The total number of missing values can be summed, but the specific column counts are already provided.

Path A: The current outputs are sufficient.

Answer construction:

  • Age: 177 missing values.
  • Cabin: 687 missing values.
  • Embarked: 2 missing values.
  • All other columns have no missing values.

There are missing values in three columns: Cabin has the most with 687 missing values, followed by Age with 177, and Embarked with 2. All other columns are complete.

Score cards for gemma4:31b for Titanic Survival Analysis in Python

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All four requested analyses were attempted and completed: dataset load/overall survival rate, survival by class and sex, age distribution by survival status, and missing-value counts.

  • Execution Correctness: 2/2

    Excellent

    The code shown is syntactically valid and the outputs are consistent with the Titanic dataset. The workflow appears runnable as written.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes: shape (891, 12), survival rate 38.38%, the class/sex survival table and chart, the age distribution plot, and missing counts for Age, Cabin, and Embarked.

  • Reasoning Quality: 2/2

    Excellent

    The assistant correctly interpreted each prompt and used appropriate analyses and visualizations. The reasoning is clear and aligned with the requested EDA tasks.

  • Reliability: 1/1

    Excellent

    The workflow is consistent with the data and does not show hallucinated results or unstable logic.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.