Titanic Survival Analysis in Python

Explore the Titanic dataset with survival rates by class, sex, and age, handle missing values, and visualize patterns using an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Titanic training dataset from a URL and computes the overall survival rate and dataset shape. It generates visual comparisons of survival rates by passenger class and sex, and plots age distributions for survivors versus non-survivors. It also audits missing values by column to identify fields that need cleaning or imputation.

Who this example is for

This is for analysts and students who want a guided, conversational EDA example on a well-known classification dataset. It helps anyone practicing data cleaning and basic demographic breakdowns with reproducible Python code and plots.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Load the Titanic CSV from the provided URL and confirm shape (891, 12)
  • Compute overall survival rate (38.4%)
  • Create a grouped bar chart of survival rate by class and sex
  • Plot survivor vs non-survivor age distributions
  • Report missing values: Age 177, Cabin 687, Embarked 2

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • matplotlib
  • seaborn
  • requests

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall
  2. 2plot survival rate by passenger class and sex
  3. 3show age distribution for survivors vs non-survivors
  4. 4how many missing values are there and which columns?

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with qwen3.5:397b

This run uses AI with a step-by-step list of prompts to perform the analysis task Titanic Survival Analysis in Python. Below, you can review the full conversation generated with qwen3.5:397b. After the conversation, we show detailed score cards for this run.

load titanic dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv and show survival rate overall
Thinking

The user wants me to load the Titanic dataset from a URL and show the overall survival rate. I'll do this in one step - load the data and calculate the survival rate.

Let me write code to:

  1. Load the CSV from the URL
  2. Show basic info (shape, columns, head)
  3. Calculate and display the overall survival rate

I'll keep this as one code block since it's all related to loading and initial inspection.

I'll load the Titanic dataset and calculate the overall survival rate.

Load and summarize Titanic data
import pandas as pd

# Load Titanic dataset from URL
url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/Titanic/train.csv"
df = pd.read_csv(url)

# Show basic info
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst 5 rows:")
df.head()

# Calculate overall survival rate
survival_rate = df['Survived'].mean()
print(f"\nOverall survival rate: {survival_rate:.2%}")
Dataset shape: (891, 12)

Columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

First 5 rows:

Overall survival rate: 38.38%
Thinking

The user asked to load the Titanic dataset and show the survival rate overall. Looking at the notebook outputs:

  1. Dataset loaded successfully: 891 rows, 12 columns
  2. Columns identified: PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
  3. Overall survival rate calculated: 38.38%

The original goal has been fully achieved:

  • ✓ Dataset loaded from the specified URL
  • ✓ Overall survival rate shown (38.38%)

This is a clear Path A situation - the outputs are sufficient to answer the original goal. I should provide a natural final answer with the key insight and mark it as complete.

The Titanic dataset has been loaded successfully with 891 passengers and 12 columns.

Overall Survival Rate: 38.38%

This means that out of the 891 passengers in this training dataset, approximately 342 passengers survived the disaster, while 549 did not. Less than 4 out of 10 passengers survived.

plot survival rate by passenger class and sex
Thinking

The user wants to visualize survival rate by passenger class (Pclass) and sex (Sex). I'll create a grouped bar chart showing survival rates across different combinations of Pclass and Sex. This is a classic Titanic analysis visualization.

I'll use seaborn for this since it's the default chart library and works well for categorical comparisons.

I'll create a visualization showing survival rates broken down by both passenger class and sex.

Notebook code
import seaborn as sns
import matplotlib.pyplot as plt

# Set figure size
plt.figure(figsize=(10, 6))

# Create grouped bar chart: survival rate by Pclass and Sex
sns.barplot(data=df, x='Pclass', y='Survived', hue='Sex', 
            estimator=lambda x: x.mean() * 100, errorbar=None,
            palette={'male': '#3498db', 'female': '#e74c3c'})

plt.title('Survival Rate by Passenger Class and Sex', fontsize=14)
plt.xlabel('Passenger Class', fontsize=12)
plt.ylabel('Survival Rate (%)', fontsize=12)
plt.legend(title='Sex', loc='upper right')
plt.ylim(0, 100)
plt.grid(axis='y', alpha=0.3)

plt.show()
Thinking

The user wants me to analyze the notebook outputs against the original goal of plotting survival rate by passenger class and sex.

Looking at the outputs:

  • Cell-1 loaded the dataset and calculated overall survival rate (38.38%)
  • Cell-2 created a bar chart showing survival rate by Pclass and Sex

The chart was created (output shows a Figure object), but I don't see the actual numeric values from the chart. According to the sufficiency heuristics:

  • "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."

So I should choose Path B and provide a code block that shows the actual survival rate numbers by Pclass and Sex in table format, so the user can see the concrete values behind the visualization.

Let me write minimal code to display the survival rates by Pclass and Sex as a table.

The bar chart has been created showing survival rates by passenger class and sex. However, to provide complete insights, I should show the actual numeric values behind the visualization.

Show survival rates by class and sex
# Show survival rates by Pclass and Sex as a table
survival_by_class_sex = df.groupby(['Pclass', 'Sex'])['Survived'].mean().unstack() * 100
survival_by_class_sex.round(2)
Sex female male
Pclass
1 96.81 36.89
2 92.11 15.74
3 50.00 13.54
Thinking

The user's original goal was to "plot survival rate by passenger class and sex". Looking at the notebook state and outputs:

  1. Cell-1: Loaded the Titanic dataset and showed basic info including overall survival rate (38.38%)
  2. Cell-2: Created a bar chart showing survival rate by Pclass and Sex
  3. Cell-3: Displayed a table with the actual numeric survival rates by Pclass and Sex

The outputs show:

  • A bar chart was created (output-2 shows a Figure object)
  • A table with exact survival rates is available (output-3):
    • Class 1: female 96.81%, male 36.89%
    • Class 2: female 92.11%, male 15.74%
    • Class 3: female 50.00%, male 13.54%

The original goal was to "plot survival rate by passenger class and sex" - this has been accomplished with:

  1. A visual bar chart (cell-2)
  2. Numeric values in table form (cell-3)

Both the visualization and the underlying data are now available. The goal is complete - we have both the plot and the numeric breakdown showing survival rates by passenger class and sex.

This qualifies for Path A: the outputs are sufficient to answer the original goal. I should provide a natural final answer with insights from the observed outputs and include <!--final answer-->, without any Python code block.

The visualization and data analysis are complete. Here are the key findings from the survival rate breakdown by passenger class and sex:

Key Insights:

  1. Sex is the strongest predictor: Across all classes, females had significantly higher survival rates than males.

  2. Class matters most for males:

    • 1st class males: 36.89% survival
    • 2nd class males: 15.74% survival
    • 3rd class males: 13.54% survival

    Male survival dropped dramatically from 1st to lower classes.

  3. Female survival was high across all classes, but still showed class disparity:

    • 1st class females: 96.81% survival
    • 2nd class females: 92.11% survival
    • 3rd class females: 50.00% survival

    Notably, 3rd class female survival (50%) was still higher than 1st class male survival (36.89%).

  4. The "women and children first" policy is evident: Even 3rd class females had better odds than 1st class males, suggesting gender was prioritized over class during evacuation.

The bar chart visualizes these disparities clearly, showing how both socioeconomic status (class) and gender influenced survival outcomes on the Titanic.

show age distribution for survivors vs non-survivors
Thinking

The user wants to see the age distribution for survivors vs non-survivors. This is a new visualization request. I should create a histogram or density plot comparing the age distributions of survivors and non-survivors.

I'll use seaborn to create overlapping histograms or a KDE plot to show the age distribution by survival status. This will help visualize whether age was a factor in survival.

Let me create one code block for this visualization.

I'll create a visualization showing the age distribution for survivors versus non-survivors. This will help us see if age played a role in survival outcomes.

Plot age distribution by survival
import seaborn as sns
import matplotlib.pyplot as plt

# Set figure size
plt.figure(figsize=(10, 6))

# Create histogram with KDE for age distribution by survival status
sns.histplot(data=df, x='Age', hue='Survived', 
             bins=20, kde=True, 
             palette={0: '#e74c3c', 1: '#27ae60'},
             alpha=0.5, element='step')

plt.title('Age Distribution: Survivors vs Non-Survivors', fontsize=14)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(labels=['Non-Survived', 'Survived'], title='Survival')
plt.grid(axis='y', alpha=0.3)

plt.show()
Thinking

The user wants me to analyze the age distribution for survivors vs non-survivors. Looking at the notebook state, cell-4 was executed and produced a histogram/KDE plot showing age distribution by survival status. The output shows "<Figure size 1000x600 with 1 Axes>" which indicates the chart was created successfully.

However, according to the sufficiency heuristics:

  • "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."

The current output only shows the figure object, not the actual numeric values (like mean age, median age, age ranges for survivors vs non-survivors). To provide a complete analysis, I should extract and show the underlying statistics.

Let me choose Path B to get the numeric summary statistics for age by survival status, which will complement the visualization and allow for concrete insights.

The age distribution histogram has been created, showing the count of passengers across different age groups separated by survival status. However, to provide concrete insights, I need to see the underlying statistics.

Let me extract the key numeric summaries for age by survival status:

Notebook code
# Show age statistics for survivors vs non-survivors
df.groupby('Survived')['Age'].agg(['count', 'mean', 'median', 'min', 'max', 'std']).round(2)
count mean median min max std
Survived
0 424 30.63 28.0 1.00 74.0 14.17
1 290 28.34 28.0 0.42 80.0 14.95
Thinking

The user wants to see the age distribution for survivors vs non-survivors. Looking at the notebook outputs:

  1. Cell-4 created a histogram visualization showing age distribution by survival status
  2. Cell-5 provided numeric statistics showing:
    • Non-survivors (0): 424 passengers, mean age 30.63, median 28.0, std 14.17
    • Survivors (1): 290 passengers, mean age 28.34, median 28.0, std 14.95

The visualization (histogram with KDE) and the summary statistics together provide a complete picture of the age distribution for survivors vs non-survivors. The key insights are:

  • Both groups have the same median age (28.0 years)
  • Non-survivors have a slightly higher mean age (30.63 vs 28.34)
  • The age ranges are similar (1-74 for non-survivors, 0.42-80 for survivors)
  • Standard deviations are similar (~14-15 years)

This is sufficient to answer the original goal. The histogram shows the distribution shape and the statistics provide the numeric summary. I should provide a final answer summarizing these findings.

The age distribution analysis for survivors vs non-survivors is complete. Here are the key findings:

Age Statistics by Survival Status:

SurvivalCountMean AgeMedian AgeAge RangeStd Dev
No (0)42430.6328.01–7414.17
Yes (1)29028.3428.00.42–8014.95

Key Insights:

  • Median age is identical (28.0 years) for both survivors and non-survivors, suggesting age alone wasn't a strong differentiator for the typical passenger.

  • Non-survivors were slightly older on average (30.63 vs 28.34 years), a difference of about 2.3 years.

  • Similar spread: Both groups have comparable standard deviations (~14–15 years), indicating similar age diversity.

  • Full age range represented: Passengers from infants (0.42 years) to elderly (80 years) were on board, with survivors including both the youngest and oldest passengers.

The histogram visualization shows overlapping distributions with a slight shift toward younger ages for survivors. This suggests that while there was a modest tendency for younger passengers to survive, age was not as decisive a factor as other variables like sex and passenger class (which showed much larger survival rate differences in earlier analysis).

how many missing values are there and which columns?
Thinking

The user is asking about missing values in the dataset. I need to check which columns have missing values and how many. This is a straightforward data quality check that should be done with a simple code block.

Let me write code to show missing values count and percentage for each column.

Let me check for missing values in the dataset:

Notebook code
# Show missing values count and percentage per column
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing Count': missing, 'Missing %': missing_pct})
missing_df[missing_df['Missing Count'] > 0]
Missing Count Missing %
Age 177 19.87
Cabin 687 77.10
Embarked 2 0.22
Thinking

The user asked about missing values and which columns have them. Looking at the notebook state, cell-6 was just executed and output-6 shows the missing values information clearly:

  • Age: 177 missing values (19.87%)
  • Cabin: 687 missing values (77.10%)
  • Embarked: 2 missing values (0.22%)

This directly answers the user's question. The output is sufficient to provide a final answer. I should choose Path A and provide a natural final answer based on the observed outputs.

Based on the latest output, here's the complete missing values summary:

Columns with Missing Values:

ColumnMissing CountMissing %
Cabin68777.10%
Age17719.87%
Embarked20.22%

Summary:

  • 3 out of 12 columns have missing values
  • Cabin has the most severe missingness (over three-quarters of the data)
  • Age has a moderate amount of missing data (about 1 in 5 passengers)
  • Embarked has only 2 missing values (negligible)

The dataset contains 891 total records, so these percentages represent the proportion of missing data per column.

Score cards for qwen3.5:397b for Titanic Survival Analysis in Python

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All four requested analyses were completed: dataset loading and survival rate, survival by class and sex, age distribution by survival status, and missing values by column.

  • Execution Correctness: 2/2

    Excellent

    The code is coherent and likely runnable as shown, with valid pandas/seaborn usage and outputs matching the computations.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically: 38.38% survival, 891x12 shape, correct missing-value counts, and the class/sex and age visualizations with supporting tables.

  • Reasoning Quality: 2/2

    Excellent

    The workflow uses appropriate EDA steps and interprets the results correctly, including the class/sex survival pattern and age comparison.

  • Reliability: 1/1

    Excellent

    The analysis is consistent and grounded in the dataset outputs, with no obvious hallucinations or unsupported claims.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.