Machine Learning

Telco Customer Churn Prediction in Python

Analyze the Telco Customer Churn dataset, engineer features, train a random forest classifier, and identify top churn drivers.

What

This AI Data Analyst workflow loads the Telco Customer Churn CSV dataset and computes the overall churn rate. It encodes categorical variables, checks class balance, and prepares features for modeling. It trains a random forest classifier, reports accuracy and a classification report, and plots the top 10 feature importances to highlight churn drivers.

Who

This is for data analysts and data scientists who want a reproducible churn modeling example using a standard telco dataset. It helps practitioners practice preprocessing mixed-type tabular data and interpreting a tree-based classifier with feature importance plots.

Tools

  • pandas
  • numpy
  • scikit-learn
  • matplotlib
  • seaborn

Outcomes

  • Dataset loaded with shape (7043, 21) and churn rate around 26.5%
  • Categorical encoding completed and class imbalance visualized
  • Random forest model trained with accuracy around 0.80 and a classification report
  • Top 10 churn drivers plotted, typically including tenure, MonthlyCharges, and TotalCharges

Quality Score

8/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

  • Task Completion: 2/2

    Excellent

    All requested steps are present: dataset loaded with churn rate, categorical encoding with class balance, Random Forest training with accuracy, and a top-10 feature importance plot.

  • Execution Correctness: 2/2

    Excellent

    Code is coherent and likely runnable end-to-end: reads CSV, uses get_dummies, performs stratified train/test split, trains RandomForestClassifier, computes accuracy, and plots importances without obvious errors.

  • Output Quality: 2/3

    Good

    Outputs include churn rate table, class balance table, accuracy value, and a feature-importance figure. However, the workflow does not present the actual top-10 feature names/values in text (only a plot image), and encoding includes customerID leading to questionable interpretability of 'drivers'.

  • Reasoning Quality: 2/2

    Excellent

    Reasoning is clear about what each step does and appropriately flags the major issue of one-hot encoding high-cardinality identifiers (customerID) affecting feature importance interpretation.

  • Reliability: 0/1

    Needs work

    The approach is fragile for the stated goal of identifying churn drivers because it encodes customerID (and potentially string-form numeric fields) without cleaning/dropping, producing an extremely high-dimensional matrix that can distort importances.

load the telco churn dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/telco-customer-churn/Telco-Customer-Churn.csv and show churn rate
Thinking

Loading dataset and computing churn

I need to follow an iterative process here. The user asked to load a dataset and show the churn rate. However, since the dataset isn't loaded yet, the first step should be to load and inspect it—checking its shape, columns, and the first few rows. Even though the user directly asks for the churn rate, the rule says to avoid combining the initial inspection with any computations or plotting in the same response. So, I should propose loading and inspecting first, then note that I can compute the churn rate afterwards.Preparing to load and inspect

I need to make sure I follow the right workflow here. The instructions say if outputs are missing, I should provide the next code step without asking for outputs. However, I've also seen that I should explain briefly and then provide code. Since the user asked for the churn rate, I might say something like, "Let’s load and inspect the file first, then we can compute the churn rate." I can use pandas to show the shape, columns, and head sequentially.Finalizing code for data display

I need to make sure my code ends properly. I can use the display function for the head and columns, and it looks like I should assign the DataFrame to a variable. I’ll start with the import statements and reading the CSV file. For the final output, I need to ensure it complies with the guideline that allows only one bare expression.

So, I could end with df.head() to keep it simple, but if I need multiple tables, I can use sequential display functions.

Let’s start by loading the dataset and checking its structure first. Then we can compute the churn rate in the next step.

Load and inspect data
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/telco-customer-churn/Telco-Customer-Churn.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column": df.columns}))
df.head()
rows columns
0 7043 21
column
0 customerID
1 gender
2 SeniorCitizen
3 Partner
4 Dependents
5 tenure
6 PhoneService
7 MultipleLines
8 InternetService
9 OnlineSecurity
10 OnlineBackup
11 DeviceProtection
12 TechSupport
13 StreamingTV
14 StreamingMovies
15 Contract
16 PaperlessBilling
17 PaymentMethod
18 MonthlyCharges
19 TotalCharges
20 Churn
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

5 rows × 21 columns

The dataset is the Telco Customer Churn table with 7,043 rows and 21 columns, and the Churn column is present, so the remaining missing piece is the actual churn rate.

Calculate churn percentages
churn_rate = df["Churn"].value_counts(normalize=True).mul(100).rename("percent").round(2)
churn_rate
Churn
No     73.46
Yes    26.54
Name: percent, dtype: float64

The churn rate is 26.54% for customers who left (Yes) and 73.46% for customers who stayed (No).

That means roughly 1 in 4 customers churned in this dataset. The table you loaded has 7,043 rows and includes the Churn column needed for this calculation.

encode categorical variables and show class balance
Thinking

Considering encoding strategies

I'm thinking about how to encode categorical variables without altering the target variable. The user wants to encode these while also showing class balance. Maybe I could use one code block for encoding and displaying class balance, aiming for a single output table. I need to remember to keep the target separate to avoid data leakage, especially since customerID shouldn't be included. Also, I need to deal with TotalCharges carefully to avoid creating too many categories.Deciding on encoding approach

First, I should inspect the data types to prepare for encoding categorical variables, although the user specifically asked for that. I think I could create an encoded feature matrix using pd.get_dummies, while excluding customerID and keeping Churn as a binary target. However, TotalCharges is an object with blanks, which might lead to many unwanted columns if not cleaned. Perhaps I can clean TotalCharges and convert it to numeric at the same time, but I'm hesitant if that complicates things. I need to ensure I'm only encoding true categorical columns.Considering code presentation

I'm thinking about how to display encoded shapes—maybe I can either do that in the response or just as a comment. But, I'll focus on explaining what the code does. It's important to remember not to ask for output but to provide the next code step with a brief explanation. I might include class balance information from the notebook, which is 73.46% No and 26.54% Yes. Then, the code will show how to encode predictors and rebuild the balance table.Thinking about NaN handling

I need to share some code that processes a DataFrame, focusing on converting "TotalCharges" to numeric and mapping the "Churn" target variable. Additionally, I'm dropping the ID-like column and creating dummy variables for the features. Here, I see a potential issue with NaN values in "TotalCharges." While it's not urgent, I might want to impute those later for model training—perhaps with the median—but I don't want to complicate things too much yet. Keeping numeric values with NaNs might be okay for now.

Class balance is already 73.46% No vs 26.54% Yes.
Next, this encodes the predictor columns, keeps Churn as a separate binary target, and excludes customerID because it’s an identifier rather than a useful feature.

Prepare encoded churn data
df2 = df.copy()

df2["TotalCharges"] = pd.to_numeric(df2["TotalCharges"], errors="coerce")
y = df2["Churn"].map({"No": 0, "Yes": 1})

X = df2.drop(columns=["customerID", "Churn"])
X_encoded = pd.get_dummies(X, drop_first=True)

class_balance = y.value_counts().rename(index={0: "No", 1: "Yes"}).to_frame("count")
class_balance["percent"] = (class_balance["count"] / class_balance["count"].sum() * 100).round(2)

class_balance
count percent
Churn
No 5174 73.46
Yes 1869 26.54

The encoding step is in place: categorical predictors were one-hot encoded with pd.get_dummies(..., drop_first=True), while customerID was excluded as an identifier and Churn was kept separate as the target. TotalCharges was also converted to numeric, which is important because it often arrives as text in this dataset.

The class balance shows the target is moderately imbalanced:

  • No churn: 5,174 customers (73.46%)
  • Yes churn: 1,869 customers (26.54%)

So after encoding, you now have a model-ready feature matrix X_encoded and a binary target y, with about a 3:1 majority-to-minority class split in favor of non-churners.

train a random forest classifier and show accuracy
Thinking

Evaluating Random Forest Training

I'm noticing that there's a discrepancy in my notebook state: it says the data frame shape is 144x2, but actually, it should be 7043x21 for the Telco data. I need to ensure we're using the correct data as a source of truth. The user has asked me to train a random forest classifier and to show its accuracy. I have to keep in mind I need to handle NaNs in the TotalCharges column and exclude customerID. Let's approach this step by step!Providing Accuracy Code

It looks like I'm missing an output, so I need to provide the next piece of code right away, using a markdown Python code block. I won't add print statements if the variable displays correctly because I want to keep it clean. I'll keep the output focused on accuracy, possibly in a DataFrame or as a scalar. I need to leverage X_encoded and y from the notebook state. However, there's some inconsistency in the variables list, so I might have to recreate them. Let’s just focus on showing accuracy for now!Creating Random Forest Pipeline

I think it makes sense to write some code that assumes the DataFrame is loaded and recreates X_encoded and y for safety. But I can also retrain using the DataFrame directly, which keeps things streamlined. I need to remember to set a random_state. I should use train_test_split while stratifying on y, but maybe I need to apply the median imputer before the split for a better pipeline and to avoid leakage. For accuracy, I’ll set up a Pipeline with SimpleImputer and RandomForestClassifier. Let’s make sure to evaluate using accuracy_score afterward!Producing minimal code

I'm laying out a streamlined code snippet that uses Random Forest for predicting churn. I need to ensure that I handle missing values, convert data types, and handle class imbalance. While accuracy is important, I should mention that it could be misleading due to the class imbalance, so precision and recall are also worth noting.

After the code, I might add a quick note to remind the user that accuracy alone doesn't give the full picture. The final output should be clear and concise!

This trains a random forest on a stratified train/test split and shows test accuracy. I’m still excluding customerID because it’s an identifier, not a real predictive feature.

Train random forest classifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
import pandas as pd

df2 = df.copy()
df2["TotalCharges"] = pd.to_numeric(df2["TotalCharges"], errors="coerce")

y = df2["Churn"].map({"No": 0, "Yes": 1})
X = df2.drop(columns=["customerID", "Churn"])

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
X_train, X_test = X_train.align(X_test, join="left", axis=1, fill_value=0)

imputer = SimpleImputer(strategy="median")
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
pd.DataFrame({"accuracy": [round(accuracy_score(y_test, y_pred), 4)]})
accuracy
0 0.7913

Accuracy is useful here, but since churn is imbalanced, the next step should be precision, recall, and a confusion matrix.

The random forest classifier reached an accuracy of 0.7913, so it correctly classified about 79.13% of customers in the test set.

That answers the request to train a random forest and show accuracy. Based on the notebook state, the model was trained on a stratified train/test split, with customerID excluded as an identifier and categorical features one-hot encoded before fitting.

plot the top 10 most important features for churn prediction
Thinking

Planning feature importance plot

I need to create a concise code snippet to plot the top 10 most important features from our trained random forest model. Since X_train is now just an array after imputation, I've lost the feature names. I'll need to recreate these from the original DataFrame or alignment before imputation. I can use the model's feature_importances_ and ensure I'm rebuilding the encoded feature names properly. My main goal is to produce a single chart using Seaborn or Matplotlib for visualization.Finalizing plotting code

I need to keep my final expression simple since I'm making a plot. It's fine to use plt.show() at the end of my code block. I should also remember to state that the customerID stays excluded as an identifier. My code will rebuild split and one-hot columns using the same random state and stratify to ensure column names match. I need to import Seaborn and Matplotlib, align my data properly, and ensure I’m working with the encoded DataFrames before imputation. I’ll also make my bar plot horizontal for the top 10 features!

We already have the trained random forest in memory, so the only thing we need is to rebuild the training feature names in the same order and plot the importances. customerID remains excluded because it’s an identifier.

Plot top feature importances
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

df2 = df.copy()
df2["TotalCharges"] = pd.to_numeric(df2["TotalCharges"], errors="coerce")

y = df2["Churn"].map({"No": 0, "Yes": 1})
X = df2.drop(columns=["customerID", "Churn"])

X_train_df, X_test_df, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train_df = pd.get_dummies(X_train_df, drop_first=True)
X_test_df = pd.get_dummies(X_test_df, drop_first=True)
X_train_df, X_test_df = X_train_df.align(X_test_df, join="left", axis=1, fill_value=0)

importance = pd.Series(rf.feature_importances_, index=X_train_df.columns).sort_values(ascending=False)
top10 = importance.head(10).sort_values()

plt.figure(figsize=(8, 5))
sns.barplot(x=top10.values, y=top10.index, palette="Blues_r")
plt.title("Top 10 Feature Importances for Churn Prediction")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()
/var/folders/pp/x90pfm0140d4glxshn77bthh0000gn/T/ipykernel_92919/219698886.py:28: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=top10.values, y=top10.index, palette="Blues_r")

If you want, the next useful step is to show the exact top 10 importance values as a table.

Thinking

Refining code presentation

I realize that I want to display the top 10 items sorted by importance, and I need to ensure the code remains minimal and runnable. Since the notebook already contains the top10 variable, I should use that directly. For efficiency, I'll use top10.sort_values(ascending=False).rename("importance").to_frame(). I’ll aim to keep the code concise and will end it with this final expression, avoiding any print statements. Just providing the code should be sufficient.

The plot was created successfully, so the feature importance chart exists. What’s still missing is the exact ranked list behind that chart, and for a top-10 ranking that matters for a clear answer.

Sort feature importance
top10.sort_values(ascending=False).rename("importance").to_frame()
importance
TotalCharges 0.193201
tenure 0.171705
MonthlyCharges 0.168201
InternetService_Fiber optic 0.039103
PaymentMethod_Electronic check 0.038334
Contract_Two year 0.031360
gender_Male 0.028517
OnlineSecurity_Yes 0.027556
PaperlessBilling_Yes 0.025673
Partner_Yes 0.023302

The plot shows that the churn model is driven mostly by billing and tenure variables, with a smaller contribution from service and contract choices.

Top 10 most important features from the random forest:

  1. TotalCharges — 0.1932
  2. tenure — 0.1717
  3. MonthlyCharges — 0.1682
  4. InternetService_Fiber optic — 0.0391
  5. PaymentMethod_Electronic check — 0.0383
  6. Contract_Two year — 0.0314
  7. gender_Male — 0.0285
  8. OnlineSecurity_Yes — 0.0276
  9. PaperlessBilling_Yes — 0.0257
  10. Partner_Yes — 0.0233

What this means for churn prediction:

  • Charges and tenure dominate the model, so customer lifetime and cost patterns matter most.
  • Fiber optic service and electronic check payments stand out among categorical variables, suggesting these groups behave differently in churn.
  • Contract type, especially two-year contracts, is also influential, which fits the common pattern that longer contracts reduce churn risk.
  • The drop from the top 3 features to the rest is fairly large, so the model relies heavily on those first three predictors.

Browse All Analysis Examples