Machine Learning

Telco Customer Churn Prediction in Python

Analyze the Telco Customer Churn dataset, engineer features, train a random forest classifier, and identify top churn drivers.

What

This AI Data Analyst workflow loads the Telco Customer Churn CSV dataset and computes the overall churn rate. It encodes categorical variables, checks class balance, and prepares features for modeling. It trains a random forest classifier, reports accuracy and a classification report, and plots the top 10 feature importances to highlight churn drivers.

Who

This is for data analysts and data scientists who want a reproducible churn modeling example using a standard telco dataset. It helps practitioners practice preprocessing mixed-type tabular data and interpreting a tree-based classifier with feature importance plots.

Tools

pandas
numpy
scikit-learn
matplotlib
seaborn

Outcomes

Dataset loaded with shape (7043, 21) and churn rate around 26.5%
Categorical encoding completed and class imbalance visualized
Random forest model trained with accuracy around 0.80 and a classification report
Top 10 churn drivers plotted, typically including tenure, MonthlyCharges, and TotalCharges

Quality Score

8/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

Task Completion: 2/2
Excellent
All requested steps are present: dataset loaded with churn rate, categorical encoding with class balance, Random Forest training with accuracy, and a top-10 feature importance plot.
Execution Correctness: 2/2
Excellent
Code is coherent and likely runnable end-to-end: reads CSV, uses get_dummies, performs stratified train/test split, trains RandomForestClassifier, computes accuracy, and plots importances without obvious errors.
Output Quality: 2/3
Good
Outputs include churn rate table, class balance table, accuracy value, and a feature-importance figure. However, the workflow does not present the actual top-10 feature names/values in text (only a plot image), and encoding includes customerID leading to questionable interpretability of 'drivers'.
Reasoning Quality: 2/2
Excellent
Reasoning is clear about what each step does and appropriately flags the major issue of one-hot encoding high-cardinality identifiers (customerID) affecting feature importance interpretation.
Reliability: 0/1
Needs work
The approach is fragile for the stated goal of identifying churn drivers because it encodes customerID (and potentially string-form numeric fields) without cleaning/dropping, producing an extremely high-dimensional matrix that can distort importances.

load the telco churn dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/telco-customer-churn/Telco-Customer-Churn.csv and show churn rate

Thinking

Loading dataset and computing churn

I need to follow an iterative process here. The user asked to load a dataset and show the churn rate. However, since the dataset isn't loaded yet, the first step should be to load and inspect it—checking its shape, columns, and the first few rows. Even though the user directly asks for the churn rate, the rule says to avoid combining the initial inspection with any computations or plotting in the same response. So, I should propose loading and inspecting first, then note that I can compute the churn rate afterwards.Preparing to load and inspect

I need to make sure I follow the right workflow here. The instructions say if outputs are missing, I should provide the next code step without asking for outputs. However, I've also seen that I should explain briefly and then provide code. Since the user asked for the churn rate, I might say something like, "Let’s load and inspect the file first, then we can compute the churn rate." I can use pandas to show the shape, columns, and head sequentially.Finalizing code for data display

I need to make sure my code ends properly. I can use the display function for the head and columns, and it looks like I should assign the DataFrame to a variable. I’ll start with the import statements and reading the CSV file. For the final output, I need to ensure it complies with the guideline that allows only one bare expression.

So, I could end with df.head() to keep it simple, but if I need multiple tables, I can use sequential display functions.

Let’s start by loading the dataset and checking its structure first. Then we can compute the churn rate in the next step.

Load and inspect data

import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/telco-customer-churn/Telco-Customer-Churn.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column": df.columns}))
df.head()

	rows	columns
0	7043	21

	column
0	customerID
1	gender
2	SeniorCitizen
3	Partner
4	Dependents
5	tenure
6	PhoneService
7	MultipleLines
8	InternetService
9	OnlineSecurity
10	OnlineBackup
11	DeviceProtection
12	TechSupport
13	StreamingTV
14	StreamingMovies
15	Contract
16	PaperlessBilling
17	PaymentMethod
18	MonthlyCharges
19	TotalCharges
20	Churn

	customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	Yes	No	1	No	No phone service	DSL	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	No	No	34	Yes	No	DSL	Yes	...	Yes	No	No	No	One year	No	Mailed check	56.95	1889.5	No
2	3668-QPYBK	Male	No	No	2	Yes	No	DSL	Yes	...	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	No	No	45	No	No phone service	DSL	Yes	...	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	No	No	2	Yes	No	Fiber optic	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes

5 rows × 21 columns

The dataset is the Telco Customer Churn table with 7,043 rows and 21 columns, and the Churn column is present, so the remaining missing piece is the actual churn rate.

Calculate churn percentages

churn_rate = df["Churn"].value_counts(normalize=True).mul(100).rename("percent").round(2)
churn_rate

#10

Churn
No     73.46
Yes    26.54
Name: percent, dtype: float64

#11

The churn rate is 26.54% for customers who left (Yes) and 73.46% for customers who stayed (No).

That means roughly 1 in 4 customers churned in this dataset. The table you loaded has 7,043 rows and includes the Churn column needed for this calculation.

#12

encode categorical variables and show class balance

#13

Thinking

Considering encoding strategies

I'm thinking about how to encode categorical variables without altering the target variable. The user wants to encode these while also showing class balance. Maybe I could use one code block for encoding and displaying class balance, aiming for a single output table. I need to remember to keep the target separate to avoid data leakage, especially since customerID shouldn't be included. Also, I need to deal with TotalCharges carefully to avoid creating too many categories.Deciding on encoding approach

First, I should inspect the data types to prepare for encoding categorical variables, although the user specifically asked for that. I think I could create an encoded feature matrix using pd.get_dummies, while excluding customerID and keeping Churn as a binary target. However, TotalCharges is an object with blanks, which might lead to many unwanted columns if not cleaned. Perhaps I can clean TotalCharges and convert it to numeric at the same time, but I'm hesitant if that complicates things. I need to ensure I'm only encoding true categorical columns.Considering code presentation

I'm thinking about how to display encoded shapes—maybe I can either do that in the response or just as a comment. But, I'll focus on explaining what the code does. It's important to remember not to ask for output but to provide the next code step with a brief explanation. I might include class balance information from the notebook, which is 73.46% No and 26.54% Yes. Then, the code will show how to encode predictors and rebuild the balance table.Thinking about NaN handling

I need to share some code that processes a DataFrame, focusing on converting "TotalCharges" to numeric and mapping the "Churn" target variable. Additionally, I'm dropping the ID-like column and creating dummy variables for the features. Here, I see a potential issue with NaN values in "TotalCharges." While it's not urgent, I might want to impute those later for model training—perhaps with the median—but I don't want to complicate things too much yet. Keeping numeric values with NaNs might be okay for now.

#14

Class balance is already 73.46% No vs 26.54% Yes.
Next, this encodes the predictor columns, keeps Churn as a separate binary target, and excludes customerID because it’s an identifier rather than a useful feature.

#15

Prepare encoded churn data

df2 = df.copy()

df2["TotalCharges"] = pd.to_numeric(df2["TotalCharges"], errors="coerce")
y = df2["Churn"].map({"No": 0, "Yes": 1})

X = df2.drop(columns=["customerID", "Churn"])
X_encoded = pd.get_dummies(X, drop_first=True)

class_balance = y.value_counts().rename(index={0: "No", 1: "Yes"}).to_frame("count")
class_balance["percent"] = (class_balance["count"] / class_balance["count"].sum() * 100).round(2)

class_balance

#16

	count	percent
Churn
No	5174	73.46
Yes	1869	26.54

#17

The encoding step is in place: categorical predictors were one-hot encoded with pd.get_dummies(..., drop_first=True), while customerID was excluded as an identifier and Churn was kept separate as the target. TotalCharges was also converted to numeric, which is important because it often arrives as text in this dataset.

The class balance shows the target is moderately imbalanced:

No churn: 5,174 customers (73.46%)
Yes churn: 1,869 customers (26.54%)

So after encoding, you now have a model-ready feature matrix X_encoded and a binary target y, with about a 3:1 majority-to-minority class split in favor of non-churners.

#18

train a random forest classifier and show accuracy

#19

Thinking

Evaluating Random Forest Training

I'm noticing that there's a discrepancy in my notebook state: it says the data frame shape is 144x2, but actually, it should be 7043x21 for the Telco data. I need to ensure we're using the correct data as a source of truth. The user has asked me to train a random forest classifier and to show its accuracy. I have to keep in mind I need to handle NaNs in the TotalCharges column and exclude customerID. Let's approach this step by step!Providing Accuracy Code

It looks like I'm missing an output, so I need to provide the next piece of code right away, using a markdown Python code block. I won't add print statements if the variable displays correctly because I want to keep it clean. I'll keep the output focused on accuracy, possibly in a DataFrame or as a scalar. I need to leverage X_encoded and y from the notebook state. However, there's some inconsistency in the variables list, so I might have to recreate them. Let’s just focus on showing accuracy for now!Creating Random Forest Pipeline

I think it makes sense to write some code that assumes the DataFrame is loaded and recreates X_encoded and y for safety. But I can also retrain using the DataFrame directly, which keeps things streamlined. I need to remember to set a random_state. I should use train_test_split while stratifying on y, but maybe I need to apply the median imputer before the split for a better pipeline and to avoid leakage. For accuracy, I’ll set up a Pipeline with SimpleImputer and RandomForestClassifier. Let’s make sure to evaluate using accuracy_score afterward!Producing minimal code

I'm laying out a streamlined code snippet that uses Random Forest for predicting churn. I need to ensure that I handle missing values, convert data types, and handle class imbalance. While accuracy is important, I should mention that it could be misleading due to the class imbalance, so precision and recall are also worth noting.

After the code, I might add a quick note to remind the user that accuracy alone doesn't give the full picture. The final output should be clear and concise!

#20

This trains a random forest on a stratified train/test split and shows test accuracy. I’m still excluding customerID because it’s an identifier, not a real predictive feature.

#21

Train random forest classifier

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
import pandas as pd

df2 = df.copy()
df2["TotalCharges"] = pd.to_numeric(df2["TotalCharges"], errors="coerce")

y = df2["Churn"].map({"No": 0, "Yes": 1})
X = df2.drop(columns=["customerID", "Churn"])

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
X_train, X_test = X_train.align(X_test, join="left", axis=1, fill_value=0)

imputer = SimpleImputer(strategy="median")
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
pd.DataFrame({"accuracy": [round(accuracy_score(y_test, y_pred), 4)]})

#22

	accuracy
0	0.7913

#23

Accuracy is useful here, but since churn is imbalanced, the next step should be precision, recall, and a confusion matrix.

#24

The random forest classifier reached an accuracy of 0.7913, so it correctly classified about 79.13% of customers in the test set.

That answers the request to train a random forest and show accuracy. Based on the notebook state, the model was trained on a stratified train/test split, with customerID excluded as an identifier and categorical features one-hot encoded before fitting.

#25

plot the top 10 most important features for churn prediction

#26

Thinking

Planning feature importance plot

I need to create a concise code snippet to plot the top 10 most important features from our trained random forest model. Since X_train is now just an array after imputation, I've lost the feature names. I'll need to recreate these from the original DataFrame or alignment before imputation. I can use the model's feature_importances_ and ensure I'm rebuilding the encoded feature names properly. My main goal is to produce a single chart using Seaborn or Matplotlib for visualization.Finalizing plotting code

I need to keep my final expression simple since I'm making a plot. It's fine to use plt.show() at the end of my code block. I should also remember to state that the customerID stays excluded as an identifier. My code will rebuild split and one-hot columns using the same random state and stratify to ensure column names match. I need to import Seaborn and Matplotlib, align my data properly, and ensure I’m working with the encoded DataFrames before imputation. I’ll also make my bar plot horizontal for the top 10 features!

#27

We already have the trained random forest in memory, so the only thing we need is to rebuild the training feature names in the same order and plot the importances. customerID remains excluded because it’s an identifier.

#28

Plot top feature importances

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

df2 = df.copy()
df2["TotalCharges"] = pd.to_numeric(df2["TotalCharges"], errors="coerce")

y = df2["Churn"].map({"No": 0, "Yes": 1})
X = df2.drop(columns=["customerID", "Churn"])

X_train_df, X_test_df, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train_df = pd.get_dummies(X_train_df, drop_first=True)
X_test_df = pd.get_dummies(X_test_df, drop_first=True)
X_train_df, X_test_df = X_train_df.align(X_test_df, join="left", axis=1, fill_value=0)

importance = pd.Series(rf.feature_importances_, index=X_train_df.columns).sort_values(ascending=False)
top10 = importance.head(10).sort_values()

plt.figure(figsize=(8, 5))
sns.barplot(x=top10.values, y=top10.index, palette="Blues_r")
plt.title("Top 10 Feature Importances for Churn Prediction")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

#29

/var/folders/pp/x90pfm0140d4glxshn77bthh0000gn/T/ipykernel_92919/219698886.py:28: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=top10.values, y=top10.index, palette="Blues_r")

#30

#31

If you want, the next useful step is to show the exact top 10 importance values as a table.

#32

Thinking

Refining code presentation

I realize that I want to display the top 10 items sorted by importance, and I need to ensure the code remains minimal and runnable. Since the notebook already contains the top10 variable, I should use that directly. For efficiency, I'll use top10.sort_values(ascending=False).rename("importance").to_frame(). I’ll aim to keep the code concise and will end it with this final expression, avoiding any print statements. Just providing the code should be sufficient.

#33

The plot was created successfully, so the feature importance chart exists. What’s still missing is the exact ranked list behind that chart, and for a top-10 ranking that matters for a clear answer.

#34

Sort feature importance

top10.sort_values(ascending=False).rename("importance").to_frame()

#35

	importance
TotalCharges	0.193201
tenure	0.171705
MonthlyCharges	0.168201
InternetService_Fiber optic	0.039103
PaymentMethod_Electronic check	0.038334
Contract_Two year	0.031360
gender_Male	0.028517
OnlineSecurity_Yes	0.027556
PaperlessBilling_Yes	0.025673
Partner_Yes	0.023302

#36

The plot shows that the churn model is driven mostly by billing and tenure variables, with a smaller contribution from service and contract choices.

Top 10 most important features from the random forest:

TotalCharges — 0.1932
tenure — 0.1717
MonthlyCharges — 0.1682
InternetService_Fiber optic — 0.0391
PaymentMethod_Electronic check — 0.0383
Contract_Two year — 0.0314
gender_Male — 0.0285
OnlineSecurity_Yes — 0.0276
PaperlessBilling_Yes — 0.0257
Partner_Yes — 0.0233

What this means for churn prediction:

Charges and tenure dominate the model, so customer lifetime and cost patterns matter most.
Fiber optic service and electronic check payments stand out among categorical variables, suggesting these groups behave differently in churn.
Contract type, especially two-year contracts, is also influential, which fits the common pattern that longer contracts reduce churn risk.
The drop from the top 3 features to the rest is fairly large, so the model relies heavily on those first three predictors.

Telco Customer Churn Prediction in Python

What

Who

Tools

Outcomes

Quality Score

Browse All Analysis Examples

Machine Learning

Time Series

Data Analysis

Exploratory Data Analysis

Finance

Nlp

Statistics