Machine Learning
Telco Customer Churn Prediction in Python
Analyze the Telco Customer Churn dataset, engineer features, train a random forest classifier, and identify top churn drivers.
What
This AI Data Analyst workflow loads the Telco Customer Churn CSV dataset and computes the overall churn rate. It encodes categorical variables, checks class balance, and prepares features for modeling. It trains a random forest classifier, reports accuracy and a classification report, and plots the top 10 feature importances to highlight churn drivers.
Who
This is for data analysts and data scientists who want a reproducible churn modeling example using a standard telco dataset. It helps practitioners practice preprocessing mixed-type tabular data and interpreting a tree-based classifier with feature importance plots.
Tools
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
Outcomes
- Dataset loaded with shape (7043, 21) and churn rate around 26.5%
- Categorical encoding completed and class imbalance visualized
- Random forest model trained with accuracy around 0.80 and a classification report
- Top 10 churn drivers plotted, typically including tenure, MonthlyCharges, and TotalCharges
Quality Score
8/10
Last scored: Apr 7, 2026
Task Completion: 2/2
ExcellentAll requested steps are present: dataset loaded with churn rate, categorical encoding with class balance, Random Forest training with accuracy, and a top-10 feature importance plot.
Execution Correctness: 2/2
ExcellentCode is coherent and likely runnable end-to-end: reads CSV, uses get_dummies, performs stratified train/test split, trains RandomForestClassifier, computes accuracy, and plots importances without obvious errors.
Output Quality: 2/3
GoodOutputs include churn rate table, class balance table, accuracy value, and a feature-importance figure. However, the workflow does not present the actual top-10 feature names/values in text (only a plot image), and encoding includes customerID leading to questionable interpretability of 'drivers'.
Reasoning Quality: 2/2
ExcellentReasoning is clear about what each step does and appropriately flags the major issue of one-hot encoding high-cardinality identifiers (customerID) affecting feature importance interpretation.
Reliability: 0/1
Needs workThe approach is fragile for the stated goal of identifying churn drivers because it encodes customerID (and potentially string-form numeric fields) without cleaning/dropping, producing an extremely high-dimensional matrix that can distort importances.