Churn Prediction with AutoML
Sometimes we don’t even realize how common machine learning (ML) is in our daily lives. Various “intelligent” algorithms help us for instance with finding the most important facts (Google), they suggest what movie to watch (Netflix), or influence our shopping decisions (Amazon). The biggest international companies quickly recognized the potential of machine learning and transferred it to business solutions.
Nowadays not only big companies are able to use ML. Imagine — not so abstract — situation when a company tries to predict customer behavior based on some personal data. Just a few years ago, the best strategy to solve this problem would be to hire a good data science team. Nowadays, thanks to growing ML popularity, it is available even for small start-ups.
Today, I would like to present you a demo of how to solve difficult business problems with ML. We will take advantage of mljar.com service and its R API. With just a few lines of code we will be able to achieve very good results.
Typical problem for companies operating on a contractual basis (like Internet, or phone providers) is whether a customer will decide to stay for a next period of time, or churn. It is crucial for a company to focus on customers who are at risk of churning in order to prevent it.
In this example we will help a telecom company to predict, which consumers are likely to renew a contract and which are not. We will base on the data from the past. This dataset in publicly available and can be downloaded for example here: https://github.com/WLOGSolutions/telco-customer-churn-in-r-and-h2o/tree/master/data.
It contains descriptors, which can be divided into three different categories:
- demographical data: gender, age, education, etc.
- account information: number of complaints, average call duration, etc.
- services used: Internet, voice etc.
Finally, we will also have a column with two labels: churn and no churn, which is our target to predict.
Now, that we have the problem set and understand our data, we can move on to the code. mljar.com has both R and Python API, but this time we focus on the former.
First of all, we need to import necessary libraries.
If you miss one of them, you can always easily install it with install.packages command.
The following lines enable you to read and clean the dataset. In three steps we: get rid of irrelevant columns (time), select only complete records and remove duplicated rows.
It is important to validate our final ML model before publishing, so we split the churn data into training and test set in proportion 7:3.
For convenience, let’s assign features and labels to separate variables:
And that’s it! We are ready to use mljar to train our ML models.
(Before, make sure that you have mljar account and your mljar token is set. For more information take a look at: https://github.com/mljar/mljar-api-R/)
The main advantage of mljar is that it frees the user from thinking about model selection. We can use various different models and compare their performance. In this case, as a performance measure we choose Area Under Curve (https://en.wikipedia.org/wiki/Receiver_operating_characteristic).
We use the following ML algorithms: logistic regression, xgboost, light gradient boosting, extra trees, regularized greedy forest and k-nearest neighbors. Moreover, in R training all these models are as simple as that:
After running this, when you log in into mljar.com web service and select “Churn” project you should see training data preview with some basic statistics. It is very helpful when you consider which features to use. In this example we decided to take advantage of all of them, but that’s not always the best strategy.
With a little bit of patience (which depends on your mljar subscription), we got our final model with the best results. All models are also listed at mljar.com in Results card.
mljar_fit function returns the best model, which in this case is: ensemble_avg.
Ensemble average means that this is a combination of various models, which usually performs better together than individual models.
Another useful option of mljar web service is hidden in “Feature Analysis” card. If you select some model there, you will see a graphical illustration of feature importance. For instance for extra trees algorithm we have:
The graph leads to a conclusion that age, unpaid invoice balance and monthly billed amounts are the most important customer descriptors, whereas number of calls or using some extra services have almost no impact on churning.
To predict labels on the test set, we use mljar_predict command.
Note that the variable y_pr contains probabilities of our churn/no-churn labels. With help of pROC package we will easily find out which threshold is the best in our case:
In data frame y_pr_c our classification results are encoded as logical values, so let’s transform them into more readable form:
Finally, we can evaluate our model’s prediction with different measures:
Similar analysis has been performed using another R library for machine learning: H2O (look here: (https://github.com/WLOGSolutions/telco-customer-churn-in-r-and-h2o). You can see a comparison of results in the table below:
I hope that I convinced you that some complex business problems can be solved easily with machine-learning algorithms using tools like mljar.
(Worth to know: you can repeat all steps of the above analysis with no coding at all, using only mljar web-service. Try it out at mljar.com. If you would like to run code: it is here.)
For more interesting R resources please check r-bloggers.