Change is the only constant in life - Heraclitus quote is so relevant in the Machine Learning world. The ML models that are running in the production need to be monitored. Why? Because data can change in time, leading to wrong predictions. The change in data distribution is so-called a data drift. I will show you how to build a dashboard for data drift detection in Python. I will use
Evidently Python package for detecting data drift and building a dashboard in the Jupyter Notebook. The dashboard notebook will be published as a web application with the
The web application is deployed in the Heroku cloud and accessible at data-drift-detection.herokuapp.com. The code for this article is on GitHub pplonski/data-drift-detection. The demo of the website is below:
What is data drift?
Data drift is a name for data change that can affect the Machine Learning model performance. There are different types of the drift. There can be change in target distribution. For example, you have a model that predicts house price based on the property description (number of rooms, location, etc.) but there is a change in the market and all properties prices go up. If you don’t detect the drift in the target distribution and don’t update the model with new targets then your model will predict too low pricies. The change in the target disribution is so-called target drift. The next type of drift is covariate drift. It is a change in the input data distribution. For example, you have a categorical feature that will start to have new category.
What to do after data drift detection?
When data drift is detected the ML model should be updated. There are many ways in which it can be done - all depends on the data. The most strightforward way is to use all available data samples to train a new model. The other approach might be to use just a new data samples to train the model. There might be also approaches with samples weighting - giving higher weight for fresh data samples and lower weights for old samples. It all depends on the data.
How to detect data drift?
The data drift can be detected in different ways. The simplest approach is to use statistical tests that compare the distribution of the training data (referred as baseline or reference) and live data (production data). If the difference between two distributions is significant then a drift occured. The most popular tests are two-sample Kolmogorov-Smirnov test, Chi-square test, Jensen–Shannon divergence, Wasserstein distance. The alternative approach might be to use Machine Learning model to monitor the data quality. There can be also a hybrid approaches.
Personally, I’d start with a simple statistical test and will try to understand the data behavior. If you are reading this article you are on a good way to building a robust ML system.
Data Drift Detection in Python
Let’s start by setting up a code repository for the project. I’m using GitHub. My repository address is: pplonski/data-drift-detection. Please create a new repo and clone it to the local file system:
git clone firstname.lastname@example.org:pplonski/data-drift-detection.git
I’m using virtual environment for each project and keep needed packages in the
virtualenv evi source evi/bin/activate pip install evidently mljar-mercury
requirements.txt should have:
evidently is a package for data drift detection and monitoring. The
mljar-mercury is a package for converting Jupyter Notebook to web application so it can be shared with non-technical users. In the
evidently docs there is recommendation to enable its extension, however I didn’t run those commands and it was working fine (just need to add
mode="inline" when displaying dashboards).
Let’s add the
evi environment to the Jupyter kernels list:
python -m ipykernel install --user --name evi
We are ready to create a new notebook. Please remember to select
evi kernel when creating a new notebook. Let’s import needed packages:
import pandas as pd from sklearn import datasets from evidently.dashboard import Dashboard from evidently.dashboard.tabs import DataDriftTab, CatTargetDriftTab
Let’s use a toy dataset from
sklearn package. It will be Iris dataset. It represents a classification problem.
iris = datasets.load_iris() iris_frame = pd.DataFrame(iris.data, columns = iris.feature_names) iris_frame['target'] = iris.target
We will use 75 first samples as a reference data (data used for training). There will be
new_sampels variable that will control the number of new samples considered for drift detection. The drift detection and dashboard can be constructed with 3 lines of Python code:
iris_data_drift_report = Dashboard(tabs=[DataDriftTab(verbose_level=verbose), CatTargetDriftTab(verbose_level=verbose)]) iris_data_drift_report.calculate(iris_frame[:75], iris_frame[-new_samples:], column_mapping = None) iris_data_drift_report.show(mode="inline")
verbose variable is boolean that controls the verbosity level of displayed dahsboard. I’ve set it to
False. The notebook with code is presented below:
It is working locally, but how to share the notebook with other stakeholders? Let’s make an interactive web application from it and share with other users. We will use an open-source
Convert notebook to web app
Mercury framework is controlled with the YAML header. Just add the YAML configuration in the first cell. Please set a cell as a RAW type. The YAML header is below:
--- title: 🌷 Iris drift detection description: Data and target drift detection in classification task show-code: False params: new_samples: input: slider value: 25 label: New samples count min: 10 max: 75 verbose: input: checkbox value: False label: Verbose ---
It sets the
description of the app. It hides the code with
show-code: False parameter. The two widgets are added:
new_samples- it is a slider widget, with default value set to 25 that controls the number of new samples used for detecting drift,
verbose- it is a checkbox that controls verbosity of the
It is important to note that
Mercury widgets are directly conected to the variables in the Python code. The variables that are conected with widgets should be in the separate cell:
new_samples = 25 verbose = False
The end-user will be able to tweak widgets values. The notebook will be executed with new parameters after
Run click . That’s all! 😊 You can run the web application locally with the following command:
Please open the web browset at address http://127.0.0.1:8000. You should see the notebook as the web app. The user can tweak widgets values and click
Run to execute the notebook with new parameters.
Mercury framework is built on top of the Django. It can be deployed on any machine with Python available. I will deploy it to Heroku. It offers a free dynos which are perfect for demo purposes. Please check
Mercury documentation for other deployment options. I’m using Heroku CLI. The deployment can be done with one command.
Let’s create files needed by Heroku. The
web: mercury run 0.0.0.0:$PORT
The server creation on Heroku can be done with the following command:
heroku create data-drift-detection
It will create a new dyno for us. The next step is to add all files to the repo:
git add Iris-Drift-Detection.ipynb git add requrements.txt git add Procfile git add runtime.txt
The web app is deployed with the following command:
git push heroku main
Please wait a while and the application should be available online. My app is running at data-drift-detection.herokuapp.com.
Scheduling and email notification
You can add scheduling and email notofications with the
Mercury framework. You need to add
notify parameters. The
schedule accepts crontab string. It defines at what time intervals the notebook will be automatically executed. The example
schedule parameter that will execute the notebook at 8:30AM from Monday to Friday is presented below:
schedule: '30 8 * * 1-5'
What is more, executed notebook can be attached as PDF in the email:
notify: on_success: email@example.com, firstname.lastname@example.org, email@example.com attachment: pdf
on_success parameter defines the list of emails that will recevie the email notification after successful notebook execution. The
attachment defines the notebook format that will be attached to the email. The complete YAML will look like:
--- title: 🌷 Iris drift detection description: Data and target drift detection in classification task schedule: '30 8 * * 1-5' notify: on_success: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org attachment: pdf show-code: False params: new_samples: input: slider value: 25 label: New samples count min: 10 max: 75 verbose: input: checkbox value: False label: Verbose ---
The notebook will be executed automatically, exported to PDF and send in email after the deployment. You will get daily PDF report with data drift check. Please check documentation for more details about notebook scheduling.
The data drift monitoring and detection is important part of Machine Learning pipeline. The tools like
Evidently can greatly simplify the process of data drift monitoring. The
Mercury framework makes it possible to deploy data drifting monitoring notebooks as web applications. The notebook can be automatically executed and send as the PDF in the email.
- Python code used in the article is available at github.com/pplonski/data-drift-detection
- Web application for data drift monitoring is available online on data-drift-detection.herokuapp.com
If you have any questions or need help please subscribe to the newsletter below. In the form, you can leave feedback, comment, or question. We do our best to respond quickly :)
💌 Join our newsletter 💌
Subscribe to our newsletter to receive product updates