PostgreSQL and Machine Learning
I will show you how to apply Machine Learning algorithms on data from the PostgreSQL database to get insights and predictions. I will use an Automated Machine Learning (AutoML) supervised. It is an open-source python package. Thanks to AutoML I will get quick access to many ML algorithms: Decision Tree, Logistic Regression, Random Forest, Xgboost, Neural Network. The AutoML will handle feature engineering as well. I will show you python code snippets that can be reused to integrate Machine Learning with PostgreSQL as a part of the ETL pipeline.
You can find all the code used in this post in the GitHub.
The marketing data
I will use Portugese Bank Marketing dataset (bank_cleaned.csv
file). This dataset is about the marketing campaigns, which aim to promote financial products for existing customers of a Portuguese bank. The each contact to the client is described by:
columns = ["age", "job", "marital", "education", "default_payment", "balance", "housing",
"loan", "day", "month", "duration", "campaign", "pdays", "previous", "poutcome"
"response" # the target
]
The response
is the taget column, which contains the information if customer subscribed to the financial product. The goal of the analysis will be to predict whether customer will select the subscription.
In this analysis I will split the dataset into:
- training data (
32,672
samples), - testing data (
8,169
samples).
All datasets are inserted into database, but for testing data the response
is not inserted.
Setup PostgreSQL database in Docker
I will set-up the PostgreSQL database in the docker.
The Dockerfile
with PostgreSQL:
FROM postgres:alpine
EXPOSE 5555
To build docker image and run container:
docker build -t mydb:latest .
docker run --name my_local_db -e POSTGRES_PASSWORD=1234 -e POSTGRES_DB=db -p 5555:5432 mydb:latest
Create table and insert the training data
For interacting with the database I will use python scripts and psycopg2
package. To initialize the database please use the init_db.py
file. Let's dig into the code.
""" init_db.py file """
import numpy as np
import pandas as pd
import psycopg2
from io import StringIO
from sklearn.model_selection import train_test_split
from db import db_engine
create_table_sql = """
CREATE TABLE IF NOT EXISTS marketing (
id serial PRIMARY KEY,
age integer,
job varchar(128),
marital varchar(128),
education varchar(128),
default_payment varchar(128),
balance integer,
housing varchar(128),
loan varchar(128),
day integer,
month varchar(128),
duration real,
campaign integer,
pdays integer,
previous integer,
poutcome varchar(128),
response varchar(128),
predicted_response varchar(128)
)
"""
get_data_sql = """select * from marketing"""
df = pd.read_csv("data/bank_cleaned.csv", index_col="id")
df.drop("response_binary", axis=1, inplace=True)
df["predicted_response"] = ""
test_size= 0.20 # 20% for testing
df_train, df_test = train_test_split(df, test_size=test_size, random_state=1234)
df_train.to_csv("data/train.csv")
df_test.to_csv("data/test.csv")
df_test = df_test.copy()
df_test["response"] = ""
df = pd.concat([df_train, df_test])
try:
conn = psycopg2.connect(db_engine())
cur = conn.cursor()
print("Create marketing table")
cur.execute(create_table_sql)
conn.commit()
print("Insert train and test data into table ...")
buffer = StringIO()
df.to_csv(buffer, index_label="id", header=False)
buffer.seek(0)
cur.copy_from(buffer, "marketing", sep=",")
conn.commit()
print("Insert finished.")
cur.close()
except Exception as e:
print("Problems:", str(e))
The code is doing three things:
- Create the
marketing
table if it not exists. - Split the data into train and test sets (
80%/20%
split). Datasets are saved to the disk. - Datasets are inserted into table in the database. The
response
values is removed from test samples.
The data is in the database. Let's log into PostgreSQL to check:
> psql -U postgres -d db --host=0.0.0.0 --port=5555
db=# select count(*) from marketing;
count
-------
40841
(1 row)
db=# select response, count(*) from marketing group by response;
response | count
----------+-------
no | 28952
| 8169
yes | 3720
(3 rows)
Let's train the Machine Learning models!
To integrate PostgreSQL with Machine Learning we will need:
- method to get training data -
get_train_data()
, - method to get live data (for computing predictions) -
get_live_data()
, - method to insert predictions into the database -
insert_predictions(predictions, ids)
, - method to get predictions (to compute the accuracy) -
get_predictions()
.
I've created db.py
file to communicate with the database (with psychopg2
):
""" db.py file """
""" Database API """
import json
import numpy as np
import pandas as pd
import psycopg2
from io import StringIO
def db_engine():
config = json.load(open("config.json"))
host = config["connection"]["host"]
port = config["connection"]["port"]
user = config["connection"]["user"]
# password should be hidden in production setting
# do not store it in config.json
password = config["connection"]["password"]
db = config["connection"]["db"]
return "user='{}' password='{}' host='{}' port='{}' dbname='{}'".format(
user, password, host, port, db
)
def sql_to_df(sql_query):
try:
conn = psycopg2.connect(db_engine())
cur = conn.cursor()
cur.execute(sql_query)
df = pd.DataFrame(cur.fetchall(), columns=[elt[0] for elt in cur.description])
cur.close()
return df
except Exception as e:
print("Problems:", str(e))
return None
def get_train_data():
config = json.load(open("config.json"))
table = config["automl"]["table"]
features = config["automl"]["features"]
target = config["automl"]["target"]
get_data_sql = f"select {','.join(features+[target])} from {table} where {target} != ''"
df = sql_to_df(get_data_sql)
if df is None:
return None, None
return df[features], df[target]
def get_live_data():
config = json.load(open("config.json"))
table = config["automl"]["table"]
features = config["automl"]["features"]
target = config["automl"]["target"]
predicted = config["automl"]["predicted"]
id_column = config["automl"]["id"]
get_data_sql = f"select {','.join(features + [id_column])} from {table} where {target} = '' and {predicted} = ''"
df = sql_to_df(get_data_sql)
if df is None:
return None, None
return df[features], df[id_column]
def get_predictions():
config = json.load(open("config.json"))
table = config["automl"]["table"]
target = config["automl"]["target"]
predicted = config["automl"]["predicted"]
id_column = config["automl"]["id"]
get_data_sql = f"select {','.join([predicted] + [id_column])} from {table} where {target} = ''"
df = sql_to_df(get_data_sql)
if df is None:
return None
df.index = df[id_column]
return df
def insert_predictions(predictions, ids):
config = json.load(open("config.json"))
table = config["automl"]["table"]
predicted = config["automl"]["predicted"]
id_column = config["automl"]["id"]
try:
conn = psycopg2.connect(db_engine())
cur = conn.cursor()
tuples = list(zip(predictions, ids))
sql_query = f"update {table} set {predicted} = %s where {id_column} = %s"
cur.executemany(sql_query, tuples)
conn.commit()
except Exception as e:
print("Problems:", str(e))
You can see that all information needed to connect and to get data is loaded from config.json
file:
{
"connection": {
"host": "0.0.0.0",
"port": 5555,
"user": "postgres",
"password": "1234",
"db": "db"
},
"automl": {
"table": "marketing",
"features": [
"age",
"job",
"marital",
"education",
"default_payment",
"balance",
"housing",
"loan",
"day",
"month",
"duration",
"campaign",
"pdays",
"previous",
"poutcome"
],
"target": "response",
"predicted": "predicted_response",
"id": "id"
}
}
- This file contains connection details (
host
,port
,user
,password
,db
). - Additionaly, it defines the data source for Machine Learning (
table
parameter). Thefeatures
describe the AutoML input,target
- the AutoML output,predicted
- the name of the column where predictions will be stored, andid
is the index column. - You can resuse this file to define your own integration of PostgreSQL with AutoML.
- The password is in the config file just for example purposes. In production setting, it should be hidden (as environment variable).
Let's train AutoML
You might find it suprissing but there are only 5
lines of code in the train_automl.py
file:
""" train_automl.py file """
from db import get_train_data
from supervised import AutoML
# get the training data
X_train, y_train = get_train_data()
# train AutoML
automl = AutoML(results_path="Response_Classifier")
automl.fit(X_train, y_train)
This code gets data for training from the database and fit()
AutoML object. The result of the AutoML are saved in Response_Classifier
directory. All models and preprocessing details are automatically saved to the hard drive. Additionally, the README.md
Markdown reports are created for AutoML and each Machine Learning model. You can check them on GitHub, here are links for few examples:
Best model | name | model_type | metric_type | metric_value | train_time | Link |
---|---|---|---|---|---|---|
1_Baseline | Baseline | logloss | 0.354508 | 0.32 | Results link | |
2_DecisionTree | Decision Tree | logloss | 0.269144 | 15.8 | Results link | |
3_Linear | Linear | logloss | 0.237079 | 7.45 | Results link | |
4_Default_RandomForest | Random Forest | logloss | 0.230933 | 14.16 | Results link | |
5_Default_Xgboost | Xgboost | logloss | 0.203599 | 10.48 | Results link | |
6_Default_NeuralNetwork | Neural Network | logloss | 0.209968 | 27.67 | Results link | |
the best | Ensemble | Ensemble | logloss | 0.199728 | 1.75 | Results link |
From the results you can see that the best model is an Ensemble
(which is combination of Xgboost
and Neural Network
).
The Ensemble
breaks the 0.2
logloss level, when any of single algorithms cannot.
Do you need Machine Learning at all?
There is a Baseline
algorithm trained by AutoML. It returns the most frequent class as predictions. To check if you need Machine Learning at all, we can compare the best model performance with the Baseline
:
% difference = (0.3545 - 0.1997) / 0.3545 * 100.0 = 43.67%
The best models is 43%
better than Baseline
. Yes, you should Machine Learning for solving this problem. (I would consider a difference smaller than 5%-10%
as a sign to reconsider Machine Learning usage or a need to do a double-check of your data)
Automated Exploratory Data Analysis
The AutoML has a feature to compute Automated Explortatory Data Analysis (EDA). All plots and informations from the EDA are saved in Markdown report which can be checked in this link. (Do you see that all reports are stored in GitHub? so they can be easily shared)
Example EDA for Target
Feature: target
Feature type: categorical
Missing: 0.0%
Unique: 2
Count:32672
Unique: 2
Top: no
Freq: 28952
Insights with Explainability
By the default the supervised
produces explanations for models. Let me show you few examples which are produced with 5
lines of code (2
lines of imports)
Decision Tree visualization
Visualization of the Decision Tree
algorithm made with dtreeviz
package. The visualization is available in the Markdown report.
Coefficients of Linear model
You can inspect the Linear
model coefficients to check which one are the most used. I will show here top-10 coefficients (based on their values). The full list of coefficients can be found in the Markdown report.
feature | Learner_1 |
---|---|
month_mar | 1.65158 |
poutcome_success | 1.60419 |
month_sep | 1.1394 |
duration | 1.07018 |
month_oct | 1.01796 |
month_dec | 0.611299 |
job_retired | 0.510886 |
job_student | 0.501499 |
job_admin. | 0.223268 |
education_tertiary | 0.219852 |
... | ... |
Feature Importance
The Xgboost
has the best performance as a single model. Let's check which features was the most important. The AutoML uses two methods to get feature importance:
- permutation-based feature importance,
- and SHAP-based feature importance.
Permutation-based importance
SHAP-based importance
SHAP Explanations - Dependence Plots
Additionally, the AutoML produces dependence plots with shap
package:
SHAP Explanations - Decision Plots
The decision plots for top-10 correct classification for response=yes
:
The decision plots for top-10 incorrect classification for response=yes
:
Predict
OK, enough. You definetly see that supervised
is an amazing ML package. Let's use our model to compute predictions and insert them into PostgreSQL:
""" predict.py file """
from db import get_live_data
from db import insert_predictions
from supervised import AutoML
X_live, ids = get_live_data()
if X_live is None or not X_live.shape[0]:
print("No new data")
else:
print("Compute predictions")
automl = AutoML(results_path="Response_Classifier")
predictions = automl.predict(X_live)
print("Insert prediction into DB")
insert_predictions(predictions, ids)
The code for computing predictions work in three steps:
- Get live data - data without
reponse
(our target) andpredicted_response
values. - Compute predictions from AutoML (ensemble of
Xgboost
andNeural Network
). - Insert predictions into the database.
How does it look in the database:
db=# select * from marketing where predicted_response in ('yes', 'no') limit 10;
id | age | job | marital | education | default_payment | balance | housing | loan | day | month | duration | campaign | pdays | previous | poutcome | response | predicted_response
-------+-----+---------------+----------+-----------+-----------------+---------+---------+------+-----+-------+----------+----------+-------+----------+----------+----------+--------------------
44301 | 35 | management | married | tertiary | no | 775 | no | no | 27 | jul | 0.15 | 1 | -1 | 0 | unknown | | no
2940 | 41 | services | married | secondary | no | 890 | yes | no | 14 | may | 13.82 | 1 | -1 | 0 | unknown | | yes
43358 | 26 | student | married | tertiary | no | 8750 | no | no | 26 | mar | 1.63 | 2 | -1 | 0 | unknown | | no
6184 | 28 | management | single | tertiary | no | 939 | yes | no | 27 | may | 12.33 | 1 | -1 | 0 | unknown | | yes
1991 | 44 | unemployed | married | secondary | no | -407 | yes | no | 9 | may | 1.98 | 2 | -1 | 0 | unknown | | no
35422 | 31 | technician | single | secondary | no | 1626 | no | no | 7 | may | 5.93 | 2 | 286 | 3 | failure | | no
25082 | 33 | self-employed | divorced | tertiary | no | 454 | no | yes | 18 | nov | 2.23 | 1 | -1 | 0 | unknown | | no
21757 | 47 | management | married | tertiary | no | 600 | no | no | 19 | aug | 4.83 | 15 | -1 | 0 | unknown | | no
45162 | 29 | admin. | single | secondary | no | 464 | no | no | 9 | nov | 3.47 | 2 | 91 | 3 | success | | yes
30301 | 39 | technician | married | primary | no | 1846 | no | no | 5 | feb | 5.1 | 5 | -1 | 0 | unknown | | no
(10 rows)
Nice, isn't it? You can fill your database with information from the future!
Feedback and accuracy
OK, let's assume that some time have passed and the 'real' response
value is known, so we can compute the accuracy of the model.
""" feedback.py file """
""" Compute accuracy of predictions """
import json
import numpy as np
import pandas as pd
from db import get_predictions
config = json.load(open("config.json"))
target = config["automl"]["target"]
predicted = config["automl"]["predicted"]
id_column = config["automl"]["id"]
test = pd.read_csv("data/test.csv", index_col=id_column)
predictions = get_predictions()
test[predicted] = predictions[predicted]
accuracy = np.round(np.sum(test[predicted] == test[target]) / test.shape[0] * 100.0, 2)
print(f"Accuracy: {accuracy}%")
This code has three steps:
- Get test data from the file with 'real'
response
values. - Get predictions from the database.
- Compute the accuracy of the predictions.
Running the feedback.py
file will print-out the accuracy:
python feedback.py
Accuracy: 91.14%
The accuracy means that 91.14%
of answers from our Machine Learning model were correct! Pretty nice! :)
Summary
- The python scripts for communication between PostgreSQL and Machine Learning were showed. The scripts can be extended and used as a part of ETL pipeline.
- The Machine Learning model was trained with Automated Machine Learning package:
supervised
. - The described solution pulled data from PostgreSQL and keep it in the local memory. There are solutions for in-database Machine Learning, for example MADlib, which can train ML models in-database. From my experience, in-database offer of algorithms is limited (just basic ML models). Even for in-database algorithms the data needs to be unloaded somewhere, so it can be limited by database's machine capacity. And please just imagine how does debugging in-database ML models will look like ...
- All code scripts are available at GitHub.