Sep 16 2020 · Piotr Płoński

PostgreSQL and Machine Learning

PostgreSQL and Machine Learning

I will show you how to apply Machine Learning algorithms on data from the PostgreSQL database to get insights and predictions. I will use an Automated Machine Learning (AutoML) supervised. It is an open-source python package. Thanks to AutoML I will get quick access to many ML algorithms: Decision Tree, Logistic Regression, Random Forest, Xgboost, Neural Network. The AutoML will handle feature engineering as well. I will show you python code snippets that can be reused to integrate Machine Learning with PostgreSQL as a part of the ETL pipeline.

You can find all the code used in this post in the GitHub.


The marketing data

I will use Portugese Bank Marketing dataset (bank_cleaned.csv file). This dataset is about the marketing campaigns, which aim to promote financial products for existing customers of a Portuguese bank. The each contact to the client is described by:

columns = ["age", "job", "marital", "education", "default_payment", "balance", "housing",
           "loan", "day", "month", "duration", "campaign", "pdays", "previous", "poutcome"
           "response" # the target
          ]

The response is the taget column, which contains the information if customer subscribed to the financial product. The goal of the analysis will be to predict whether customer will select the subscription.

In this analysis I will split the dataset into:

  • training data (32,672 samples),
  • testing data (8,169 samples).

All datasets are inserted into database, but for testing data the response is not inserted.

Setup PostgreSQL database in Docker

I will set-up the PostgreSQL database in the docker.

The Dockerfile with PostgreSQL:

FROM postgres:alpine
EXPOSE 5555

To build docker image and run container:

docker build -t mydb:latest .
docker run --name my_local_db -e POSTGRES_PASSWORD=1234 -e POSTGRES_DB=db -p 5555:5432 mydb:latest

Create table and insert the training data

For interacting with the database I will use python scripts and psycopg2 package. To initialize the database please use the init_db.py file. Let's dig into the code.

""" init_db.py file """
import numpy as np
import pandas as pd
import psycopg2
from io import StringIO

from sklearn.model_selection import train_test_split
from db import db_engine 

create_table_sql = """
CREATE TABLE IF NOT EXISTS marketing (
    id serial PRIMARY KEY,
    age integer,
    job varchar(128),
    marital varchar(128),
    education varchar(128),
    default_payment varchar(128),
    balance integer,
    housing varchar(128),
    loan varchar(128),
    day integer,
    month varchar(128),
    duration real,
    campaign integer, 
    pdays integer,
    previous integer,
    poutcome varchar(128),
    response varchar(128),
    predicted_response varchar(128)
)
"""

get_data_sql = """select * from marketing"""

df = pd.read_csv("data/bank_cleaned.csv", index_col="id")
df.drop("response_binary", axis=1, inplace=True)
df["predicted_response"] = ""
test_size= 0.20 # 20% for testing
df_train, df_test = train_test_split(df, test_size=test_size, random_state=1234)
df_train.to_csv("data/train.csv")
df_test.to_csv("data/test.csv")
df_test = df_test.copy()
df_test["response"] = ""

df = pd.concat([df_train, df_test])

try:
    conn = psycopg2.connect(db_engine())
    cur = conn.cursor()
    print("Create marketing table")
    cur.execute(create_table_sql)
    conn.commit()
    print("Insert train and test data into table ...")
    buffer = StringIO()
    df.to_csv(buffer, index_label="id", header=False)
    buffer.seek(0)
    cur.copy_from(buffer, "marketing", sep=",")
    conn.commit()
    print("Insert finished.")

    cur.close()
except Exception as e:
    print("Problems:", str(e))

The code is doing three things:

  1. Create the marketing table if it not exists.
  2. Split the data into train and test sets (80%/20% split). Datasets are saved to the disk.
  3. Datasets are inserted into table in the database. The response values is removed from test samples.

The data is in the database. Let's log into PostgreSQL to check:

> psql -U postgres -d db --host=0.0.0.0 --port=5555 

db=# select count(*) from marketing;
 count 
-------
 40841
(1 row)

db=# select response, count(*) from marketing group by response;
 response | count 
----------+-------
 no       | 28952
          |  8169
 yes      |  3720
(3 rows)

Let's train the Machine Learning models!

To integrate PostgreSQL with Machine Learning we will need:

  • method to get training data - get_train_data(),
  • method to get live data (for computing predictions) - get_live_data(),
  • method to insert predictions into the database - insert_predictions(predictions, ids),
  • method to get predictions (to compute the accuracy) - get_predictions().

I've created db.py file to communicate with the database (with psychopg2):

""" db.py file """
""" Database API """
import json
import numpy as np
import pandas as pd
import psycopg2
from io import StringIO

def db_engine():
   config = json.load(open("config.json"))
   host = config["connection"]["host"]
   port = config["connection"]["port"]
   user = config["connection"]["user"]
   # password should be hidden in production setting
   # do not store it in config.json
   password = config["connection"]["password"]
   db = config["connection"]["db"]

   return "user='{}' password='{}' host='{}' port='{}' dbname='{}'".format(
       user, password, host, port, db
   )

def sql_to_df(sql_query):
   try:
       conn = psycopg2.connect(db_engine())
       cur = conn.cursor()
       cur.execute(sql_query)
       df = pd.DataFrame(cur.fetchall(), columns=[elt[0] for elt in cur.description])
       cur.close()
       return df
   except Exception as e:
       print("Problems:", str(e))
   
   return None


def get_train_data():
   config = json.load(open("config.json"))
   table = config["automl"]["table"]
   features = config["automl"]["features"]
   target = config["automl"]["target"]
   get_data_sql = f"select {','.join(features+[target])} from {table} where {target} != ''"
   df = sql_to_df(get_data_sql)
   if df is None:
       return None, None
   return df[features], df[target]
   
   
def get_live_data():
   config = json.load(open("config.json"))
   table = config["automl"]["table"]
   features = config["automl"]["features"]
   target = config["automl"]["target"]
   predicted = config["automl"]["predicted"]
   id_column = config["automl"]["id"]
   get_data_sql = f"select {','.join(features + [id_column])} from {table} where {target} = '' and {predicted} = ''"
   df = sql_to_df(get_data_sql)
   if df is None:
       return None, None
   return df[features], df[id_column]


def get_predictions():
   config = json.load(open("config.json"))
   table = config["automl"]["table"]
   target = config["automl"]["target"]
   predicted = config["automl"]["predicted"]
   id_column = config["automl"]["id"]
   get_data_sql = f"select {','.join([predicted] + [id_column])} from {table} where {target} = ''"
   df = sql_to_df(get_data_sql)
   if df is None:
       return None
   df.index = df[id_column]
   return df
   
def insert_predictions(predictions, ids):
   config = json.load(open("config.json"))
   table = config["automl"]["table"]
   predicted = config["automl"]["predicted"]
   id_column = config["automl"]["id"]

   try:
       conn = psycopg2.connect(db_engine())
       cur = conn.cursor()
       tuples = list(zip(predictions, ids))
       sql_query = f"update {table} set {predicted} = %s where {id_column} = %s"
       cur.executemany(sql_query, tuples)
       conn.commit()
   except Exception as e:
       print("Problems:", str(e))

You can see that all information needed to connect and to get data is loaded from config.json file:

{
    "connection": {
        "host": "0.0.0.0",
        "port": 5555,
        "user": "postgres",
        "password": "1234",
        "db": "db"
    },
    "automl": {
        "table": "marketing",
        "features": [
            "age",
            "job",
            "marital",
            "education",
            "default_payment",
            "balance",
            "housing",
            "loan",
            "day",
            "month",
            "duration",
            "campaign",
            "pdays",
            "previous",
            "poutcome"
        ],
        "target": "response",
        "predicted": "predicted_response",
        "id": "id"
    }
}
  • This file contains connection details (host, port, user, password, db).
  • Additionaly, it defines the data source for Machine Learning (table parameter). The features describe the AutoML input, target - the AutoML output, predicted - the name of the column where predictions will be stored, and id is the index column.
  • You can resuse this file to define your own integration of PostgreSQL with AutoML.
  • The password is in the config file just for example purposes. In production setting, it should be hidden (as environment variable).

Let's train AutoML

You might find it suprissing but there are only 5 lines of code in the train_automl.py file:

""" train_automl.py file """
from db import get_train_data
from supervised import AutoML

# get the training data
X_train, y_train = get_train_data()
# train AutoML
automl = AutoML(results_path="Response_Classifier")
automl.fit(X_train, y_train)

This code gets data for training from the database and fit() AutoML object. The result of the AutoML are saved in Response_Classifier directory. All models and preprocessing details are automatically saved to the hard drive. Additionally, the README.md Markdown reports are created for AutoML and each Machine Learning model. You can check them on GitHub, here are links for few examples:

Best model name model_type metric_type metric_value train_time Link
1_Baseline Baseline logloss 0.354508 0.32 Results link
2_DecisionTree Decision Tree logloss 0.269144 15.8 Results link
3_Linear Linear logloss 0.237079 7.45 Results link
4_Default_RandomForest Random Forest logloss 0.230933 14.16 Results link
5_Default_Xgboost Xgboost logloss 0.203599 10.48 Results link
6_Default_NeuralNetwork Neural Network logloss 0.209968 27.67 Results link
the best Ensemble Ensemble logloss 0.199728 1.75 Results link

From the results you can see that the best model is an Ensemble (which is combination of Xgboost and Neural Network). The Ensemble breaks the 0.2 logloss level, when any of single algorithms cannot.

Do you need Machine Learning at all?

There is a Baseline algorithm trained by AutoML. It returns the most frequent class as predictions. To check if you need Machine Learning at all, we can compare the best model performance with the Baseline:

% difference = (0.3545 - 0.1997) / 0.3545 * 100.0 =  43.67%

The best models is 43% better than Baseline. Yes, you should Machine Learning for solving this problem. (I would consider a difference smaller than 5%-10% as a sign to reconsider Machine Learning usage or a need to do a double-check of your data)

Automated Exploratory Data Analysis

The AutoML has a feature to compute Automated Explortatory Data Analysis (EDA). All plots and informations from the EDA are saved in Markdown report which can be checked in this link. (Do you see that all reports are stored in GitHub? so they can be easily shared)

Example EDA for Target

Feature: target
Feature type: categorical
Missing: 0.0%
Unique: 2
Count:32672
Unique: 2
Top: no
Freq: 28952
Target distribution

Insights with Explainability

By the default the supervised produces explanations for models. Let me show you few examples which are produced with 5 lines of code (2 lines of imports)

Decision Tree visualization

Visualization of the Decision Tree algorithm made with dtreeviz package. The visualization is available in the Markdown report.

Decision tree visualization

Coefficients of Linear model

You can inspect the Linear model coefficients to check which one are the most used. I will show here top-10 coefficients (based on their values). The full list of coefficients can be found in the Markdown report.

feature Learner_1
month_mar 1.65158
poutcome_success 1.60419
month_sep 1.1394
duration 1.07018
month_oct 1.01796
month_dec 0.611299
job_retired 0.510886
job_student 0.501499
job_admin. 0.223268
education_tertiary 0.219852
... ...

Feature Importance

The Xgboost has the best performance as a single model. Let's check which features was the most important. The AutoML uses two methods to get feature importance:

  • permutation-based feature importance,
  • and SHAP-based feature importance.

Permutation-based importance

Permutation based feature importance

SHAP-based importance

SHAP based feature importance

SHAP Explanations - Dependence Plots

Additionally, the AutoML produces dependence plots with shap package:

SHAP dependence plot

SHAP Explanations - Decision Plots

The decision plots for top-10 correct classification for response=yes:

SHAP the best decision plots

The decision plots for top-10 incorrect classification for response=yes:

SHAP the worst decision plots

Predict

OK, enough. You definetly see that supervised is an amazing ML package. Let's use our model to compute predictions and insert them into PostgreSQL:

""" predict.py file """
from db import get_live_data
from db import insert_predictions
from supervised import AutoML


X_live, ids = get_live_data()
if X_live is None or not X_live.shape[0]:
    print("No new data")
else:
    print("Compute predictions")
    automl = AutoML(results_path="Response_Classifier")
    predictions = automl.predict(X_live)
    print("Insert prediction into DB")
    insert_predictions(predictions, ids)

The code for computing predictions work in three steps:

  1. Get live data - data without reponse (our target) and predicted_response values.
  2. Compute predictions from AutoML (ensemble of Xgboost and Neural Network).
  3. Insert predictions into the database.

How does it look in the database:

db=# select * from marketing where predicted_response in ('yes', 'no') limit 10;
  id   | age |      job      | marital  | education | default_payment | balance | housing | loan | day | month | duration | campaign | pdays | previous | poutcome | response | predicted_response 
-------+-----+---------------+----------+-----------+-----------------+---------+---------+------+-----+-------+----------+----------+-------+----------+----------+----------+--------------------
 44301 |  35 | management    | married  | tertiary  | no              |     775 | no      | no   |  27 | jul   |     0.15 |        1 |    -1 |        0 | unknown  |          | no
  2940 |  41 | services      | married  | secondary | no              |     890 | yes     | no   |  14 | may   |    13.82 |        1 |    -1 |        0 | unknown  |          | yes
 43358 |  26 | student       | married  | tertiary  | no              |    8750 | no      | no   |  26 | mar   |     1.63 |        2 |    -1 |        0 | unknown  |          | no
  6184 |  28 | management    | single   | tertiary  | no              |     939 | yes     | no   |  27 | may   |    12.33 |        1 |    -1 |        0 | unknown  |          | yes
  1991 |  44 | unemployed    | married  | secondary | no              |    -407 | yes     | no   |   9 | may   |     1.98 |        2 |    -1 |        0 | unknown  |          | no
 35422 |  31 | technician    | single   | secondary | no              |    1626 | no      | no   |   7 | may   |     5.93 |        2 |   286 |        3 | failure  |          | no
 25082 |  33 | self-employed | divorced | tertiary  | no              |     454 | no      | yes  |  18 | nov   |     2.23 |        1 |    -1 |        0 | unknown  |          | no
 21757 |  47 | management    | married  | tertiary  | no              |     600 | no      | no   |  19 | aug   |     4.83 |       15 |    -1 |        0 | unknown  |          | no
 45162 |  29 | admin.        | single   | secondary | no              |     464 | no      | no   |   9 | nov   |     3.47 |        2 |    91 |        3 | success  |          | yes
 30301 |  39 | technician    | married  | primary   | no              |    1846 | no      | no   |   5 | feb   |      5.1 |        5 |    -1 |        0 | unknown  |          | no
(10 rows)

Nice, isn't it? You can fill your database with information from the future!

Feedback and accuracy

OK, let's assume that some time have passed and the 'real' response value is known, so we can compute the accuracy of the model.

""" feedback.py file """
""" Compute accuracy of predictions """
import json
import numpy as np
import pandas as pd
from db import get_predictions


config = json.load(open("config.json"))
target = config["automl"]["target"]
predicted = config["automl"]["predicted"]
id_column = config["automl"]["id"]

test = pd.read_csv("data/test.csv", index_col=id_column)
predictions = get_predictions()
test[predicted] = predictions[predicted]

accuracy = np.round(np.sum(test[predicted] == test[target]) / test.shape[0] * 100.0, 2)
print(f"Accuracy: {accuracy}%")

This code has three steps:

  1. Get test data from the file with 'real' response values.
  2. Get predictions from the database.
  3. Compute the accuracy of the predictions.

Running the feedback.py file will print-out the accuracy:

python feedback.py 

Accuracy: 91.14%

The accuracy means that 91.14% of answers from our Machine Learning model were correct! Pretty nice! :)

Summary

  • The python scripts for communication between PostgreSQL and Machine Learning were showed. The scripts can be extended and used as a part of ETL pipeline.
  • The Machine Learning model was trained with Automated Machine Learning package: supervised.
  • The described solution pulled data from PostgreSQL and keep it in the local memory. There are solutions for in-database Machine Learning, for example MADlib, which can train ML models in-database. From my experience, in-database offer of algorithms is limited (just basic ML models). Even for in-database algorithms the data needs to be unloaded somewhere, so it can be limited by database's machine capacity. And please just imagine how does debugging in-database ML models will look like ...
  • All code scripts are available at GitHub.