CatBoost Custom Evaluation Metric CatBoost is a powerful gradient boosting framework. It can be used for classification, regression, and ranking. It is available in many languages, like: Python, R, Java, and C++. It can handle categorical features without any preprocessing. As all gradient boosting algorithms it can overfit if trained with too many trees (iterations). If the number of trees is too small, then we will observe underfit. To find the optimal number of trees the early stopping can be applied. This technique observes the evaluation metric on the separate dataset (from training).

CatBoost has many evaluation metrics built-in:

  • evaluation metrics for binary classification: docs,
  • evaluation metrics for multiclass classification: docs,
  • evaluation metrics for regression: docs.

Although there are many metrics, sometimes there might be a situation that you would like to monitor an evaluation metric that is not available, in such situation you can write your own custom evaluation metric. In this post, I will show you how to do this.

Loss function vs. objective vs. evaluation metric

There can be defined two functions in the CatBoost:

  • loss function that is used for tree building (for training), it needs to be differentiable,
  • evaluation metric that is used in early stopping for overfitting detection. It doesn’t need to be differentiable.

The loss function is sometimes called the objective. In this post, we will set a custom evaluation metric.

Class for custom eval_metric

In the CatBoost the evaluation metric needs to be defined as a class with three methods:

  • get_final_error(self, error, weight),
  • is_max_optimal(self),
  • evaluate(self, appoxes, target, weight).

In this post, we will create custom evaluation metric for computing Pearson correlation between predictions and target values. The code to compute correlation as a custom metric:

class CatBoostEvalMetricPearson(object):
    def get_final_error(self, error, weight):
        return error

    def is_max_optimal(self):
        # the larger metric value the better
        return True

    def evaluate(self, approxes, target, weight):
        assert len(approxes) == 1
        assert len(target) == len(approxes[0])
        preds = np.array(approxes[0])
        target = np.array(target)
        return np.corrcoef(target, preds)[0, 1], 0

Example training of CatBoost with custom eval_metric

Let’s create a working example with a toy dataset. To generate the dataset, we will use make_regression() from a scikit-learn package.

import numpy as np
from catboost import CatBoostRegressor, Pool
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import matplotlib
matplotlib.rcParams.update({'font.size': 14})

# generate data
X, y = make_regression(n_samples=100, n_features=10, n_informative=5)
X_train, X_validation, y_train, y_validation = train_test_split(X, y, 
                                      test_size=0.25, random_state=42)

When creating the CatBoost model we need to set the eval_metric. It can be a string in the case of a built-in metric. For custom metric we pass there an object of the class CatBoostEvalMetricPearson. In the fit() method we need to pass eval_set - an additional data that will be used for evaluation metric monitoring.

model = CatBoostRegressor(

After the training, we can access the metric values from each iteration by accessing the evals_result_ variable from the CatBoostRegressor object. Please notice that to get metric values we use the class name as a string "CatBoostEvalMetricPearson". We get learning curves by plotting evaluation metric values.

plt.plot(model.evals_result_["learn"]["CatBoostEvalMetricPearson"], label="Training Correlation")
plt.plot(model.evals_result_["validation"]["CatBoostEvalMetricPearson"], label="Validation Correlation")
plt.xlabel("Number of trees")

The learning curves for training and testing sets with Pearson correlation: CatBoost Custom Evaluation Metric with Correlation


The CatBoost algorithm has many built-in evaluation metric. In the case of a special metric, it can be easily added by creating a custom class.

In the MLJAR AutoML package (available at GitHub) there are many evaluation metrics available:

  • for binary classification: ‘logloss’, ‘auc’, ‘f1’, ‘average_precision’, or ‘accuracy’,
  • for multiclass classification: ‘logloss’, ‘f1’, or ‘accuracy’,
  • for regression: ‘rmse’, ‘mse’, ‘mae’, ‘r2’, ‘mape’, ‘spearman’, or ‘pearson’. You can easily select evaluation metric by setting eval_metric in AutoML() constructor:
automl = AutoML(mode="Compete", eval_metric="f1"), y)

You can check details of different metrics implementation in this file: link to

Check our open-source AutoML framework for tabular data!