Baseline vs CatBoost

Baseline is the simplest algorithm that provides predictions without complex computations. For classification tasks, the Baseline returns the most frequent class. For regression tasks, the Baseline returns the average of the target from training data.

References

Nathan de Lara, Edouard Pineau Baseline Algorithm for Graph Classification,2018

License

License for Scikit-Learn implementation of Baseline: New BSD License

CatBoost provides Machine Learning algorithms under gradient boost framework developed by Yandex. It supports both numerical and categorical features.

It works on Linux, Windows, and macOS systems. It provides interfaces to Python and R. Trained model can be also used in C++, Java, C+, Rust, CoreML, ONNX, PMML.

References

Anna Veronika Dorogush, Andrey Gulin, Gleb Gusev, Nikita Kazeev, Liudmila Ostroumova Prokhorenkova, Aleksandr Vorobev, Fighting biases with dynamic boosting, 2017.
Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin, CatBoost: gradient boosting with categorical features support, Workshop on ML Systems at NIPS 2017.

License

Apache-2.0 License

Binary classification

Baseline 0:19 CatBoost

Multiclass classification

Baseline 0:7 CatBoost

Regression

Baseline 0:16 CatBoost

Binary classification

Adult dataset

Metric: Accuracy

Baseline 0.76072 - vs - 0.87439 CatBoost

This is an Adult database. The prediction task is to determine whether a person makes over 50K a year. Data extraction was done by Barry Becker from the 1994 Census database. Variables are all self-explanatory except __fnlwgt__. This is a proxy for the people's demographic background: 'People with similar demographic characteristics should have similar weights.' This similarity-statement is not relevant in all the different 51 states.

Category: People

Rows: 48,842 Columns: 14

Available at OpenML: https://openml.org/d/1590

Amazon employee access dataset

Metric: Accuracy

Baseline 0.94211 - vs - 0.95545 CatBoost

This is an Amazon_employee_access database. The data consists of real historical data collected from 2010 & 2011. Employees are manually allowed or denied access to resources over time. The data is used to create an algorithm capable of learning from this historical data to predict approval/denial for employees' unseen set. There is a considerable amount of data regarding an employee’s role within an organization and the resources to which they have access. Given the data related to current employees and their provisioned access, models can be built that automatically determine access privileges as employees enter and leave roles within a company. These auto-access models seek to minimize the human involvement required to grant or revoke employee access.

Category: Technology

Rows: 32,769 Columns: 9

Available at OpenML: https://openml.org/d/4135

Aps failure dataset

Metric: Accuracy

Baseline 0.98191 - vs - 0.99466 CatBoost

This is an APS Failure at Scania Trucks. The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS), which generates pressurized air utilized in various functions in a truck, such as braking and gear changes. The datasets' positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS.

Category: Manufacturing

Rows: 76,000 Columns: 170

Available at OpenML: https://openml.org/d/41138

Banknote authentication dataset

Metric: Accuracy

Baseline 0.55539 - vs - 0.99854 CatBoost

This is a banknote-authentication. Dataset about distinguishing genuine and forged banknotes. Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used.

Category: Fintech

Rows: 1,372 Columns: 4

Available at OpenML: https://openml.org/d/1462

Bank marketing dataset

Metric: Accuracy

Baseline 0.88302 - vs - 0.91146 CatBoost

The Bank Marketing Dataset. The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. More than one contact to the same client was often required to access if the product (bank term deposit) would be (or not) subscribed.

Category: Marketing

Rows: 45,211 Columns: 16

Available at OpenML: https://openml.org/d/1461

Bioresponse dataset

Metric: Accuracy

Baseline 0.54226 - vs - 0.81098 CatBoost

This is a Bioresponse database. Predict a biological response of molecules from their chemical properties. The first column contains experimental data describing an actual biological response; the molecule was seen to elicit this response (1) or not (0). The remaining columns represent molecular descriptors (d1 through d1776); these are calculated properties that can capture some of the molecule's characteristics - for example, size, shape, or elemental constitution. The descriptor matrix has been normalized.

Category: Technology

Rows: 3,751 Columns: 1,776

Available at OpenML: https://openml.org/d/4134

Churn dataset

Metric: Accuracy

Baseline 0.8586 - vs - 0.9646 CatBoost

This is a churn dataset. A dataset relating characteristics of telephony account features and usage and whether or not the customer churned.

Category: Marketing

Rows: 5,000 Columns: 20

Available at OpenML: https://openml.org/d/40701

Click prediction small dataset

Metric: Accuracy

Baseline 0.83158 - vs - 0.83619 CatBoost

This is a Click_prediction_small database. This data is derived from the 2012 KDD Cup. The data is about advertisements shown alongside search results in a search engine and whether or not people clicked on these ads. A search session contains information on user id, the user's query, ads displayed to the user, and the target feature indicating whether a user clicked at least one of the ads in this session.

Category: Marketing

Rows: 39,948 Columns: 11

Available at OpenML: https://openml.org/d/1220

Credit approval dataset

Metric: Accuracy

Baseline 0.55507 - vs - 0.89565 CatBoost

This is a credit-approval dataset. This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect the confidentiality of the data.

Category: Banking

Rows: 690 Columns: 15

Available at OpenML: https://openml.org/d/29

Credit g dataset

Metric: Accuracy

Baseline 0.7 - vs - 0.792 CatBoost

This is a German Credit dataset. It classifies people described by a set of attributes as good or bad credit risks. This dataset contains such information as a type of job, age, credit history.

Category: Banking

Rows: 1000 Columns: 20

Available at OpenML: https://openml.org/d/31

Diabetes dataset

Metric: Accuracy

Baseline 0.65104 - vs - 0.79948 CatBoost

This is a Pima Indians Diabetes Database. According to World Health Organization criteria, the diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes.

Category: Healthcare

Rows: 768 Columns: 8

Available at OpenML: https://openml.org/d/37

Electricity dataset

Metric: Accuracy

Baseline 0.57545 - vs - 0.90464 CatBoost

This is an Electricity dataset. This data was collected from the Australian New South Wales Electricity Market. In this market, prices are not fixed and are affected by the market's demand and supply. They are set every five minutes. Electricity transfers to/from the neighboring state of Victoria were done to alleviate fluctuations.

Category: Energy

Rows: 45,312 Columns: 8

Available at OpenML: https://openml.org/d/151

Higgs dataset

Metric: Accuracy

Baseline 0.52858 - vs - 0.73033 CatBoost

This is a Higgs database. Higgs Boson detection data. The data has been produced using Monte Carlo simulations.

Category: Science

Rows: 98,050 Columns: 28

Available at OpenML: https://openml.org/d/23512

Internet advertisements dataset

Metric: Accuracy

Baseline 0.86002 - vs - 0.97957 CatBoost

This dataset represents a set of possible advertisements on Internet pages. The features encode the image's geometry (if available) as well as phrases occurring in the URL, the image's URL and alt text, the anchor text, and words occurring near the anchor text. The task is to predict whether an image is an advertisement ('ad') or not ('nonad').

Category: Marketing

Rows: 3,279 Columns: 1,558

Available at OpenML: https://openml.org/d/40978

Kddcup09 churn dataset

Metric: Accuracy

Baseline 0.92656 - vs - 0.92764 CatBoost

This is a KDDCup09_churn database. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict customers' propensity to switch providers (churn). Churn is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, the churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time.

Category: Marketing

Rows: 50,000 Columns: 230

Available at OpenML: https://openml.org/d/1112

Kddcup09 upselling dataset

Metric: Accuracy

Baseline 0.92636 - vs - 0.95176 CatBoost

This is a KDDCup09_upselling database. Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

Category: Marketing

Rows: 50,000 Columns: 230

Available at OpenML: https://openml.org/d/1114

Phishing websites dataset

Metric: Accuracy

Baseline 0.55694 - vs - 0.96943 CatBoost

This is the Phishing Websites Data. There's plenty of articles about predicting phishing websites have been disseminated these days; no reliable training dataset has been published publically, maybe because there is no agreement in the literature on the definitive features that characterize phishing webpages. Hence it is difficult to shape a dataset that covers all possible features. In this dataset, the authors shed light on the important features that have proved to be sound and effective in predicting phishing websites.

Category: Web

Rows: 11,055 Columns: 30

Available at OpenML: https://openml.org/d/4534

Spambase dataset

Metric: Accuracy

Baseline 0.60596 - vs - 0.96218 CatBoost

This is a SPAM E-mail Database. This collection of spam e-mails came from postmasters and individuals who had filed spam. Collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get an extensive collection of non-spam to generate a general-purpose spam filter.

Category: Technology

Rows: 4,601 Columns: 57

Available at OpenML: https://openml.org/d/44

Wdbc dataset

Metric: Accuracy

Baseline 0.62742 - vs - 0.97891 CatBoost

This is a WDBC dataset (Wisconsin Diagnostic Brest Cancer). Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image.

Category: Healthcare

Rows: 569 Columns: 30

Available at OpenML: https://openml.org/d/1510

Multiclass classification

Amazon commerce reviews dataset

Metric: Accuracy

Baseline 0.02 - vs - 0.76933 CatBoost

This is an amazon-commerce-reviews. Datasets are derived from the customer's reviews on Amazon Commerce Website for authorship identification. Most previous studies conducted identification experiments for two to ten authors. But in the online context, reviews to be identified usually have more potential authors, and normally classification algorithms are not adapted to a large number of target classes. To examine the robustness of classification algorithms, the authors of this database identify 50 of the most active users (represented by a unique ID and username) who frequently posted reviews in these newsgroups. The number of reviews we collected for each user is 30.

Category: Marketing

Rows: 1,500 Columns: 10,000

Available at OpenML: https://openml.org/d/1457

Car dataset

Metric: Accuracy

Baseline 0.70023 - vs - 0.98727 CatBoost

The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety. Because of known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.

Category: Automotive

Rows: 1,728 Columns: 6

Available at OpenML: https://openml.org/d/40975

Cnae 9 dataset

Metric: Accuracy

Baseline 0.11111 - vs - 0.93796 CatBoost

This is a cnae-9 database. It is a data set containing 1080 documents of free text business descriptions of Brazilian companies categorized into a subset of 9 categories. The original texts were preprocessed to obtain the current data set: initially, it was kept only letters, and then was removed prepositions of the texts. Next, the words were transformed into their canonical form. Finally, each document was represented as a vector, where the weight of each word is its frequency in the document. This data set is highly sparse.

Category: Business

Rows: 1,080 Columns: 856

Available at OpenML: https://openml.org/d/1468

Connect 4 dataset

Metric: Accuracy

Baseline 0.6583 - vs - 0.82045 CatBoost

This database contains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced. Attributes represent board positions on a 6x6 board. The outcome class is the game-theoretical value for the first player (2: win, 1: loss, 0: draw).

Category: Gaming

Rows: 45,312 Columns: 8

Available at OpenML: https://openml.org/d/40668

Mfeat factors dataset

Metric: Accuracy

Baseline 0.1 - vs - 0.9725 CatBoost

One of a set of 6 datasets describing features of handwritten numerals (0 - 9) extracted from a collection of Dutch utility maps. Corresponding patterns in different datasets correspond to the same original character. 200 instances per class (for a total of 2,000 instances) have been digitized in binary images.

Category: Technology

Rows: 2,000 Columns: 216

Available at OpenML: https://openml.org/d/12

Segment dataset

Metric: Accuracy

Baseline 0.14286 - vs - 0.97835 CatBoost

The instances were drawn randomly from a database of 7 outdoor images. The images were hand-segmented to create a classification for every pixel. Each instance is a 3x3 region.

Category: Technology

Rows: 2,310 Columns: 19

Available at OpenML: https://openml.org/d/40984

Vehicle dataset

Metric: Accuracy

Baseline 0.25768 - vs - 0.80851 CatBoost

The vehicle silhouettes - purpose to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Category: Automotive

Rows: 846 Columns: 18

Available at OpenML: https://openml.org/d/54

Regression

Abalone dataset

Metric: Root Mean Square Error (RMSE)

Baseline 3.22557 - vs - 2.15223 CatBoost

This is Abalone data. Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope. There are other, easier to obtain measurements to predict the age. Further information can be used, such as weather patterns and location (hence food availability).

Category: Animals

Rows: 4,177 Columns: 8

Available at OpenML: https://openml.org/d/42726

Airlines depdelay 1m dataset

Metric: Root Mean Square Error (RMSE)

Baseline 28.87721 - vs - 28.23039 CatBoost

This is an Airlines Departure Delay Prediction. This is a processed version of the original data, designed to predict departure delay.

Category: Insurance

Rows: 188,318 Columns: 131

Available at OpenML: https://openml.org/d/42721

Allstate claims severity dataset

Metric: Root Mean Square Error (RMSE)

Baseline 2,904.09295 - vs - 1,969.81091 CatBoost

This is an Allstate Claims severity database. This dataset contains insurance claims. Allstate is developing automated methods of predicting the cost, and hence severity, of claims. This dataset was shared on Kaggle to find insight into better ways to predict claims severity.

Category: Energy

Rows: 45,312 Columns: 8

Available at OpenML: https://openml.org/d/42571

Black friday dataset

Metric: Root Mean Square Error (RMSE)

Baseline 5,082.32359 - vs - 3,473.29718 CatBoost

This is a Black Friday database. It contains customer purchases on Black Friday and information as age, gender, marital status of consumers.

Category: Retail

Rows: 166,821 Columns: 9

Available at OpenML: https://openml.org/d/41540

Boston dataset

Metric: Root Mean Square Error (RMSE)

Baseline 9.19864 - vs - 3.03283 CatBoost

This is the Boston house-price data database. It contains such information as per capita crime rate by town, the proportion of non-retail business acres per town, the average number of rooms per dwelling.

Category: Real Estate

Rows: 506 Columns: 13

Available at OpenML: https://openml.org/d/531

Buzzinsocialmedia twitter dataset

Metric: Root Mean Square Error (RMSE)

Baseline 612.35298 - vs - 203.44832 CatBoost

This is a Buzz in the Social Media Twitter database. This data-set contains examples of buzz events from two different social networks: Twitter, and Tom's Hardware, a forum network focusing on new technology with more conservative dynamics.

Category: Social Media

Rows: 583,250 Columns: 77

Available at OpenML: https://openml.org/d/4549

Colleges dataset

Metric: Root Mean Square Error (RMSE)

Baseline 0.22494 - vs - 0.1422 CatBoost

This is the Colleges database. Regroups information for about 7800 different US colleges. Including geographical information, stats about the population attending, and post-graduation career earnings.

Category: People

Rows: 7,063 Columns: 47

Available at OpenML: https://openml.org/d/42727

Diamonds dataset

Metric: Root Mean Square Error (RMSE)

Baseline 3,989.45129 - vs - 527.77227 CatBoost

This is a Diamonds database. This classic dataset contains the prices and other attributes of almost 54,000 diamonds.

Category: Technology

Rows: 53,940 Columns: 9

Available at OpenML: https://openml.org/d/42225

House sales dataset

Metric: Root Mean Square Error (RMSE)

Baseline 367,123.68404 - vs - 111,992.81295 CatBoost

This is a house_sales database. This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. It contains 19 house features plus the price and the id columns, along with 21613 observations. It's a great dataset for evaluating simple regression models.

Category: Business

Rows: 21,613 Columns: 22

Available at OpenML: https://openml.org/d/42731

Moneyball dataset

Metric: Root Mean Square Error (RMSE)

Baseline 91.6178 - vs - 22.54679 CatBoost

This is the Moneyball database. This dataset contains some of the information that was available to Billy Beane and Paul DePodesta, who worked for the Oakland Athletics in the early 2000s and changed the game of baseball. It can be used to understand their statistical methods better. The database contains such information as team, league, year, runs scored, wins.

Category: Sport

Rows: 1,232 Columns: 14

Available at OpenML: https://openml.org/d/41021

Nyc taxi green dec 2016 dataset

Metric: Root Mean Square Error (RMSE)

Baseline 2.71372 - vs - 1.82319 CatBoost

This is a Trip Record Data database. It is provided by the New York City Taxi and Limousine Commission (TLC). The dataset included TLC trips of the green line in December 2016.

Category: Automotive

Rows: 581,835 Columns: 18

Available at OpenML: https://openml.org/d/42729

Online news popularity dataset

Metric: Root Mean Square Error (RMSE)

Baseline 11,626.98622 - vs - 11,480.91625 CatBoost

This is an Online News Popularity database. This dataset summarizes a heterogeneous set of features about Mashable articles in a period of two years. The goal is to predict the number of shares in social networks (popularity).

Category: Marketing

Rows: 39,644 Columns: 60

Available at OpenML: https://openml.org/d/42724

Santander transaction value dataset

Metric: Root Mean Square Error (RMSE)

Baseline 8,235,688.0032 - vs - 7,129,743.58376 CatBoost

This is a Santander Transaction Value database. It provides an anonymized dataset containing numeric feature variables, the numeric target column, and a string ID column. The Santander Group supplied this database on Kaggle to find a way to identify the value of transactions for each potential customer. This is the first step that Santander needs to nail in order to personalize their services at scale. The digitalization of everyday lives means that customers expect services to be delivered in a personalized and timely manner... and often before they've even realized they need the service.

Category: Technology

Rows: 4,459 Columns: 4,992

Available at OpenML: https://openml.org/d/42572

Space ga dataset

Metric: Root Mean Square Error (RMSE)

Baseline 0.1981 - vs - 0.10543 CatBoost

This is an Election database. It contains 3,107 observations on county votes cast in the 1980 U.S. presidential election. Specifically, it contains the total number of votes cast in the 1980 presidential election per county (VOTES), the population in each county of 18 years of age or older (POP), the population in each county with a 12th grade or higher education ( EDUCATION), the number of owner-occupied housing units (HOUSES), the aggregate income (INCOME), the X spatial coordinate of the county (XCOORD), and the Y spatial coordinate of the county (YCOORD).

Category: Public Sector

Rows: 3,107 Columns: 6

Available at OpenML: https://openml.org/d/507

Us crime dataset

Metric: Root Mean Square Error (RMSE)

Baseline 0.233 - vs - 0.13126 CatBoost

This is a Communities and Crime database. Communities within the United States. The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR.

Category: People

Rows: 1,994 Columns: 127

Available at OpenML: https://openml.org/d/42730

Wine quality dataset

Metric: Root Mean Square Error (RMSE)

Baseline 0.87331 - vs - 0.59672 CatBoost

This is a Wine Quality database. Datasets are related to red and white variants of the Portuguese 'Vinho Verde' wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g., there is no data about grape types, wine brand, wine selling price, etc.). The inputs include objective tests (e.g., PH values), and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

Category: Retail

Rows: 6,497 Columns: 11

Available at OpenML: https://openml.org/d/287