Lightgbm vs Extra Trees

Extra Trees (Extremely Randomized Trees) the ensemble learning algorithms. It constructs the set of decision trees. During tree construction the decision rule is randomly selected. This algorithm is very similar to Random Forest except random selection of split values.

Reference

P. Geurts, D. Ernst., and L. Wehenkel, Extremely randomized trees, Machine Learning, vol.63, pp.3-42, 2006

License

License for Scikit-Learn implementation of Extra Trees: New BSD License

Links

ExtraTreesClassifier Documentation

ExtraTreesRegressor Documentation

Scikit-Learn GitHub

Scikit-Learn Website

LightGBM (Light Gradient Boosting Machine) is a Machine Learning library that provides algorithms under gradient boosting framework developed by Microsoft.

It works on Linux, Windows, macOS, and supports C++, Python, R and C#.

Reference

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu, LightGBM: A Highly Efficient Gradient Boosting Decision Tree, NIPS 2017, pp. 3149-3157.

Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Tie-Yan Liu, A Communication-Efficient Parallel Algorithm for Decision Tree, NIPS 2016, pp. 1279-1287.

License

MIT License

Links

LightGBM GitHub repository

LightGBM Documentation


« Back to Machine Learning Algorithms Comparison

Algorithms were compared on OpenML datasets. There were 19 datasets with binary-classification, 7 datasets with multi-class classification, and 16 datasets with regression tasks. Algorithms were trained with AutoML mljar-supervised. They were trained with advanced feature engineering switched off, without ensembling. All models were trained with the 5-fold cross validation with shuffle and stratification (for classification tasks).
Different hyperparameters for each algorithm were checked during the training.

For binary classification the Area Under ROC Curve (AUC) metric was used.
For multi-class classification the LogLoss metric was used.
The regression task was optimized with Root Mean Square Error (RMSE).

Algorithms were scored on each dataset and compared. The better performing algorithm have 1 point for each dataset. The more points assigned for the algorithm the better.

Binary classification

Lightgbm 18:1 Extra Trees

Multiclass classification

Lightgbm 7:0 Extra Trees

Regression

Lightgbm 16:0 Extra Trees

the winner

« Back to Machine Learning Algorithms Comparison

Binary classification

apsfailure dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.9919 - vs - 0.9864 Extra Trees

This is an APS Failure at Scania Trucks. The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS), which generates pressurized air utilized in various functions in a truck, ...

Category: Manufacturing

# Rows: 76,000 # Columns: 170

Read more »
Apsfailure Auc Lightgbm Vs Extra Trees

internet-advertisements dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.9656 - vs - 0.9383 Extra Trees

This dataset represents a set of possible advertisements on Internet pages. The features encode the image's geometry (if available) as well as phrases occurring in the URL, the image's URL and alt text, the anchor text, and words occurring near the anchor ...

Category: Marketing

# Rows: 3,279 # Columns: 1,558

Read more »
Internet Advertisement Auc Lightgbm Vs Extra Trees

kddcup09_churn dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.7371 - vs - 0.7141 Extra Trees

This is a KDDCup09_churn database. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict customers' propensity to switch providers (churn). Churn is one of two primary factors that ...

Category: Marketing

# Rows: 50,000 # Columns: 230

Read more »
Kddcup09_Churn Auc Lightgbm Vs Extra Trees

kddcup09_upselling dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.8668 - vs - 0.8335 Extra Trees

This is a KDDCup09_upselling database. Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict ...

Category: Marketing

# Rows: 50,000 # Columns: 230

Read more »
Kddcup09_Upselling Auc Lightgbm Vs Extra Trees

adult dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.9318 - vs - 0.898 Extra Trees

This is an Adult database. The prediction task is to determine whether a person makes over 50K a year. Data extraction was done by Barry Becker from the 1994 Census database. Variables are all self-explanatory except __fnlwgt__. This is a proxy for the ...

Category: People

# Rows: 48,842 # Columns: 14

Read more »
Adult Auc Lightgbm Vs Extra Trees

amazon_employee_access dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.8651 - vs - 0.7482 Extra Trees

This is an Amazon_employee_access database. The data consists of real historical data collected from 2010 & 2011. Employees are manually allowed or denied access to resources over time. The data is used to create an algorithm capable of learning from ...

Category: Technology

# Rows: 32,769 # Columns: 9

Read more »
Amazon Employee Access Auc Lightgbm Vs Extra Trees

bank-marketing dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.9377 - vs - 0.9091 Extra Trees

The Bank Marketing Dataset. The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. More than one contact to the same client was often required to access if the product ...

Category: Marketing

# Rows: 45,211 # Columns: 16

Read more »
Bank Marketing Auc Lightgbm Vs Extra Trees

banknote-authentication dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 1.0 - vs - 1.0 Extra Trees

This is a banknote-authentication. Dataset about distinguishing genuine and forged banknotes. Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print ...

Category: Fintech

# Rows: 1,372 # Columns: 4

Read more »
Banknote Authentication Auc Lightgbm Vs Extra Trees

bioresponse dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.8794 - vs - 0.8447 Extra Trees

This is a Bioresponse database. Predict a biological response of molecules from their chemical properties. The first column contains experimental data describing an actual biological response; the molecule was seen to elicit this response (1) or not (0). ...

Category: Technology

# Rows: 3,751 # Columns: 1,776

Read more »
Bioresponse Auc Lightgbm Vs Extra Trees

churn dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.8878 - vs - 0.8702 Extra Trees

This is a churn dataset. A dataset relating characteristics of telephony account features and usage and whether or not the customer churned.

Category: Marketing

# Rows: 5,000 # Columns: 20

Read more »
Churn Auc Lightgbm Vs Extra Trees

click_prediction_small dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.695 - vs - 0.658 Extra Trees

This is a Click_prediction_small database. This data is derived from the 2012 KDD Cup. The data is about advertisements shown alongside search results in a search engine and whether or not people clicked on these ads. A search session contains information ...

Category: Marketing

# Rows: 39,948 # Columns: 11

Read more »
Click Prediction Small Auc Lightgbm Vs Extra Trees

credit-approval dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.9333 - vs - 0.9274 Extra Trees

This is a credit-approval dataset. This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect the confidentiality of the data.

Category: Banking

# Rows: 690 # Columns: 15

Read more »
Credit Approval Auc Lightgbm Vs Extra Trees

credit-g dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.8329 - vs - 0.83 Extra Trees

This is a German Credit dataset. It classifies people described by a set of attributes as good or bad credit risks. This dataset contains such information as a type of job, age, credit history.

Category: Banking

# Rows: 1,000 # Columns: 20

Read more »
Credit G Auc Lightgbm Vs Extra Trees

diabetes dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.8978 - vs - 0.9193 Extra Trees

This is a Pima Indians Diabetes Database. According to World Health Organization criteria, the diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes.

Category: Healthcare

# Rows: 768 # Columns: 8

Read more »
Diabetes Auc Lightgbm Vs Extra Trees

electricity dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.9875 - vs - 0.8547 Extra Trees

This is an Electricity dataset. This data was collected from the Australian New South Wales Electricity Market. In this market, prices are not fixed and are affected by the market's demand and supply. They are set every five minutes. Electricity transfers ...

Category: Energy

# Rows: 45,312 # Columns: 8

Read more »
Electricity Auc Lightgbm Vs Extra Trees

higgs dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.8165 - vs - 0.7548 Extra Trees

This is a Higgs database. Higgs Boson detection data. The data has been produced using Monte Carlo simulations.

Category: Science

# Rows: 98,050 # Columns: 28

Read more »
Higgs Auc Lightgbm Vs Extra Trees

phishingwebsites dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.9958 - vs - 0.9864 Extra Trees

This is the Phishing Websites Data. There's plenty of articles about predicting phishing websites have been disseminated these days; no reliable training dataset has been published publically, maybe because there is no agreement in the literature on the ...

Category: Web

# Rows: 11,055 # Columns: 30

Read more »
Phishing Websites Auc Lightgbm Vs Extra Trees

spambase dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.987 - vs - 0.9675 Extra Trees

This is a SPAM E-mail Database. This collection of spam e-mails came from postmasters and individuals who had filed spam. Collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are ...

Category: Technology

# Rows: 4,601 # Columns: 57

Read more »
Spambase Auc Lightgbm Vs Extra Trees

wdbc dataset

Metric: Area Under ROC Curve (AUC)

Lightgbm 0.996 - vs - 0.996 Extra Trees

This is a WDBC dataset (Wisconsin Diagnostic Brest Cancer). Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image.

Category: Healthcare

# Rows: 569 # Columns: 30

Read more »
Wdbc Auc Lightgbm Vs Extra Trees

Multiclass classification

amazon-commerce-reviews dataset

Metric: Cross-Entropy Loss (LOGLOSS)

Lightgbm 1.2214 - vs - 2.3412 Extra Trees

This is an amazon-commerce-reviews. Datasets are derived from the customer's reviews on Amazon Commerce Website for authorship identification. Most previous studies conducted identification experiments for two to ten authors. But in the online context, ...

Category: Marketing

# Rows: 1,500 # Columns: 10,000

Read more »
Amazon Commerce Reviews Logloss Lightgbm Vs Extra Trees

car dataset

Metric: Cross-Entropy Loss (LOGLOSS)

Lightgbm 0.0017 - vs - 0.2404 Extra Trees

The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety. Because of known underlying concept structure, this database ...

Category: Automotive

# Rows: 1,728 # Columns: 6

Read more »
Car Logloss Lightgbm Vs Extra Trees

cnae-9 dataset

Metric: Cross-Entropy Loss (LOGLOSS)

Lightgbm 0.1504 - vs - 0.8523 Extra Trees

This is a cnae-9 database. It is a data set containing 1080 documents of free text business descriptions of Brazilian companies categorized into a subset of 9 categories. The original texts were preprocessed to obtain the current data set: initially, ...

Category: Business

# Rows: 1,080 # Columns: 856

Read more »
Cnae 9 Logloss Lightgbm Vs Extra Trees

connect-4 dataset

Metric: Cross-Entropy Loss (LOGLOSS)

Lightgbm 0.3251 - vs - 0.6792 Extra Trees

This database contains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced. Attributes represent board positions on a 6x6 board. The outcome class is the game-theoretical value ...

Category: Gaming

# Rows: 67,557 # Columns: 42

Read more »
Connect 4 Logloss Lightgbm Vs Extra Trees

mfeat-factors dataset

Metric: Cross-Entropy Loss (LOGLOSS)

Lightgbm 0.1214 - vs - 0.3081 Extra Trees

One of a set of 6 datasets describing features of handwritten numerals (0 - 9) extracted from a collection of Dutch utility maps. Corresponding patterns in different datasets correspond to the same original character. 200 instances per class (for a total ...

Category: Technology

# Rows: 2,000 # Columns: 216

Read more »
Mfeat Factors Logloss Lightgbm Vs Extra Trees

segment dataset

Metric: Cross-Entropy Loss (LOGLOSS)

Lightgbm 0.0814 - vs - 0.1923 Extra Trees

The instances were drawn randomly from a database of 7 outdoor images. The images were hand-segmented to create a classification for every pixel. Each instance is a 3x3 region.

Category: Technology

# Rows: 2,310 # Columns: 19

Read more »
Segment Logloss Lightgbm Vs Extra Trees

vehicle dataset

Metric: Cross-Entropy Loss (LOGLOSS)

Lightgbm 0.4718 - vs - 0.5506 Extra Trees

The vehicle silhouettes - purpose to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Category: Automotive

# Rows: 846 # Columns: 18

Read more »
Vehicle Logloss Lightgbm Vs Extra Trees

Regression

airlines_depdelay_1m dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 27.1951 - vs - 27.5511 Extra Trees

This is an Airlines Departure Delay Prediction. This is a processed version of the original data, designed to predict departure delay.

Category: Technology

# Rows: 1,000,000 # Columns: 9

Read more »
Airlines_Depdelay_1M Rmse Lightgbm Vs Extra Trees

allstate_claims_severity dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 1,919.8 - vs - 2,229.53 Extra Trees

This is an Allstate Claims severity database. This dataset contains insurance claims. Allstate is developing automated methods of predicting the cost, and hence severity, of claims. This dataset was shared on Kaggle to find insight into better ways to ...

Category: Insurance

# Rows: 188,318 # Columns: 131

Read more »
Allstate_Claims_Severity Rmse Lightgbm Vs Extra Trees

buzzinsocialmedia_twitter dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 149.839 - vs - 197.648 Extra Trees

This is a Buzz in the Social Media Twitter database. This data-set contains examples of buzz events from two different social networks: Twitter, and Tom's Hardware, a forum network focusing on new technology with more conservative dynamics.

Category: Social Media

# Rows: 583,250 # Columns: 77

Read more »
Buzzinsocialmedia_Twitter Rmse Lightgbm Vs Extra Trees

moneyball dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 21.225 - vs - 22.3639 Extra Trees

This is the Moneyball database. This dataset contains some of the information that was available to Billy Beane and Paul DePodesta, who worked for the Oakland Athletics in the early 2000s and changed the game of baseball. It can be used to understand ...

Category: Sport

# Rows: 1,232 # Columns: 14

Read more »
Moneyball Rmse Lightgbm Vs Extra Trees

onlinenewspopularity dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 15,437.8 - vs - 15,443.6 Extra Trees

This is an Online News Popularity database. This dataset summarizes a heterogeneous set of features about Mashable articles in a period of two years. The goal is to predict the number of shares in social networks (popularity).

Category: Marketing

# Rows: 39,644 # Columns: 60

Read more »
Onlinenewspopularity Rmse Lightgbm Vs Extra Trees

santander_transaction_value dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 7,368,640.0 - vs - 7,858,450.0 Extra Trees

This is a Santander Transaction Value database. It provides an anonymized dataset containing numeric feature variables, the numeric target column, and a string ID column. The Santander Group supplied this database on Kaggle to find a way to identify the ...

Category: Technology

# Rows: 4,459 # Columns: 4,992

Read more »
Santander_Transaction_Volume Rmse Lightgbm Vs Extra Trees

abalone dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 2.1894 - vs - 2.3431 Extra Trees

This is Abalone data. Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope. There are other, easier to obtain ...

Category: Animals

# Rows: 4,177 # Columns: 8

Read more »
Abalone Rmse Lightgbm Vs Extra Trees

black_friday dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 3,399.58 - vs - 3,592.62 Extra Trees

This is a Black Friday database. It contains customer purchases on Black Friday and information as age, gender, marital status of consumers.

Category: Retail

# Rows: 166,821 # Columns: 9

Read more »
Black_Friday Rmse Lightgbm Vs Extra Trees

boston dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 2.0456 - vs - 2.3956 Extra Trees

This is the Boston house-price data database. It contains such information as per capita crime rate by town, the proportion of non-retail business acres per town, the average number of rooms per dwelling.

Category: Real Estate

# Rows: 506 # Columns: 13

Read more »
Boston Rmse Lightgbm Vs Extra Trees

colleges dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 0.1444 - vs - 0.1621 Extra Trees

This is the Colleges database. Regroups information for about 7800 different US colleges. Including geographical information, stats about the population attending, and post-graduation career earnings.

Category: People

# Rows: 7,063 # Columns: 47

Read more »
Colleges Rmse Lightgbm Vs Extra Trees

diamonds dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 486.745 - vs - 941.415 Extra Trees

This is a Diamonds database. This classic dataset contains the prices and other attributes of almost 54,000 diamonds.

Category: Technology

# Rows: 53,940 # Columns: 9

Read more »
Diamonds Rmse Lightgbm Vs Extra Trees

house_sales dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 97,406.9 - vs - 150,496.0 Extra Trees

This is a house_sales database. This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. It contains 19 house features plus the price and the id columns, along with 21613 observations. ...

Category: Business

# Rows: 21,613 # Columns: 22

Read more »
House_Sales Rmse Lightgbm Vs Extra Trees

nyc-taxi-green-dec-2016 dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 1.5557 - vs - 1.9205 Extra Trees

This is a Trip Record Data database. It is provided by the New York City Taxi and Limousine Commission (TLC). The dataset included TLC trips of the green line in December 2016.

Category: Automotive

# Rows: 581,835 # Columns: 18

Read more »
Nyc_Taxi_Gree_Dec2016 Rmse Lightgbm Vs Extra Trees

space_ga dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 0.0926 - vs - 0.1145 Extra Trees

This is an Election database. It contains 3,107 observations on county votes cast in the 1980 U.S. presidential election. Specifically, it contains the total number of votes cast in the 1980 presidential election per county (VOTES), the population in ...

Category: Public Sector

# Rows: 3,107 # Columns: 6

Read more »
Space_Ga Rmse Lightgbm Vs Extra Trees

us_crime dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 0.1294 - vs - 0.1338 Extra Trees

This is a Communities and Crime database. Communities within the United States. The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR.

Category: People

# Rows: 1,994 # Columns: 127

Read more »
Us_Crime Rmse Lightgbm Vs Extra Trees

wine_quality dataset

Metric: Root Mean Square Error (RMSE)

Lightgbm 0.585 - vs - 0.696 Extra Trees

This is a Wine Quality database. Datasets are related to red and white. This is a Wine Quality database. Datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) ...

Category: Retail

# Rows: 6,497 # Columns: 11

Read more »
Wine_Quality Rmse Lightgbm Vs Extra Trees

« Back to Machine Learning Algorithms Comparison