Neural Network vs Xgboost

Neural Network (Multi-Layer Perceptron, MLP) is an algorithm inspired by biological neural networks. The MLP consists of connected graph of processing units that mimic the neurons. The connections between neurons are so-called weights. Their values are selected during the training process. The training goal is to minimize the error between values predicted by MLP and true values.

References

  • Hinton, Geoffrey E., Connectionist learning procedures, Artificial intelligence, vol.40, pp.185-234, 1989

License

License for Scikit-Learn implementation of Neural Network: New BSD License

XGBoost (Extreme Gradient Boosting) is a library that provides machine learning algorithms under the a gradient boosting framework.

It works with major operating systems like Linux, Windows and macOS. It can run on a single machine or in the distributed environment with frameworks like Apache Hadoop, Apache Spark, Apache Flink, Dask, and DataFlow.

The library is available with interface in many programming languages: C++, Python, Java, R, Julia, Perl, and Scala.

References

License

Apache-2.0 License


Binary classification

Neural Network 0:6 Xgboost

Multiclass classification

Neural Network 1:4 Xgboost

Regression

Neural Network 1:6 Xgboost


Binary classification

Adult dataset

Metric: Accuracy

Neural Network 0.85214 - vs - 0.87582 Xgboost

This is an Adult database. The prediction task is to determine whether a person makes over 50K a year. Data extraction was done by Barry Becker from the 1994 Census database. Variables are all self-explanatory except __fnlwgt__. This is a proxy for the people's demographic background: 'People with similar demographic characteristics should have similar weights.' This similarity-statement is not relevant in all the different 51 states.

Category: People

Rows: 48,842 Columns: 14


Apsfailure dataset

Metric: Accuracy

Neural Network 0.98884 - vs - 0.99487 Xgboost

This is an APS Failure at Scania Trucks. The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS), which generates pressurized air utilized in various functions in a truck, such as braking and gear changes. The datasets' positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS.

Category: Manufacturing

Rows: 76,000 Columns: 170


Internet advertisements dataset

Metric: Accuracy

Neural Network 0.96889 - vs - 0.97652 Xgboost

This dataset represents a set of possible advertisements on Internet pages. The features encode the image's geometry (if available) as well as phrases occurring in the URL, the image's URL and alt text, the anchor text, and words occurring near the anchor text. The task is to predict whether an image is an advertisement ('ad') or not ('nonad').

Category: Marketing

Rows: 3,279 Columns: 1,558


Kddcup09 upselling dataset

Metric: Accuracy

Neural Network 0.93434 - vs - 0.95172 Xgboost

This is a KDDCup09_upselling database. Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

Category: Marketing

Rows: 50,000 Columns: 230


Phishing websites dataset

Metric: Accuracy

Neural Network 0.96852 - vs - 0.97467 Xgboost

This is the Phishing Websites Data. There's plenty of articles about predicting phishing websites have been disseminated these days; no reliable training dataset has been published publically, maybe because there is no agreement in the literature on the definitive features that characterize phishing webpages. Hence it is difficult to shape a dataset that covers all possible features. In this dataset, the authors shed light on the important features that have proved to be sound and effective in predicting phishing websites.

Category: Web

Rows: 11,055 Columns: 30


Spambase dataset

Metric: Accuracy

Neural Network 0.94436 - vs - 0.95892 Xgboost

This is a SPAM E-mail Database. This collection of spam e-mails came from postmasters and individuals who had filed spam. Collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get an extensive collection of non-spam to generate a general-purpose spam filter.

Category: Technology

Rows: 4,601 Columns: 57


Wdbc dataset

Metric: Accuracy

Neural Network 0.96485 - vs - 0.97891 Xgboost

This is a WDBC dataset (Wisconsin Diagnostic Brest Cancer). Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image.

Category: Healthcare

Rows: 569 Columns: 30


Multiclass classification

Car dataset

Metric: Accuracy

Neural Network 0.989 - vs - 0.99595 Xgboost

The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety. Because of known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.

Category: Automotive

Rows: 1,728 Columns: 6


Mfeat factors dataset

Metric: Accuracy

Neural Network 0.9695 - vs - 0.9705 Xgboost

One of a set of 6 datasets describing features of handwritten numerals (0 - 9) extracted from a collection of Dutch utility maps. Corresponding patterns in different datasets correspond to the same original character. 200 instances per class (for a total of 2,000 instances) have been digitized in binary images.

Category: Technology

Rows: 2,000 Columns: 216


Segment dataset

Metric: Accuracy

Neural Network 0.96883 - vs - 0.97835 Xgboost

The instances were drawn randomly from a database of 7 outdoor images. The images were hand-segmented to create a classification for every pixel. Each instance is a 3x3 region.

Category: Technology

Rows: 2,310 Columns: 19


Vehicle dataset

Metric: Accuracy

Neural Network 0.81915 - vs - 0.78132 Xgboost

The vehicle silhouettes - purpose to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Category: Automotive

Rows: 846 Columns: 18


Wine quality dataset

Metric: Accuracy

Neural Network 0.56972 - vs - 0.69195 Xgboost

This is a Wine Quality database. Datasets are related to red and white. This is a Wine Quality database. Datasets are related to red and white variants of the Portuguese 'Vinho Verde' wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g., there is no data about grape types, wine brand, wine selling price, etc.). The inputs include objective tests (e.g., PH values), and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

Category: Retail

Rows: 6,497 Columns: 11


Regression

Abalone dataset

Metric: Root Mean Square Error (RMSE)

Neural Network 2.25937 - vs - 2.20467 Xgboost

This is Abalone data. Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope. There are other, easier to obtain measurements to predict the age. Further information can be used, such as weather patterns and location (hence food availability).

Category: Animals

Rows: 4,177 Columns: 8


House sales dataset

Metric: Root Mean Square Error (RMSE)

Neural Network 208,744.02098 - vs - 123,275.11599 Xgboost

This is a house_sales database. This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. It contains 19 house features plus the price and the id columns, along with 21613 observations. It's a great dataset for evaluating simple regression models.

Category: Business

Rows: 21,613 Columns: 22


Moneyball dataset

Metric: Root Mean Square Error (RMSE)

Neural Network 39.0368 - vs - 25.0563 Xgboost

This is the Moneyball database. This dataset contains some of the information that was available to Billy Beane and Paul DePodesta, who worked for the Oakland Athletics in the early 2000s and changed the game of baseball. It can be used to understand their statistical methods better. The database contains such information as team, league, year, runs scored, wins.

Category: Sport

Rows: 1,232 Columns: 14


Online news popularity dataset

Metric: Root Mean Square Error (RMSE)

Neural Network 44,885.69545 - vs - 11,628.62756 Xgboost

This is an Online News Popularity database. This dataset summarizes a heterogeneous set of features about Mashable articles in a period of two years. The goal is to predict the number of shares in social networks (popularity).

Category: Marketing

Rows: 39,644 Columns: 60


Santander transaction value dataset

Metric: Root Mean Square Error (RMSE)

Neural Network 7,107,567,922.43132 - vs - 7,207,917.87205 Xgboost

This is a Santander Transaction Value database. It provides an anonymized dataset containing numeric feature variables, the numeric target column, and a string ID column. The Santander Group supplied this database on Kaggle to find a way to identify the value of transactions for each potential customer. This is the first step that Santander needs to nail in order to personalize their services at scale. The digitalization of everyday lives means that customers expect services to be delivered in a personalized and timely manner... and often before they've even realized they need the service.

Category: Technology

Rows: 4,459 Columns: 4,992


Space ga dataset

Metric: Root Mean Square Error (RMSE)

Neural Network 0.11161 - vs - 0.11335 Xgboost

This is an Election database. It contains 3,107 observations on county votes cast in the 1980 U.S. presidential election. Specifically, it contains the total number of votes cast in the 1980 presidential election per county (VOTES), the population in each county of 18 years of age or older (POP), the population in each county with a 12th grade or higher education ( EDUCATION), the number of owner-occupied housing units (HOUSES), the aggregate income (INCOME), the X spatial coordinate of the county (XCOORD), and the Y spatial coordinate of the county (YCOORD).

Category: Public Sector

Rows: 3,107 Columns: 6


Us crime dataset

Metric: Root Mean Square Error (RMSE)

Neural Network 0.2192 - vs - 0.13726 Xgboost

This is a Communities and Crime database. Communities within the United States. The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR.

Category: People

Rows: 1,994 Columns: 127