Train Decision Tree classifier

Classification is a task of predicting discrete target labels. The scikit-learn package provides an implementation of the Decision Tree algorithm for classification tasks. The class for building a classification Decision Tree is called DecisionTreeClassifier. We will train a Decision Tree model on the Iris dataset, which describes the properties of iris flowers. The species column is the target label, and the rest of the columns are the flower features.

# import packages
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

Load sample data set and then split it into X and y variables.

# load example dataset
df = pd.read_csv(
    "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/iris/data.csv",
    skipinitialspace=True,
)
# display first rows
df.head()
# create X columns list and set y column
x_cols = [
    "sepal length (cm)",
    "sepal width (cm)",
    "petal length (cm)",
    "petal widght (cm)",
]
y_col = "class"
# set input matrix
X = df[x_cols]
# set target vector
y = df[y_col]
# display data shapes
print(f"X shape is {X.shape}")
print(f"y shape is {y.shape}")

Create Decision Tree object with DecisionTreeClassifier class.

# initialize Decision Tree
my_tree = DecisionTreeClassifier(criterion="gini", random_state=42)
# display model card
my_tree
# fit model
my_tree.fit(X, y)
# compute prediction
predicted = my_tree.predict(X)
print("Predictions")
print(predicted)

# predict class probabilities
predicted_proba = my_tree.predict_proba(X)
print("Predicted class probabilities")
print(predicted_proba)

Conclusions

In this notebook, we trained a Decision Tree classifier on the Iris dataset. This notebook serves solely to demonstrate how to train a Decision Tree model for a classification task. For more advanced topics, please refer to other notebooks to learn how to:

  • tune hyperparameters for the Decision Tree,
  • Save and load the Decision Tree model,
  • Visualize the Decision Tree model,
  • Evaluate prediction performance using different metrics.

Recipes used in the train-decision-tree-classifier.ipynb

All code recipes used in this notebook are listed below. You can click them to check their documentation.

Packages used in the train-decision-tree-classifier.ipynb

List of packages that need to be installed in your Python environment to run this notebook. Please note that MLJAR Studio automatically installs and imports required modules for you.

pandas>=1.0.0

scikit-learn>=1.5.0