Tune Decision Tree classifier

This notebook demonstrates tuning a Decision Tree model. We'll find the best hyperparameters for a Decision Tree classifier on the Iris dataset using randomized search and cross-validation, then train the model with these parameters on the full dataset.

This notebook was created with MLJAR Studio

MLJAR Studio is Python code editior with interactive code recipes and local AI assistant.
You have code recipes UI displayed at the top of code cells.

Documentation

This notebook is using very simple dataset Iris just to show the process of hyperparameters tunning. In real life scenario, dataset will be more complex, and you will need more iterations in the randomized search to find the best performing hyperparameters.

# import packages
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV

Let's load sample dataset using Sample datasets recipe.

# load example dataset
df = pd.read_csv(
    "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/iris/data.csv",
    skipinitialspace=True,
)
# display first rows
df.head()

Split DataFrame into X and y. The X matrix contains features. The y vector is our target. Decision Tree will learn to predict target values.

# create X columns list and set y column
x_cols = [
    "sepal length (cm)",
    "sepal width (cm)",
    "petal length (cm)",
    "petal widght (cm)",
]
y_col = "class"
# set input matrix
X = df[x_cols]
# set target vector
y = df[y_col]
# display data shapes
print(f"X shape is {X.shape}")
print(f"y shape is {y.shape}")

Create Decision Tree model, it will be used for hyperparameters tunning. We don't need to setup hyperparameters right now, because we don't know what are best values for them.

# initialize Decision Tree
my_tree = DecisionTreeClassifier(criterion="gini", random_state=42)
# display model card
my_tree

We will use 5-fold cross validation and randomized search to check different combinations of hyperparameters. The best performing hyperparameters will be selected at the end of tunning.

# create validation strategy
vs = KFold(n_splits=5, shuffle=True, random_state=42)
# parameters grid for search
params_grid = {
    "criterion": ["gini", "entropy", "log_loss"],
    "max_depth": [2, 3, 4, 5, 6, 7, 8],
}
# create search strategy
cv_search = RandomizedSearchCV(
    my_tree,
    params_grid,
    n_iter=10,
    scoring="accuracy",
    cv=vs,
    random_state=42,
    verbose=4,
)
# run search strategy
cv_search.fit(X, y)
# display best parameters
print(f"Best score {cv_search.best_score_}")
print(f"Best params {cv_search.best_params_}")

Train Decision Tree with best hyperparameters values and all samples from dataset.

# initialize Decision Tree
my_tree = DecisionTreeClassifier(criterion="log_loss", max_depth=3, random_state=42)
# display model card
my_tree
# fit model
my_tree.fit(X, y)

Conclusions

Tuning hyperparameters is important step in building good Machine Learning pipeline. Tuned hyperparameters help model to predict more accurate values on unseen samples.

The process of hyperparameters search is similar for all Machine Learning algorithms, you can apply this strategy for Random Forest, Xgboost or Neural Networks.

Recipes used in the tune-decision-tree-classifier.ipynb

All code recipes used in this notebook are listed below. You can click them to check their documentation.

Packages used in the tune-decision-tree-classifier.ipynb

List of packages that need to be installed in your Python environment to run this notebook. Please note that MLJAR Studio automatically installs and imports required modules for you.

pandas>=1.0.0

scikit-learn>=1.5.0