Tune Decision Tree classifier

Purpose of this notebook is to show how Decision Tree model can be tunned. We will search for the best combination of hyper paramters for Decision Tree classifier trained on Iris dataset. We will use a randomized search approach for traversing hyper parameters space. The search approach is using cross validation. After search we will train Decision Tree with selected hyper parameters on all data.

# import packages
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV
# load example dataset
df = pd.read_csv(
    "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/iris/data.csv",
    skipinitialspace=True,
)
# display first rows
df.head()
# create X columns list and set y column
x_cols = [
    "sepal length (cm)",
    "sepal width (cm)",
    "petal length (cm)",
    "petal widght (cm)",
]
y_col = "class"
# set input matrix
X = df[x_cols]
# set target vector
y = df[y_col]
# display data shapes
print(f"X shape is {X.shape}")
print(f"y shape is {y.shape}")
# initialize Decision Tree
my_tree = DecisionTreeClassifier(criterion="gini", random_state=42)
# display model card
my_tree
# create validation strategy
vs = KFold(n_splits=5, shuffle=True, random_state=42)
# parameters grid for search
params_grid = {
    "criterion": ["gini", "entropy", "log_loss"],
    "max_depth": [2, 3, 4, 5, 6, 7, 8],
}
# create search strategy
cv_search = RandomizedSearchCV(
    my_tree,
    params_grid,
    n_iter=10,
    scoring="accuracy",
    cv=vs,
    random_state=42,
    verbose=4,
)
# run search strategy
cv_search.fit(X, y)
# display best parameters
print(f"Best score {cv_search.best_score_}")
print(f"Best params {cv_search.best_params_}")

Train Decision Tree with best hyper parameters and all data set.

# initialize Decision Tree
my_tree = DecisionTreeClassifier(criterion="log_loss", max_depth=3, random_state=42)
# display model card
my_tree
# fit model
my_tree.fit(X, y)

Conclusions

Tuning hyper parameters is important step in building good Machine Learning pipeline. Tuned parameters help model to generalize better.

Recipes used in the tune-decision-tree-classifier.ipynb

All code recipes used in this notebook are listed below. You can click them to check their documentation.

Packages used in the tune-decision-tree-classifier.ipynb

List of packages that need to be installed in your Python environment to run this notebook. Please note that MLJAR Studio automatically installs and imports required modules for you.

pandas>=1.0.0

scikit-learn>=1.5.0