Tune Decision Tree classifier
This notebook demonstrates tuning a Decision Tree model. We'll find the best hyperparameters for a Decision Tree classifier on the Iris dataset using randomized search and cross-validation, then train the model with these parameters on the full dataset.
MLJAR Studio is Python code editior with interactive code recipes and local AI assistant.
You have code recipes UI displayed at the top of code cells.
This notebook is using very simple dataset Iris just to show the process of hyperparameters tunning. In real life scenario, dataset will be more complex, and you will need more iterations in the randomized search to find the best performing hyperparameters.
# import packages
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV
Let's load sample dataset using Sample datasets
recipe.
# load example dataset
df = pd.read_csv(
"https://raw.githubusercontent.com/pplonski/datasets-for-start/master/iris/data.csv",
skipinitialspace=True,
)
# display first rows
df.head()
Split DataFrame into X
and y
. The X
matrix contains features. The y
vector is our target. Decision Tree will learn to predict target values.
# create X columns list and set y column
x_cols = [
"sepal length (cm)",
"sepal width (cm)",
"petal length (cm)",
"petal widght (cm)",
]
y_col = "class"
# set input matrix
X = df[x_cols]
# set target vector
y = df[y_col]
# display data shapes
print(f"X shape is {X.shape}")
print(f"y shape is {y.shape}")
Create Decision Tree model, it will be used for hyperparameters tunning. We don't need to setup hyperparameters right now, because we don't know what are best values for them.
# initialize Decision Tree
my_tree = DecisionTreeClassifier(criterion="gini", random_state=42)
# display model card
my_tree
We will use 5-fold cross validation and randomized search to check different combinations of hyperparameters. The best performing hyperparameters will be selected at the end of tunning.
# create validation strategy
vs = KFold(n_splits=5, shuffle=True, random_state=42)
# parameters grid for search
params_grid = {
"criterion": ["gini", "entropy", "log_loss"],
"max_depth": [2, 3, 4, 5, 6, 7, 8],
}
# create search strategy
cv_search = RandomizedSearchCV(
my_tree,
params_grid,
n_iter=10,
scoring="accuracy",
cv=vs,
random_state=42,
verbose=4,
)
# run search strategy
cv_search.fit(X, y)
# display best parameters
print(f"Best score {cv_search.best_score_}")
print(f"Best params {cv_search.best_params_}")
Train Decision Tree with best hyperparameters values and all samples from dataset.
# initialize Decision Tree
my_tree = DecisionTreeClassifier(criterion="log_loss", max_depth=3, random_state=42)
# display model card
my_tree
# fit model
my_tree.fit(X, y)
Conclusions
Tuning hyperparameters is important step in building good Machine Learning pipeline. Tuned hyperparameters help model to predict more accurate values on unseen samples.
The process of hyperparameters search is similar for all Machine Learning algorithms, you can apply this strategy for Random Forest, Xgboost or Neural Networks.
Recipes used in the tune-decision-tree-classifier.ipynb
All code recipes used in this notebook are listed below. You can click them to check their documentation.
Packages used in the tune-decision-tree-classifier.ipynb
List of packages that need to be installed in your Python environment to run this notebook. Please note that MLJAR Studio automatically installs and imports required modules for you.
pandas>=1.0.0
scikit-learn>=1.5.0
Similar notebooks
List of similar Python notebooks, so you can find more inspiration 😊