Decision Tree features importance

Scikit-learn's permutation importance assesses the impact of each feature on a Decision Tree model's predictions by measuring how much performance drops when feature values are randomly shuffled.

This notebook was created with MLJAR Studio

MLJAR Studio is Python code editior with interactive code recipes and local AI assistant.
You have code recipes UI displayed at the top of code cells.

Documentation

Decision Tree can be used to obtain a feature importance. However, the built-it feature importance can be misleading for data sets with high cardinal features. The built-in approach computes the importance based on information of how many times feature was used in decision node. The features with high cardinality are more often selected for decision nodes than features with low cardinality. This can produce misleading importance plot.

That's why I would like to show you how to get features importance for Decision Tree model with permutation method. The idea behind is very simple. We take fitted model and compute prediction score. Then we shuffle each feature and compute the decrease in the performance. The more performance decreases the more important is feature.

The permutation based feature importance is algorithm agnostic, can be used with any model that implements scikit-learn API.

# import packages
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from matplotlib import pyplot as plt
from sklearn.inspection import permutation_importance

Load sample data for regression task.

# load example dataset
df = pd.read_csv(
    "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/housing/data.csv"
)
# display first rows
df.head()
# create X columns list and set y column
x_cols = [
    "CRIM",
    "ZN",
    "INDUS",
    "CHAS",
    "NOX",
    "RM",
    "AGE",
    "DIS",
    "RAD",
    "TAX",
    "PTRATIO",
    "B",
    "LSTAT",
]
y_col = "MEDV"
# set input matrix
X = df[x_cols]
# set target vector
y = df[y_col]
# display data shapes
print(f"X shape is {X.shape}")
print(f"y shape is {y.shape}")
# initialize Decision Tree
my_tree = DecisionTreeRegressor(criterion="squared_error", max_depth=5, random_state=42)
# display model card
my_tree
# fit model
my_tree.fit(X, y)

Compute permutation values and do a horizontal bar plot.

# compute permutation importance
result = permutation_importance(
    my_tree, X, y, scoring="neg_mean_squared_error", n_repeats=10, random_state=42
)
# plot importance
sorted_idx = result.importances_mean.argsort()
_ = plt.barh(my_tree.feature_names_in_[sorted_idx], result.importances_mean[sorted_idx])
_ = plt.xlabel("Feature importance")

Conclusions

The feature importance is important knowledge when bulding Machine Learning pipeline. Sometimes, you can drop less important features to speed up the Machine Learning pipeline.

The permutation approach is available to any ML algorithm that implements the scikit-learn API. Please note, that this method can be computational expensive if you use complex model with large data.

Recipes used in the decision-tree-feature-importance.ipynb

All code recipes used in this notebook are listed below. You can click them to check their documentation.

Packages used in the decision-tree-feature-importance.ipynb

List of packages that need to be installed in your Python environment to run this notebook. Please note that MLJAR Studio automatically installs and imports required modules for you.

pandas>=1.0.0

scikit-learn>=1.5.0

matplotlib>=3.8.4