LightGBM predict on Pandas DataFrame - Column Order Matters
When working with machine learning models like LightGBM, one important detail you must remember is the order of the columns in your data. I recently discovered that column order matters a lot when computing predictions, even when using Pandas DataFrames.
At first, I thought that keeping the column order consistent was only important for NumPy arrays. With Pandas DataFrames, I believed the order would not affect the prediction results. However, my experiments showed a surprising result: for LightGBM, if you change the column order of your DataFrame after training, the predictions will change.
Interestingly, even ChatGPT was not aware of this subtle issue.
I hope this article helps others avoid the same mistake and improves the way you work with machine learning models like LightGBM.
LightGBM predict with shuffled columns
I write simple example Python code, which demonstrate the problem:
- there is a Pandas DataFrame with four columns:
feature1
,feature2
,feature3
andtarget
, - I train a LightGBM regressor on the data using scikit-learn API,
- I compute predictions on data with orginal column order,
- I reverse the columns order to
feature3
,feature2
,feature1
and again compute predictions,
The surprise when I compared predictions! They are different. Below is the code to reproduce the issue. I was using LightGBM in version 4.6.0
.
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
# Get LightGBM version
print(lgb.__version__)
# Set random seed for reproducibility
np.random.seed(42)
# Generate random data
n_samples = 1000
df = pd.DataFrame({
'feature1': np.random.randn(n_samples),
'feature2': np.random.randn(n_samples),
'feature3': np.random.randn(n_samples)
})
# Create a target with a linear relationship plus some noise
df['target'] = 3 * df['feature1'] - 2 * df['feature2'] + np.random.randn(n_samples)
# Split the data into training and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
X_train = train_df.drop(columns=['target'])
y_train = train_df['target']
X_test = test_df.drop(columns=['target'])
y_test = test_df['target']
# Train a LightGBM regressor
model = lgb.LGBMRegressor(random_state=42, verbosity=-1)
model.fit(X_train, y_train)
# Predict using the original column order
pred_original = model.predict(X_test)
# Shuffle the columns in the test set
shuffled_columns = ['feature3', 'feature2', 'feature1']
X_test_shuffled = X_test[shuffled_columns]
# Predict using the shuffled column order
pred_shuffled = model.predict(X_test_shuffled)
# Compare the predictions
are_equal = np.allclose(pred_original, pred_shuffled)
print("Are the predictions identical when columns are shuffled? ", are_equal)
# Print orignal predictions
print(pred_original[:10])
# Print shuffled predictions
print(pred_shuffled[:10])
The output of the above script, you can compare the predictions - they are different!
4.6.0
Are the predictions identical when columns are shuffled? False
[ 1.07174678 5.64838888 -6.05203608 -4.45541263 -5.23009083 0.52663101
-2.48320415 2.21768145 1.94955987 -4.83940242]
[ 5.35309562 3.07502193 1.01734446 -6.29262504 4.21820035 -2.28689916
5.44986385 3.71556868 0.59879776 -2.18837962]
How does it work in scikit-learn
I also compared this behavior with the random forest model from the scikit-learn package. When using random forests, if you change the column order in your DataFrame, the code throws an exception.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
# Set random seed for reproducibility
np.random.seed(42)
# Generate random data
n_samples = 1000
df = pd.DataFrame({
'feature1': np.random.randn(n_samples),
'feature2': np.random.randn(n_samples),
'feature3': np.random.randn(n_samples)
})
# Create a target with a linear relationship plus some noise
df['target'] = 3 * df['feature1'] - 2 * df['feature2'] + np.random.randn(n_samples)
# Split the data into training and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
X_train = train_df.drop(columns=['target'])
y_train = train_df['target']
X_test = test_df.drop(columns=['target'])
# Train a RandomForestRegressor
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
# Predict using the original column order
pred_original = model.predict(X_test)
# Shuffle the columns in the test set
shuffled_columns = ['feature3', 'feature2', 'feature1']
X_test_shuffled = X_test[shuffled_columns]
# Predict using the shuffled column order
pred_shuffled = model.predict(X_test_shuffled)
# Compare the predictions
are_equal = np.allclose(pred_original, pred_shuffled)
print("Are the predictions identical when columns are shuffled? ", are_equal)
The output from running RandomForestRegressor
:
Traceback (most recent call last):
File "/Users/olunia/sandbox/mljar-supervised/rf_order.py", line 37, in <module>
pred_shuffled = model.predict(X_test_shuffled)
File "/Users/olunia/sandbox/mljar-supervised/venv/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 1063, in predict
X = self._validate_X_predict(X)
File "/Users/olunia/sandbox/mljar-supervised/venv/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 641, in _validate_X_predict
X = self._validate_data(
File "/Users/olunia/sandbox/mljar-supervised/venv/lib/python3.9/site-packages/sklearn/base.py", line 608, in _validate_data
self._check_feature_names(X, reset=reset)
File "/Users/olunia/sandbox/mljar-supervised/venv/lib/python3.9/site-packages/sklearn/base.py", line 535, in _check_feature_names
raise ValueError(message)
ValueError: The feature names should match those that were passed during fit.
Feature names must be in the same order as they were in fit.
I think, it is better to throw an Exception rather than produce wrong predictions.
A Simple Fix for column order
If you use the scikit-learn API for LightGBM, there is an easy way to solve this issue. The LightGBM model has an argument called feature_name_
that stores the original order of the features used during training. Before making predictions, you simply need to reorder the columns of your new data to match this original order.
X_test_shuffled = X_test[model.feature_name_]
# Predict using the shuffled column order
pred_shuffled = model.predict(X_test_shuffled)
# Compare the predictions
are_equal = np.allclose(pred_original, pred_shuffled)
print("Are the predictions identical when columns are shuffled? ", are_equal)
LightGBM validate input features
After writing this article, friend of mine recommended to use validate_features
in the predict()
function. This argument by default is False
, but when set to True
will force feature order check. Below it updated code and an exception returned from LightGBM:
# Predict using the shuffled column order
pred_shuffled = model.predict(X_test_shuffled, validate_features=True)
Output
4.6.0
[LightGBM] [Fatal] Expected 'feature1' at position 0 but found 'feature3'
Traceback (most recent call last):
File "/Users/olunia/sandbox/mljar-supervised/lgbm_order.py", line 40, in <module>
pred_shuffled = model.predict(X_test_shuffled, validate_features=True)
File "/Users/olunia/sandbox/mljar-supervised/venv/lib/python3.9/site-packages/lightgbm/sklearn.py", line 1144, in predict
return self._Booster.predict( # type: ignore[union-attr]
File "/Users/olunia/sandbox/mljar-supervised/venv/lib/python3.9/site-packages/lightgbm/basic.py", line 4767, in predict
return predictor.predict(
File "/Users/olunia/sandbox/mljar-supervised/venv/lib/python3.9/site-packages/lightgbm/basic.py", line 1149, in predict
_safe_call(
File "/Users/olunia/sandbox/mljar-supervised/venv/lib/python3.9/site-packages/lightgbm/basic.py", line 313, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))
lightgbm.basic.LightGBMError: Expected 'feature1' at position 0 but found 'feature3'
In my opinion, the safest way is to always reorder columns and have feature valdiation enabled:
# Reorder features
X_test_shuffled = X_test[model.feature_name_]
# Predict using the shuffled column order with feature validation
pred_shuffled = model.predict(X_test_shuffled, validate_features=True)
Summary
This experience taught me a valuable lesson: always keep the column order consistent between training and prediction, even when using Pandas DataFrames. It is a simple detail that can have a big impact on your model's performance running in the production.