Train Random Forest regressor

The scikit-learn provides implementation of Random Forest algorithm. It can be used to predict continuous target. In this notebook, we will use Python code to train RandomForestRegressor to predict real estate prices.

This notebook was created with MLJAR Studio

MLJAR Studio is Python code editior with interactive code recipes and local AI assistant.
You have code recipes UI displayed at the top of code cells.

Documentation

The hard part of creating good Machine Learning pipeline is data preparation. In our notebook, we are using Housing dataset from Sample datasets recipe. This dataset is already prepared for training. There are no missing values or categorical features that needs preprocessing.

# import packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
# load example dataset
df = pd.read_csv(
    "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/housing/data.csv"
)
# display first rows
df.head()

We need to split pandas DataFrame into X and y. What are those X, y variables?

In Machine Learning, X and y are commonly used to represent:

  • X: The input features or independent variables. It's typically a matrix where each row represents an observation and each column represents a feature.
  • y: The target variable or dependent variable. It's usually a vector where each element represents the target value corresponding to each observation in "X".

For example, in the context of training a model to predict real estate prices:

  • X: Various features such as the size of the house, number of bedrooms, location, etc.
  • y: The real estate prices.
# create X columns list and set y column
x_cols = [
    "CRIM",
    "ZN",
    "INDUS",
    "CHAS",
    "NOX",
    "RM",
    "AGE",
    "DIS",
    "RAD",
    "TAX",
    "PTRATIO",
    "B",
    "LSTAT",
]
y_col = "MEDV"
# set input matrix
X = df[x_cols]
# set target vector
y = df[y_col]
# display data shapes
print(f"X shape is {X.shape}")
print(f"y shape is {y.shape}")

The below code initialize forest variable. It is object of class RandomForestRegressor. The model is not ready for use after initialization. It only knows the settings - values of hyperparameters that will be used during the fit().

# initialize Random Forest
forest = RandomForestRegressor(
    n_estimators=100, criterion="squared_error", random_state=42, n_jobs=-1
)
# display model card
forest

Training of Random Forest is performed with fit() function. The code looks simple, but this step might be the most time consuming. Training for large datasets can take hours or days - no kidding.

# fit model
forest.fit(X, y)

We compute predictions on training dataset, just to show you how to use Random Forest regressor to compute predictions.

# compute prediction
predicted = forest.predict(X)
print("Predictions")
print(predicted)

Let's add a new column predicted in our DataFrame. We will do a scatter plot with matplotlib showing grand truth values vs predicted values.

# save predictions in DataFrame
df["predicted"] = predicted
# create scatter
plt.scatter(df["MEDV"], df["predicted"], color="tab:blue", alpha=0.3)
plt.title("Grand truth vs predicted")
plt.xlabel("True value")
plt.ylabel("Predicted")
# display plot
plt.show()

Conclusions

I really enjoy building this Python notebook. I was using Housing data set, from Sample datasets recipe. This dataset describes real estate properties. It is perfect for Machine Learning exercises because it doesn't require data preprocessing. The Random Forest regressor was trained on all available data. In real life, you would like to keep some part of samples (~25%) for testing. At the end, predictions were computed. The scatter plot is showing true target values vs predicted one - for each point we know real value and predicted on. In the case of perfect predictions, there will be only stright line y=x. In our notebook, you can see that there are some errors in predictions. This is expected. The model that is doing perfect predictions is rather overfitting.

All the best! ๐Ÿ˜Š

Recipes used in the train-random-forest-regressor.ipynb

All code recipes used in this notebook are listed below. You can click them to check their documentation.

Packages used in the train-random-forest-regressor.ipynb

List of packages that need to be installed in your Python environment to run this notebook. Please note that MLJAR Studio automatically installs and imports required modules for you.

pandas>=1.0.0

scikit-learn>=1.5.0

matplotlib>=3.8.4

Similar notebooks

List of similar Python notebooks, so you can find more inspiration ๐Ÿ˜Š