AutoML Housing Use Case

How AutoML Can Help

MLJAR AutoML simplifies complex housing market analysis by automating the application of machine learning models. The tool helps with a variety of housing-related tasks such as predicting property prices, identifying market trends and evaluating investment opportunities with high precision and ease. To present we are going to use the house-prices dataset, which contains 1,460 samples and, as usual, includes various characteristics such as area, number of bedrooms and bathrooms, location, year of construction and others. These features are used to predict house prices by capturing the relationship between property features and market values. MLJAR AutoML drives business growth by automating complex data analysis, enabling precise insights and forecasts that empower companies to make informed decisions and seize market opportunities with confidence.

Business Value

25%
More Accurate Price Predictions

When compared to conventional approaches, MLJAR AutoML improves price predictions by about 25% and offers accurate property value estimates.

20%
Improved Forecasting

Using property data, MLJAR AutoML improves the assessment of possible investments by providing a 20% increase in forecasting accuracy for future price changes.

30%
Better Trend Detection

MLJAR AutoML's sophisticated algorithms enable it to detect new trends and changes in the housing market up to 30% more accurately, which improves strategic maneuvers.

25%
Better Strategic Planning

By utilizing cutting-edge machine learning, MLJAR AutoML can enhance strategic planning and operational efficiency by approximately 25% by optimizing decision-making processes.

AutoML Report

MLJAR AutoML creates a detailed report filled with valuable information, offering deep insights into model performance, data analysis, and evaluation metrics. Here are some examples.

Leaderboard

In this case, we used Compete mode to better determine housing prices. It uses feature generation and Stacked Ensemble techniques. AutoML chose rmse as the metric it would use to measure the performance of trained models. It then selected Ensemble_Stacked as the best model, as shown in the table and chart below.

Best model name model_type metric_type metric_value train_time
1_DecisionTree Decision Tree rmse 45384.9 2.8
2_DecisionTree Decision Tree rmse 41576.1 2.7
3_DecisionTree Decision Tree rmse 41576.1 2.75
4_Linear Linear rmse 5.60193e+14 4.07
5_Default_LightGBM LightGBM rmse 28368.7 6.02
6_Default_Xgboost Xgboost rmse 28406.3 8.36
7_Default_CatBoost CatBoost rmse 26154.6 125.91
5_Default_LightGBM_GoldenFeatures LightGBM rmse 28917.3 15.85
5_Default_LightGBM_KMeansFeatures LightGBM rmse 28909.9 6.83
5_Default_LightGBM_RandomFeature LightGBM rmse 28918.4 12.34
5_Default_LightGBM_SelectedFeatures LightGBM rmse 26587.2 14.35
10_LightGBM_SelectedFeatures LightGBM rmse 27225.5 5.03
11_LightGBM_SelectedFeatures LightGBM rmse 26835.3 28.96
14_LightGBM_SelectedFeatures LightGBM rmse 27387.4 6.24
5_Default_LightGBM_SelectedFeatures_BoostOnErrors LightGBM rmse 27911.5 6.12
Ensemble Ensemble rmse 25373.1 0.7
5_Default_LightGBM_SelectedFeatures_Stacked LightGBM rmse 26498.7 5.83
6_Default_Xgboost_Stacked Xgboost rmse 26029.1 6.06
11_LightGBM_SelectedFeatures_Stacked LightGBM rmse 26557.4 6.48
10_LightGBM_SelectedFeatures_Stacked LightGBM rmse 26327.8 5.22
14_LightGBM_SelectedFeatures_Stacked LightGBM rmse 26467.9 6.11
the best Ensemble_Stacked Ensemble rmse 25030.3 1.08

Performance

AutoML Performance

Spearman Correlation of Models

The pairwise Spearman correlation coefficients between various models are shown in the heatmap. The monotonic relationship between the predictions of two models is represented by the strength of each cell. Strong correlations are shown by values around 1 and weak or nonexistent correlations by values near 0. This heatmap allows comprehension of the relative performance of several models in terms of ranking data points.

models spearman correlation

Install and import necessary packages

Install the packages with the command:

pip install pandas, scikit-learn, mljar-supervised

Import the packages into your code:

# import packages
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised import AutoML

Load data

We will read data from a house-prices dataset.

# load example dataset
df = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/master/house_prices/data.csv")
# display DataFrame shape
print(f"Loaded data shape {df.shape}")
# display first rows
df.head()
Loaded data shape (1460, 81)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

Split dataframe to train/test

To split a dataframe into train and test sets, we divide the data to create separate datasets for training and evaluating a model. This ensures we can assess the model's performance on unseen data.

# split data
train, test = train_test_split(df, train_size=0.95, shuffle=True, random_state=42)
# display data shapes
print(f"All data shape {df.shape}")
print(f"Train shape {train.shape}")
print(f"Test shape {test.shape}")
All data shape (1460, 81)
Train shape (1387, 81)
Test shape (73, 81)

Select X,y for ML training

We will split the training set into features (X) and target (y) variables for model training.

# create X columns list and set y column
x_cols = ["Id", "MSSubClass", "MSZoning", "LotFrontage", "LotArea", "Street", "Alley", "LotShape", "LandContour", "Utilities", "LotConfig", "LandSlope", "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle", "OverallQual", "OverallCond", "YearBuilt", "YearRemodAdd", "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType", "MasVnrArea", "ExterQual", "ExterCond", "Foundation", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "Heating", "HeatingQC", "CentralAir", "Electrical", "1stFlrSF", "2ndFlrSF", "LowQualFinSF", "GrLivArea", "BsmtFullBath", "BsmtHalfBath", "FullBath", "HalfBath", "BedroomAbvGr", "KitchenAbvGr", "KitchenQual", "TotRmsAbvGrd", "Functional", "Fireplaces", "FireplaceQu", "GarageType", "GarageYrBlt", "GarageFinish", "GarageCars", "GarageArea", "GarageQual", "GarageCond", "PavedDrive", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "3SsnPorch", "ScreenPorch", "PoolArea", "PoolQC", "Fence", "MiscFeature", "MiscVal", "MoSold", "YrSold", "SaleType", "SaleCondition"]
y_col = "SalePrice"
# set input matrix
X = train[x_cols]
# set target vector
y = train[y_col]
# display data shapes
print(f"X shape is {X.shape}")
print(f"y shape is {y.shape}")
X shape is (1387, 80)
y shape is (1387,)

Fit AutoML

We need to train a model for our dataset. The fit() method will handle the model training and optimization automatically. As I mentioned earlier, we will use Compete mode.

# create automl object
automl = AutoML(total_time_limit=300, mode="Compete")
# train automl
automl.fit(X, y)

Predict

Generate predictions using the trained AutoML model on test data.

# predict with AutoML
predictions = automl.predict(test)
# predicted values
print(predictions)
[144200.01854481 333014.50875311 106213.75943714 153613.39006172
 324427.2050469   82008.58172825 230573.3966845  141587.7751637
  81523.7365335  141939.3305265  151555.51526989 122295.0114666
 113791.78288931 196722.3717284  170567.81345626 132755.77420974
 194888.61854689 132510.55987797 111215.81206006 210739.81815179
 158445.52435324 215305.04739866 172982.38898362 133318.89830203
 200452.49475991 165324.35603598 190850.35102739 115355.07335218
 171394.52910499 195549.31027695 121300.92655452 265344.66703149
 212471.70549282 116628.94083787 261174.65724333 151219.67212891
 131934.61867884 207408.74543931 349658.26693882 101137.52189884
 124121.94806606 242956.2250202  118476.9337727  372726.93034982
 127100.51320983 131670.11189905 109370.39616166 126313.82013718
 434816.2930139  130982.65444246 122208.70573846 208118.20920927
 117267.20590026 350947.94791427 150172.14678739 241902.91119148
 196417.82949507 160127.23490299 136301.1819027  105939.59190505
  70311.72629056 156734.23298213 300181.65260702 282059.83093084
 298303.82265402 216830.45019613 107068.93271429 343398.87749412
 113438.95318629 164573.35348328 121959.33290497 123339.73676794
 114607.86273824]

Conclusions

The adoption of MLJAR AutoML in the housing industry brings transformative benefits. It offers precise property valuations, insightful market trend analyses, and personalized customer experiences. By automating complex data analysis and operational tasks, AutoML allows real estate professionals to make informed decisions and focus more on strategic initiatives. As this technology advances, its role in enhancing efficiency and accuracy in the housing sector will become increasingly indispensable, paving the way for a smarter, more responsive real estate market.

See you soon👋.