AutoML Climate Change Use Case

How AutoML Can Help

As an example, we are using a dataset that focuses around simulation crashes that occur during uncertainty quantification (UQ) ensembles for climate models. This dataset contains records from simulations performed in LLNL's UQ Pipeline software system utilizing a Latin hypercube sampling technique. The dataset includes of three unique 180-member Latin hypercube ensembles, for a total of 540 simulations. Of these, 46 simulations were unsuccessful because of numerical problems resulting from certain combinations of parameters. Our goal is to create classification models that, given the values of input parameters, can forecast if a simulation will succeed or fail. We can speed up the process of developing and assessing predictive models by utilizing MLJAR AutoML, which will enable us to thoroughly examine the elements that lead to simulation failures. The automation features of MLJAR AutoML greatly increase this analysis's efficiency and accuracy, promoting better climate model reliability and more knowledgeable climate research decision-making.

Business Value

30%
Faster

Up to 30% quicker than using conventional techniques, MLJAR AutoML can process huge climate datasets. This acceleration improves the responsiveness of climate research and decision-making by enabling faster identification of trends and anomalies.

25%
More Accurate

Climate model forecasting accuracy can be increased by up to 25% using MLJAR AutoML, which automates the building and optimization of models. Forecasts become more accurate as a result, and policy and mitigation plans become more knowledgeable.

35%
Larger Data Capacity

Compared to traditional methods, MLJAR AutoML effectively handles and analyzes up to 35% more data. This capacity facilitates deeper comprehension of intricate climatic systems and more thorough climate simulations.

20%
Deeper Insights

Automation with MLJAR AutoML can increase the detection of important elements impacting climate change by up to 20% by highlighting patterns and linkages in climate data that were previously overlooked.

40%
Quicker Model Development

The time needed to create and implement predictive models can be reduced by up to 40% using MLJAR AutoML. The cycle for converting data insights into workable climate strategies is shortened by this reduction.

AutoML Report

MLJAR AutoML generates comprehensive reports that are packed with valuable information, providing a deep insight of model performance, data analysis, and assessment measures. Here are a few illustrations.

Leaderboard

To evaluate the effectiveness of trained models, AutoML has used logloss as its performance measure. As can be seen in the table and graph below, Ensemble was consequently chosen as the best model.

Best model name model_type metric_type metric_value train_time
1_Baseline Baseline logloss 0.283614 0.79
2_DecisionTree Decision Tree logloss 0.49616 10.27
3_Linear Linear logloss 0.216 4.41
4_Default_Xgboost Xgboost logloss 0.249461 5.28
5_Default_NeuralNetwork Neural Network logloss 0.315241 2.47
6_Default_RandomForest Random Forest logloss 0.231318 7.17
the best Ensemble Ensemble logloss 0.215956 1.59

Performance

AutoML Performance

Spearman Correlation of Models

Based on their rank-order correlation, the plot shows the relationships between various models. The strength of the models' performance rankings is demonstrated by the Spearman correlation, which measures how well a monotonic function may describe the relationship between two variables. Stronger associations, when models perform similarly across various measures or datasets, are indicated by higher correlation values.

models spearman correlation

Feature Importance

The graphic illustrates how important different features are in relation to different models. Each cell in this heatmap shows the significance of a single characteristic for a certain model, and the color intensity corresponds to the degree of significance. Lighter colors suggest lesser importance, while darker or more vivid colors indicate higher importance. By comparing the contributions of various models, this visualization aids in identifying which elements regularly contribute significantly to prediction performance and which have less of an impact.

feature importance across models

Install and import necessary packages

Install the packages with the command:

pip install mljar-supervised, scikit-learn

Import the packages into your code:

# import packages
from supervised import AutoML
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Load data

We will read data from an OpenML dataset.

# load dataset
data = fetch_openml(data_id=1467, as_frame=True)
X = data.data
y = data.target
# display data shape
print(f"Loaded X shape {X.shape}")
print(f"Loaded y shape {y.shape}")
# display first rows
X.head()
Loaded X shape (540, 20)
Loaded y shape (540,)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
0 1 1 502 0.927825 0.252866 0.298838 0.170521 0.735936 0.428325 0.567947 0.474370 0.245675 0.104226 0.869091 0.997518 0.448620 0.307522 0.858310 0.796997 0.869893
1 1 2 249 0.457728 0.359448 0.306957 0.843331 0.934851 0.444572 0.828015 0.296618 0.616870 0.975786 0.914344 0.845247 0.864152 0.346713 0.356573 0.438447 0.512256
2 1 3 202 0.373238 0.517399 0.504993 0.618903 0.605571 0.746225 0.195928 0.815667 0.679355 0.803413 0.643995 0.718441 0.924775 0.315371 0.250642 0.285636 0.365858
3 1 4 56 0.104055 0.197533 0.421837 0.742056 0.490828 0.005525 0.392123 0.010015 0.471463 0.597879 0.761659 0.362751 0.912819 0.977971 0.845921 0.699431 0.475987
4 1 5 278 0.513199 0.061812 0.635837 0.844798 0.441502 0.191926 0.487546 0.358534 0.551543 0.743877 0.312349 0.650223 0.522261 0.043545 0.376660 0.280098 0.132283

5 rows × 20 columns

Split dataframe to train/test

To split a dataframe into train and test sets, we divide the data to create separate datasets for training and evaluating a model. This ensures we can assess the model's performance on unseen data.

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.90, shuffle=True, stratify=y, random_state=42)
# display data shapes
print(f"X_train shape {X_train.shape}")
print(f"X_test shape {X_test.shape}")
print(f"y_train shape {y_train.shape}")
print(f"y_test shape {y_test.shape}")
X_train shape (486, 20)
X_test shape (54, 20)
y_train shape (486,)
y_test shape (54,)

Fit AutoML

We need to train a model for our dataset. The fit() method will handle the model training and optimization automatically.

# create automl object
automl = AutoML(total_time_limit=300, mode="Explain")
# train automl
automl.fit(X_train, y_train)

Predict

Generate predictions using the trained AutoML model on test data.

# predict with AutoML
predictions = automl.predict(X_test)
# predicted values
print(predictions)
['2' '1' '1' '1' '1' '1' '1' '1' '1' '1' '2' '1' '1' '2' '1' '1' '1' '1'
 '1' '2' '2' '2' '1' '1' '1' '2' '2' '1' '1' '2' '1' '1' '1' '2' '2' '1'
 '2' '1' '1' '1' '1' '1' '2' '1' '1' '1' '1' '1' '1' '1' '1' '2' '1' '1'
 '2' '1' '1' '1' '2' '1' '2' '2' '2' '1' '1' '2' '1' '2' '2' '1' '1' '1'
 '1' '1' '2' '1' '1' '1' '1' '1' '1' '1' '1' '2' '1' '2' '1' '1' '1' '2'
 '1' '2' '1' '2' '1' '1' '1' '1' '1' '2' '1' '1' '2' '1' '1' '1']

Compute accuracy

We are computing the accuracy score and valid values (y_test) with our predictions.

# compute metric
metric_accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {metric_accuracy}")
Accuracy: 0.9074074074074074

Conclusions

When applied to climate model simulations, MLJAR AutoML provides a number of benefits for enhancing model dependability and comprehending accidents in the simulation. MLJAR AutoML improves the precision of simulation results prediction and failure root cause analysis by automating the intricate process of data analysis. Due to this advanced automation, researchers can handle large amounts of simulation data quickly and effectively, identifying trends and other elements that they might have missed using more conventional techniques. The increasing assimilation of this technology has potential to propel our comprehension of climate processes and augment the accuracy of climate forecasts.

See you soon👋.