AutoML Biology Use Case

How AutoML Can Help

MLJAR AutoML automates the use of machine learning models, which simplifies complex chemical analysis. The program helps with a number of chemical property-related activities, including determining important molecular descriptors, forecasting a compound's biodegradability, and easily and precisely assessing an impact on the environment. 1,055 samples make up the QSAR Biodegradation dataset, which also includes topological descriptors, number of atoms, and molecular weight. These characteristics, which represent the connection between molecular structure and environmental behavior, are utilized to forecast the biodegradability of substances. By automating difficult data analysis and providing accurate insights and projections that enable academics and businesses to make confident decisions and innovate, MLJAR AutoML promotes scientific and industrial growth.

Business Value

30%
Faster

Large datasets from microbiological investigations can be analyzed by MLJAR AutoML up to 30% faster than by conventional techniques, which can speed up the identification of novel microbes, genetic markers, or possible therapeutic targets. This shortens the time it takes for new discoveries to reach the market and speeds up the research cycle.

25%
More Accurate

MLJAR AutoML can increase the accuracy of predictive models used in microbiology by approximately 25% by automating the processes of model selection and adjustments. Better microbial behavior predictions, more dependable diagnoses, and more successful treatment plans are the outcomes of this.

20%
More Efficient

By optimizing processes and enhancing data management, MLJAR AutoML's integration with current laboratory information management systems (LIMS) and other data sources can increase the overall efficiency of microbiological research operations by about 20%.

30%
Lower Costs

By eliminating the requirement for specialist data science knowledge, MLJAR AutoML may result in a 30% reduction in the cost of creating and maintaining machine learning models. This makes resource allocation more efficient for microbiological labs and research organizations.

40%
More Productive

Through the automation of repetitive operations like feature engineering, data preprocessing, and model validation, MLJAR AutoML can increase researcher productivity by approximately 40%. This frees up time for more innovative and high-level strategic duties.

AutoML Report

MLJAR AutoML offers in-depth understanding of model performance, data analysis, and assessment metrics through the generation of extensive reports that are full with useful information. Here are a few examples.

Leaderboard

Here, Compete mode was applied to better evaluate biodegradation. It makes use of Stacked Ensemble and feature generating algorithms. For determining the effectiveness of trained models, AutoML selected logloss as its performance indicator. The table and charts below reflect its subsequent selection of Ensemble as the best model.

Best model name model_type metric_type metric_value train_time
1_DecisionTree Decision Tree logloss 0.465164 3.98
4_Linear Linear logloss 0.379504 3.17
53_ExtraTrees Extra Trees logloss 0.315308 4.1
36_CatBoost CatBoost logloss 0.298112 5.28
63_NeuralNetwork Neural Network logloss 0.396284 4.04
31_CatBoost_GoldenFeatures CatBoost logloss 0.291778 4.65
34_CatBoost_KMeansFeatures CatBoost logloss 0.298418 4.61
33_CatBoost_RandomFeature CatBoost logloss 0.300238 4.84
40_RandomForest_SelectedFeatures Random Forest logloss 0.306136 4.45
63_NeuralNetwork_SelectedFeatures Neural Network logloss 0.378618 4.27
76_CatBoost_SelectedFeatures CatBoost logloss 0.278526 4.56
83_LightGBM LightGBM logloss 0.307299 4.42
86_ExtraTrees_SelectedFeatures Extra Trees logloss 0.308615 4.85
89_Xgboost Xgboost logloss 0.318557 4.52
101_NearestNeighbors Nearest Neighbors logloss 0.557712 4.36
108_Xgboost_SelectedFeatures Xgboost logloss 0.26159 4.65
110_LightGBM_SelectedFeatures LightGBM logloss 0.292253 4.6
114_RandomForest_SelectedFeatures Random Forest logloss 0.306244 5.54
the best Ensemble Ensemble logloss 0.216808 65.36

Performance

AutoML Performance

AutoML Performance Boxplot

AutoML Performance Boxplot

Install and import necessary packages

Install the packages with the command:

pip install mljar-supervised, scikit-learn

Import the packages into your code:

# import packages
from supervised import AutoML
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Load data

We will read data from an OpenML dataset.

# load dataset
data = fetch_openml(data_id=1494, as_frame=True)
X = data.data
y = data.target
# display data shape
print(f"Loaded X shape {X.shape}")
print(f"Loaded y shape {y.shape}")
# display first rows
X.head()
Loaded X shape (1055, 41)
Loaded y shape (1055,)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V32 V33 V34 V35 V36 V37 V38 V39 V40 V41
0 3.919 2.6909 0 0 0 0 0 31.4 2 0 ... 0 0 0 0 2.949 1.591 0 7.253 0 0
1 4.170 2.1144 0 0 0 0 0 30.8 1 1 ... 0 0 0 0 3.315 1.967 0 7.257 0 0
2 3.932 3.2512 0 0 0 0 0 26.7 2 4 ... 0 0 0 1 3.076 2.417 0 7.601 0 0
3 3.000 2.7098 0 0 0 0 0 20.0 0 2 ... 0 0 0 1 3.046 5.000 0 6.690 0 0
4 4.236 3.3944 0 0 0 0 0 29.4 2 4 ... 0 0 0 0 3.351 2.405 0 8.003 0 0

5 rows × 41 columns

Split dataframe to train/test

To split a dataframe into train and test sets, we divide the data to create separate datasets for training and evaluating a model. This ensures we can assess the model's performance on unseen data.

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.90, shuffle=True, stratify=y, random_state=42)
# display data shapes
print(f"X_train shape {X_train.shape}")
print(f"X_test shape {X_test.shape}")
print(f"y_train shape {y_train.shape}")
print(f"y_test shape {y_test.shape}")
X_train shape (949, 41)
X_test shape (106, 41)
y_train shape (949,)
y_test shape (106,)

Fit AutoML

We need to train a model for our dataset. The fit() method will handle the model training and optimization automatically. As I mentioned earlier, we will use Compete mode.

# create automl object
automl = AutoML(total_time_limit=600, mode="Compete")
# train automl
automl.fit(X_train, y_train)

Predict

Generate predictions using the trained AutoML model on test data.

# predict with AutoML
predictions = automl.predict(X_test)
# predicted values
print(predictions)
['2' '1' '1' '1' '1' '1' '1' '1' '1' '1' '2' '1' '1' '2' '1' '1' '1' '1'
 '1' '2' '2' '2' '1' '1' '1' '2' '2' '1' '1' '2' '1' '1' '1' '2' '2' '1'
 '2' '1' '1' '1' '1' '1' '2' '1' '1' '1' '1' '1' '1' '1' '1' '2' '1' '1'
 '2' '1' '1' '1' '2' '1' '2' '2' '2' '1' '1' '2' '1' '2' '2' '1' '1' '1'
 '1' '1' '2' '1' '1' '1' '1' '1' '1' '1' '1' '2' '1' '2' '1' '1' '1' '2'
 '1' '2' '1' '2' '1' '1' '1' '1' '1' '2' '1' '1' '2' '1' '1' '1']

Compute accuracy

We are computing the accuracy score and valid values (y_test) with our predictions.

# compute metric
metric_accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {metric_accuracy}")
Accuracy: 0.8301886792452831

Conclusions

There are many advantages of using MLJAR AutoML in biological research and diagnostics. By automating the examination of extensive biological and chemical data, it enables precise forecasts, early detection of biological phenomena, and the development of targeted interventions. Because AutoML manages enormous datasets efficiently, it can find patterns and insights that older approaches might miss. This technology will become more and more important in biological research and diagnostics as it develops, improving research outcomes and more.

See you soon👋.