AutoML Biology Use Case

How AutoML Can Help

MLJAR AutoML automates the use of machine learning models, which simplifies complex chemical analysis. The program helps with a number of chemical property-related activities, including determining important molecular descriptors, forecasting a compound's biodegradability, and easily and precisely assessing an impact on the environment. 1,055 samples make up the QSAR Biodegradation dataset, which also includes topological descriptors, number of atoms, and molecular weight. These characteristics, which represent the connection between molecular structure and environmental behavior, are utilized to forecast the biodegradability of substances. By automating difficult data analysis and providing accurate insights and projections that enable academics and businesses to make confident decisions and innovate, MLJAR AutoML promotes scientific and industrial growth.

Business Value

30%
Faster

Large datasets from microbiological investigations can be analyzed by MLJAR AutoML up to 30% faster than by conventional techniques, which can speed up the identification of novel microbes, genetic markers, or possible therapeutic targets. This shortens the time it takes for new discoveries to reach the market and speeds up the research cycle.

25%
More Accurate

MLJAR AutoML can increase the accuracy of predictive models used in microbiology by approximately 25% by automating the processes of model selection and adjustments. Better microbial behavior predictions, more dependable diagnoses, and more successful treatment plans are the outcomes of this.

20%
More Efficient

By optimizing processes and enhancing data management, MLJAR AutoML's integration with current laboratory information management systems (LIMS) and other data sources can increase the overall efficiency of microbiological research operations by about 20%.

30%
Lower Costs

By eliminating the requirement for specialist data science knowledge, MLJAR AutoML may result in a 30% reduction in the cost of creating and maintaining machine learning models. This makes resource allocation more efficient for microbiological labs and research organizations.

40%
More Productive

Through the automation of repetitive operations like feature engineering, data preprocessing, and model validation, MLJAR AutoML can increase researcher productivity by approximately 40%. This frees up time for more innovative and high-level strategic duties.

AutoML Report

MLJAR AutoML offers in-depth understanding of model performance, data analysis, and assessment metrics through the generation of extensive reports that are full with useful information. Here are a few examples.

Leaderboard

Here, Compete mode was applied to better evaluate biodegradation. It makes use of Stacked Ensemble and feature generating algorithms. For determining the effectiveness of trained models, AutoML selected logloss as its performance indicator. The table and charts below reflect its subsequent selection of Ensemble as the best model.

Best model	name	model_type	metric_type	metric_value	train_time
	1_DecisionTree	Decision Tree	logloss	0.465164	3.98
	4_Linear	Linear	logloss	0.379504	3.17
	53_ExtraTrees	Extra Trees	logloss	0.315308	4.1
	36_CatBoost	CatBoost	logloss	0.298112	5.28
	63_NeuralNetwork	Neural Network	logloss	0.396284	4.04
	31_CatBoost_GoldenFeatures	CatBoost	logloss	0.291778	4.65
	34_CatBoost_KMeansFeatures	CatBoost	logloss	0.298418	4.61
	33_CatBoost_RandomFeature	CatBoost	logloss	0.300238	4.84
	40_RandomForest_SelectedFeatures	Random Forest	logloss	0.306136	4.45
	63_NeuralNetwork_SelectedFeatures	Neural Network	logloss	0.378618	4.27
	76_CatBoost_SelectedFeatures	CatBoost	logloss	0.278526	4.56
	83_LightGBM	LightGBM	logloss	0.307299	4.42
	86_ExtraTrees_SelectedFeatures	Extra Trees	logloss	0.308615	4.85
	89_Xgboost	Xgboost	logloss	0.318557	4.52
	101_NearestNeighbors	Nearest Neighbors	logloss	0.557712	4.36
	108_Xgboost_SelectedFeatures	Xgboost	logloss	0.26159	4.65
	110_LightGBM_SelectedFeatures	LightGBM	logloss	0.292253	4.6
	114_RandomForest_SelectedFeatures	Random Forest	logloss	0.306244	5.54
the best	Ensemble	Ensemble	logloss	0.216808	65.36

Performance

AutoML Performance

AutoML Performance Boxplot

Install and import necessary packages

Install the packages with the command:

pip install mljar-supervised, scikit-learn

Import the packages into your code:

# import packages
from supervised import AutoML
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Load data

We will read data from an OpenML dataset.

# load dataset
data = fetch_openml(data_id=1494, as_frame=True)
X = data.data
y = data.target
# display data shape
print(f"Loaded X shape {X.shape}")
print(f"Loaded y shape {y.shape}")
# display first rows
X.head()

Loaded X shape (1055, 41)
Loaded y shape (1055,)

	V1	V2	V8	V9	V10	...	V35	V36	V37	V39
0	3.919	2.6909	31.4	2	0	...	0	2.949	1.591	7.253
1	4.170	2.1144	30.8	1	1	...	0	3.315	1.967	7.257
2	3.932	3.2512	26.7	2	4	...	1	3.076	2.417	7.601
3	3.000	2.7098	20.0	0	2	...	1	3.046	5.000	6.690
4	4.236	3.3944	29.4	2	4	...	0	3.351	2.405	8.003

5 rows × 41 columns

Split dataframe to train/test

To split a dataframe into train and test sets, we divide the data to create separate datasets for training and evaluating a model. This ensures we can assess the model's performance on unseen data.

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.90, shuffle=True, stratify=y, random_state=42)
# display data shapes
print(f"X_train shape {X_train.shape}")
print(f"X_test shape {X_test.shape}")
print(f"y_train shape {y_train.shape}")
print(f"y_test shape {y_test.shape}")

X_train shape (949, 41)
X_test shape (106, 41)
y_train shape (949,)
y_test shape (106,)

Fit AutoML

We need to train a model for our dataset. The fit() method will handle the model training and optimization automatically. As I mentioned earlier, we will use Compete mode.

# create automl object
automl = AutoML(total_time_limit=600, mode="Compete")
# train automl
automl.fit(X_train, y_train)

Predict

Generate predictions using the trained AutoML model on test data.

# predict with AutoML
predictions = automl.predict(X_test)
# predicted values
print(predictions)

['2' '1' '1' '1' '1' '1' '1' '1' '1' '1' '2' '1' '1' '2' '1' '1' '1' '1'
 '1' '2' '2' '2' '1' '1' '1' '2' '2' '1' '1' '2' '1' '1' '1' '2' '2' '1'
 '2' '1' '1' '1' '1' '1' '2' '1' '1' '1' '1' '1' '1' '1' '1' '2' '1' '1'
 '2' '1' '1' '1' '2' '1' '2' '2' '2' '1' '1' '2' '1' '2' '2' '1' '1' '1'
 '1' '1' '2' '1' '1' '1' '1' '1' '1' '1' '1' '2' '1' '2' '1' '1' '1' '2'
 '1' '2' '1' '2' '1' '1' '1' '1' '1' '2' '1' '1' '2' '1' '1' '1']

Compute accuracy

We are computing the accuracy score and valid values (y_test) with our predictions.

# compute metric
metric_accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {metric_accuracy}")

Accuracy: 0.8301886792452831

Conclusions

There are many advantages of using MLJAR AutoML in biological research and diagnostics. By automating the examination of extensive biological and chemical data, it enables precise forecasts, early detection of biological phenomena, and the development of targeted interventions. Because AutoML manages enormous datasets efficiently, it can find patterns and insights that older approaches might miss. This technology will become more and more important in biological research and diagnostics as it develops, improving research outcomes and more.

See you soon👋.