Healthcare

AutoML for Breast Cancer Variant Pathogenicity Prediction: Optimizing Dataset Design for Genomic AI

machine learning
artificial intelligence
AutoML
automated machine learning
breast cancer
genomic medicine
pathogenicity prediction
variant classification
precision oncology
clinical AI
genomics
SHAP interpretability
dataset optimization

MLJAR tools were used in the following publication.

Leveraging AutoML to Optimize Dataset Selection for Improved Breast Cancer Variant Pathogenicity Prediction

Rahaf M. Ahmad, Noura Al Dhaheri, Mohd Saberi Mohamad, Bassam R. Ali

United Arab Emirates University, College of Medicine and Health Sciences | Multimedia University, Centre for Advanced Analytics and AI | Universitas Brawijaya

This study systematically benchmarks three leading AutoML frameworks—TPOT, H2O AutoML, and MLJAR—to determine the optimal dataset composition for breast cancer variant pathogenicity prediction. By evaluating four curated genomic datasets, the authors demonstrate that disease-specific, cancer-focused data significantly improves predictive performance, calibration, and model interpretability. The best-performing dataset achieved near-perfect AUC scores across frameworks, supported by SHAP, LIME, and permutation importance analyses highlighting biologically relevant conservation and ensemble-based features. The research establishes a scalable and clinically interpretable machine learning framework for precision oncology and genomic decision support systems.

Computational and Structural Biotechnology Journal • October 24, 2025

DOI: 10.1016/j.csbj.2025.10.052

Research Domains

Explore peer-reviewed and applied machine learning studies across diverse domains, including healthcare analytics, financial modeling, manufacturing optimization, and structured data classification problems.

Why Researchers and ML Engineers Choose MLJAR Studio

A private, AI-powered Python notebook designed for reproducible machine learning experiments, structured benchmarking, and applied research workflows - fully under your control.

Reproducible Machine Learning Experiments

Design structured pipelines, save experiment runs, and compare results across iterations with full transparency. Every validation setup, hyperparameter configuration, and model benchmark is recorded - making your research repeatable and defensible.

Local-First Execution & Data Control

Run all workflows directly on your machine. Sensitive datasets remain private, with no mandatory cloud uploads or external AI services required. Maintain full control over runtime environments and compliance requirements.

Autonomous Model Benchmarking & Optimization

Automatically compare candidate models, perform cross-validation, and run hyperparameter optimization while retaining full visibility into generated Python code and evaluation metrics. Accelerate experimentation without sacrificing methodological rigor.

Build Research-Grade ML Workflows Locally

Run automated model benchmarking, hyperparameter optimization, and autonomous experiments while keeping full control over your data.

Download MLJAR Studio

View Documentation