AutoML

AI and AutoML for Tabular Data: BERT-Sort Semantic Ordinal Encoding with Large Language Models

AI for tabular data
AutoML feature engineering
semantic ordinal encoding
BERT for structured data
Large Language Models in AutoML
categorical feature encoding
machine learning preprocessing
MLJAR AutoML evaluation
LLM feature representation
automated data preparation

MLJAR tools were used in the following publication.

BERT-Sort: A Zero-shot MLM Semantic Encoder on Ordinal Features for AutoML

Mehdi Bahrami, Wei-Peng Chen, Lei Liu, Mukul Prasad

Fujitsu Research of America, Sunnyvale, California, USA

This research introduces BERT-Sort, a novel AI-powered semantic encoding framework that improves ordinal categorical feature handling in AutoML systems for tabular data. By leveraging zero-shot Masked Language Models (BERT, RoBERTa, XLM) to capture semantic relationships between ordinal values, the method significantly outperforms traditional alphabetical encoders such as OrdinalEncoder. Evaluated across 10 public datasets and 42 ordinal features, BERT-Sort improves ordinal accuracy by up to 27% and enhances downstream machine learning performance across leading AutoML platforms including MLJAR, H2O, FLAML, and AutoGluon. The study demonstrates how Large Language Models (LLMs) can enhance feature engineering, automated preprocessing, and end-to-end AI model performance in structured data pipelines.

AutoML Conference 2022 • May 9, 2022

DOI: https://arxiv.org/abs/2205.04186

Research Domains

Explore peer-reviewed and applied machine learning studies across diverse domains, including healthcare analytics, financial modeling, manufacturing optimization, and structured data classification problems.

Why Researchers and ML Engineers Choose MLJAR Studio

A private, AI-powered Python notebook designed for reproducible machine learning experiments, structured benchmarking, and applied research workflows - fully under your control.

Reproducible Machine Learning Experiments

Design structured pipelines, save experiment runs, and compare results across iterations with full transparency. Every validation setup, hyperparameter configuration, and model benchmark is recorded - making your research repeatable and defensible.

Local-First Execution & Data Control

Run all workflows directly on your machine. Sensitive datasets remain private, with no mandatory cloud uploads or external AI services required. Maintain full control over runtime environments and compliance requirements.

Autonomous Model Benchmarking & Optimization

Automatically compare candidate models, perform cross-validation, and run hyperparameter optimization while retaining full visibility into generated Python code and evaluation metrics. Accelerate experimentation without sacrificing methodological rigor.

Build Research-Grade ML Workflows Locally

Run automated model benchmarking, hyperparameter optimization, and autonomous experiments while keeping full control over your data.

Download MLJAR Studio

View Documentation