AI Data Analysis Benchmarks

We defined practical analysis workflows from multiple domains, then ran them with AI Data Analyst using different LLM engines. On this page you can browse each workflow, open full notebook conversations, and compare model quality in shared score tables. The overall results show that modern LLMs perform very well on structured data analysis tasks.

AI Data Analysis Workflows

Browse benchmark workflows by domain and open any analysis to review prompts, conversation, outputs, and model scores.

Exploratory Data Analysis

5

View all →

Time Series

5

View all →

Data Analysis

4

View all →

Machine Learning

3

View all →

NLP

3

View all →

Finance

2

View all →

Statistics

2

View all →

Global model comparison

These cards and table compare all scored model runs across all published analysis workflows.

Average score (0-10)

gpt-oss:120b

9.75

n=24

gpt-5.4

9.58

n=24

glm-5.1

9.38

n=24

gemma4:31b

8.96

n=24

qwen3-coder-next

8.58

n=24

qwen3.5:397b

8.46

n=24

gpt-oss:120b

Average score: 9.75/10

Scored workflows: 24

gpt-5.4

Average score: 9.58/10

Scored workflows: 24

glm-5.1

Average score: 9.38/10

Scored workflows: 24

gemma4:31b

Average score: 8.96/10

Scored workflows: 24

qwen3-coder-next

Average score: 8.58/10

Scored workflows: 24

qwen3.5:397b

Average score: 8.46/10

Scored workflows: 24

Detailed Workflow Comparison Table

This table compares model scores for each workflow. Open any score chip to jump directly to the selected model conversation and review full prompts, code, outputs, and score cards.

Workflow	gemma4:31b	glm-5.1	gpt-5.4	gpt-oss:120b	qwen3-coder-next	qwen3.5:397b
Air Passengers Forecasting with ARIMA air-passengers-forecast	10.0/10	10.0/10	9.0/10	10.0/10	8.0/10	10.0/10
Apple Stock Price Analysis in Python stock-price-analysis	10.0/10	9.0/10	10.0/10	10.0/10	6.0/10	7.0/10
Bitcoin Returns and Volatility Analysis crypto-returns-analysis	10.0/10	6.0/10	10.0/10	10.0/10	9.0/10	6.0/10
Boston Housing Prices EDA in Python housing-prices-eda	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10
Breast Cancer Diagnosis with SVM in Python breast-cancer-diagnosis	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10
E-commerce Sales Analysis in Python ecommerce-sales-eda	9.0/10	10.0/10	10.0/10	10.0/10	9.0/10	10.0/10
Energy Consumption Forecasting with Prophet energy-consumption-forecast	10.0/10	10.0/10	10.0/10	10.0/10	3.0/10	10.0/10
Exploratory Data Analysis (EDA) in Python exploratory-data-analysis-python	8.0/10	10.0/10	10.0/10	10.0/10	10.0/10	9.0/10
How to Analyze a CSV File in Python analyze-csv-python	10.0/10	9.0/10	9.0/10	10.0/10	9.0/10	4.0/10
HR Employee Attrition Analysis in Python hr-attrition-analysis	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10
Hypothesis Testing in Python (t-test, ANOVA) hypothesis-testing-python	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10
Iris Feature Analysis and Visualization in Python iris-feature-analysis	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10
Iris Species Classification with Decision Tree iris-classification	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10
Linear Regression Analysis in Python regression-analysis-python	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10
Portfolio Optimization in Python portfolio-optimization	5.0/10	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10
Red Wine Quality EDA in Python wine-quality-eda	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10
Sentiment Analysis of Amazon Reviews sentiment-analysis-python	10.0/10	8.0/10	8.0/10	10.0/10	9.0/10	4.0/10
Telco Customer Churn Prediction in Python customer-churn-prediction	6.0/10	10.0/10	10.0/10	10.0/10	10.0/10	6.0/10
Text Data EDA in Python text-eda-python	8.0/10	10.0/10	10.0/10	10.0/10	6.0/10	10.0/10
Time Series Anomaly Detection in Python anomaly-detection	6.0/10	10.0/10	8.0/10	8.0/10	7.0/10	10.0/10
Titanic Survival Analysis in Python titanic-survival-analysis	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10	10.0/10
Topic Modeling with LDA in Python topic-modeling-python	9.0/10	6.0/10	9.0/10	10.0/10	7.0/10	10.0/10
Value at Risk (VaR) Analysis in Python risk-metrics-var	9.0/10	10.0/10	9.0/10	10.0/10	10.0/10	2.0/10
Data Cleaning with Pandas in Python pandas-data-cleaning	5.0/10	7.0/10	8.0/10	6.0/10	3.0/10	5.0/10

What This Benchmark Shows

We tested the same step-by-step data analysis workflows across multiple LLM models and compared results using a shared scoring rubric. Across domains, most models deliver strong notebook outputs with high task completion and useful analytical reasoning. Use these examples as a reference for prompt design, model selection, and workflow quality before running similar analyses on your own data in MLJAR Studio.

Start using AI for Data Analysis

MLJAR Studio helps you analyze data with AI, run machine learning workflows, and build reproducible notebook-based results on your own computer.

Runs locally • Supports local LLMs

Download MLJAR Studio

View Documentation