gpt-oss:120b
Average score: 9.75/10
Scored workflows: 24
We defined practical analysis workflows from multiple domains, then ran them with AI Data Analyst using different LLM engines. On this page you can browse each workflow, open full notebook conversations, and compare model quality in shared score tables. The overall results show that modern LLMs perform very well on structured data analysis tasks.
Browse benchmark workflows by domain and open any analysis to review prompts, conversation, outputs, and model scores.
These cards and table compare all scored model runs across all published analysis workflows.
Average score (0-10)
Average score: 9.75/10
Scored workflows: 24
Average score: 9.58/10
Scored workflows: 24
Average score: 9.38/10
Scored workflows: 24
Average score: 8.96/10
Scored workflows: 24
Average score: 8.58/10
Scored workflows: 24
Average score: 8.46/10
Scored workflows: 24
This table compares model scores for each workflow. Open any score chip to jump directly to the selected model conversation and review full prompts, code, outputs, and score cards.
We tested the same step-by-step data analysis workflows across multiple LLM models and compared results using a shared scoring rubric. Across domains, most models deliver strong notebook outputs with high task completion and useful analytical reasoning. Use these examples as a reference for prompt design, model selection, and workflow quality before running similar analyses on your own data in MLJAR Studio.
MLJAR Studio helps you analyze data with AI, run machine learning workflows, and build reproducible notebook-based results on your own computer.
Runs locally • Supports local LLMs