AI Data Analysis Benchmarks for Time Series

We defined practical analysis workflows from multiple domains, then ran them with AI Data Analyst using different LLM engines. On this page you can browse each workflow, open full notebook conversations, and compare model quality in shared score tables. The overall results show that modern LLMs perform very well on structured data analysis tasks.

Time Series Workflow Examples

Browse reproducible AI data analysis workflows in Time Series. Open any example to review prompts, conversation steps, generated code, outputs, and model-level quality scores.

Air Passengers Forecasting with ARIMA

Decompose the classic Air Passengers time series, identify trend and seasonality, fit an ARIMA model, and forecast 12 months ahead.

Open analysis →

Apple Stock Price Analysis in Python

Load Apple stock price data, compute moving averages, calculate daily returns, visualize volatility, and compute the Sharpe ratio.

Open analysis →

Bitcoin Returns and Volatility Analysis

Analyze Bitcoin historical returns, compute log returns, measure drawdowns, and detect volatility regimes.

Open analysis →

Energy Consumption Forecasting with Prophet

Analyze hourly energy consumption data, explore daily and weekly patterns, and build a forecasting model using Prophet.

Open analysis →

Time Series Anomaly Detection in Python

Detect anomalies in a time series using rolling z-score and Isolation Forest, then visualize flagged points.

Open analysis →

Model Comparison for Time Series

Compare LLM performance across workflows in this category. Open any score chip to jump directly to that model run and inspect the full conversation and notebook output.

Average score (0-10)

gpt-5.4

9.60

n=5

gpt-oss:120b

9.60

n=5

gemma4:31b

9.20

n=5

glm-5.1

9.20

n=5

qwen3.5:397b

8.00

n=5

qwen3-coder-next

6.80

n=5

gpt-5.4

Average score: 9.60/10

Scored workflows: 5

gpt-oss:120b

Average score: 9.60/10

Scored workflows: 5

gemma4:31b

Average score: 9.20/10

Scored workflows: 5

glm-5.1

Average score: 9.20/10

Scored workflows: 5

qwen3.5:397b

Average score: 8.00/10

Scored workflows: 5

qwen3-coder-next

Average score: 6.80/10

Scored workflows: 5

Detailed Workflow Comparison Table for Time Series

This table compares model scores for each workflow in Time Series. Open any score chip to jump directly to the selected model conversation and review full prompts, code, outputs, and score cards.

Workflow	gemma4:31b	glm-5.1	gpt-5.4	gpt-oss:120b	qwen3-coder-next	qwen3.5:397b
Air Passengers Forecasting with ARIMA air-passengers-forecast	10.0/10	10.0/10	10.0/10	10.0/10	8.0/10	8.0/10
Apple Stock Price Analysis in Python stock-price-analysis	10.0/10	10.0/10	10.0/10	10.0/10	6.0/10	7.0/10
Bitcoin Returns and Volatility Analysis crypto-returns-analysis	10.0/10	6.0/10	10.0/10	10.0/10	10.0/10	5.0/10
Energy Consumption Forecasting with Prophet energy-consumption-forecast	10.0/10	10.0/10	10.0/10	10.0/10	3.0/10	10.0/10
Time Series Anomaly Detection in Python anomaly-detection	6.0/10	10.0/10	8.0/10	8.0/10	7.0/10	10.0/10

What This Benchmark Shows

We tested the same step-by-step data analysis workflows across multiple LLM models and compared results using a shared scoring rubric. In Time Series, most models produce strong notebook outputs with high task completion and useful analytical reasoning. Use these examples as a reference for prompt design, model selection, and workflow quality before running similar analyses on your own data in MLJAR Studio.

Start using AI for Time Series

MLJAR Studio helps you analyze data with AI, run machine learning workflows, and build reproducible notebook-based results on your own computer.

Runs locally • Supports local LLMs

Download Studio

View Documentation