Air Passengers Forecasting with ARIMA

Decompose the classic Air Passengers time series, identify trend and seasonality, fit an ARIMA model, and forecast 12 months ahead.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Air Passengers monthly time series from a CSV URL and plots the 1949–1960 passenger counts. It decomposes the series into trend, seasonal, and residual components and checks stationarity using the Augmented Dickey-Fuller test with differencing as needed. It then fits an ARIMA model and generates a 12-month forecast with confidence intervals.

Who this example is for

This is for analysts and students learning classical time-series forecasting with decomposition, stationarity checks, and ARIMA modeling. It helps practitioners build a reproducible baseline forecast workflow on a well-known seasonal dataset.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Time-series line chart showing trend and seasonality (1949–1960)
  • Seasonal decomposition plot with trend/seasonal/residual panels
  • ADF test output and differencing step to achieve stationarity
  • ARIMA model fit summary and diagnostics-ready artifacts
  • 12-month forecast plot with confidence intervals

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • matplotlib
  • statsmodels
  • scipy

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load air passengers dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/air-passengers/AirPassengers.csv and plot the time series
  2. 2decompose the series into trend, seasonality and residuals
  3. 3check stationarity with ADF test and apply differencing if needed
  4. 4fit an ARIMA model and forecast the next 12 months

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
gpt-5.49/102/22/23/31/21/1Open gpt-5.4 conversation
qwen3-coder-next8/102/22/23/31/20/1Open qwen3-coder-next conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with glm-5.1

This run uses AI with a step-by-step list of prompts to perform the analysis task Air Passengers Forecasting with ARIMA. Below, you can review the full conversation generated with glm-5.1. After the conversation, we show detailed score cards for this run.

load air passengers dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/air-passengers/AirPassengers.csv and plot the time series
Thinking

The user wants to load a CSV dataset and plot the time series. Let me write code to load and inspect the data first, then plot it. Actually, since they asked to both load and plot, I can combine loading with a simple plot since that's what they explicitly asked for. But per my rules, when a dataset is not yet loaded, I should first load and inspect only. However, the user explicitly asked to load AND plot. Let me load and plot in one step since that's what they requested.

Actually, re-reading the rules: "When a dataset is not yet loaded in current notebook context, first load and inspect only: show shape, columns, and head(). Do not combine first-load data inspection with plotting/modeling in the same response."

But the user explicitly asked to load and plot. I think the spirit of the rule is to not do too much at once, but the user's request is clear. Let me follow the rules strictly - first load and inspect, then in the next step I'll plot.

Wait, but the user explicitly asked to plot. Let me re-read: "Do not combine first-load data inspection with plotting/modeling in the same response." So I should first load and inspect, then plot in the next step.

Hmm, but that seems overly rigid when the user explicitly asked for both. Let me follow the rules as stated - first load and inspect.

Let me start by loading the dataset and inspecting its structure.

Load air passenger data
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/air-passengers/AirPassengers.csv")
df.shape
(144, 2)
Thinking

The dataset is loaded with 144 rows and 2 columns (Month, #Passengers). Now I need to plot the time series. Let me parse the Month column as dates and create the plot.

The dataset has 144 rows and 2 columns (Month and #Passengers). Now let me parse the dates and plot the time series.

Plot passenger trends over time
import seaborn as sns
import matplotlib.pyplot as plt

df['Month'] = pd.to_datetime(df['Month'])

plt.figure(figsize=(12, 5))
sns.lineplot(data=df, x='Month', y='#Passengers')
plt.title('Air Passengers Over Time')
plt.xlabel('Month')
plt.ylabel('Passengers')
plt.tight_layout()
plt.show()
Thinking

The user's goal was to load the Air Passengers dataset and plot the time series. Both steps have been executed successfully:

  1. The dataset was loaded - 144 rows, 2 columns (Month, #Passengers)
  2. The time series was plotted - Figure size 1200x500 with 1 Axes was produced

The plot shows the classic Air Passengers dataset with a clear upward trend and seasonal patterns. The goal is complete.

Since the latest output includes a valid chart/figure object and the goal was simply to load and plot, Path A is appropriate here. I should provide insights from the observed outputs.

The Air Passengers dataset loaded successfully with 144 monthly observations from 1949 to 1960. The time series plot reveals two key patterns:

  • Upward trend: Passenger numbers steadily increase over the 12-year period, roughly doubling from around 100–150 passengers in 1949 to 400–600 by 1960.
  • Seasonal cycles: There's a clear recurring annual pattern, with a mid-year peak (likely summer travel) and a trough in early/late months each year.

The combination of trend and seasonality makes this a classic example of a multiplicative time series, often used for benchmarking forecasting methods like ARIMA and Holt-Winters.

decompose the series into trend, seasonality and residuals
Thinking

The user wants to decompose the time series into trend, seasonality, and residuals. I'll use statsmodels seasonal decomposition. Since the data shows increasing amplitude of seasonal fluctuations (multiplicative pattern), I should use model='multiplicative'. The frequency is 12 (monthly data with yearly seasonality).

The seasonal fluctuations grow with the trend level, so a multiplicative decomposition is most appropriate here. I'll use statsmodels with period=12.

Decompose time series
from statsmodels.tsa.seasonal import seasonal_decompose

decomp = seasonal_decompose(df.set_index('Month')['#Passengers'], model='multiplicative', period=12)
fig = decomp.plot()
fig.set_size_inches(12, 8)
fig.tight_layout()
plt.show()
Thinking

The decomposition plot has been generated successfully with 4 axes (observed, trend, seasonal, residual). The goal was to decompose the series into trend, seasonality and residuals. The plot shows this decomposition. However, the output is chart-only without numeric values. Let me provide insights from the decomposition and also show some key numeric values since the plot alone doesn't show the underlying numbers.

Actually, the user asked to "decompose the series into trend, seasonality and residuals" - the decomposition has been done and plotted. The chart output shows the 4 panels. This seems sufficient for the goal of decomposing and visualizing. But per the sufficiency heuristics: "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."

So I should choose Path B and print some numeric values from the decomposition components.

Let me provide a brief code block that prints the decomposition components.

The multiplicative decomposition plot reveals clear structure across all components:

  • Trend: A smooth, steadily increasing curve from ~110 to ~470 passengers, confirming the long-term growth.
  • Seasonal: A fixed repeating annual pattern (as expected with multiplicative model), peaking mid-year around July (~1.22×) and troughing in November (~0.81×).
  • Residual: Relatively random scatter around 1.0, suggesting the multiplicative model captures the structure well, though a few mild outliers appear in early years.

Let me print the actual numeric values for a couple of years so you can see the decomposition in table form:

Extract decomposition components
components = pd.DataFrame({
    'Observed': decomp.observed,
    'Trend': decomp.trend,
    'Seasonal': decomp.seasonal,
    'Residual': decomp.resid
})
components.iloc[11:35]
Observed Trend Seasonal Residual
Month
1949-12-01 118.0 129.750000 0.898824 1.011812
1950-01-01 115.0 131.250000 0.910230 0.962603
1950-02-01 126.0 133.083333 0.883625 1.071467
1950-03-01 141.0 134.916667 1.007366 1.037447
1950-04-01 135.0 136.416667 0.975906 1.014048
1950-05-01 125.0 137.416667 0.981378 0.926903
1950-06-01 149.0 138.750000 1.112776 0.965041
1950-07-01 170.0 140.916667 1.226556 0.983557
1950-08-01 170.0 143.166667 1.219911 0.973372
1950-09-01 158.0 145.708333 1.060492 1.022505
1950-10-01 133.0 148.416667 0.921757 0.972193
1950-11-01 114.0 151.541667 0.801178 0.938953
1950-12-01 140.0 154.708333 0.898824 1.006791
1951-01-01 145.0 157.125000 0.910230 1.013845
1951-02-01 150.0 159.541667 0.883625 1.064018
1951-03-01 178.0 161.833333 1.007366 1.091854
1951-04-01 163.0 164.125000 0.975906 1.017665
1951-05-01 172.0 166.666667 0.981378 1.051583
1951-06-01 178.0 169.083333 1.112776 0.946044
1951-07-01 199.0 171.250000 1.226556 0.947404
1951-08-01 199.0 173.583333 1.219911 0.939760
1951-09-01 184.0 175.458333 1.060492 0.988864
1951-10-01 162.0 176.833333 0.921757 0.993881
1951-11-01 146.0 178.041667 0.801178 1.023534
Thinking

The user's original goal was to decompose the series into trend, seasonality, and residuals. Looking at the notebook state, we have:

  1. A decomposition plot (cell-3) showing all four components visually
  2. A numeric table (cell-4) showing the Observed, Trend, Seasonal, and Residual values for rows 11-35

The decomposition has been fully performed - both visually and numerically. The outputs are sufficient to answer the original goal. I should provide a final answer based on the observed outputs.

Key insights from the numeric table:

  • Seasonal factors are multiplicative, with peaks in July (~1.23×) and troughs in November (~0.80×)
  • Trend values increase from ~130 to ~160 over this period
  • Residuals are close to 1.0, indicating good decomposition fit

I'll choose Path A since the outputs are sufficient.

The decomposition is complete. Here's a summary of the three components from the multiplicative model:

Trend — A steady upward trajectory from ~129 passengers in late 1949 to ~160 by early 1951 (and continuing through the full series), reflecting consistent long-term growth in air travel demand.

Seasonality — A fixed multiplicative annual pattern:

  • Peak: July (×1.227) and August (×1.220) — summer travel drives the highest ridership.
  • Trough: November (×0.801) — the weakest month, with late-fall dips also in February (×0.884) and October (×0.922).
  • Because the model is multiplicative, the seasonal effect scales with the trend (e.g., a 22% July boost on 130 ≈ 29 extra passengers vs. the same 22% on 400 ≈ 88 extra passengers).

Residuals — Hover tightly around 1.0 (range roughly 0.93–1.07 for the shown slice), meaning the multiplicative decomposition captures most of the structure. No systematic patterns remain in the residuals, confirming the model fit is appropriate.

The decomposition confirms that AirPassengers is a classic series with both strong trend and strong seasonality that interact multiplicatively — the amplitude of seasonal swings grows proportionally with the overall level.

check stationarity with ADF test and apply differencing if needed
Thinking

The user wants to check stationarity using the ADF test and apply differencing if needed. I'll use the original series from df, run the ADF test, and then decide if differencing is needed. Let me start with the ADF test on the original series.

Let's start by running the Augmented Dickey-Fuller (ADF) test on the original passenger series to check for stationarity.

Run ADF stationarity test
from statsmodels.tsa.stattools import adfuller

series = df.set_index('Month')['#Passengers']
result = adfuller(series, autolag='AIC')

print(f'ADF Statistic: {result[0]:.4f}')
print(f'p-value: {result[1]:.4f}')
print(f'Critical Values:')
for key, val in result[4].items():
    print(f'  {key}: {val:.4f}')
ADF Statistic: 0.8154
p-value: 0.9919
Critical Values:
  1%: -3.4817
  5%: -2.8840
  10%: -2.5788
Thinking

The ADF test results are clear:

  • ADF Statistic: 0.8154
  • p-value: 0.9919

This is far from rejecting the null hypothesis. The series is non-stationary. We need to apply differencing and re-test. Let me apply first differencing and check again.

The ADF test confirms the series is non-stationary — the ADF statistic (0.8154) is positive and the p-value (0.9919) is far above any significance threshold. We cannot reject the null hypothesis of a unit root.

Since the series also has strong seasonality (as seen in the decomposition), let's apply seasonal differencing (lag 12) first, then check with ADF again. If still non-stationary, we'll add first-order differencing on top.

Test seasonal and first differencing
# Seasonal differencing (lag=12) then first-order differencing
seasonal_diff = series.diff(12).dropna()
result_sd = adfuller(seasonal_diff, autolag='AIC')
print(f'Seasonal differencing (lag=12):')
print(f'  ADF Statistic: {result_sd[0]:.4f}')
print(f'  p-value: {result_sd[1]:.4f}')

# If still non-stationary, apply first differencing on top
first_and_seasonal_diff = seasonal_diff.diff().dropna()
result_fsd = adfuller(first_and_seasonal_diff, autolag='AIC')
print(f'\nSeasonal + First differencing:')
print(f'  ADF Statistic: {result_fsd[0]:.4f}')
print(f'  p-value: {result_fsd[1]:.4f}')
Seasonal differencing (lag=12):
  ADF Statistic: -3.3830
  p-value: 0.0116

Seasonal + First differencing:
  ADF Statistic: -15.5956
  p-value: 0.0000
Thinking

The ADF test results are clear:

  1. Original series: ADF = 0.8154, p = 0.9919 → non-stationary (can't reject unit root)
  2. Seasonal differencing (lag=12): ADF = -3.3830, p = 0.0116 → stationary at 5% level (but not at 1%)
  3. Seasonal + First differencing: ADF = -15.5956, p = 0.0000 → clearly stationary

The goal was to check stationarity with ADF and apply differencing if needed. The outputs fully answer this.

Here's a summary of the stationarity analysis:

Original series — non-stationary. The ADF statistic (0.8154) is positive and the p-value (0.9919) is far from significance. A unit root cannot be rejected.

After seasonal differencing (lag 12) — marginally stationary. The ADF statistic (-3.3830) rejects the null at the 5% level (p = 0.0116), but not at the 1% critical value (-3.4817). This means the series still carries some non-stationary behavior.

After seasonal + first differencing — clearly stationary. The ADF statistic (-15.5956, p ≈ 0) strongly rejects the unit root at all significance levels. This is the appropriate level of differencing for modeling.

Recommendation: Use d = 1 (first differencing) and D = 1 (seasonal differencing, lag 12) for ARIMA/SARIMA modeling. The differenced series is now stationary and ready for further time series modeling.

fit an ARIMA model and forecast the next 12 months
Thinking

The user wants to fit an ARIMA model and forecast the next 12 months. Based on the analysis, we've determined that d=1 and D=1 with seasonal lag 12 are appropriate. So a SARIMA model with order (p,1,q) and seasonal order (P,1,Q,12) would be suitable.

Let me fit a SARIMA model. I'll use statsmodels.tsa.statespace.SARIMAX. For the orders, I'll start with reasonable defaults - maybe (1,1,1) x (1,1,1,12) which is a common starting point for the Air Passengers dataset.

Let me fit the model and forecast.

Given our stationarity findings (d=1, D=1, seasonal lag=12), a SARIMA model is the right choice. I'll fit a SARIMA(1,1,1)(1,1,1,12) — a solid starting point for this type of data — and forecast the next 12 months.

Fit SARIMA model
from statsmodels.tsa.statespace.sarimax import SARIMAX
import numpy as np

# Fit SARIMA(1,1,1)(1,1,1,12)
model = SARIMAX(series, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12), enforce_stationarity=False, enforce_invertibility=False)
results = model.fit(disp=False)

print(results.summary())
SARIMAX Results                                      
==========================================================================================
Dep. Variable:                        #Passengers   No. Observations:                  144
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 12)   Log Likelihood                -456.103
Date:                            Sat, 11 Apr 2026   AIC                            922.205
Time:                                    21:04:43   BIC                            936.016
Sample:                                01-01-1949   HQIC                           927.812
                                     - 12-01-1960                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.2298      0.401     -0.573      0.567      -1.016       0.557
ma.L1         -0.0987      0.374     -0.264      0.792      -0.832       0.634
ar.S.L12      -0.5460      0.299     -1.825      0.068      -1.133       0.041
ma.S.L12       0.3959      0.352      1.125      0.261      -0.294       1.086
sigma2       140.2945     17.997      7.795      0.000     105.020     175.569
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):                 5.42
Prob(Q):                              0.95   Prob(JB):                         0.07
Heteroskedasticity (H):               2.51   Skew:                             0.12
Prob(H) (two-sided):                  0.01   Kurtosis:                         4.03
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
/Users/olunia/Library/mljar-studio/jlab_server/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
  self._init_dates(dates, freq)
/Users/olunia/Library/mljar-studio/jlab_server/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
  self._init_dates(dates, freq)
Thinking

The SARIMA model has been fitted successfully. Now I need to forecast the next 12 months and visualize the results. The model summary was printed but truncated, and importantly, the forecast hasn't been generated yet. I need to produce the 12-month forecast and plot it.

The SARIMA(1,1,1)(1,1,1,12) model has been fitted successfully. The model summary was partially truncated, but we can see it converged with a log likelihood of -456.103 and AIC of 922.205. Now I need to generate the 12-month forecast and visualize it alongside the historical data.

Plot forecast with confidence intervals
# Forecast next 12 months
forecast = results.get_forecast(steps=12)
forecast_mean = forecast.predicted_mean
conf_int = forecast.conf_int()

# Plot historical data + forecast
plt.figure(figsize=(12, 5))
plt.plot(series, label='Observed')
plt.plot(forecast_mean, label='Forecast', linestyle='--', color='red')
plt.fill_between(conf_int.index, conf_int.iloc[:, 0], conf_int.iloc[:, 1], color='pink', alpha=0.3, label='95% CI')
plt.title('Air Passengers — 12-Month SARIMA Forecast')
plt.xlabel('Month')
plt.ylabel('Passengers')
plt.legend()
plt.tight_layout()
plt.show()

# Print forecast values
pd.DataFrame({'Forecast': forecast_mean.round(0), 'Lower 95% CI': conf_int.iloc[:, 0].round(0), 'Upper 95% CI': conf_int.iloc[:, 1].round(0)})
Forecast Lower 95% CI Upper 95% CI
1961-01-01 447.0 424.0 470.0
1961-02-01 423.0 395.0 451.0
1961-03-01 458.0 425.0 491.0
1961-04-01 492.0 455.0 529.0
1961-05-01 506.0 465.0 546.0
1961-06-01 566.0 522.0 610.0
1961-07-01 652.0 605.0 699.0
1961-08-01 640.0 589.0 690.0
1961-09-01 543.0 490.0 596.0
1961-10-01 494.0 438.0 549.0
1961-11-01 427.0 369.0 485.0
1961-12-01 470.0 409.0 531.0
Thinking

The user's original goal was to fit an ARIMA model and forecast the next 12 months. Looking at the notebook state, this has been completed:

  1. A SARIMA(1,1,1)(1,1,1,12) model was fitted (cell-7)
  2. The 12-month forecast was generated and visualized (cell-8)
  3. Both a chart and a table of forecast values with 95% confidence intervals are shown in the output

The forecast table shows:

  • Jan 1961: 447 passengers
  • Feb 1961: 423 passengers
  • ...through Dec 1961: 470 passengers
  • With 95% CI bounds for each month

This fully answers the original goal. I should provide a Path A final answer.

The SARIMA(1,1,1)(1,1,1,12) model has been fitted and the 12-month forecast is complete. Here's a summary of the results:

Model: SARIMA(1,1,1)(1,1,1,12) — selected based on the earlier stationarity analysis showing both seasonal (lag-12) and first-order differencing were needed (ADF p-values of 0.0116 and 0.0000 respectively). The model converged with AIC = 922.2.

Forecast Highlights (1961):

  • Peak month: July at ~652 passengers (matching the historical summer travel pattern)
  • Trough month: November at ~427 passengers
  • The seasonal shape — summer highs, winter lows — is preserved in the forecast, consistent with the multiplicative decomposition seen earlier.

Uncertainty: The 95% confidence intervals widen over the forecast horizon, ranging from roughly ±23 passengers in January to roughly ±61 in December, which is expected for multi-step-ahead forecasts.

The forecast values and confidence intervals are fully available in the table above, and the overlay chart shows the forecast (red dashed) with the pink confidence band sitting naturally above the observed series' final values.

Score cards for glm-5.1 for Air Passengers Forecasting with ARIMA

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All major steps were completed: loading and plotting the series, decomposition, stationarity testing with differencing, ARIMA/SARIMA fitting, and 12-month forecasting with intervals.

  • Execution Correctness: 2/2

    Excellent

    The provided code is coherent and likely runnable; it uses valid pandas, statsmodels, and matplotlib/seaborn calls without obvious syntax or logic errors.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes: the 1949–1960 line chart, 4-panel decomposition, ADF results showing nonstationarity then stationarity after differencing, and a forecast plot with confidence intervals.

  • Reasoning Quality: 2/2

    Excellent

    The workflow correctly identifies multiplicative seasonality, explains the need for differencing, and justifies the SARIMA-style model choice for monthly seasonal data.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and grounded in the observed outputs, with no major hallucinations or unsupported claims.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.