Air Passengers Forecasting with ARIMA

Decompose the classic Air Passengers time series, identify trend and seasonality, fit an ARIMA model, and forecast 12 months ahead.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Air Passengers monthly time series from a CSV URL and plots the 1949–1960 passenger counts. It decomposes the series into trend, seasonal, and residual components and checks stationarity using the Augmented Dickey-Fuller test with differencing as needed. It then fits an ARIMA model and generates a 12-month forecast with confidence intervals.

Who this example is for

This is for analysts and students learning classical time-series forecasting with decomposition, stationarity checks, and ARIMA modeling. It helps practitioners build a reproducible baseline forecast workflow on a well-known seasonal dataset.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Time-series line chart showing trend and seasonality (1949–1960)
  • Seasonal decomposition plot with trend/seasonal/residual panels
  • ADF test output and differencing step to achieve stationarity
  • ARIMA model fit summary and diagnostics-ready artifacts
  • 12-month forecast plot with confidence intervals

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • matplotlib
  • statsmodels
  • scipy

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load air passengers dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/air-passengers/AirPassengers.csv and plot the time series
  2. 2decompose the series into trend, seasonality and residuals
  3. 3check stationarity with ADF test and apply differencing if needed
  4. 4fit an ARIMA model and forecast the next 12 months

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
gpt-5.49/102/22/23/31/21/1Open gpt-5.4 conversation
qwen3-coder-next8/102/22/23/31/20/1Open qwen3-coder-next conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Air Passengers Forecasting with ARIMA. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load air passengers dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/air-passengers/AirPassengers.csv and plot the time series
Thinking

Planning data inspection

I need to follow an iterative process, tackling one step at a time. The user wants to load and plot data, but the developer advises starting with loading the dataset and inspecting it — specifically showing the shape, columns, and head() of the DataFrame before plotting. I think it's essential to clarify this and provide the necessary code for loading and inspecting first. I can display the shape of the DataFrame, using a single code block for clarity. I'll ensure to mention that plotting will be the next step!Constructing data summary

I think it's better to create a dictionary summary and display it, followed by the head() of the DataFrame. Notebook rules suggest displaying multiple tables sequentially, but I should adhere to the guideline of inspecting data first by showing shape, columns, and head(). I'll load the DataFrame while parsing 'Month' as datetime. The first code snippet will be df = pd.read_csv(url, parse_dates=['Month']). Then, I’ll display the summary as a dictionary and call df.head() to finish up.Finalizing data display

I can create a dictionary using {"shape": df.shape, "columns": df.columns.tolist()} as a single expression. It looks good to use display() for that dictionary and then call df.head() afterward. This way, I can keep it all in one code block. I shouldn't prompt a run since no outputs are available yet. I’ll mention that the next step will involve making a line plot, and then I'll keep the answer concise!

Let’s start by loading the dataset and checking that the date column looks right before plotting it.

Load and preview data
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/air-passengers/AirPassengers.csv"

df = pd.read_csv(url, parse_dates=["Month"])

display({"shape": df.shape, "columns": df.columns.tolist()})
df.head()
{'shape': (144, 2), 'columns': ['Month', '#Passengers']}
Month #Passengers
0 1949-01-01 112
1 1949-02-01 118
2 1949-03-01 132
3 1949-04-01 129
4 1949-05-01 121

The dataset is in the notebook with the two fields needed for a time series: Month and #Passengers. The dates were parsed correctly, and the preview shows monthly observations starting in January 1949, so the next direct step is the line plot.

Plot air passengers over time
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 4))
sns.lineplot(data=df, x="Month", y="#Passengers")
plt.title("Air Passengers Over Time")
plt.xlabel("Month")
plt.ylabel("Number of Passengers")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Thinking

Determining final answer format

I need to make a decision based on the A/B outputs regarding the best way to present the answer. Output-1 includes both shape columns and head, while output-2 is only a figure object. Given that the latest output is chart-only, using the context from previous outputs shows that it contains both a valid chart and supporting numeric values. So, I will conclude by noting that a time-series line chart was created with the number of passengers plotted over monthly dates.Clarifying dataset details

I might want to mention that the dataset covers 144 months, starting from January 1949. I shouldn’t infer the end date too much, but I know that 144 monthly records suggest it could end in December 1960. I’ll stick to the observed outputs showing 144 rows starting from 1949-01-01. I don't need to repeat what's already visible, but I should highlight that this is a monthly time series with one date column and one passenger-count column, which is sufficient for plotting. A line chart was created.

The data used for the plot is a monthly time series with Month as the date field and #Passengers as the passenger count field. The observed preview confirms the series begins in January 1949, and the notebook contains 144 monthly records, which is the expected structure for this dataset.

The time series plot has been created as a line chart with:

  • Month on the x-axis
  • #Passengers on the y-axis

From the loaded values shown before plotting, the chart is based on monthly passenger counts such as 112, 118, 132, 129, and 121 in the opening months, so the visualization is using the correct underlying series.

decompose the series into trend, seasonality and residuals
Thinking

Decomposing time series

I want to provide an answer based on how to decompose a time series into trend, seasonality, and residuals. I'll suggest the next code step, which involves importing seasonal_decompose and plotting the results. Since we're working with a monthly dataset, I'll set the month as an index and use the model='multiplicative' due to the increasing amplitude of seasonality in the AirPassengers dataset. I'll provide a single code block without asking for output. My final expression will just be the plotting command.

For this dataset, a multiplicative decomposition is a good default because the seasonal swings get larger as the series level rises.

Plot seasonal decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt

ts = df.set_index("Month")["#Passengers"]

decomp = seasonal_decompose(ts, model="multiplicative", period=12)
fig = decomp.plot()
fig.set_size_inches(10, 8)
plt.tight_layout()
plt.show()
Thinking

Considering data presentation

I'm thinking I should keep one table with maybe the first 15 rows to illustrate the NaNs at the edges, which arise from the moving average and seasonal patterns. Typically, there's a default of one table per step, but it might be okay to override that here. I’ll consider using the display function with just the head for brevity, ensuring I include the trend, seasonality, residuals, and the underlying numeric values. Let's make it concise!

The decomposition figure was created successfully, so the series has already been split into observed, trend, seasonal, and residual components. The only missing piece is the underlying numeric values, since the current output is just a figure object.

Create decomposition components table
from IPython.display import display
import pandas as pd

parts = pd.DataFrame({
    "observed": decomp.observed,
    "trend": decomp.trend,
    "seasonal": decomp.seasonal,
    "resid": decomp.resid,
})

display(parts.head(15))
parts.tail(15)
observed trend seasonal resid
Month
1949-01-01 112.0 NaN 0.910230 NaN
1949-02-01 118.0 NaN 0.883625 NaN
1949-03-01 132.0 NaN 1.007366 NaN
1949-04-01 129.0 NaN 0.975906 NaN
1949-05-01 121.0 NaN 0.981378 NaN
1949-06-01 135.0 NaN 1.112776 NaN
1949-07-01 148.0 126.791667 1.226556 0.951664
1949-08-01 148.0 127.250000 1.219911 0.953401
1949-09-01 136.0 127.958333 1.060492 1.002220
1949-10-01 119.0 128.583333 0.921757 1.004028
1949-11-01 104.0 129.000000 0.801178 1.006270
1949-12-01 118.0 129.750000 0.898824 1.011812
1950-01-01 115.0 131.250000 0.910230 0.962603
1950-02-01 126.0 133.083333 0.883625 1.071467
1950-03-01 141.0 134.916667 1.007366 1.037447
observed trend seasonal resid
Month
1959-10-01 407.0 440.958333 0.921757 1.001337
1959-11-01 362.0 445.833333 0.801178 1.013461
1959-12-01 405.0 450.625000 0.898824 0.999919
1960-01-01 417.0 456.333333 0.910230 1.003928
1960-02-01 391.0 461.375000 0.883625 0.959079
1960-03-01 419.0 465.208333 1.007366 0.894086
1960-04-01 461.0 469.333333 0.975906 1.006495
1960-05-01 472.0 472.750000 0.981378 1.017359
1960-06-01 535.0 475.041667 1.112776 1.012079
1960-07-01 622.0 NaN 1.226556 NaN
1960-08-01 606.0 NaN 1.219911 NaN
1960-09-01 508.0 NaN 1.060492 NaN
1960-10-01 461.0 NaN 0.921757 NaN
1960-11-01 390.0 NaN 0.801178 NaN
1960-12-01 432.0 NaN 0.898824 NaN

Your series is decomposed into the three requested parts, and the outputs show a clear pattern:

  • Trend: the trend component rises steadily over time, from roughly 127 passengers/month in mid-1949 to about 475 passengers/month by mid-1960. This indicates strong long-term growth in air travel.
  • Seasonality: the seasonal component is stable by month and multiplicative, meaning it scales with the level of the series.
    • Strongest months are around July–August with seasonal factors near 1.22, so those months are about 22% above trend.
    • Weakest month is November with a factor near 0.80, about 20% below trend.
    • Other lower-than-trend months include January, February, October, and December.
  • Residuals: the residual component is mostly close to 1.0, which means the trend + seasonality explain most of the variation. For example, many residual values are near 0.95 to 1.02, so the unexplained noise is relatively small.
  • Edge NaNs: the missing values at the beginning and end of the trend/residual columns are expected in classical decomposition because centered moving averages cannot be computed at the boundaries.

So, in plain terms: the passenger series has a strong upward trend, repeating yearly seasonality, and comparatively modest residual noise. The multiplicative decomposition was an appropriate choice because the seasonal swings increase as the series level increases.

check stationarity with ADF test and apply differencing if needed
Thinking

Planning stationarity check

I need to check the stationarity of a time series with the ADF test and apply differencing if necessary. It seems the user has updated their goal, so I should base my approach on the current state. I’ll run the ADF test on the original series first, analyze the p-value, and if it's non-stationary, then proceed with differencing. I’ll keep it concise with one code block, possibly including a summary table, and make sure to import the right libraries first.

The decomposition already suggests the series is not stationary because it has both a strong trend and yearly seasonality. The clean next step is to confirm that with an ADF test on the original series.

Run ADF test
from statsmodels.tsa.stattools import adfuller
import pandas as pd

adf = adfuller(ts.dropna())

adf_result = pd.Series(
    {
        "adf_statistic": adf[0],
        "p_value": adf[1],
        "used_lags": adf[2],
        "n_obs": adf[3],
    }
)

crit = pd.Series(adf[4], name="critical_value")
pd.concat([adf_result, crit])
adf_statistic      0.815369
p_value            0.991880
used_lags         13.000000
n_obs            130.000000
1%                -3.481682
5%                -2.884042
10%               -2.578770
dtype: float64

The ADF output shows the series is not stationary in its current form.

  • The ADF statistic is 0.815, which is not more negative than any of the critical values.
  • The p-value is 0.9919, far above common thresholds like 0.05.

That means we fail to reject the null hypothesis of a unit root, so the original passenger series is non-stationary. This matches the earlier decomposition: there is a strong upward trend and clear yearly seasonality.

Because of that, differencing is needed. For this series, a good choice is:

  • first differencing to remove trend: y_t - y_{t-1}
  • and likely seasonal differencing at lag 12 to remove annual seasonality: y_t - y_{t-12}

So the practical conclusion is:

  • Original series: not stationary
  • Action needed: apply differencing
  • Most appropriate here: seasonal structure suggests using seasonal differencing (12), often combined with first differencing if one differencing step alone is not enough

Based on the observed output alone, the series still needs to be differenced before using methods that assume stationarity.

fit an ARIMA model and forecast the next 12 months
Thinking

Fitting ARIMA for forecasting

I need to answer the user's request to fit an ARIMA model and forecast for the next 12 months using the current notebook state. The outputs indicate that the data is nonstationary, so I should consider differencing. Since the user mentioned seasonality, proposing a seasonal ARIMA (SARIMA) model seems appropriate. I'll aim to fit a SARIMAX model on the original series while handling nonstationarity directly, but I must focus on providing the solution in a single code block.Fitting a baseline SARIMA model

Since the user wants to work iteratively, I think it’s best to first fit a baseline SARIMA model and inspect its summary. The user's request for a forecast suggests that I can fit the model now and provide the forecast later. I’ll note that the ADF test indicates nonstationarity, so I'll use differencing terms $d=1$ and $D=1$ with a seasonal period of 12. I plan to fit the model using SARIMAX from statsmodels and show only the summary table in one code block to keep things concise.

Since the series is non-stationary and clearly seasonal, I’d use a seasonal ARIMA with one regular difference and one seasonal difference: d=1, D=1, seasonal period $12$.

Here’s a good first fit to start with:

Fit SARIMAX model
from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(
    ts,
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, 12),
    enforce_stationarity=False,
    enforce_invertibility=False
)

model_fit = model.fit(disp=False)
model_fit.summary()
/Users/olunia/Library/mljar-studio/jlab_server/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
  self._init_dates(dates, freq)
/Users/olunia/Library/mljar-studio/jlab_server/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
  self._init_dates(dates, freq)
SARIMAX Results
Dep. Variable: #Passengers No. Observations: 144
Model: SARIMAX(1, 1, 1)x(1, 1, 1, 12) Log Likelihood -456.103
Date: Fri, 10 Apr 2026 AIC 922.205
Time: 19:35:23 BIC 936.016
Sample: 01-01-1949 HQIC 927.812
- 12-01-1960
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 -0.2298 0.401 -0.573 0.567 -1.016 0.557
ma.L1 -0.0987 0.374 -0.264 0.792 -0.832 0.634
ar.S.L12 -0.5460 0.299 -1.825 0.068 -1.133 0.041
ma.S.L12 0.3959 0.352 1.125 0.261 -0.294 1.086
sigma2 140.2945 17.997 7.795 0.000 105.020 175.569
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 5.42
Prob(Q): 0.95 Prob(JB): 0.07
Heteroskedasticity (H): 2.51 Skew: 0.12
Prob(H) (two-sided): 0.01 Kurtosis: 4.03


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

If this fits cleanly, the next step is to generate the 12-month forecast and plot it.

The SARIMA model has already been fit successfully on the monthly passenger series, and the setup is appropriate for the data you explored:

  • The earlier ADF result showed the raw series was non-stationary.
  • The fitted model uses both regular differencing and seasonal differencing: (1,1,1) × (1,1,1,12).
  • That matches the visible pattern in the series: upward trend plus strong 12-month seasonality.

What’s still missing for your original goal is the actual 12-month forecast values. Here is the minimal next step to produce them:

Generate forecast summary
forecast = model_fit.get_forecast(steps=12)
forecast_df = forecast.summary_frame()[["mean", "mean_ci_lower", "mean_ci_upper"]]
forecast_df
#Passengers mean mean_ci_lower mean_ci_upper
1961-01-01 447.222620 424.007632 470.437608
1961-02-01 422.734099 394.771420 450.696778
1961-03-01 457.700895 424.797900 490.603890
1961-04-01 492.277505 455.270834 529.284177
1961-05-01 505.563366 464.826096 546.300636
1961-06-01 565.671782 521.526200 609.817365
1961-07-01 651.986404 604.675665 699.297144
1961-08-01 639.634243 589.357594 689.910891
1961-09-01 543.058845 489.981679 596.136011
1961-10-01 493.727026 437.989896 549.464156
1961-11-01 426.938001 368.662189 485.213813
1961-12-01 470.135609 409.427185 530.844033

You fitted a seasonal ARIMA model successfully:

  • Model: (1,1,1)×(1,1,1,12)
  • This choice is consistent with the observed data:
    • the ADF test on the raw series had a very high p-value ($0.9919$), so the original series was not stationary,
    • the monthly pattern is clearly seasonal, so seasonal differencing with period $12$ is appropriate.

The 12-month forecast for 1961 is:

MonthForecastLower 95% CIUpper 95% CI
1961-01447.22424.01470.44
1961-02422.73394.77450.70
1961-03457.70424.80490.60
1961-04492.28455.27529.28
1961-05505.56464.83546.30
1961-06565.67521.53609.82
1961-07651.99604.68699.30
1961-08639.63589.36689.91
1961-09543.06489.98596.14
1961-10493.73437.99549.46
1961-11426.94368.66485.21
1961-12470.14409.43530.84

What this means:

  • The forecast preserves the strong annual seasonality already visible in the history.
  • Peak demand is expected in July–August 1961, with July forecast highest at about 651.99 passengers.
  • Lower months are expected in February and November, matching the historical seasonal pattern.
  • The confidence intervals widen a bit across the horizon, reflecting uncertainty in longer-range forecasts.

So the requested ARIMA-style forecasting task has been answered with a fitted seasonal ARIMA model and a next-12-month forecast.

Score cards for gpt-5.4 for Air Passengers Forecasting with ARIMA

Total score: 9/10

  • Task Completion: 2/2

    Excellent

    The workflow loads the dataset, plots the series, performs multiplicative decomposition, runs an ADF test, fits an ARIMA/SARIMAX model, and produces a 12-month forecast.

  • Execution Correctness: 2/2

    Excellent

    The code shown is syntactically coherent and the outputs indicate it ran successfully, including the decomposition, ADF results, model summary, and forecast table.

  • Output Quality: 3/3

    Excellent

    The outputs match the expected outcomes semantically: a monthly line chart, a 4-panel decomposition, an ADF p-value showing non-stationarity, and a forecast with confidence intervals.

  • Reasoning Quality: 1/2

    Needs work

    The reasoning is mostly correct, but it is somewhat verbose and includes some speculative commentary; it does not clearly state the differencing step after the ADF result despite the prompt.

  • Reliability: 1/1

    Excellent

    The workflow is reasonably consistent and uses standard time-series methods, though it relies on a fixed ARIMA specification rather than explicit model selection.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.