Air Passengers Forecasting with ARIMA

Decompose the classic Air Passengers time series, identify trend and seasonality, fit an ARIMA model, and forecast 12 months ahead.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Air Passengers monthly time series from a CSV URL and plots the 1949–1960 passenger counts. It decomposes the series into trend, seasonal, and residual components and checks stationarity using the Augmented Dickey-Fuller test with differencing as needed. It then fits an ARIMA model and generates a 12-month forecast with confidence intervals.

Who this example is for

This is for analysts and students learning classical time-series forecasting with decomposition, stationarity checks, and ARIMA modeling. It helps practitioners build a reproducible baseline forecast workflow on a well-known seasonal dataset.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Time-series line chart showing trend and seasonality (1949–1960)
  • Seasonal decomposition plot with trend/seasonal/residual panels
  • ADF test output and differencing step to achieve stationarity
  • ARIMA model fit summary and diagnostics-ready artifacts
  • 12-month forecast plot with confidence intervals

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • matplotlib
  • statsmodels
  • scipy

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load air passengers dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/air-passengers/AirPassengers.csv and plot the time series
  2. 2decompose the series into trend, seasonality and residuals
  3. 3check stationarity with ADF test and apply differencing if needed
  4. 4fit an ARIMA model and forecast the next 12 months

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next8/102/22/23/31/20/1Open qwen3-coder-next conversation
qwen3.5:397b8/102/21/22/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Air Passengers Forecasting with ARIMA. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load air passengers dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/air-passengers/AirPassengers.csv and plot the time series
Thinking

Preparing to display data

I realize that I don't need to use print statements since I can display variables directly. I should work on creating a summary DataFrame to show its shape and columns using display(...). I won't include comma-separated expressions to keep things simple. My next step will be to confirm the structure and then proceed to plotting. So, I’ll say, "Let's load it first and check the structure, then we'll plot it in the next step."

Let’s load it first and check the structure, then we’ll plot the time series in the next step.

Load and preview dataset
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/air-passengers/AirPassengers.csv"

df = pd.read_csv(url)

summary = pd.DataFrame({
    "rows": [df.shape[0]],
    "columns": [df.shape[1]],
    "column_names": [list(df.columns)]
})

display(summary)
display(df.head())
rows columns column_names
0 144 2 [Month, #Passengers]
Month #Passengers
0 1949-01 112
1 1949-02 118
2 1949-03 132
3 1949-04 129
4 1949-05 121

The dataset is in the expected time-series shape: 144 monthly records with a date-like Month column and one numeric series, #Passengers. That means it’s ready to plot as a single passenger-count time series.

Plot air passengers over time
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

df["Month"] = pd.to_datetime(df["Month"])

plt.figure(figsize=(12, 5))
sns.lineplot(data=df, x="Month", y="#Passengers")
plt.title("Air Passengers Over Time")
plt.xlabel("Month")
plt.ylabel("Number of Passengers")
plt.xticks(rotation=45)
plt.tight_layout()

The outputs already show the dataset structure and the plotted figure was created from the correct two columns: Month and #Passengers, with 144 monthly records. The first few observed values begin in 1949 and confirm this is a monthly passenger time series suitable for plotting over time.

The time series plot uses Month converted to datetime on the x-axis and #Passengers on the y-axis, which matches the requested visualization of air passengers over time.

decompose the series into trend, seasonality and residuals
Thinking

Decomposing a series

I need to create an answer based on the notebook state. The user is asking about decomposing a time series. With the data loaded (including Month datetime), it seems like I should use the statsmodels library's seasonal_decompose. I'll set the index and frequency, providing a concise explanation along with minimal code. Since the AirPassengers data has increasing variance, I think a multiplicative decomposition is appropriate, using period=12. I also plan to create a decomposition plot. I’ll finish with plt.show().

Since this is monthly data and the seasonal swings grow as the level rises, a multiplicative decomposition is the right choice here.

Plot seasonal decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt

ts = df.set_index("Month")["#Passengers"].asfreq("MS")

decomp = seasonal_decompose(ts, model="multiplicative", period=12)

fig = decomp.plot()
fig.set_size_inches(12, 8)
plt.tight_layout()
plt.show()
Thinking

Choosing decomposition path

I need to decide between two options: A or B, for decomposing a series into trend, seasonality, and residuals. The goal is to focus on insights, and since the latest output is only a visual chart, I should choose Path B. This means I'll display the values used in the chart and provide a concise code block for gathering the missing evidence.

I’ll mention that we have 144 monthly observations, and the seasonal period appears to be 12. I'll include minimal runnable code to create a DataFrame from the decomposition results and display the first 15 rows.Crafting tables for decomposition

Since the goal is to decompose a series, I'll need the underlying values. I should provide a clear and straightforward summary, ideally using a single table for display.

I'll present the first 15 rows and the last 5 to inspect the residuals, while keeping in mind not to overwhelm with multiple tables. I'll create a compact summary table that includes observed, trend, seasonal, and residual values for those first 15 dates.

Finally, I’ll note that there are NaN values at the edges due to moving averages.

The decomposition plot is in place and it confirms the series was split into observed, trend, seasonal, and residual components over 144 monthly points. Because the output is visual-only, the missing piece is a compact table of the component values themselves. One detail to expect: the trend and residual columns will have NaN at the start and end because the 12-month decomposition needs surrounding months to estimate the centered trend.

Display decomposition components
from IPython.display import display
import pandas as pd

parts = pd.DataFrame({
    "observed": decomp.observed,
    "trend": decomp.trend,
    "seasonal": decomp.seasonal,
    "resid": decomp.resid
})

display(parts.head(15))
observed trend seasonal resid
Month
1949-01-01 112.0 NaN 0.910230 NaN
1949-02-01 118.0 NaN 0.883625 NaN
1949-03-01 132.0 NaN 1.007366 NaN
1949-04-01 129.0 NaN 0.975906 NaN
1949-05-01 121.0 NaN 0.981378 NaN
1949-06-01 135.0 NaN 1.112776 NaN
1949-07-01 148.0 126.791667 1.226556 0.951664
1949-08-01 148.0 127.250000 1.219911 0.953401
1949-09-01 136.0 127.958333 1.060492 1.002220
1949-10-01 119.0 128.583333 0.921757 1.004028
1949-11-01 104.0 129.000000 0.801178 1.006270
1949-12-01 118.0 129.750000 0.898824 1.011812
1950-01-01 115.0 131.250000 0.910230 0.962603
1950-02-01 126.0 133.083333 0.883625 1.071467
1950-03-01 141.0 134.916667 1.007366 1.037447

The series has been successfully decomposed into trend, seasonality, and residuals using a multiplicative model with a 12-month period, which fits this dataset because the seasonal variation grows as the overall level rises.

From the observed component values:

  • Trend captures the long-term upward movement in passenger volume.
    Early in the series, the estimated trend is around 126.8 in July 1949 and rises to about 134.9 by March 1950, showing steady growth.

  • Seasonality captures the repeating within-year monthly pattern.
    The seasonal factor is:

    • above 1 in stronger months, such as July (1.2266) and August (1.2199), meaning these months are typically higher than the trend.
    • below 1 in weaker months, such as November (0.8012) and February (0.8836), meaning these months are typically lower than the trend.
  • Residuals are what remains after removing trend and seasonality.
    These values are mostly close to 1, for example 0.9517, 1.0022, 1.0063, which suggests the decomposition is capturing most of the systematic structure and leaving relatively small irregular fluctuations.

  • The NaN values at the beginning are expected for decomposition because the centered moving average used to estimate trend cannot be computed at the edges.

So, in plain terms:

  • the trend shows passenger counts increasing over time,
  • the seasonality shows a strong recurring annual cycle with summer peaks and late-autumn lows,
  • the residuals are comparatively small random deviations after accounting for those two patterns.
check stationarity with ADF test and apply differencing if needed
Thinking

Planning ADF Execution

I need to respond based on the notebook state and the user's prompt, iterating one step at a time while being concise. Analyzing the latest outputs against the original goal and the current user prompt is crucial. It seems I should run an Augmented Dickey-Fuller (ADF) test on the original series. There's a variable called ts, but it may not be explicitly listed due to truncation. It’s safer to reconstruct it from the DataFrame. Let's focus on the code for the ADF test without assuming outputs.

The decomposition already suggests the series is probably not stationary because it has both a rising trend and strong seasonality. The next step is to confirm that with an ADF test on the original monthly series.

Run ADF test
from statsmodels.tsa.stattools import adfuller
import pandas as pd
from IPython.display import display

ts = df.set_index("Month")["#Passengers"].asfreq("MS")

adf_result = adfuller(ts.dropna())

adf_table = pd.DataFrame({
    "metric": ["ADF statistic", "p-value", "lags used", "n observations"],
    "value": [adf_result[0], adf_result[1], adf_result[2], adf_result[3]]
})

display(adf_table)
metric value
0 ADF statistic 0.815369
1 p-value 0.991880
2 lags used 13.000000
3 n observations 130.000000

If the p-value is above 0.05, the next step will be first-order differencing.

The ADF test shows the original series is not stationary.

From the observed output:

  • ADF statistic = 0.815
  • p-value = 0.9919

Because the p-value is far above 0.05, we fail to reject the null hypothesis of a unit root. That means the monthly passenger series still has non-stationary structure, which matches what we already saw in the decomposition: a clear upward trend and strong seasonality.

So differencing is needed. Given this series has both trend and yearly seasonality, the practical interpretation is:

  • First differencing addresses the trend:
    y_t - y_{t-1}
  • Since this is monthly data with annual seasonality, it will likely also need seasonal differencing with lag 12:
    y_t - y_{t-12}

Based on the current output alone, the correct conclusion is:

  • the original series is non-stationary
  • differencing should be applied before modeling
  • for this dataset, seasonal differencing is especially important because of the strong 12-month cycle visible in the decomposition
fit an ARIMA model and forecast the next 12 months
Thinking

Figuring out ARIMA modeling steps

I need to respond based on the notebook's state, taking one step at a time. The user wants to fit an ARIMA model and forecast for the next 12 months. I wonder if I should apply differencing first, as the original goal was to achieve stationarity with ADF. It seems like I should probably choose a SARIMA model due to seasonality, but given the context, it might be appropriate to fit a seasonal ARIMA instead. Perhaps I should create a differenced series and test for stationarity again as a first step.Deciding on the next code step

Since there's no fresh output for this step, I think I should provide the next code snippet. I’ve analyzed the latest outputs and found that the ADF test indicates non-stationarity. Thus, applying differencing, likely both seasonal and/or a first difference, seems to be the next step. Since the user wants to fit a model and forecast, I could suggest fitting a seasonal ARIMA model. Maybe I should use the order (1,1,1) and seasonal order (1,1,1,12) and forecast for the next 12 months.Choosing the right ARIMA approach

Since the data is monthly and seasonal, using a seasonal ARIMA seems more fitting than a non-seasonal one. I want to keep the code minimal and focused on the current step. It looks like I can fit the model and perhaps show a forecast table, or maybe just one chart or one table per step. Since it's only one step this time, I’m leaning towards a table that includes confidence intervals. I could utilize the statsmodels SARIMAX for this, especially since the notebook already has the time series data.

Because the series is non-stationary and clearly seasonal, a seasonal ARIMA (SARIMA) is the better choice here than a plain non-seasonal ARIMA.

A good first fit for this dataset is:

  • regular differencing: d = 1
  • seasonal differencing: D = 1 with period 12
Fit SARIMAX forecast model
from statsmodels.tsa.statespace.sarimax import SARIMAX
import pandas as pd
from IPython.display import display

ts = df.set_index("Month")["#Passengers"].asfreq("MS")

model = SARIMAX(
    ts,
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, 12),
    enforce_stationarity=False,
    enforce_invertibility=False
)

result = model.fit(disp=False)

forecast_res = result.get_forecast(steps=12)
forecast_mean = forecast_res.predicted_mean
forecast_ci = forecast_res.conf_int()

forecast_df = pd.DataFrame({
    "forecast": forecast_mean,
    "lower_ci": forecast_ci.iloc[:, 0],
    "upper_ci": forecast_ci.iloc[:, 1]
})

display(forecast_df)
forecast lower_ci upper_ci
1961-01-01 447.222620 424.007631 470.437608
1961-02-01 422.734098 394.771419 450.696778
1961-03-01 457.700895 424.797899 490.603890
1961-04-01 492.277505 455.270833 529.284177
1961-05-01 505.563366 464.826094 546.300638
1961-06-01 565.671782 521.526198 609.817366
1961-07-01 651.986404 604.675663 699.297145
1961-08-01 639.634242 589.357592 689.910893
1961-09-01 543.058845 489.981677 596.136013
1961-10-01 493.727025 437.989893 549.464158
1961-11-01 426.938001 368.662187 485.213815
1961-12-01 470.135609 409.427182 530.844035

This will fit the model and produce the next 12 monthly forecasts with confidence intervals. After that, the next useful step is to plot the historical series together with the forecast.

The fitted model is a seasonal ARIMA, specifically SARIMA(1,1,1) × (1,1,1,12), which is appropriate here because the series was non-stationary and showed strong yearly seasonality.

The 12-month forecast continues the same seasonal pattern seen in the historical data:

  • lower values at the start of the year,
  • rising through spring,
  • a summer peak,
  • then declining into autumn and winter.

The forecasted passenger counts for 1961 are:

  • Jan: 447.2
  • Feb: 422.7
  • Mar: 457.7
  • Apr: 492.3
  • May: 505.6
  • Jun: 565.7
  • Jul: 652.0
  • Aug: 639.6
  • Sep: 543.1
  • Oct: 493.7
  • Nov: 426.9
  • Dec: 470.1

The model expects the peak month to be July 1961 at about 652 passengers, with August close behind. The lowest forecast is February 1961 at about 423 passengers.

The confidence intervals widen a bit for later months, which means uncertainty grows as the forecast horizon extends. Even so, the intervals still support the same overall seasonal shape, so the forecast is consistent and plausible for this dataset.

Score cards for gpt-5.4 for Air Passengers Forecasting with ARIMA

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    The workflow covers loading, plotting, decomposition, ADF testing, differencing discussion, and 12-month forecasting.

  • Execution Correctness: 2/2

    Excellent

    The provided code is coherent and likely runnable, using standard pandas, statsmodels, and plotting calls without obvious syntax errors.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes: time series plot, 4-panel decomposition, ADF result showing non-stationarity, and a forecast table with confidence intervals.

  • Reasoning Quality: 2/2

    Excellent

    The reasoning correctly links the upward trend and seasonality to non-stationarity and justifies multiplicative decomposition and differencing.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and uses appropriate standard methods, with no clear hallucinations or unsafe steps.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.