Air Passengers Forecasting with ARIMA

Decompose the classic Air Passengers time series, identify trend and seasonality, fit an ARIMA model, and forecast 12 months ahead.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Air Passengers monthly time series from a CSV URL and plots the 1949–1960 passenger counts. It decomposes the series into trend, seasonal, and residual components and checks stationarity using the Augmented Dickey-Fuller test with differencing as needed. It then fits an ARIMA model and generates a 12-month forecast with confidence intervals.

Who this example is for

This is for analysts and students learning classical time-series forecasting with decomposition, stationarity checks, and ARIMA modeling. It helps practitioners build a reproducible baseline forecast workflow on a well-known seasonal dataset.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Time-series line chart showing trend and seasonality (1949–1960)
  • Seasonal decomposition plot with trend/seasonal/residual panels
  • ADF test output and differencing step to achieve stationarity
  • ARIMA model fit summary and diagnostics-ready artifacts
  • 12-month forecast plot with confidence intervals

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • matplotlib
  • statsmodels
  • scipy

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load air passengers dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/air-passengers/AirPassengers.csv and plot the time series
  2. 2decompose the series into trend, seasonality and residuals
  3. 3check stationarity with ADF test and apply differencing if needed
  4. 4fit an ARIMA model and forecast the next 12 months

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
gpt-5.49/102/22/23/31/21/1Open gpt-5.4 conversation
qwen3-coder-next8/102/22/23/31/20/1Open qwen3-coder-next conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gemma4:31b

This run uses AI with a step-by-step list of prompts to perform the analysis task Air Passengers Forecasting with ARIMA. Below, you can review the full conversation generated with gemma4:31b. After the conversation, we show detailed score cards for this run.

load air passengers dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/air-passengers/AirPassengers.csv and plot the time series
Thinking
  • Task 1: Load the "AirPassengers" dataset from a specific URL.
    • Task 2: Plot the time series.

    • Constraint 1: Work iteratively.

    • Constraint 2: One execution step at a time.

    • Constraint 3: Load and inspect first (shape, columns, head()) before plotting/modeling.

    • Constraint 4: Use seaborn/matplotlib for plotting (unless interactivity requested).

    • Constraint 5: No $ for currency (not applicable here, but good to remember).

    • Constraint 6: One code block per response.

    • The user wants to load data and plot.

    • According to the "Notebook rules", I must first load and inspect the dataset before plotting.

    • Step 1: Load the CSV and display basic info.

    • URL: https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/air-passengers/AirPassengers.csv

    • Library: pandas

    • Inspections: df.shape, df.columns, df.head().

    • Explain that I'll start by loading and inspecting the data to ensure the structure is correct before plotting.

I'll start by loading the dataset and inspecting its structure to make sure the date column is handled correctly before we plot the time series.

Load air passengers data
import pandas as pd

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/air-passengers/AirPassengers.csv"
df = pd.read_csv(url)

# Inspect the dataset
display(df.head())
df.shape
Month #Passengers
0 1949-01 112
1 1949-02 118
2 1949-03 132
3 1949-04 129
4 1949-05 121
(144, 2)
Thinking

The user wants to load the air passengers dataset and plot the time series. I have already loaded the data and inspected the head and shape. The output shows two columns: Month and #Passengers. The Month column currently contains strings (e.g., '1949-01'). To plot this as a time series, I need to convert the Month column to datetime objects and set it as the index. Then I can plot #Passengers.

Current state:

  • Data loaded into df.
  • Columns: Month (object/string), #Passengers (int).
  • Goal: Plot the time series.

Since the plotting part of the goal hasn't been executed yet, I need to perform the conversion and plotting.

Step:

  1. Convert Month to datetime.
  2. Plot #Passengers over time.

I will use seaborn for plotting.

The dataset contains 144 records with two columns: Month and #Passengers. The Month column is currently stored as strings, so I'll convert it to datetime objects to ensure the time series is plotted correctly.

Plot air passengers time series
import matplotlib.pyplot as plt
import seaborn as sns

# Convert Month to datetime and set as index
df['Month'] = pd.to_datetime(df['Month'])
df.set_index('Month', inplace=True)

# Plot the time series
plt.figure(figsize=(12, 6))
sns.lineplot(data=df, x=df.index, y='#Passengers')
plt.title('Air Passengers Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Passengers')
plt.show()
Thinking

The user wants to load the air passengers dataset and plot the time series. Looking at the notebook state:

  • Cell 1 loaded the CSV and displayed the head and shape (144 rows, 2 columns: Month, #Passengers).
  • Cell 2 converted Month to datetime, set it as the index, and plotted a line chart using seaborn.
  • The output of Cell 2 is <Figure size 1200x600 with 1 Axes>.

According to the "Sufficiency heuristics":

  • "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."
  • The goal was to "load... and plot".
  • We have the chart, but the latest output (Output-2) is just the figure object representation. The data used for the chart was transformed (index changed).
  • However, Output-1 already gave us the head of the data.
  • The prompt asks to act as a data analyst and extract insights.

The dataset shows a clear upward trend in passenger numbers over time, with a distinct seasonal pattern (recurring peaks and troughs) visible in the plot. The data spans from 1949-01 onwards.

Wait, the heuristic says if it's "chart-only/visual-only and does not include underlying numeric/text values", choose Path B. Output-2 is the chart. Output-1 had the first few rows. Does the user want a table of the data? The goal was "load... and plot". The plot is done. The data was loaded. The heuristic suggests that if the latest output is only the figure, I should print the values used in the chart to be sure. But the df.head() from Cell 1 already shows the starting values. However, the "latest output" specifically refers to Output-2.

Actually, the original goal is very simple: load and plot. Both are done. The chart shows the time series of air passengers. The data shows passenger counts starting at 112 in Jan 1949 and increasing over time.

Let's check if I should do Path A or B. The goal is "load... and plot". I have the load (Cell 1) and the plot (Cell 2). The plot is the final requirement. The data used for the plot is the df which was modified in Cell 2. If I just say "Here is the plot", that's Path A. If I follow the heuristic strictly: "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart." Output-2 is indeed just the figure. But is that "missing evidence"? The user didn't ask for a summary table, just to plot. Usually, "plot the time series" is satisfied by the plot. But the prompt is very strict about the heuristic. Let's see: Output-1 provided df.head(). Output-2 provides the plot. Between the two, I have the values and the plot. But the heuristic says "If the latest output is chart-only... choose Path B". The latest output (Output-2) is the chart. Maybe I should display a summary of the data (e.g., df.describe()) or the last few rows to confirm the range?

Wait, the goal was just "load... and plot". If I provide a final answer, I explain what the plot shows. The plot shows a strong positive linear trend and strong seasonality. The passenger numbers grow from around 100 in 1949 to significantly higher values by the end of the series.

Let's double check the heuristic again. "If the goal is to show/plot/chart/visualize and latest output contains both a valid chart/figure object and explicit numeric/text values supporting the conclusion, choose Path A". The "latest output" might be interpreted as the state of the notebook. The notebook state contains both Output-1 (numeric) and Output-2 (chart). Therefore, Path A is appropriate.

Insights:

  • The number of air passengers shows a consistent upward trend over the 144-month period.
  • There is a clear seasonal pattern, with peaks occurring annually, suggesting a regular cycle of travel demand.
  • The starting value in January 1949 was 112 passengers, and the volume increased substantially over the years.

Final Answer construction:

  • Summarize findings.
  • Confirm load and plot.
  • Mark with <!--final answer-->.

The air passengers dataset has been successfully loaded and visualized. The time series plot reveals two primary characteristics: a strong, consistent upward trend in the number of passengers over the 144-month period and a clear seasonal pattern with recurring annual peaks. Starting from 112 passengers in January 1949, the volume of travel grew significantly throughout the dataset's duration.

decompose the series into trend, seasonality and residuals
Thinking

The user wants to decompose the air passengers time series into trend, seasonality, and residuals. The current state of the notebook is:

  • df is a DataFrame with #Passengers as the column and Month (datetime) as the index.
  • The data has been loaded and plotted.

To perform time series decomposition, the statsmodels library is the standard choice (specifically seasonal_decompose). Since the data shows both an upward trend and increasing seasonal amplitude, a multiplicative decomposition is usually more appropriate for air passengers data, but additive is the default. I should check the nature of the seasonality or just use seasonal_decompose.

Plan:

  1. Import seasonal_decompose from statsmodels.tsa.seasonal.
  2. Apply decomposition (multiplicative is typically better for this dataset, but I'll start with a standard call or suggest the model).
  3. Plot the resulting components.

Wait, seasonal_decompose returns an object with trend, seasonal, and resid attributes. I can plot these using the built-in .plot() method of the decomposition result.

Step 1: Perform the decomposition.

To decompose the time series, I'll use the seasonal_decompose function from the statsmodels library. Given that the seasonal fluctuations seem to increase as the trend grows, a multiplicative model is typically more appropriate for this specific dataset.

Decompose and plot time series
from statsmodels.tsa.seasonal import seasonal_decompose

# Perform multiplicative decomposition
decomposition = seasonal_decompose(df['#Passengers'], model='multiplicative')

# Plot the decomposed components
plt.figure(figsize=(12, 8))
decomposition.plot()
plt.show()
<Figure size 1200x800 with 0 Axes>
Thinking

The user wants to decompose the air passengers time series into trend, seasonality, and residuals. The latest notebook output ([output-3]) shows that seasonal_decompose was called with model='multiplicative' and a plot was generated.

Reviewing the outputs:

  • [output-3] provides a visual decomposition showing the Observed, Trend, Seasonal, and Residual components.
  • The goal was "decompose the series into trend, seasonality and residuals".

According to the "Sufficiency heuristics":

  • "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."
  • The current output is purely visual (<Figure ...>). To provide a comprehensive answer, it's better to have the numeric values of these components (e.g., the first few rows of the decomposition results) to confirm the decomposition worked as expected numerically.

Wait, the user goal is simply "decompose the series...". Usually, a plot is the primary way to "see" the decomposition. However, the prompt's heuristics specifically say: "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."

The decomposition result is stored in the decomposition object. I should extract the components into a DataFrame and display them to fulfill the heuristic.

Plan:

  1. Provide insights from the plot: The multiplicative decomposition successfully separated the series into a steady upward trend, a consistent seasonal pattern (likely peaking in summer), and residuals that represent the irregular variations.
  2. Since the output was chart-only, I will use Path B to display the numeric values of the decomposition.

Wait, looking at the prompt again: "If the goal is to show/plot/chart/visualize and latest output contains both a valid chart/figure object and explicit numeric/text values supporting the conclusion, choose Path A...". The goal was "decompose the series", which is a process. The result is the decomposition. The user didn't explicitly say "plot the decomposition", but the code provided did.

Actually, if the user just said "decompose", and a plot was provided, the visual evidence is there, but the numeric data is not. The heuristic for "chart-only" output says choose Path B.

Let's refine the insight: The decomposition plot shows a clear linear-like growth in the trend component and a stable seasonal factor, indicating that the seasonal variance increases proportionally with the trend (justifying the multiplicative model). The residuals show the noise remaining after removing these effects.

Next step (Path B): Extract the components into a DataFrame to provide the numeric evidence of the decomposition.

Wait, looking at the decomposition object: it has attributes trend, seasonal, and resid.

Let's double check the rules: "Do not introduce new side goals...". "Every proposed step must directly move toward answering the original goal." Numeric values are part of the "decomposition".

One detail: the output-3 shows <Figure size 1200x800 with 0 Axes> followed by <Figure size 640x480 with 4 Axes>. This is standard seasonal_decompose().plot() behavior.

I will provide the insight and then a code block to show the numeric components.

The decomposition plot confirms that the multiplicative model was appropriate, as the seasonal swings increase in magnitude alongside the upward trend. The trend component shows a steady, nearly linear growth in passenger numbers, while the seasonal component reveals a consistent annual pattern with a significant peak during the summer months. The residuals represent the remaining irregular fluctuations after the trend and seasonality are removed.

To complete the decomposition analysis, I will extract the numeric values for the trend, seasonal, and residual components.

Create decomposition DataFrame
# Create a DataFrame from the decomposition components
decomposition_df = pd.concat([
    decomposition.observed, 
    decomposition.trend, 
    decomposition.seasonal, 
    decomposition.resid
], axis=1)

decomposition_df.columns = ['Observed', 'Trend', 'Seasonal', 'Residual']
decomposition_df.head()
Observed Trend Seasonal Residual
Month
1949-01-01 112.0 NaN 0.910230 NaN
1949-02-01 118.0 NaN 0.883625 NaN
1949-03-01 132.0 NaN 1.007366 NaN
1949-04-01 129.0 NaN 0.975906 NaN
1949-05-01 121.0 NaN 0.981378 NaN
Thinking
  • Goal: Decompose the series into trend, seasonality, and residuals.
    • Current state:

      • df loaded and indexed by date.
      • Multiplicative decomposition performed using seasonal_decompose.
      • Plot of decomposition generated (though output-3 shows some figure objects, it's clear the decomposition was performed).
      • decomposition_df created containing Observed, Trend, Seasonal, and Residual components.
      • decomposition_df.head() shows the first 5 rows, where Trend and Residual are NaN (expected due to the centered moving average used in seasonal_decompose).
    • The user's goal was to decompose the series.

    • The decomposition has been performed.

    • The results are visualized (plot) and tabulated (decomposition_df).

    • The multiplicative model was chosen because the seasonal variance grows with the trend.

    • The trend is upward, seasonality is periodic, and residuals are the remaining noise.

    • Do we have enough?

      • We have the plot (Visual).
      • We have the data table (Numeric).
      • We have the logic (Multiplicative).
    • The output 4 shows NaNs at the start, which is standard for seasonal_decompose.

    • The original goal "decompose the series into trend, seasonality and residuals" is fully addressed by providing both the visual plot and the resulting dataframe.

    • Path A: Final answer.

    • Insight 1: The multiplicative decomposition reveals a consistent upward trend in air passenger numbers over the period.

    • Insight 2: There is a clear, repeating seasonal pattern, with peaks occurring annually during the summer months.

    • Insight 3: The choice of a multiplicative model is validated by the observation that seasonal fluctuations increase in proportion to the trend.

    • Insight 4: The residuals contain the irregular, non-seasonal, and non-trend variations.

    • No code blocks.

    • Include <!--final answer-->.

    • No chain-of-thought.

    • No meta-phrases.

The time series has been successfully decomposed using a multiplicative model, which is appropriate here as the seasonal fluctuations increase in magnitude as the trend rises.

The decomposition reveals three distinct components:

  1. Trend: A consistent, long-term upward growth in the number of air passengers.
  2. Seasonality: A strong, repeating annual pattern with significant peaks during the summer months.
  3. Residuals: The remaining irregular fluctuations after removing the trend and seasonal effects.

The numeric results are captured in decomposition_df, and the visual evidence confirms that the passenger growth is both steady and subject to predictable seasonal cycles.

check stationarity with ADF test and apply differencing if needed
Thinking

The user wants to check the stationarity of the air passengers time series using the Augmented Dickey-Fuller (ADF) test and apply differencing if the series is found to be non-stationary.

- `df`: DataFrame with `#Passengers` column and `Month` index.
- `decomposition`: Result of seasonal decomposition.
- `decomposition_df`: DataFrame containing the components.

- Load `adfuller` from `statsmodels.tsa.stattools`.
- Run ADF test on `df['#Passengers']`.
- Analyze the p-value to determine stationarity (typically p > 0.05 means non-stationary).
- If non-stationary, apply differencing (`df['#Passengers'].diff()`).
- It's best to check the original series first.

Plan:

  1. Perform the ADF test on the original series.
  2. Provide the results (ADF statistic and p-value).
  3. Decide if differencing is needed based on the p-value.

Wait, the prompt says "check stationarity... and apply differencing if needed". I should do this step-by-step. First, let's check the stationarity.

Step 1: ADF Test on the original data.

To check for stationarity, I will use the Augmented Dickey-Fuller (ADF) test. If the p-value is greater than 0.05, we can conclude that the series is non-stationary and requires differencing.

Perform ADF test
from statsmodels.tsa.stattools import adfuller

# Perform ADF test
result = adfuller(df['#Passengers'])

# Print the results
print(f'ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')
print('Critical Values:')
for key, value in result[4].items():
    print(f'   {key}: {value}')
ADF Statistic: 0.8153688792060523
p-value: 0.991880243437641
Critical Values:
   1%: -3.4816817173418295
   5%: -2.8840418343195267
   10%: -2.578770059171598
Thinking

The user wants to check stationarity with the ADF test and apply differencing if needed. Latest output (cell-5) shows the ADF test results for df['#Passengers']:

  • ADF Statistic: 0.8153688792060523
  • p-value: 0.991880243437641
  • Critical Values: 1% (-3.48), 5% (-2.88), 10% (-2.57)

The p-value (0.99) is significantly higher than 0.05, and the ADF statistic is much larger than the critical values. This indicates that the null hypothesis (that a unit root is present/the series is non-stationary) cannot be rejected. The series is non-stationary.

The original goal was "check stationarity with ADF test and apply differencing if needed". I have checked stationarity and found it non-stationary. Now I need to apply differencing "if needed". Since it is non-stationary, differencing is needed.

I should now apply first-order differencing and check for stationarity again.

Path B is required because I've only performed the first part of the goal (checking). I need to apply differencing and verify if the resulting series is stationary.

Insights: The p-value of 0.99 indicates that the original series is non-stationary.

Next step: Apply first-order differencing and run the ADF test again.

The p-value of 0.99 is well above the 0.05 threshold, and the ADF statistic is greater than the critical values, confirming that the original series is non-stationary. As requested, I will now apply first-order differencing to remove the trend and test for stationarity again.

Test stationarity of differenced data
# Apply first-order differencing
df_diff = df['#Passengers'].diff().dropna()

# Perform ADF test on the differenced series
result_diff = adfuller(df_diff)

print(f'ADF Statistic: {result_diff[0]}')
print(f'p-value: {result_diff[1]}')
print('Critical Values:')
for key, value in result_diff[4].items():
    print(f'   {key}: {value}')
ADF Statistic: -2.829266824169988
p-value: 0.05421329028382705
Critical Values:
   1%: -3.4816817173418295
   5%: -2.8840418343195267
   10%: -2.578770059171598
Thinking

The user goal is to "check stationarity with ADF test and apply differencing if needed".

Looking at the notebook state:

  • Cell 5: Performed ADF test on df['#Passengers'].
  • Output 5: p-value = 0.99188 (non-stationary).
  • Cell 6: Applied first-order differencing and performed ADF test on df_diff.
  • Output 6: p-value = 0.05421.

Analysis of Output 6:

  • The p-value is 0.05421.
  • This is slightly above the common 0.05 threshold for stationarity.
  • Usually, in time series analysis, if the first difference is still not strictly stationary (p > 0.05), we might try a second difference or check for seasonality (which the decomposition in cell 3 already showed is present).
  • However, the p-value is very close to 0.05.
  • Given the original prompt "apply differencing if needed", we have applied first-order differencing.
  • To be rigorous, a p-value of 0.054 implies we cannot reject the null hypothesis of a unit root at the 5% level, although it is close.
  • Since the data is monthly air passengers, it's well-known to have strong seasonality. Seasonal differencing (lag 12) is often required for this specific dataset.
  • The goal was "check stationarity... and apply differencing if needed". First-order differencing alone didn't quite get it below 0.05.

Is it sufficient?

  • The original series is definitely non-stationary.
  • The first-differenced series is almost stationary (p=0.054), but technically not at the 5% level.
  • To fully satisfy the goal of making the series stationary "if needed", I should probably try seasonal differencing or a second difference.
  • But the prompt asks me to decide Path A or B.
  • If I say it's stationary now, I'm ignoring the p=0.054 > 0.05.
  • If I say it's not stationary, I need to provide a code block to fix it.

Let's look at the p-value again: 0.0542. This is very close. In many practical contexts, this is acceptable. But in a strict statistical sense, it's non-stationary. Also, looking at the decomposition in cell 3, there is a very clear seasonal pattern. First-order differencing removes the trend but not the seasonality. Seasonal differencing (lag 12) is the standard "next step" for this dataset.

Therefore, I will choose Path B to apply seasonal differencing to see if that achieves stationarity.

Wait, the instructions say: "If current outputs are sufficient to answer the original goal, provide a natural final answer...". The goal is "check stationarity ... and apply differencing if needed". I checked. I applied. The result of the application is p=0.054. Is that "done"? Or is it "still needed"? Since it's > 0.05, differencing is still "needed" to reach the threshold.

Plan:

  1. Insight: First-order differencing reduced the p-value from 0.99 to 0.054, but it remains slightly above the 0.05 threshold, suggesting the series is still not fully stationary (likely due to the strong seasonality observed in the decomposition).
  2. Action: Apply seasonal differencing (lag 12) to remove the seasonal component and test again.

Wait, the instructions say "Do not introduce new side goals... Stay strictly within the original user goal." The goal is "apply differencing if needed". If first-order isn't enough, then further differencing (seasonal or second-order) is "needed".

Let's double check the p-value 0.054. It is $> 0.05$. I will propose seasonal differencing.

Wait, I should check if the user considers 0.054 "stationary enough". Usually, 0.05 is the hard cutoff. Since 0.054 > 0.05, I will apply one more level of differencing (seasonal) to reach stationarity.

Actually, looking at the specific dataset (AirPassengers), seasonal differencing is the correct approach.

One detail: "Keep one primary objective per response." Current objective: achieve stationarity via differencing.

Let's write the response. Insight: The original series was non-stationary (p-value ≈ 0.99). First-order differencing significantly improved this, but the p-value of 0.054 is still slightly above the 0.05 significance level. This is expected given the strong seasonality seen in the decomposition plot.

Code: Apply seasonal differencing (lag 12) and test again.

The original series was non-stationary with a p-value of 0.99. First-order differencing significantly reduced this to 0.054, but it remains slightly above the 0.05 threshold. Given the strong seasonal patterns observed in the earlier decomposition, seasonal differencing is needed to achieve full stationarity.

Test seasonal stationarity
# Apply seasonal differencing (lag 12 for monthly data)
df_seasonal_diff = df['#Passengers'].diff(12).dropna()

# Perform ADF test on seasonally differenced series
result_seasonal = adfuller(df_seasonal_diff)

print(f'ADF Statistic: {result_seasonal[0]}')
print(f'p-value: {result_seasonal[1]}')
print('Critical Values:')
for key, value in result_seasonal[4].items():
    print(f'   {key}: {value}')
ADF Statistic: -3.3830207264924805
p-value: 0.011551493085514978
Critical Values:
   1%: -3.4816817173418295
   5%: -2.8840418343195267
   10%: -2.578770059171598
Thinking

The original goal is to "check stationarity with ADF test and apply differencing if needed".

Looking at the notebook state:

  • [cell-5] ADF test on original series: p-value = 0.9918 (Non-stationary).
  • [cell-6] ADF test on first-order differenced series: p-value = 0.0542 (Borderline/Non-stationary at 5% level).
  • [cell-7] ADF test on seasonally differenced series: p-value = 0.0115 (Stationary at 5% level).

The user goal was to check stationarity and apply differencing if needed. The sequence of tests shows:

  1. The original data is non-stationary.
  2. First-order differencing almost reached stationarity but didn't quite pass the 0.05 threshold.
  3. Seasonal differencing (lag 12) successfully achieved stationarity (p-value < 0.05).

The goal has been met. I have the results from the ADF tests and the necessary differencing steps have been applied and verified.

Path A is appropriate here. I will provide a final answer summarizing these findings.

The Augmented Dickey-Fuller (ADF) tests confirm that the original series is non-stationary, with a p-value of approximately 0.99. While first-order differencing significantly reduced the p-value to 0.054, it remained slightly above the standard 0.05 threshold for stationarity. However, applying seasonal differencing (lag 12) successfully achieved stationarity, resulting in a p-value of 0.011.

fit an ARIMA model and forecast the next 12 months
Thinking

The user wants to fit an ARIMA model and forecast the next 12 months of air passengers.

*   Data is loaded: `df` (AirPassengers).
*   Index is `Month` (datetime).
*   Stationarity analysis:
    *   Original: $p = 0.99$ (Non-stationary).
    *   1st Order Difference: $p = 0.054$ (Borderline).
    *   Seasonal Difference (lag 12): $p = 0.011$ (Stationary).
*   Observation: The data has a strong seasonal component (multiplicative decomposition was done).
*   Since there is seasonality, a standard ARIMA model might not be enough; SARIMA (Seasonal ARIMA) is more appropriate. However, the user asked for "ARIMA". In practice, for this dataset, SARIMA is the standard choice.

*   The user's prompt asks for "ARIMA". I should probably use `SARIMAX` from `statsmodels` because the data is clearly seasonal.
*   Parameters $(p, d, q) \times (P, D, Q, s)$:
    *   $d = 1$ (first-order difference seemed helpful).
    *   $D = 1$ (seasonal difference was the one that truly made it stationary).
    *   $s = 12$ (monthly data).
    *   $p, q, P, Q$: Need to be determined. I can use a simple set of parameters or a grid search/Auto-ARIMA approach. To keep it simple and iterative, I'll start with a basic SARIMA configuration or use `pmdarima` if available (but it's not in the imports). Since I can't assume external libraries not already there, I'll use `statsmodels.tsa.statespace.sarimax.SARIMAX`.

*   *Step 1*: Fit the model. I'll use `SARIMAX` as it's the standard for seasonal data. I'll choose reasonable default orders for now (e.g., 1,1,1 x 1,1,1,12) or a simple order.
*   *Step 2*: Forecast the next 12 months.
*   *Step 3*: Plot the forecast.

*Constraint*: "Prefer one execution step at a time." "Return at most one Python code block per response."

*Decision*: I will fit the SARIMA model first and print a summary. Actually, the user wants the forecast. I can fit and forecast in one go, but it's cleaner to fit and check the summary first. However, for the sake of progress, I'll fit and generate the forecast for the next 12 months.

*Correction*: To be precise, I should probably mention why I'm using SARIMA instead of ARIMA.

*Plan*:
1. Import `SARIMAX`.
2. Define the model.
3. Fit the model.
4. Forecast 12 steps ahead.

*Parameter choice*:
- $d=1, D=1, s=12$.
- Simple $p=1, q=1, P=1, Q=1$ as a starting point.

Wait, I should check if I should just do the fitting part first. The prompt says "fit an ARIMA model and forecast the next 12 months". I'll do both: fit and forecast.

*Refinement*: To avoid guesswork on parameters, I'll use `SARIMAX` with a simple set of orders.

*Code structure*:
```python
from statsmodels.tsa.statespace.sarimax import SARIMAX

# Define model (Seasonal ARIMA)
# Order: (p, d, q) x (P, D, Q, s)
# Based on previous steps, d=1 and D=1 are likely helpful.
model = SARIMAX(df['#Passengers'], 
                order=(1, 1, 1), 
                seasonal_order=(1, 1, 1, 12), 
                enforce_stationarity=False, 
                enforce_invertibility=False)

results = model.fit()

# Forecast next 12 months
forecast = results.get_forecast(steps=12)
forecast_df = forecast.summary_frame()

forecast_df
```

Wait, the rule says "Keep one primary objective per response". Fitting and forecasting is the goal. I'll do it in one block.

*Check on "ARIMA" vs "SARIMA"*: The user said ARIMA. I'll explain that since we found seasonality, SARIMA is the appropriate extension of ARIMA for this data.

*Double check on imports*: I need to import `SARIMAX`.

*Steps for the response*:
1. Explain that since the data shows clear seasonality (confirmed by the ADF test on seasonal differences), a SARIMA (Seasonal ARIMA) model is the best choice.
2. Provide code to fit the model and generate the 12-month forecast.

Wait, if I just provide the forecast table, it's not very helpful visually. But the prompt says "forecast the next 12 months". I'll provide the forecast and let the user ask for the plot next, or I can do the forecast and the table. Let's stick to the result first.

Wait, a better approach is to fit the model and show the forecast results.

*Actually*, a data analyst would typically:
1. Fit model.
2. Forecast.
3. Visualize.

Since I can only do one code block, I will fit and generate the forecast table.

*Wait*, looking at the notebook state, the index is already datetime. This is perfect for `SARIMAX`.

One more thing: I'll use a basic `order=(1, 1, 1)` and `seasonal_order=(1, 1, 1, 12)` as these are common defaults for this specific dataset (the classic AirPassengers dataset).

Let's write the code.

Since the ADF tests showed that seasonal differencing (lag 12) was the most effective way to achieve stationarity, and the earlier decomposition revealed a strong seasonal pattern, a SARIMA (Seasonal ARIMA) model is the appropriate choice here. SARIMA extends ARIMA by explicitly modeling the seasonal components.

I'll fit a SARIMA model with a standard configuration for this dataset—incorporating both first-order and seasonal differencing—and generate the forecast for the next 12 months.

Fit SARIMA and forecast
from statsmodels.tsa.statespace.sarimax import SARIMAX

# Fit SARIMA model
# order=(p, d, q), seasonal_order=(P, D, Q, s)
# We use d=1 and D=1 based on our stationarity analysis, s=12 for monthly data
model = SARIMAX(df['#Passengers'], 
                order=(1, 1, 1), 
                seasonal_order=(1, 1, 1, 12), 
                enforce_stationarity=False, 
                enforce_invertibility=False)

results = model.fit(disp=False)

# Forecast the next 12 months
forecast_obj = results.get_forecast(steps=12)
forecast_df = forecast_obj.summary_frame()

forecast_df
/Users/olunia/Library/mljar-studio/jlab_server/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
  self._init_dates(dates, freq)
/Users/olunia/Library/mljar-studio/jlab_server/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
  self._init_dates(dates, freq)
#Passengers mean mean_se mean_ci_lower mean_ci_upper
1961-01-01 447.222620 11.844599 424.007632 470.437608
1961-02-01 422.734099 14.266935 394.771420 450.696778
1961-03-01 457.700895 16.787551 424.797900 490.603890
1961-04-01 492.277505 18.881302 455.270834 529.284177
1961-05-01 505.563366 20.784704 464.826096 546.300636
1961-06-01 565.671782 22.523670 521.526200 609.817365
1961-07-01 651.986404 24.138576 604.675665 699.297144
1961-08-01 639.634243 25.651823 589.357594 689.910891
1961-09-01 543.058845 27.080684 489.981679 596.136011
1961-10-01 493.727026 28.437834 437.989896 549.464156
1961-11-01 426.938001 29.733104 368.662189 485.213813
1961-12-01 470.135609 30.974255 409.427185 530.844033
Thinking

The user goal is to fit an ARIMA (actually SARIMA, given the seasonality) model and forecast the next 12 months. The notebook shows that:

  1. Data was loaded and inspected.
  2. Seasonality was identified.
  3. Stationarity was tested; seasonal differencing worked best.
  4. A SARIMA(1,1,1)(1,1,1,12) model was fitted.
  5. A forecast for the next 12 months was generated and the summary frame (forecast_df) was displayed.

The output of forecast_df shows the mean forecast and confidence intervals for the period 1961-01-01 to 1961-12-01.

The goal "fit an ARIMA model and forecast the next 12 months" has been technically achieved. The model is fitted, and the forecasts are provided in a table.

However, usually, when someone asks for a forecast, they want to see it visualized alongside the historical data to understand the trend and seasonal alignment. But the prompt asks for "fit an ARIMA model and forecast the next 12 months". The output provided the numeric values.

Checking "Sufficiency heuristics":

  • The latest output contains explicit numeric/text values (the forecast_df table).
  • The goal was "fit an ARIMA model and forecast the next 12 months".
  • The result forecast_df explicitly provides these forecasts.

Wait, is it sufficient? The user might want a plot, but they didn't explicitly ask for one in the goal statement ("fit an ARIMA model and forecast the next 12 months"). They asked for the forecast. The table provides the forecast.

However, looking at the provided forecast_df output: 1961-01-01: 447.22 1961-02-01: 422.73 ... 1961-12-01: 470.13

This satisfies the request.

Path A: Sufficient. Path B: Not sufficient if a visualization is needed.

The prompt says: "If current outputs are sufficient to answer the original goal, provide a natural final answer...". The goal is the fit and the forecast. Both are present.

Final Answer should explain the results: The SARIMA model was successfully fitted, incorporating both first-order and seasonal differencing to account for the trends and seasonal patterns observed in the air passenger data. The forecast for the next 12 months (January 1961 through December 1961) predicts a continued upward trend with a characteristic seasonal peak in the summer months, reaching a maximum of approximately 651.99 passengers in July 1961.

Let me double check if any other requirement is missing. No.

Wait, the user goal was "fit an ARIMA model". I used SARIMA. This is correct because ARIMA is the general class and SARIMA is the specific implementation for seasonal data.

Final answer construction.

The SARIMA model was successfully fitted to the air passenger data, incorporating both first-order and seasonal differencing to account for the trend and the strong 12-month seasonality.

The forecast for the next 12 months (January 1961 to December 1961) indicates a continuation of the growth trend and seasonal fluctuations. The predicted number of passengers starts at 447.22 in January 1961 and peaks at 651.99 in July 1961, following the historical seasonal pattern.

Score cards for gemma4:31b for Air Passengers Forecasting with ARIMA

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All major requested steps were completed: loading and plotting the series, decomposition, ADF testing with differencing, and fitting an ARIMA/SARIMA model with a 12-month forecast.

  • Execution Correctness: 2/2

    Excellent

    The code shown is syntactically coherent and the outputs indicate it ran successfully, including decomposition, ADF results, and forecast generation.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically: the time series plot, 4-panel decomposition, stationarity testing with differencing, and a forecast table with confidence intervals are all present.

  • Reasoning Quality: 2/2

    Excellent

    The workflow correctly interprets the non-stationary series, applies differencing, and justifies the ARIMA/SARIMA modeling choice based on the ADF results.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and uses standard time-series methods; minor issues like using SARIMAX instead of a plain ARIMA are acceptable here.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.