HFW 4: Less lags are more

Using 4 lagged quarters to forecast revenues performs worse than a naïve model

Feb 17, 2025

In our last post, we compared an autoregressive linear regression model with recursive forecasting to the naïve model. We found the linear regression performed similarly to the naïve model, with some modest differences. This result is something to keep in the back of our mind. Even though linear regression is not considered all that sophisticated in terms of machine learning, it is a heck of a lot more sophisticated than straightlining the last actual value. That such sophistication doesn't yield better performance across a broad sample of companies with different end markets, business models, and product mix should be filed away in our memory banks. For future comparisons, we'll keep the the naïve and linear regression models as benchmarks. We'll now start considering other time series models in earnest. Our first will be the autoregressive, or AR model.

An autoregressive model uses past values to predict the future value. There can be one or many past values as in the following equation:

\(X_t = c + \phi X_{t-1} + \phi X_{t-2} + \phi X_{t-3} + \epsilon\)

Here there are three lagged values, a constant c, and an error ϵ. The coefficients, or ϕs, are estimated using different methods—typically ordinary least squares (OLS; e.g., linear regression) or maximum likelihood estimation (MLE). Some practitioners use what is called the Yule-Walker equation. This assumes the time series are stationary, which is not the case for most financial data, so we won't discuss it that much here. We could, of course, transform the data to get it to approximate a stationary series, but we'll wait for more complex algorithms that do automatically do that for us.1 But that's getting ahead of ourselves.

Now, we know you might be scratching your head and thinking, "Wait! Wasn't your autoregressive linear regression model really just an AR(1) model?" The answer is yes! And, in fact, even though the Python package we're using—statsmodels—employs maximum likelihood estimation to solve for the coefficients, rather than ordinary least squares, it returns the same result. This is because that algorithm assumes the error terms are normally distributed, just like OLS. So we've already created a proper AR(1) model without any extra work! What efficiency!

That said, let's look at an AR model with additional lags. If we were planning to do this more rigorously, we'd examine the autocorrelations between the revenues of all 50 companies and then either select an appropriate lag for each company based on the statistical significance of the correlation coefficient. We'll save that for a Code Walk-Through. Here, we'll just use our intuition and pick an AR(4) model. The reason? In our former life as an equity research analyst, the last four quarters had the greatest bearing on how we thought about the next four. Certainly things could change within a year, but looking further back from those four quarters usually didn't help in terms of insight unless we had a good prior period as an analogy for our expectations for the future. Not very rigorous, but better than no view!

We could, of course, repeat the AR(4) model using linear regression, but we'll now move to the statsmodels’ AutoReg class to make our lives easier. This should all be clear in the code. After we run through all the model building, we calculate and then plot the scaled RMSEs by sector below.

Nothing stands out apart from XLU, which looks surprisingly poor. For the purposes of this blog, we won't look for causes other than to note that the year-over-year changes in revenue look pretty large (e.g., Nextera (NEE)) or exhibit some remarkable changepoints (e.g., Vistra (VST), and Sempra (SRE)) for the companies in that sector. See the chart below for details

Just about any model that is looking at more history than the last value will have a hard time with accurate forecasts, especially when faced with such changepoints. Let's move on the comparisons with the naïve and AR(1) models.

While the AR(4) performs a bit better on a few sectors—XLP and XLV—it is worse, sometimes exceptionally so, than the naïve and AR(1) models.

Let's look at this across all tickers instead of by sector using a violin plot.

There are some significant errors in the AR(4) model, as evidenced by the long neck (in keeping with the metaphor). Let's look at the top 20 worst errors by company and model. As opposed to our last post where we sorted by the naïve model, here we'll sort by AR(4).

While there are some issues, the biggest flub comes from Vistra (VST). Although CVX looks pretty bad too. If we were thinking about using AR(4), we'd need to drill down to see if removing VST helps performance. But from this cursory assessment, we'd do well to avoid AR(4) altogether. The next model we'll look at is the moving average. Stay tuned!

Code below.

# Load packages
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import matplotlib as mpl
import pickle
import seaborn as sns

# Assign chart style
plt.style.use('seaborn-v0_8')
plt.rcParams["figure.figsize"] = (12,6)

# Functions to load and save files
def save_dict_to_file(data, filename):
    with open(filename, 'wb') as f:
        pickle.dump(data, f)

def load_dict_from_file(filename):
    with open(filename, 'rb') as f:
        return pickle.load(f)

# Symbols used
etf_symbols = ['XLF', 'XLI', 'XLE', 'XLK', 'XLV', 'XLY', 'XLP', 'XLB', 'XLU', 'XLC']

ticker_list = ["SHW", "LIN", "ECL", "FCX", "VMC",
               "XOM", "CVX", "COP", "WMB", "SLB",
               "JPM", "V", "MA", "BAC", "GS",
               "CAT", "RTX", "DE", "UNP", "BA",
               "AAPL", "MSFT", "NVDA", "ORCL", "CRM",
               "COST", "WMT", "PG", "KO", "PEP",
               "NEE", "D", "DUK", "VST", "SRE",
               "LLY", "UNH", "JNJ", "PFE", "MRK",
               "AMZN", "SBUX", "HD", "BKNG", "MCD",
               "META", "GOOG", "NFLX", "T", "DIS"
]

xlb = ["SHW", "LIN", "ECL", "FCX", "VMC"]
xle = ["XOM", "CVX", "COP", "WMB", "SLB"]
xlf = ["JPM", "V", "MA", "BAC", "GS"]
xli = ["CAT", "RTX", "DE", "UNP", "BA"]
xlk = ["AAPL", "MSFT", "NVDA", "ORCL", "CRM"]
xlp = ["COST", "WMT", "PG", "KO", "PEP"]
xlu = ["NEE", "D", "DUK", "VST", "SRE"]
xlv = ["LLY", "UNH", "JNJ", "PFE", "MRK"]
xly = ["AMZN", "SBUX", "HD", "BKNG", "MCD"]
xlc = ["META", "GOOG", "NFLX", "T", "DIS"]

sectors = [xlf,  xli,  xle,  xlk, xlv,  xly, xlp, xlb, xlu, xlc]

# Sector dictionary
sector_dict = {symbol.lower(): tickers for symbol, tickers in zip(etf_symbols, sectors)}

# Load data from disk
# See Code Walk-Throughs for how we built the data set
df_sector_dict = load_dict_from_file("path/to/data/df_sector_dict.pkl")

# Create functions for indexing
def create_index(series):
    if series.iloc[0] > 0:
        return series/series.iloc[0] * 100
    else:
        return (series - series.iloc[0])/-series.iloc[0] * 100

# Clean dataframes
df_rev_index_dict = {}
for key in sector_dict:
    temp_df = df_sector_dict[key].copy()
    col_1 = temp_df.columns[0]
    temp_df = temp_df[[col_1] + [x for x in temp_df.columns if 'revenue' in x]]
    temp_df.columns = ['date'] + [x.replace('revenue_', '').lower() for x in temp_df.columns[1:]]
    temp_idx = temp_df.copy()
    temp_idx[[x for x in temp_idx if 'date' not in x]] = temp_idx[[x for x in temp_idx if 'date' not in x]].apply(create_index)
    df_rev_index_dict[key] = temp_idx

# Create train/test dataframes
df_rev_train_dict = {}
df_rev_test_dict = {}
for key in df_rev_index_dict:
    df_out = df_rev_index_dict[key]
    df_rev_train_dict[key] = df_out.loc[df_out['date'] < "2023-01-01"]
    df_rev_test_dict[key] = df_out.loc[df_out['date'] >= "2023-01-01"]

# Create model dictionary
ar_models = {}
lags = 4
for key in df_rev_train_dict:
    tickers = [x.lower() for x in sector_dict[key]]
    length = df_rev_train_dict[key].shape[0]
    df_train = df_rev_train_dict[key]
    ar_models[key] = {}
    for ticker in tickers:
        try: 
            model = AutoReg(df_train[ticker], lags).fit(cov_type="fixed scale")
            # model = AutoReg(df_train[ticker], lags).fit()
            # model = sm.tsa.SARIMAX(df_train[ticker], order=(lags,0,0)).fit()
        except ValueError as e:
            print(f"{ticker} fails due to {e}")

        ar_models[key][ticker] = model

# Check one sector's models output
print(ar_models['xlb']['lin'].summary())

# Function for scaled RMSE
def get_rmse_scaled(series):
    return np.sqrt(np.mean((series['actual']-series['predicted'])**2))/np.mean(series['actual'])

# RMSE dataframe
# Create dataframe
ar_err_df = pd.DataFrame(columns = ['sector', 'ticker', 'actual', 'predicted'])
count = 0
for key in ar_models:
    for ticker in ar_models[key]:            
        # print(f"{key}: {ticker}")
        try: 
            y_act = df_rev_test_dict[key][ticker].values
            y_pred = ar_models[key][ticker].predict(16,19, dynamic=False).values
        except ValueError as e:
            print(f"{ticker} fails due to {e}")
            y_act = np.nan
            y_pred = np.nan

        ar_err_df.loc[count] = [key.upper(), ticker, y_act, y_pred]
        count += 1

for ticker in ar_err_df['ticker'].to_list():
    actual = ar_err_df.loc[ar_err_df['ticker']==ticker, 'actual'].values[0]
    predicted = ar_err_df.loc[ar_err_df['ticker']==ticker, 'predicted'].values[0]
    rmse = np.sqrt(np.mean((actual - predicted)**2))
    rmse_scaled = rmse/actual.mean()
    ar_err_df.loc[ar_err_df['ticker']==ticker, 'rmse'] = rmse
    ar_err_df.loc[ar_err_df['ticker']==ticker, 'rmse_scaled'] = rmse_scaled

# Function to flatten
def flatten_df(dataf: pd.DataFrame, group_name:str, cols:list) -> pd.DataFrame:
    df_grouped = dataf.groupby(group_name)[cols].agg(list)

    for col in cols:
        df_grouped[col] = df_grouped[col].apply(lambda x: np.concatenate(x))
    
    df_long = df_grouped.apply(pd.Series.explode).reset_index()

    return df_long

# Clean and flatten ar_err_df
ar_df_long = flatten_df(ar_err_df, 'sector', ['actual', 'predicted'])

# Graph scaled RMSE by sector
(ar_df_long.groupby('sector')[['actual', 'predicted']].apply(get_rmse_scaled).sort_values(ascending=True)*100).plot(kind='bar', rot=0)
plt.xlabel('')
plt.ylabel('RMSE (%)')
plt.title('Scaled RMSE by Sector for AR(4) Models')
plt.show()

# Why is XLU so poor?
print(df_rev_test_dict['xlu'].iloc[0,1:].values/df_rev_train_dict['xlu'].iloc[-4,1:].values)

# Plot
pd.DataFrame((df_rev_test_dict['xlu'].iloc[:,1:].values/df_rev_train_dict['xlu'].iloc[-4:,1:].values-1)*100,
             columns = df_rev_test_dict['xlu'].columns[1:]).T.rename(columns={0:"Q1", 1:"Q2", 2:"Q3", 3:"Q4"}).plot(kind='bar', rot=0)
plt.xticks(np.arange(5), [x.upper() for x in df_rev_test_dict['xlu'].columns[1:]])
plt.ylabel("Change (%)")
plt.legend(loc='upper center', ncol=4)
plt.title('Year-over-year change for each company in XLU')
plt.show()

# Load comparison dataframe
comp_df = pd.read_pickle("hello_world/data/comp_df.pkl")
ar = ar_df_long.groupby('sector')[['actual', 'predicted']].apply(get_rmse_scaled).reset_index().rename(columns={0:'ar'})
comp_df = comp_df.merge(ar, how="left", on='sector')

# Plot RMSE by sector and model
(comp_df.set_index('sector')[['naive', 'lr', 'ar']].sort_values('naive')*100).plot(kind='bar', rot=0)
plt.xlabel('')
plt.ylabel('RMSE (%)')
plt.ylim(0, 110)
plt.legend(['Naive', 'AR(1)', 'AR(4)'], loc='upper center', ncol=3)
plt.title('Scaled RMSE Comparisons by Sector and Model')
plt.savefig("hello_world/images/day_5_naive_ar1_ar4_rmses_by_sector.png")
plt.show()

# Group by tickers
ar_ticker_long = flatten_df(ar_err_df, 'ticker', ['actual', 'predicted'])

comp_all_df = pd.read_pickle("hello_world/data/comp_all_df.pkl")
ar_ticker = ar_ticker_long.groupby('ticker')[['actual', 'predicted']].apply(get_rmse_scaled).reset_index().rename(columns={0:'ar'})
comp_all_df = comp_all_df.merge(ar_ticker, how="left", on='ticker')

# Violinplot
sns.violinplot(comp_all_df[['naive', 'lr', 'ar']]*100)
plt.xticks(np.arange(3), ['Naive', 'AR(1)', 'AR(4)'])
plt.ylabel('RMSE (%)')
plt.title('Scaled RMSE by forecasting model all companies')
plt.show()

# Top 20 by naive model from last post
n_large = [x.upper() for x in comp_all_df.nlargest(10,"naive")['ticker'].to_list()]
(comp_all_df.nlargest(10, "naive").set_index("ticker")*100).plot(kind='bar', rot=0)
plt.xlabel('')
plt.xticks(np.arange(10), n_large)
plt.ylabel('Scaled RMSE (%)')
plt.legend(['Naive', 'AR(1)', 'AR(4)'], loc='upper center', ncol=3)
plt.title('Top 20% highest error rate by company and Model')
plt.show()

# Top 20 by ar4
n_large = [x.upper() for x in comp_all_df.nlargest(10,"ar")['ticker'].to_list()]
(comp_all_df.nlargest(10, "ar").set_index("ticker")*100).plot(kind='bar', rot=0)
plt.xlabel('')
plt.xticks(np.arange(10), n_large)
plt.ylabel('Scaled RMSE (%)')
plt.legend(['Naive', 'AR(1)', 'AR(4)'], loc='upper center', ncol=3)
plt.title('Top 20% highest error rate by company and Model')
plt.show()

Such a transformation is often referred to as differencing, which usually takes the form of calculating the rate of change of the particular value from a past value. That rate of change can be in the same units as the value of interest (via subtraction) or expressed as percent or logarithm.

Options, Stocks, Machines on Substack

Discussion about this post