In our last post, we put the I in ARIMA and showed that overall this more complex model did not outperform the simpler naïve or autoregressive models. This is a natural transition point as we move to a different class of models that aim to solve some thorny problems common to time series data—heteroskedasticity. For those not familiar with that mouthful, no need to worry we'll discuss it soon enough. But first we need to make a detour to build a second naïve model. This one is based on revenue growth rates, rather than raw revenues. Strictly speaking a growth rate-based model is not exactly naïve, but it does fall in that category of assuming no prior knowledge of the future other that what we already know about the present.
Before we get ahead of ourselves, let's establish a roadmap. We’ll build the intuition behind growth rate-type forecasting models; compare such a model or models to our benchmarks; identify and analyze some of the issues with this and the previous types of forecasting we've executed thus far; then, with that knowledge, we’ll implement the types of models that deal with that heteroskee, whatever that word is above.
Why use growth rates rather than raw numbers to forecast? There are some deep mathematical artifacts that can explain this in a rigorous fashion. But a simple reason is the following. Growth rates can be more stable. Now stable is a pretty nebulous term in this context. But, at the risk of parroting a famous judge's definition of something more salacious, we'll know it when we see it. Let's bring back our indexed revenue chart for all the companies in the ten sectors of the S&P 500.
Even indexed who can see a lot of variability that makes it hard for one to say whether next quarter's revenues will be 500 or 90. What if we transform all those revenues into growth rates? We show that below.
True, this seems almost just as squiggly. But here are few items to point out. Besides the fact that the numbers are smaller and thus easier to wrap our heads around (we have small brains), we see that they tend to be a bit more rangebound. The trend has disappeared. That makes it a bit easier to forecast—the rates as opposed to the levels, that is. Additionally, such a transformation can create more stationary data. Though in this case, we see that there are still some periods that feature large jumps while others are relatively calm. We'll leave what is meant precisely by the word stationary and what exactly we’re applying it to (e.g., means or variance) open for now. But this should ring a bell for those who often hear the problem with time series data and forecasting is non-stationarity even if it’s not exactly clear what the speaker is talking about when he or she utters such phrases.
What if we use growth rates—naïvely—to forecast revenues? How we decide to use them requires a bit less naïveté than our straightline approach from earlier. Indeed, we can use the sequential or year-over-year change. We could also use the last four quarters sequential or year-over-year changes to forecast the next four quarters. These last two methods are closer to what a fundamental research analyst might use. We’re not there yet. We'll take the most recent sequential change and straightline that to keep it as naïve as possible. Here's the scaled RMSE chart you should all be more than familiar with by now.
If you’ve read the other posts (we certainly hope you have!), then this bar chart should look familiar, although the performance of XLP might be disconcerting. In our next post, we’ll compare this result with the benchmark models and expand on how to use growth rates in more sophisticated ways. Stay tuned!
Code below
# Load packages
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import pickle
# Assign chart style
plt.style.use('seaborn-v0_8')
plt.rcParams["figure.figsize"] = (12,6)
# Handy functions
def save_dict_to_file(data, filename):
with open(filename, 'wb') as f:
pickle.dump(data, f)
def load_dict_from_file(filename):
with open(filename, 'rb') as f:
return pickle.load(f)
def get_rmse_dataframe(model_dict, rev_dict, start, end, dynamic=True):
err_df = pd.DataFrame(columns = ['sector', 'ticker', 'actual', 'predicted'])
count = 0
for key in model_dict:
for ticker in model_dict[key]:
try:
y_act = rev_dict[key][ticker].values
if dynamic:
y_pred = model_dict[key][ticker].predict(start,end, dynamic=False).values
else:
y_pred = model_dict[key][ticker].predict(start,end).values
except ValueError as e:
print(f"{ticker} fails due to {e}")
y_act = np.nan
y_pred = np.nan
err_df.loc[count] = [key.upper(), ticker, y_act, y_pred]
count += 1
for ticker in err_df['ticker'].to_list():
actual = err_df.loc[err_df['ticker']==ticker, 'actual'].values[0]
predicted = err_df.loc[err_df['ticker']==ticker, 'predicted'].values[0]
rmse = np.sqrt(np.mean((actual - predicted)**2))
rmse_scaled = rmse/actual.mean()
err_df.loc[err_df['ticker']==ticker, 'rmse'] = rmse
err_df.loc[err_df['ticker']==ticker, 'rmse_scaled'] = rmse_scaled
return err_df
def get_rmse_scaled(series):
return np.sqrt(np.mean((series['actual']-series['predicted'])**2))/np.mean(series['actual'])
def flatten_df(dataf: pd.DataFrame, group_name:str, cols:list) -> pd.DataFrame:
df_grouped = dataf.groupby(group_name)[cols].agg(list)
for col in cols:
df_grouped[col] = df_grouped[col].apply(lambda x: np.concatenate(x))
df_long = df_grouped.apply(pd.Series.explode).reset_index()
return df_long
# Symbols used
etf_symbols = ['XLF', 'XLI', 'XLE', 'XLK', 'XLV', 'XLY', 'XLP', 'XLB', 'XLU', 'XLC']
ticker_list = ["SHW", "LIN", "ECL", "FCX", "VMC",
"XOM", "CVX", "COP", "WMB", "SLB",
"JPM", "V", "MA", "BAC", "GS",
"CAT", "RTX", "DE", "UNP", "BA",
"AAPL", "MSFT", "NVDA", "ORCL", "CRM",
"COST", "WMT", "PG", "KO", "PEP",
"NEE", "D", "DUK", "VST", "SRE",
"LLY", "UNH", "JNJ", "PFE", "MRK",
"AMZN", "SBUX", "HD", "BKNG", "MCD",
"META", "GOOG", "NFLX", "T", "DIS"
]
xlb = ["SHW", "LIN", "ECL", "FCX", "VMC"]
xle = ["XOM", "CVX", "COP", "WMB", "SLB"]
xlf = ["JPM", "V", "MA", "BAC", "GS"]
xli = ["CAT", "RTX", "DE", "UNP", "BA"]
xlk = ["AAPL", "MSFT", "NVDA", "ORCL", "CRM"]
xlp = ["COST", "WMT", "PG", "KO", "PEP"]
xlu = ["NEE", "D", "DUK", "VST", "SRE"]
xlv = ["LLY", "UNH", "JNJ", "PFE", "MRK"]
xly = ["AMZN", "SBUX", "HD", "BKNG", "MCD"]
xlc = ["META", "GOOG", "NFLX", "T", "DIS"]
sectors = [xlf, xli, xle, xlk, xlv, xly, xlp, xlb, xlu, xlc]
# Sector dictionary
sector_dict = {symbol.lower(): tickers for symbol, tickers in zip(etf_symbols, sectors)}
# Load data from disk
# See Code Walk-Throughs for how we built the data set
df_sector_dict = load_dict_from_file("hello_world/data/simfin_df_rev_dict.pkl")
# Create functions for indexing
def create_index(series):
if series.iloc[0] > 0:
return series/series.iloc[0] * 100
else:
return (series - series.iloc[0])/-series.iloc[0] * 100
# Clean dataframes
df_rev_index_dict = {}
for key in sector_dict:
temp_df = df_sector_dict[key].copy()
col_1 = temp_df.columns[0]
temp_df = temp_df[[col_1] + [x for x in temp_df.columns if 'revenue' in x]]
temp_df.columns = ['date'] + [x.replace('revenue_', '').lower() for x in temp_df.columns[1:]]
temp_idx = temp_df.copy()
temp_idx[[x for x in temp_idx if 'date' not in x]] = temp_idx[[x for x in temp_idx if 'date' not in x]].apply(create_index)
df_rev_index_dict[key] = temp_idx
# Create train/test dataframes
df_rev_train_dict = {}
df_rev_test_dict = {}
for key in df_rev_index_dict:
df_out = df_rev_index_dict[key]
df_rev_train_dict[key] = df_out.loc[df_out['date'] < "2023-01-01"]
df_rev_test_dict[key] = df_out.loc[df_out['date'] >= "2023-01-01"]
# Plot dataframes
fig, axes = plt.subplots(3, 3, sharey=False, sharex=True, figsize=(14,8))
for idx, ax in enumerate(fig.axes):
etf = etf_symbols[idx].lower()
stocks = [x.lower() for x in df_rev_train_dict[etf].columns[1:]]
df_plot = df_rev_train_dict[etf]
ax.plot(df_plot['date'], df_plot[stocks])
ax.set_xlabel('')
ax.tick_params(axis='x', rotation=45)
ax.legend([x.upper() for x in stocks], fontsize=6, loc='upper left')
if idx == 3:
ax.set_ylabel("Index")
ax.set_title(etf.upper())
plt.tight_layout()
plt.show()
# Plot sequential change
fig, axes = plt.subplots(3, 3, sharey=False, sharex=True, figsize=(14,8))
for idx, ax in enumerate(fig.axes):
etf = etf_symbols[idx].lower()
stocks = [x.lower() for x in df_rev_train_dict[etf].columns[1:]]
df_plot = df_rev_train_dict[etf].copy()
df_plot[stocks] = df_plot[stocks].apply(lambda x: np.log(x).diff()*100)
ax.plot(df_plot['date'], df_plot[stocks])
ax.set_xlabel('')
ax.tick_params(axis='x', rotation=45)
ax.legend([x.upper() for x in stocks], fontsize=6, loc='upper left')
if idx == 3:
ax.set_ylabel("Percent (%)")
ax.set_title(etf.upper())
plt.tight_layout()
plt.show()
# Create naive growth rate forecast
rmse_chg = []
for etf in etf_symbols:
# Create forecast df
# etf = 'XLK'
fcst_df = df_rev_train_dict[etf.lower()].copy()
cols = [x.lower() for x in df_rev_train_dict[etf.lower()].columns[1:]]
fcst_df[cols] = fcst_df[cols].apply(lambda x: np.log(x).diff())
fcst = fcst_df.iloc[-1,1:].apply(lambda x: np.exp(x)).values.reshape(-1)
# Get actuals and predicted
actual = df_rev_test_dict[etf.lower()].copy()
predicted = pd.DataFrame(columns=cols)
base = df_rev_train_dict[etf.lower()].iloc[-1, 1:].values
for i in range(4):
base *= fcst
predicted.loc[i] = np.array(base)
# Calculate rmse
rmse_comp = np.sqrt(np.mean((actual.iloc[:,1:].T.values - predicted.T.values)**2, axis=1, dtype=float))
rmse_all = np.sqrt(np.mean((actual.iloc[:,1:].T.values - predicted.T.values)**2))
rmse = np.append(rmse_comp, rmse_all)
# Calculate means
avg = actual.iloc[:, 1:].mean().values
avg_all = actual.iloc[:, 1:].values.flatten().mean()
mean_all = np.append(avg,avg_all)
# Create rmse dataframe
names = cols + [f'{etf.lower()}_all']
temp_df = pd.DataFrame(zip(names, rmse, mean_all), columns=['ticker','rmse', 'avg'])
temp_df['rmse_scaled'] = temp_df['rmse']/temp_df['avg']
temp_df['sector'] = etf
rmse_chg.append(temp_df)
# Create error dataframe
chg_err_df = pd.concat([x for x in rmse_chg], axis=0).reset_index(drop=True)
print(chg_err_df)
# Graph
((chg_err_df.loc[chg_err_df['ticker'].isin([f"{x.lower()}_all" for x in etf_symbols]),
['sector','rmse_scaled']]
.set_index('sector')*100)
.sort_values('rmse_scaled')).plot(kind='bar', rot=0)
plt.xlabel('')
plt.ylabel('RMSE (%)')
plt.legend('')
plt.title('Scaled RMSE by sector for naive growth forecast')
plt.show()