[HFWWT 1]: Hello Forecasting World Walk Through 1

Identify top 5 stocks by S&P 500 sector ETF and use Alpha Vantage API to find tickers

Jan 17, 2025

In this post, we walk through how we built up the dataset we use for our revenue forecasting series. While this might be a bid dry, we hope to show the logic and hurdles we had to overcome so that readers can, if they want, reproduce the same dataset for themselves. Let's go!

There are number of financial APIs available to retail investors, with both free and paid versions. We could easily build something using a paid API, but that would not be reproducible for readers who don't want to fork out the dough for such a thing. A few of the available APIs are Alpha Vantage, SimFin, and Tiingo. We've used all these APIs in one form or another. We've also used others like FinnHub and Financial Modeling Prep (FMP). The FundamentalAnalysis package was the main way we interacted with FMP, which you can see in our post on forecasting net income using neural networks. However, that package was deprecated and ported to FinanceToolkit. We did not feel like learning another package to get the series going, so we won't use FMP or Finnhub. We may come back to these packages in the future. Whatever, the case we have three APIs to use and the ever useful yfinance package.

We first need to find the top 10 companies in each of the sectors. That is, the underlying components of the Select SPDRs® ETFs: ['XLF', 'XLI', 'XLE', 'XLK', 'XLV', 'XLY', 'XLP', 'XLB', 'XLU', 'XLC']. After loading our packages and setting our API keys, we instantiate the symbols and create a function to scrape the State Street website that has this data handy.

# Load packages
import os
import requests
import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import matplotlib as mpl
import time
import re
from dotenv import load_dotenv
import pickle

plt.style.use('seaborn-v0_8')
plt.rcParams["figure.figsize"] = (12,6)

# File save/load functions
def save_dict_to_file(data, filename):
    with open(filename, 'wb') as f:
        pickle.dump(data, f)

def load_dict_from_file(filename):
    with open(filename, 'rb') as f:
        return pickle.load(f)

# API keys
load_dotenv()
av_api_key = os.getenv('ALPHAVANTAGE_API_KEY')
ti_api_key = os.getenv('TIINGO_API_KEY')
simfin_api_key = os.getenv('SIMFIN_API_KEY')

etf_symbols = ['XLF', 'XLI', 'XLE', 'XLK', 'XLV', 'XLY', 'XLP', 'XLB', 'XLU', 'XLC']

def get_etf_holdings(etf_symbol):
    url = f"https://www.ssga.com/us/en/intermediary/etfs/the-financial-select-sector-spdr-fund-{etf_symbol.lower()}"

    # Send a GET request to fetch the webpage content
    response = requests.get(url)
    response.raise_for_status()  # Check for request errors

    # Read the HTML content into a list of DataFrames
    dfs = pd.read_html(response.text)

    # Iterate through the DataFrames to find the one containing the Top Ten Holdings
    for df in dfs:
        if all(col in df.columns for col in ['Name', 'Shares Held', 'Weight']):
            top_ten_holdings = df
            break
    
    top_ten_holdings.columns = ['name', 'shares', 'weight']
    top_ten_holdings['sector'] = etf_symbol
    top_ten_holdings = top_ten_holdings[['sector', 'name', 'shares', 'weight']]
    top_ten_holdings['shares'] = pd.to_numeric(top_ten_holdings['shares'], errors='coerce')
    top_ten_holdings['weight'] = top_ten_holdings['weight'].apply(lambda x: x.replace("%",''))
    top_ten_holdings['weight'] = pd.to_numeric(top_ten_holdings['weight'], errors='coerce')
    
    return top_ten_holdings

# Get holdings
all_holdings = [] 
for symbol in etf_symbols:
    holdings = get_etf_holdings(symbol)
    all_holdings.append(holdings)

# Convert holdings lists to dataframe
all_holdings = pd.concat([holding for holding in all_holdings], axis=0)

# Check correct datatypes
print(all_holdings.info())

# Save to csv if desired
# all_holdings.to_csv("path/to/data/all_holdings.csv", index=False)

# Load csv if desired
# all_holdings = pd.read_csv("path/to/data/all_holdings.csv")

A few notes on strategy. We use dotenv because it creates a nice way to access API keys stored in a .env file, which is a good way to store keys and secrets you don't want to share if working with production code or a local branch of repo that you'll eventually merge (or hope to!) with the main branch. You may notice while running the get_etf_holdings function you get a warning such as:

<stdin>:8: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object

You can safely ignore that. Some folks elect to import warnings to turn off such messages. I'd rather see them, but know to ignore them. When you print(all_holdings.info()) you should see the following:

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 0 to 9
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   sector  100 non-null    object
 1   name    100 non-null    object
 2   shares  100 non-null    int64
 3   weight  100 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 3.9+ KB

It's critical that shares and weight are in some numeric format in order to sort those values. We took care of that in the get_etf_holdings function above. Next we group by sector and find the top 5 stocks by weight.

# Create Top 5 holdings dataframe
top_5 = all_holdings.groupby('sector')[['sector','name', 'weight']].apply(lambda x: x.nlargest(5, 'weight')).reset_index(drop=True)
print(top_5)

The resulting data frame has 50 rows, which we won't show below for space. Nonetheless, you should see something like the following:

   sector                          name  weight
0     XLB                     LINDE PLC   21.05
1     XLB       SHERWIN WILLIAMS CO/THE    8.17
2     XLB  AIR PRODUCTS + CHEMICALS INC    6.96
3     XLB                    ECOLAB INC    6.18
4     XLB          FREEPORT MCMORAN INC    5.98
-----------------------------------------------
44    XLV              MERCK + CO. INC.    4.91
45    XLY                AMAZON.COM INC   20.97
46    XLY                     TESLA INC   17.90
47    XLY                HOME DEPOT INC    6.29
48    XLY               MCDONALD S CORP    4.59
49    XLY          BOOKING HOLDINGS INC    4.31

If you are running this weeks or months after we first published the post, those companies will probably be different. But the key here is to confirm you're getting an output as close to what I'm getting.

Let's move on. While this is a very useful list, and assuming State Street doesn't block scraping or change the format of its website for the SPDRs anytime soon, this should also be a good function to get the names of the companies on which we want to collect data in the future. The main problem: no tickers!

We'll have to use the APIs we mentioned above to find the tickers associated with the companies. You'll also notice that some of the companies have odd formatting—for example, MCDONALD S CORP or SHERWIN WILLIAMS CO/THE. This will require some text cleaning/regex. This is definitely not the most interesting part of the data wrangling process. Indeed, we find working with regex mostly opaque and a frustrating experience best avoided.

Let's start with Alpha Vantage. It has a symbol search endpoint, which we initially thought could work. Note: if you're on the free version, there is a daily API call limit of 25 requests at the time of writing, so you might want to take my word for it rather that waste the resource. Spoiler alert: this endpoint does not work so well for our purposes.

# Create function to extract the name from the top_5 holdings dataframe and change "+" to "&" to match conventions in APIs
def extract_words(name: str) -> str:
    if "+" in name:
        parts = name.split()
        plus_index = parts.index("+")
        name = " ".join([x for x in parts[:plus_index]]) + " & " + " ".join([x for x in parts[plus_index + 1:]])
    name = name.replace("CO/THE", "")
    return name

# Create function to find ticke associated with 
def get_ticker_alpha_vantage(company_name: str, api_key: str|None) -> str:
    url = f'https://www.alphavantage.co/query?function=SYMBOL_SEARCH&keywords={company_name}&apikey={api_key}'
    r = requests.get(url)
    data = r.json()
    dat = pd.DataFrame(data['bestMatches'])
    ticker = dat.loc[(dat["4. region"] == 'United States') & (dat['8. currency'] == 'USD') & (dat["1. symbol"].str.len() <= 4), "1. symbol"].values[0]
    return ticker

# Get company names
companies = top_5['name'].to_list()
companies = [extract_words(name) for name in companies]

tickers_av = []
missing_av = []

for company in companies:
    try:
        ticker = get_ticker_alpha_vantage(company, av_api_key)
        tickers.append(ticker)
    except Exception as e:
        missing_av.append(company)
        print(f"Error for {company}: {e}")

assert len(tickers_av) + len(missing_av) == 50, "You lost some tickers!"
if len(missing_av) > 0:
    print(f"Missing {len(missing_av)} tickers!")
else:
    print("Good to go!")

If you do run all of the above functions and loops, you'll notice that we're wearing a belt and suspenders with the except and assert logic. The except is there as a placeholder to ensure we can iterate through all the tickers without the for loop failing. The assert is there if for some reason the two lists didn't get all 50 companies. Finally, the last condition is to check how many missing tickers we actually have. That should print out the following: Missing 34 tickers! (Note: you may get one or two more or less than this.) Whatever the case, that's too many for our purposes—almost 70% missing. We could, of course, fix some of the regex and try again, or simply move on to the next API with the hope that it is in fact better. Since we’re not committed to one API or another, we’ll look at Tiingo in the next walk-through. Stay tuned!

Options, Stocks, Machines on Substack

Discussion about this post