[HFWWT 3]: Walk-Through 3

Trying Yahoo! to find tickers missed by Alpha Vantage and Tiingo

Feb 04, 2025

In our last walk through of the code that builds the Hello Forecasting World dataset, we tried the Tiingo search endpoint to assign tickers to our 50 stocks that comprise the Top 5 holdings in State Street's S&P 500 sector ETFs. Unfortunately, while Tiingo found more tickers than AlphaVantage, almost 20% of them were useless. We could have spent time trying to debug, but we decided instead to see if Yahoo! would prove to be a better source. By searching Yahoo! Finance, it's possible to find lots of company info. The issue is that it returns a lot of info. So one could spend a fair amount of time writing functions to handle that, or you could hack your way through to get a good enough answer. We choose the latter.

We'll assume you have the scripts from walk throughs 1 and 2. Let's move to the Yahoo search.

# Try Yahoo

url = "https://query1.finance.yahoo.com/v1/finance/search"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:122.0) Gecko/20100101 Firefox/122.0"
}

params = {
    "q": "company",
    "lang": "en-US",
    "region": "US",
}

yahoo_query = []

for name in companies:
    params["q"] = name
    data = requests.get(url, params=params, headers=headers).json()
    df = pd.DataFrame(data["quotes"])
    yahoo_query.append(df)
    # Sleep 1 second so we don't get flagged!
    time.sleep(1)

# Check for missing companies
for idx, ticker in enumerate(yahoo_query):
    if len(ticker) == 0:
        print(companies[idx])

The for loop prints out the following:

META PLATFORMS INC CLASS A
BERKSHIRE HATHAWAY INC CL B
MASTERCARD INC A
MCDONALD S CORP

Not a bad result on first glance. Only four companies are missing from query, albeit large ones! But, if worse comes to worst, it should be pretty easy to find the tickers for these companies. Although that is not our plan; we want to do as much of this programmatically to be nice to our future selves by cutting out the manual hunt-and-peck if we need to do this again.

As already mentioned, the Yahoo! query returns a lot of info, much of which is not useful for our purposes. Let's see.

# Concatenate lists into a data frame and print head
comp_info = pd.concat([x for x in yahoo_query if len(x) > 0], axis=0)
print(comp_info.head().iloc[:,:-4]) # remove last four columns which are mostly irrelevant

>>> print(comp_info.head().iloc[:,:-4])
  exchange                        shortname quoteType    symbol   index    score typeDisp   longname exchDisp           sector       sectorDisp             industry         industryDisp
0      NMS                        Linde plc    EQUITY       LIN  quotes  20253.0   Equity  Linde plc   NASDAQ  Basic Materials  Basic Materials  Specialty Chemicals  Specialty Chemicals
1      MEX                        LINDE PLC    EQUITY  LIN1N.MX  quotes  20003.0   Equity  Linde plc   Mexico              NaN              NaN                  NaN                  NaN
2      LSE          LINDE PLC LINDE ORD SHS    EQUITY    0M2B.L  quotes  20003.0   Equity  Linde plc   London              NaN              NaN                  NaN                  NaN
3      GER  Linde plc                     R    EQUITY    LIN.DE  quotes  20003.0   Equity  Linde plc    XETRA  Basic Materials  Basic Materials  Specialty Chemicals  Specialty Chemicals
4      VIE                        LINDE PLC    EQUITY    LIN.VI  quotes  20002.0   Equity  Linde plc   Vienna  Basic Materials  Basic Materials  Specialty Chemicals  Specialty Chemicals

[5 rows x 17 columns]

A few issues should stand out. First, the same company, Linde, appears on multiple exchanges with multiple symbols. We'll need to search for the correct exchange symbol, which turns out to be NYQ. Let's drill down a little more into the data to figure out what type of filtering we'll need to do.

# Drill down
print(comp_info['exchange'].unique())
print(comp_info.loc[comp_info['exchange'] == "NYQ"].shape[0])
print(comp_info.loc[comp_info['exchange'] == "NYQ",'longname'].nunique())

>>> print(comp_info['exchange'].unique())
['NMS' 'MEX' 'LSE' 'GER' 'VIE' 'DUS' 'STU' 'NYQ' 'MIL' 'SAO' 'FRA' 'HAN'
 'BER' 'TLO' 'BUE' 'DXE' 'MUN' 'CXE' 'IOB' 'HAM' 'SGO' 'PNK' 'NEO' 'BVC'
 'MSC' 'JPX' 'NGM' 'VAN' 'KSC' 'WSE']
>>> print(comp_info.loc[comp_info['exchange'] == "NYQ"].shape[0])
41
>>> print(comp_info.loc[comp_info['exchange'] == "NYQ",'longname'].nunique())
31

Quite a lot of exchanges! When we select only NYQ we find 41 rows. Filtering for unique values of the longname returns 31 rows. More filtering necessary! Let's whittle down the dataset into a more manageable dataframe and see what we can see.

# Create filtered dataframe
comp_tickers = comp_info.loc[(comp_info['exchDisp'].isin(["NASDAQ", "NYSE"])) & # Exchanges of interest
                             (comp_info['symbol'].str.len()<5) &                # Remove odd length tickers     
                             comp_info['exchange'].isin(["NMS", "NYQ"]) &       # Belts and suspenders 
                             ~comp_info['shortname'].str.contains("%"),         # Anomaly found, but not shown 
                             ["shortname", "symbol"]].reset_index(drop=True)    # Columns of interest, and reset index

# Print number of rows
print(comp_tickers.shape[0])

# Print random sample of company names and tickers
np.random.seed(123)
print(comp_tickers.sample(5))

>>> comp_tickers = comp_info.loc[(comp_info['exchDisp'].isin(["NASDAQ", "NYSE"])) & 
...                              (comp_info['symbol'].str.len()<5) &
...                              comp_info['exchange'].isin(["NMS", "NYQ"]) &
...                              ~comp_info['shortname'].str.contains("%"), 
...                              ["shortname", "symbol"]].reset_index(drop=True)
>>>
>>> np.random.seed(123)
>>> print(comp_tickers.shape[0])
47
>>> print(comp_tickers.sample(5))
              shortname symbol
8   EOG Resources, Inc.    EOG
11         GE Aerospace     GE
44          Tesla, Inc.   TSLA
18   NVIDIA Corporation   NVDA
43     Amazon.com, Inc.   AMZN

This seems better, but there are still some issues. The filtered dataframe has 48 rows, but we know the Yahoo query missed four companies, so we have some duplicates, which means we'll also probably have some incorrect tickers too. We need to check the company names in Yahoo query against the company names we have from our State Street scrape. If we had a big database we could simply check the names from the Yahoo query against it. But we don't and we wouldn't say the naming conventions from State Street are canonical either. So we have to figure out a way to remove punctuation and superfluous abbreviations like corp., comp., and inc. We do that in the code below.

# Normalize function to clean and standardize strings
# Updated normalize function to remove specified terms but keep "&"

def normalize(name):
    name = name.upper()                                                                # Convert to uppercase
    name = re.sub(r'\b(CORP(ORATION)?|COMP(ANY)?|HOLD(ING)?|THE|INC\.?)\b', '', name)  # Remove "CORP", "CORPORATION", "INC", "INC."
    name = re.sub(r'[^A-Z0-9 &]', '', name)                                            # Remove punctuation except "&"
    name = re.sub(r' ', '', name)
    return name

# Normalize both lists with the updated function
spdr_list = companies.copy()
yahoo_list = comp_tickers['shortname'].to_list()
normalized_spdr_list = [normalize(name) for name in spdr_list]
normalized_yahoo_list = [normalize(name) for name in yahoo_list]

# Filter yahoo_list to keep only items that have a match in spdr_list
cleaned_list = [
    yahoo for yahoo, norm_yahoo in zip(yahoo_list, normalized_yahoo_list)
    if norm_yahoo in normalized_spdr_list
]

# Filter spdr_list to keep only itmes not in yahoo
missing_list = [
    spdr for spdr, norm_spdr in zip(spdr_list, normalized_spdr_list)
    if norm_spdr not in normalized_yahoo_list
]

# Print cleaned list and numbeer
print(cleaned_list)
print(len(cleaned_list))

# Print missing list
print(missing_list)
print(len(missing_list))

The last print functions reveal the following.

>>> print(cleaned_list)
['Linde plc', 'Sherwin-Williams Company (The)', 'Ecolab Inc.', 'Freeport-McMoRan, Inc.', 'Netflix, Inc.', 'Exxon Mobil Corporation', 'Chevron Corporation', 'ConocoPhillips', 'EOG Resources, Inc.', 'JP Morgan Chase & Co.', 'Bank of America Corporation', 'Caterpillar, Inc.', 'RTX Corporation', 'Honeywell International Inc.', 'Union Pacific Corporation', 'Apple Inc.', 'NVIDIA Corporation', 'Microsoft Corporation', 'Broadcom Inc.', 'Salesforce, Inc.', 'Costco Wholesale Corporation', 'Walmart Inc.', 'Procter & Gamble Company (The)', 'Coca-Cola Company (The)', 'Pepsico, Inc.', 'NextEra Energy, Inc.', 'Southern Company (The)', 'Vistra Corp.', 'Johnson & Johnson', 'AbbVie Inc.', 'Amazon.com, Inc.', 'Tesla, 
Inc.', 'Home Depot, Inc. (The)']
>>> print(len(cleaned_list))
33
>>> # Print missing list
>>>
>>> print(missing_list)
['AIR PRODUCTS & CHEMICALS INC', 'META PLATFORMS INC CLASS A', 'ALPHABET INC CL A', 'ALPHABET INC CL C', 'TAKE TWO INTERACTIVE SOFTWRE', 'WILLIAMS COS INC', 'BERKSHIRE HATHAWAY INC CL B', 'VISA INC CLASS A SHARES', 'MASTERCARD INC A', 'GENERAL ELECTRIC', 'CONSTELLATION ENERGY', 'DUKE ENERGY CORP', 'ELI LILLY & CO', 'UNITEDHEALTH GROUP INC', 'MERCK & CO. INC.', 'MCDONALD S CORP', 'BOOKING HOLDINGS INC']
>>> print(len(missing_list))
17

We have 66% of the companies we need. That's a so-so result but better than a coin flip. It looks like the remaining companies are well-known enough that we can input them into our final ticker list. This is definitely less than ideal. We would have preferred to solve this programmatically, but that might have taken too long and still not returned a great result. Such are the vagaries of open-source, reproducible research. We'll put together the ticker list, add the missing tickers, and then we'll be able to move to the next step—getting the revenue data. That will have to wait for the next walk-through, however. The remaining code is below.

# Get tickers
tickers = comp_tickers.loc[comp_tickers['shortname'].isin(cleaned_list), "symbol"].to_list()
ticker_full = tickers.copy()

# Create manually searched tickers 
manual_tickers = ["APD", "META", "GOOGL", "GOOG", "TTWO", "WMB", "BRK.B", "V", "MA", "GE", "CEG", "DUK", "LLY", "UNH", "MRK", "MCD", "BKNG"]
assert len(ticker_full) + len(manual_tickers) == 50, "Still missing tickers!"

# Add missing tickers 
ticker_full.extend(manual_tickers)

# Print ticker_full
print(ticker_full)

>>> print(ticker_full)
['LIN', 'SHW', 'ECL', 'FCX', 'NFLX', 'XOM', 'CVX', 'COP', 'EOG', 'JPM', 'BAC', 'CAT', 'RTX', 'HON', 'UNP', 'AAPL', 'NVDA', 'MSFT', 'AVGO', 'CRM', 'COST', 'WMT', 'PG', 'KO', 'PEP', 'NEE', 'SO', 'VST', 'JNJ', 'ABBV', 'AMZN', 'TSLA', 'HD', 'APD', 'META', 'GOOGL', 'GOOG', 'TTWO', 'WMB', 'BRK.B', 'V', 'MA', 'GE', 'CEG', 'DUK', 'LLY', 'UNH', 'MRK', 'MCD', 'BKNG']

Options, Stocks, Machines on Substack

Discussion about this post