[HFWWT 3]: Walk-Through 3
Trying Yahoo! to find tickers missed by Alpha Vantage and Tiingo
In our last walk through of the code that builds the Hello Forecasting World dataset, we tried the Tiingo search endpoint to assign tickers to our 50 stocks that comprise the Top 5 holdings in State Street's S&P 500 sector ETFs. Unfortunately, while Tiingo found more tickers than AlphaVantage, almost 20% of them were useless. We could have spent time trying to debug, but we decided instead to see if Yahoo! would prove to be a better source. By searching Yahoo! Finance, it's possible to find lots of company info. The issue is that it returns a lot of info. So one could spend a fair amount of time writing functions to handle that, or you could hack your way through to get a good enough answer. We choose the latter.
We'll assume you have the scripts from walk throughs 1 and 2. Let's move to the Yahoo search.
# Try Yahoo
url = "https://query1.finance.yahoo.com/v1/finance/search"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:122.0) Gecko/20100101 Firefox/122.0"
}
params = {
"q": "company",
"lang": "en-US",
"region": "US",
}
yahoo_query = []
for name in companies:
params["q"] = name
data = requests.get(url, params=params, headers=headers).json()
df = pd.DataFrame(data["quotes"])
yahoo_query.append(df)
# Sleep 1 second so we don't get flagged!
time.sleep(1)
# Check for missing companies
for idx, ticker in enumerate(yahoo_query):
if len(ticker) == 0:
print(companies[idx])
The for loop prints out the following:
META PLATFORMS INC CLASS A
BERKSHIRE HATHAWAY INC CL B
MASTERCARD INC A
MCDONALD S CORP
Not a bad result on first glance. Only four companies are missing from query, albeit large ones! But, if worse comes to worst, it should be pretty easy to find the tickers for these companies. Although that is not our plan; we want to do as much of this programmatically to be nice to our future selves by cutting out the manual hunt-and-peck if we need to do this again.
As already mentioned, the Yahoo! query returns a lot of info, much of which is not useful for our purposes. Let's see.
# Concatenate lists into a data frame and print head
comp_info = pd.concat([x for x in yahoo_query if len(x) > 0], axis=0)
print(comp_info.head().iloc[:,:-4]) # remove last four columns which are mostly irrelevant
>>> print(comp_info.head().iloc[:,:-4])
exchange shortname quoteType symbol index score typeDisp longname exchDisp sector sectorDisp industry industryDisp
0 NMS Linde plc EQUITY LIN quotes 20253.0 Equity Linde plc NASDAQ Basic Materials Basic Materials Specialty Chemicals Specialty Chemicals
1 MEX LINDE PLC EQUITY LIN1N.MX quotes 20003.0 Equity Linde plc Mexico NaN NaN NaN NaN
2 LSE LINDE PLC LINDE ORD SHS EQUITY 0M2B.L quotes 20003.0 Equity Linde plc London NaN NaN NaN NaN
3 GER Linde plc R EQUITY LIN.DE quotes 20003.0 Equity Linde plc XETRA Basic Materials Basic Materials Specialty Chemicals Specialty Chemicals
4 VIE LINDE PLC EQUITY LIN.VI quotes 20002.0 Equity Linde plc Vienna Basic Materials Basic Materials Specialty Chemicals Specialty Chemicals
[5 rows x 17 columns]
A few issues should stand out. First, the same company, Linde, appears on multiple exchanges with multiple symbols. We'll need to search for the correct exchange symbol, which turns out to be NYQ
. Let's drill down a little more into the data to figure out what type of filtering we'll need to do.
# Drill down
print(comp_info['exchange'].unique())
print(comp_info.loc[comp_info['exchange'] == "NYQ"].shape[0])
print(comp_info.loc[comp_info['exchange'] == "NYQ",'longname'].nunique())
>>> print(comp_info['exchange'].unique())
['NMS' 'MEX' 'LSE' 'GER' 'VIE' 'DUS' 'STU' 'NYQ' 'MIL' 'SAO' 'FRA' 'HAN'
'BER' 'TLO' 'BUE' 'DXE' 'MUN' 'CXE' 'IOB' 'HAM' 'SGO' 'PNK' 'NEO' 'BVC'
'MSC' 'JPX' 'NGM' 'VAN' 'KSC' 'WSE']
>>> print(comp_info.loc[comp_info['exchange'] == "NYQ"].shape[0])
41
>>> print(comp_info.loc[comp_info['exchange'] == "NYQ",'longname'].nunique())
31
Quite a lot of exchanges! When we select only NYQ
we find 41 rows. Filtering for unique values of the longname
returns 31 rows. More filtering necessary! Let's whittle down the dataset into a more manageable dataframe and see what we can see.
# Create filtered dataframe
comp_tickers = comp_info.loc[(comp_info['exchDisp'].isin(["NASDAQ", "NYSE"])) & # Exchanges of interest
(comp_info['symbol'].str.len()<5) & # Remove odd length tickers
comp_info['exchange'].isin(["NMS", "NYQ"]) & # Belts and suspenders
~comp_info['shortname'].str.contains("%"), # Anomaly found, but not shown
["shortname", "symbol"]].reset_index(drop=True) # Columns of interest, and reset index
# Print number of rows
print(comp_tickers.shape[0])
# Print random sample of company names and tickers
np.random.seed(123)
print(comp_tickers.sample(5))
>>> comp_tickers = comp_info.loc[(comp_info['exchDisp'].isin(["NASDAQ", "NYSE"])) &
... (comp_info['symbol'].str.len()<5) &
... comp_info['exchange'].isin(["NMS", "NYQ"]) &
... ~comp_info['shortname'].str.contains("%"),
... ["shortname", "symbol"]].reset_index(drop=True)
>>>
>>> np.random.seed(123)
>>> print(comp_tickers.shape[0])
47
>>> print(comp_tickers.sample(5))
shortname symbol
8 EOG Resources, Inc. EOG
11 GE Aerospace GE
44 Tesla, Inc. TSLA
18 NVIDIA Corporation NVDA
43 Amazon.com, Inc. AMZN
This seems better, but there are still some issues. The filtered dataframe has 48 rows, but we know the Yahoo query missed four companies, so we have some duplicates, which means we'll also probably have some incorrect tickers too. We need to check the company names in Yahoo query against the company names we have from our State Street scrape. If we had a big database we could simply check the names from the Yahoo query against it. But we don't and we wouldn't say the naming conventions from State Street are canonical either. So we have to figure out a way to remove punctuation and superfluous abbreviations like corp., comp., and inc.
We do that in the code below.
# Normalize function to clean and standardize strings
# Updated normalize function to remove specified terms but keep "&"
def normalize(name):
name = name.upper() # Convert to uppercase
name = re.sub(r'\b(CORP(ORATION)?|COMP(ANY)?|HOLD(ING)?|THE|INC\.?)\b', '', name) # Remove "CORP", "CORPORATION", "INC", "INC."
name = re.sub(r'[^A-Z0-9 &]', '', name) # Remove punctuation except "&"
name = re.sub(r' ', '', name)
return name
# Normalize both lists with the updated function
spdr_list = companies.copy()
yahoo_list = comp_tickers['shortname'].to_list()
normalized_spdr_list = [normalize(name) for name in spdr_list]
normalized_yahoo_list = [normalize(name) for name in yahoo_list]
# Filter yahoo_list to keep only items that have a match in spdr_list
cleaned_list = [
yahoo for yahoo, norm_yahoo in zip(yahoo_list, normalized_yahoo_list)
if norm_yahoo in normalized_spdr_list
]
# Filter spdr_list to keep only itmes not in yahoo
missing_list = [
spdr for spdr, norm_spdr in zip(spdr_list, normalized_spdr_list)
if norm_spdr not in normalized_yahoo_list
]
# Print cleaned list and numbeer
print(cleaned_list)
print(len(cleaned_list))
# Print missing list
print(missing_list)
print(len(missing_list))
The last print functions reveal the following.
>>> print(cleaned_list)
['Linde plc', 'Sherwin-Williams Company (The)', 'Ecolab Inc.', 'Freeport-McMoRan, Inc.', 'Netflix, Inc.', 'Exxon Mobil Corporation', 'Chevron Corporation', 'ConocoPhillips', 'EOG Resources, Inc.', 'JP Morgan Chase & Co.', 'Bank of America Corporation', 'Caterpillar, Inc.', 'RTX Corporation', 'Honeywell International Inc.', 'Union Pacific Corporation', 'Apple Inc.', 'NVIDIA Corporation', 'Microsoft Corporation', 'Broadcom Inc.', 'Salesforce, Inc.', 'Costco Wholesale Corporation', 'Walmart Inc.', 'Procter & Gamble Company (The)', 'Coca-Cola Company (The)', 'Pepsico, Inc.', 'NextEra Energy, Inc.', 'Southern Company (The)', 'Vistra Corp.', 'Johnson & Johnson', 'AbbVie Inc.', 'Amazon.com, Inc.', 'Tesla,
Inc.', 'Home Depot, Inc. (The)']
>>> print(len(cleaned_list))
33
>>> # Print missing list
>>>
>>> print(missing_list)
['AIR PRODUCTS & CHEMICALS INC', 'META PLATFORMS INC CLASS A', 'ALPHABET INC CL A', 'ALPHABET INC CL C', 'TAKE TWO INTERACTIVE SOFTWRE', 'WILLIAMS COS INC', 'BERKSHIRE HATHAWAY INC CL B', 'VISA INC CLASS A SHARES', 'MASTERCARD INC A', 'GENERAL ELECTRIC', 'CONSTELLATION ENERGY', 'DUKE ENERGY CORP', 'ELI LILLY & CO', 'UNITEDHEALTH GROUP INC', 'MERCK & CO. INC.', 'MCDONALD S CORP', 'BOOKING HOLDINGS INC']
>>> print(len(missing_list))
17
We have 66% of the companies we need. That's a so-so result but better than a coin flip. It looks like the remaining companies are well-known enough that we can input them into our final ticker list. This is definitely less than ideal. We would have preferred to solve this programmatically, but that might have taken too long and still not returned a great result. Such are the vagaries of open-source, reproducible research. We'll put together the ticker list, add the missing tickers, and then we'll be able to move to the next step—getting the revenue data. That will have to wait for the next walk-through, however. The remaining code is below.
# Get tickers
tickers = comp_tickers.loc[comp_tickers['shortname'].isin(cleaned_list), "symbol"].to_list()
ticker_full = tickers.copy()
# Create manually searched tickers
manual_tickers = ["APD", "META", "GOOGL", "GOOG", "TTWO", "WMB", "BRK.B", "V", "MA", "GE", "CEG", "DUK", "LLY", "UNH", "MRK", "MCD", "BKNG"]
assert len(ticker_full) + len(manual_tickers) == 50, "Still missing tickers!"
# Add missing tickers
ticker_full.extend(manual_tickers)
# Print ticker_full
print(ticker_full)
>>> print(ticker_full)
['LIN', 'SHW', 'ECL', 'FCX', 'NFLX', 'XOM', 'CVX', 'COP', 'EOG', 'JPM', 'BAC', 'CAT', 'RTX', 'HON', 'UNP', 'AAPL', 'NVDA', 'MSFT', 'AVGO', 'CRM', 'COST', 'WMT', 'PG', 'KO', 'PEP', 'NEE', 'SO', 'VST', 'JNJ', 'ABBV', 'AMZN', 'TSLA', 'HD', 'APD', 'META', 'GOOGL', 'GOOG', 'TTWO', 'WMB', 'BRK.B', 'V', 'MA', 'GE', 'CEG', 'DUK', 'LLY', 'UNH', 'MRK', 'MCD', 'BKNG']