My trip down a rabbit hole in pursuit of understanding Python

As a resident of Pennsylvania, I was curious about the distribution of votes for US President across geography (e.g. rural-PA compared with urban-PA) as well as vote modality (e.g. in-person compared with by-mail.) I understood that it would take at least a few days after the election to count votes cast by mail, and it was suggested by analysts that Trump supporters would be represented at a higher relative rate during in-person voting. I was quite surprised by the frequency and format of reporting over the coming days, though. At 10:33am on 11/6/20, I calculated that Biden required approximately 61% of the remaining votes to pull ahead of Trump in Pennsylvania. I took some notes during the day between tasks:

10:33am, 540 votes were reported and Biden won 49% of them;  I got on a conference call.  11:02am, 73 votes were reported and Biden won 81% of them;  I got on another conference call.  11:51am, 18,295 votes were reported and Biden won 88% of them;  I made a telephone call.  11:59am, 123 votes were reported and Biden won 50% of them.  I thought, "What the hell kind of system are we running here?"

Around lunchtime, I shared observations with my brother, and he asked about how I was pulling the data. It occurred to me that, since helping establish an insurtech startup, Insurance Skout, some years earlier, I have been interested in learning how to automate processes, gather data, and create cool visualizations. Yet I never took the time to invest into this learning.

And so begins my brief journey down the rabbit hole of Python. I never took a computer science class or learned programming, other than a simple and amateur effort to learn LPC for text-based video game development in high school. I thought I would invest a few hours into learning this programming language. The experience has been super interesting. I expected it would be a challenge to learn the language and syntax of the applications; I hoped I could learn enough over a pair of weekends to develop a sense of the language, grab data via API, and then conduct some analysis and visualizations of the data. So, my partially read books continue to be partially read; I invested some time over these past two weekends into a Python for Dummies book and a few Youtube videos.

Where to start

After reviewing the cursory Python for Dummies introduction and Youtube videos, I considered areas I might be curious about, which include the following:

  • Insurance or health care data (as a result of my business)
  • Financial markets data (as a result of personal interest in markets)
  • Weather data (as a matter of curiosity)

I discovered tons of cool and publicly available (read: free) data sets. I decided to begin working with some global data regarding COVID cases, as well as a public API available through the FDIC. My objectives were to develop a comfort with Python and data structures to enable me to, by today (11/14) add an “arrow in my quiver” of tools by which to get and analyze data.

Covid data

I started with the dataset that is publicly available from The Covid Tracking Project. This is the dataset used by Johns Hopkins to power their Covid Tracker project. I was able to learn enough to grab some data, in this instance via CSV file, and then conduct some analysis and visualization. One notable issue I struggled with was moving from a full data-file with all countries to a limited report focused on four, which I chose based upon a blogpost I wrote earlier this year about introductory math during COVID. I learned about the module Pandas and the use of data frames. There is something wrong with the way the calculations are counting cases, but aside from the need to fix the math and get MatPlotLib to not overwrite labels, I admit feeling a sense of satisfaction in building this.

A screen shot of output from my program looking at covid data in Python
A screen shot of output from my program looking at covid data in Python

FDIC data

I also thought it would be interesting to attempt to build an analysis of bank performance by gathering data from the FDIC public API through a get() method. Here I wrestled with writing code to systematically and easily change my report to include different financial measures (e.g. Assets, ROA, ROE, Deposits), states (e.g. Pennsylvania and Maryland), as well as number of observations to include. I found it much easier to trouble shoot my dozens of errors with one object at a time, rather than all possible objects. Although I spent time reading about the differences between lists and dictionaries in my book, I failed to appreciate the distinction (including lists of lists and lists of dictionaries) until wrestling with the JSON data that came from the FDIC site.

Some screen shots of output from my program looking at FDIC data in Python

What next?

I feel a sense of satisfaction with these two efforts. I would like to learn how to use the tools a bit more efficiently, as I can see cool use cases for marketing in our business; There are publicly available data about mid-market employer benefit programs available for download from the DOL. There are also financial questions I am curious about; I wonder how disparate data from the FDIC, Federal Reserve, and BLS might be paired with commercial data to create forward looking insights about credit markets. From a health care perspective, there are interesting datasets available via API from CMS; I think I will start with Procedure Price Lookup.

I continue to have a few partially read books scattered throughout our home. I imagine I will not make much progress on them this weekend because I would like to spend a little more time down this rabbit hole before getting back to work on Monday!