I had a great conversation with some friends and colleagues earlier about the role MRF files might play into the practice of healthcare policy from numerous standpoints. We also explored the challenges native to these efforts. We also investigated the work of researchers in this space.
I think there is a good opportunity to contribute to understanding dynamics of American healthcare by connecting insights from Payer MRF, Hospital MRF, and real claims data.
Some challenges include differing naming conventions, MRF file structures, and varying forms of reimbursement (e.g. percentage, fee schedule, negotiated, etc.)
The objective of this blogpost is to share some prior efforts and research I conducted when exploring examples of the MRF files the Transparency in Coverage (TiC) rules resulted in.
But, why do we care?
Per our conversation, there are many fronts where these efforts are useful. From public policy to personal inquiry; From research that could lead to system improvement and therefore social value to development of novel datasets and strategies that could add value to customers and create commercial value; There is also potential for direct commercial benefit for plan sponsors of self-insured health plans, insurers, and other risk bearing entities to make better decisions around risk and procurement based on better data. This was a recent slide from a Valhalla Business Advisors meeting.
Hospital Files
As a starting point to review, I pulled the MRF from Hershey at this website: https://www.pennstatehealth.org/patients-visitors/billing-medical-records/hospital-price-transparency. Note that the hospital also provides a link to CMS FAQ on these file formats: https://www.pennstatehealth.org/patients-visitors/billing-medical-records/hospital-price-transparency#:~:text=Centers%20for%20Medicare%20and%20Medicaid%20Services%20(CMS)%20requirement
Here is my initial exploratory Python code:
import pandas as pd file_path = '/Users/jensthorsen/Workspace/Notebooks/transparency blogpost dec 23/251854772_the-milton-s-hershey-medical-center_standardcharges.csv' # Attempting to read the CSV file with a different encoding (ISO-8859-1) try: data = pd.read_csv(file_path, encoding='ISO-8859-1') file_info = { "size": data.memory_usage(deep=True).sum(), # size in bytes "shape": data.shape, # rows and columns "columns": data.columns.tolist(), # column names "first_few_rows": data.head() # first few rows for a preview } except Exception as e: file_info = str(e) file_info
Here is the initial data return:
{'size': 267915425, 'shape': (103012, 49), 'columns': ['Line Type', 'CDM Tech', 'CDM Tech - Description', 'DRG', 'CPT / HCPC', 'NDC', 'Revenue Code', 'Gross Charge', 'Self Pay Cash Price', 'Min Negotiated Rate', 'Max Negotiated Rate', 'Aetna Better Health MA CHIP', 'Aetna Comm', 'Aetna Medicare', 'Aetna Signature Administrators', 'AmeriHealth MA', 'AmeriHealth MC Adv', 'CBC CHIP', 'CBC Comm', 'CBC Medicare', 'Cigna Comm', 'Cigna Medicare', 'Geisinger Comm', 'Geisinger MA', 'Geisinger Medicare', 'Health Partners Plans MA', 'Health Partners Plans MA CHIP', 'Health Partners Plans Medicare', 'Highmark ACA', 'Highmark CHIP', 'Highmark Choice Blue', 'Highmark Comm', 'Highmark Medicare', 'Highmark Wholecare MA', 'Highmark Wholecare Medicare', 'Humana Medicare', 'Multiplan', 'PA Health & Wellness MA', 'PA Health & Wellness Medicare', 'Preferred Healthcare Lanc', 'Preferred Healthcare Systems', 'UHC Comm', 'UHC MA CHIP', 'UHC Medicare', 'UHC Medicare I-SNP / D-SNP', 'UPMC Comm', 'UPMC MA', 'UPMC MA CHIP', 'UPMC Medicare'], 'first_few_rows': Line Type CDM Tech CDM Tech - Description DRG CPT / HCPC NDC \ 0 NaN 4407175 FNA BX W/O IMG GDN EA ADDL NaN 10004 NaN 1 NaN 3514915 FINE NEEDLE ASPIRATION(W/IMAGING) NaN 10005 NaN 2 NaN 3514918 FINE NEEDLE ASPIRATION(W/IMAGING) NaN 10005 NaN 3 NaN 4407176 FNA BX W/US GDN 1ST LESION NaN 10005 NaN 4 NaN 4805500 US FNA Thyroid Gland NaN 10005 NaN Revenue Code Gross Charge Self Pay Cash Price Min Negotiated Rate ... \ 0 761.0 56 19.49 0 ... 1 761.0 3,654.00 1,271.96 553.71 ... 2 761.0 3,654.00 1,271.96 553.71 ... 3 761.0 3,654.00 1,271.96 553.71 ... 4 761.0 3,654.00 1,271.96 553.71 ... Preferred Healthcare Lanc Preferred Healthcare Systems UHC Comm \ 0 42 42 42.78 1 2,740.50 2,740.50 2,791.66 2 2,740.50 2,740.50 2,791.66 3 2,740.50 2,740.50 2,791.66 4 2,740.50 2,740.50 2,791.66 UHC MA CHIP UHC Medicare UHC Medicare I-SNP / D-SNP UPMC Comm UPMC MA \ 0 22.4 0 0 42.04 21.11 1 1,461.60 651.42 664.45 2,743.06 1,377.19 2 1,461.60 651.42 664.45 2,743.06 1,377.19 3 1,461.60 651.42 664.45 2,743.06 1,377.19 4 1,461.60 651.42 664.45 2,743.06 1,377.19 UPMC MA CHIP UPMC Medicare 0 24.84 0 1 1,620.55 651.42 2 1,620.55 651.42 3 1,620.55 651.42 4 1,620.55 651.42 [5 rows x 49 columns]}
Also, here is the return to data.info():
<class 'pandas.core.frame.DataFrame'> RangeIndex: 103012 entries, 0 to 103011 Data columns (total 49 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Line Type 2084 non-null object 1 CDM Tech 100926 non-null object 2 CDM Tech - Description 103007 non-null object 3 DRG 2081 non-null float64 4 CPT / HCPC 56652 non-null object 5 NDC 12424 non-null object 6 Revenue Code 67595 non-null float64 7 Gross Charge 100926 non-null object 8 Self Pay Cash Price 103007 non-null object 9 Min Negotiated Rate 102961 non-null object 10 Max Negotiated Rate 102961 non-null object 11 Aetna Better Health MA CHIP 102171 non-null object 12 Aetna Comm 101620 non-null object 13 Aetna Medicare 55160 non-null object 14 Aetna Signature Administrators 101620 non-null object 15 AmeriHealth MA 55711 non-null object 16 AmeriHealth MC Adv 55160 non-null object 17 CBC CHIP 101645 non-null object 18 CBC Comm 101645 non-null object 19 CBC Medicare 101620 non-null object 20 Cigna Comm 101620 non-null object 21 Cigna Medicare 55160 non-null object 22 Geisinger Comm 101620 non-null object 23 Geisinger MA 55711 non-null object 24 Geisinger Medicare 55160 non-null object 25 Health Partners Plans MA 55711 non-null object 26 Health Partners Plans MA CHIP 55711 non-null object 27 Health Partners Plans Medicare 55160 non-null object 28 Highmark ACA 63824 non-null object 29 Highmark CHIP 55185 non-null object 30 Highmark Choice Blue 63824 non-null object 31 Highmark Comm 63824 non-null object 32 Highmark Medicare 55160 non-null object 33 Highmark Wholecare MA 55711 non-null object 34 Highmark Wholecare Medicare 55160 non-null object 35 Humana Medicare 55160 non-null object 36 Multiplan 101620 non-null object 37 PA Health & Wellness MA 55711 non-null object 38 PA Health & Wellness Medicare 55160 non-null object 39 Preferred Healthcare Lanc 101620 non-null object 40 Preferred Healthcare Systems 101620 non-null object 41 UHC Comm 101620 non-null object 42 UHC MA CHIP 102171 non-null object 43 UHC Medicare 55160 non-null object 44 UHC Medicare I-SNP / D-SNP 55160 non-null object 45 UPMC Comm 101620 non-null object 46 UPMC MA 102171 non-null object 47 UPMC MA CHIP 102171 non-null object 48 UPMC Medicare 55160 non-null object dtypes: float64(2), object(47) memory usage: 38.5+ MB
Ok, so there is some interesting stuff here. There is no value for CPT codes 99213 and 99214, but there are values for 99211 per this code:
# Set the option to display all columns pd.set_option('display.max_columns', None) # Filter rows and select first 25 columns filtered_data = data[data['CPT / HCPC'] == '99211'].iloc[:, 0:25] # Convert the DataFrame to HTML and set index=False to not display the index html = filtered_data.to_html(index=False) # Display the HTML in Jupyter Notebook from IPython.display import display, HTML display(HTML(html))
When trying to identify the relative allowances for highest cost services, the top costs are dominated by varying doses of high cost therapies (e.g. Zolgensma, CAR T-Cell Therapy, and SYNCARDIA). With a little effort, here is a listing of some comparative allowances for high cost services:
# Creating filters for each specified term to ensure only one entry per term car_t_cell_therapy_filter = data['CDM Tech - Description'].str.contains("CAR T-Cell Therapy", case=False, na=False) heartmate_filter = data['CDM Tech - Description'].str.contains("HEARTMATE", case=False, na=False) syncardia_filter = data['CDM Tech - Description'].str.contains("SYNCARDIA", case=False, na=False) zolgensma_filter = data['CDM Tech - Description'].str.contains("Zolgensma", case=False, na=False) mcs_evaheart_implant_filter = data['CDM Tech - Description'].str.contains("MCS Evaheart Implant", case=False, na=False) # Getting one entry for each term one_car_t_cell_therapy = data[car_t_cell_therapy_filter].head(1) one_heartmate = data[heartmate_filter].head(1) one_syncardia = data[syncardia_filter].head(1) one_zolgensma = data[zolgensma_filter].head(1) one_mcs_evaheart_implant = data[mcs_evaheart_implant_filter].head(1) # Combining these entries combined_specific_entries = pd.concat([one_car_t_cell_therapy, one_heartmate, one_syncardia, one_zolgensma, one_mcs_evaheart_implant]) # Creating a filter to exclude these specific entries from the rest of the data exclude_specific_terms_filter = ~(car_t_cell_therapy_filter | heartmate_filter | syncardia_filter | zolgensma_filter | mcs_evaheart_implant_filter) # Getting the next five highest charges that do not have these duplicate terms next_five_unique_entries = data[exclude_specific_terms_filter].sort_values(by='Gross Charge', ascending=False).head(5) # Combining the specific entries with the unique entries final_combined_entries = pd.concat([combined_specific_entries, next_five_unique_entries]) # Reordering columns final_combined_entries_info = final_combined_entries[reordered_columns] final_combined_entries_info
Payer MRF
Payer MRF seem to be more challenging. Let’s start with Highmark. Here is the website they have posted for these files at https://mrfdata.hmhs.com/:
If you click through on a state, e.g. Highmark Blue Shield Pennsylvania, you download a JSON file, which is to say, a data structure which stands for Javascript Object Notation; In parlance, it is a string representation of a dictionary, which holds data in key-value pairs. On the one hand, this initial JSON file looks like this:
However, underneath this otherwise unintelligible word-salad are three keys:
reporting_entity_name
reporting_entity_type
reporting_structure
Under these is more detail. For the first two keys, you can see this example:
{"reporting_entity_name":"Highmark Blue Shield Pennsylvania","reporting_entity_type":"Health Insurance Issuer","reporting_structure":
It is in the reporting_struture key things get interesting. It becomes a list of dictionaries. Here is one example of the start to item 1 in such a list, of which there are 28 in this particular file:
You can notice that, within this list, there are then a series of additional dictionaries. The structure to these includes, per highmark_data[‘reporting_structure’][0].keys():
reporting_plans: You can see the example here has a single dictionary within a list with
keys including plan_name, plan_id_type, plan_id, and plan_market_type.in_network_files
: You can see this is also a list with links to rate files; There are 312 files of this sort, which have a ‘description’ key (basically all “National In-Network Negotiated Rates File”) and ‘location’ key (e.g. ‘https://highmarkbspa.mrf.bcbs.com/2023-12_020_02I0_in-network-rates_1_of_2.json.gz?&Expires=1705586456&Signature=HzHRrll5RHTR2J3XsP5KwdwMO57CQjOVJj8pKje45Lvve6hsZLvz1i6otthQj75S8MYXnLeRa86u~16QwMqR~YzyzqZJNKU3WFfswRc5Tspdn88POxpJNU5WT5QRUMgbZxQdzrXYxDsMHeFFG3g0Fi3zzq5pLhlph6CSbpn~epsHcz7QHqW7mLoJuSd60xf~qnXfpZvUX7o~Z7ILRMz0oKcs4fEGWypfFxLX8hYtqeKH1XK4RfouhrBYt8RvJRmbg8AZs8cahb6s6~4jwPy8jP~IB0vRaxRL7qDY0alE2KcK8Jen5CjezHwmglZV9DqlsVYoFhcX0eVJQdnzuiG-vQ__&Key-Pair-Id=K27TQMT39R1C8A‘
So down the rabbit hole we go. This file is rather large. It extracts to a new JSON file that is 1.13 GB in size (keep in mind that there were 312 just in the prior tab, which was just one region for Highmark!)
So, here is some initial insight into the structure:
Note that these nested and enormous dictionaries, over months, insurers, and plans, is what caused our firm to create this slide for a recent roundtable. Our friend Brendan also helped us sort through this for a few purposes with his firm using Parquet files and a Clickhouse database tool (note SQL query in the file for an example of how these can work.)
i am curious why do an analyse structure routine when the schema is laid out at cms.gov eg. https://github.com/CMSgov/price-transparency-guide/tree/master/schemas/in-network-rates ?
deityrush from discord server on 8/3/23
Here is some code to investigate the structure of these keys:
# Sample code to explore the top-level keys # Print basic information print("Reporting Entity Name:", in_network_rates_data['reporting_entity_name']) print("Reporting Entity Type:", in_network_rates_data['reporting_entity_type']) print("Last Updated On:", in_network_rates_data['last_updated_on']) print("Version:", in_network_rates_data['version']) # Explore 'provider_references' print("\nProvider References Structure:") if 'provider_references' in in_network_rates_data: provider_refs = in_network_rates_data['provider_references'] print("Type:", type(provider_refs)) print("First few entries:", provider_refs[:5]) # Assuming it's a list # Explore 'in_network' print("\nIn Network Structure:") if 'in_network' in in_network_rates_data: in_network = in_network_rates_data['in_network'] print("Type:", type(in_network)) print("First few entries:", in_network[:5]) # Assuming it's a list
And here is some info:
And, finally:
So, perhaps these are enough notes for tonight. I will see you soon! There is alot of opportunity to work on this and create some insights and value!
I returned to look at this…
So, my kids are in piano lessons at the Music Academy in State College. I thought I would explore this a bit further. In the prior file, the reporting entity (from the Highmark website, mind you) was Arkansas Blue Cross Blue Shield. I tried pulling the next file and, again, it is Arkansas Blue Cross and Blue Shield.
I wonder if, as the prior files were “linked” – i.e. 1 of 2 and 2 of 2 – if the next file will be a different entity. This one is 10.3 GB in size and 1 of 4.
While I continue to wait for this file to download, I should share some insights about the community working with this. A few folks have stood out as being particularly helpful to me, personally: Brian Cotter has built a community on Linked In and a Discord server, and Brendan Younger has supported my efforts to understand this by sharing insights into data structure and tools, such as Parquet and Clickhouse, to cope with the massive size of these files. Valhalla also contracted with him for certain bespoke datasets. Brendan’s firm, Tynbil, is worth looking at for insights into what the data can eventually “look like” after a strong ETL process, and he has good thoughts about why it all matters.
Here is the link to the Discord server: https://discord.com/channels/1107282462444371998/1107282463161581612
Ok, the file has loaded. Now we need to wait for the JSON object to get read in into memory… But then this…
One of the challenges as a self-taught tinkerer with computer programming, based upon a foundation of Harvard’s CS50 and various projects, Youtube videos, and Python/SQL/C for Dummies, is coping with certain issues of sophistication (e.g. very large files and complicated SQL queries) by using more more “advanced” techniques (e.g. streaming / “reading chunks into memory”) can feel a bit like wandering in the dark. Fortunately, the emergence of ChatGPT and friends like Brian and Brendan have helped me to cope and get some workable files. ChatGPT suggested the following alternative approach and this worked:
Darn. Again, not Highmark. Let’s come up with a loop to go across all these files to find an example of a file which has Highmark in the name of the reporting entity. Here is the code:
import ijson import requests import os import gzip for file_info in highmark_data['reporting_structure'][0]['in_network_files']: file_url = file_info['location'] # Download file to a temporary location response = requests.get(file_url) temp_file_path = "temp_file.json.gz" # Change the extension to .gz with open(temp_file_path, 'wb') as temp_file: temp_file.write(response.content) # Check if the file is compressed and decompress if necessary if temp_file_path.endswith('.gz'): with gzip.open(temp_file_path, 'rb') as compressed_file, open("temp_file.json", 'wb') as decompressed_file: decompressed_file.write(compressed_file.read()) os.remove(temp_file_path) # Remove the compressed file after decompression temp_file_path = "temp_file.json" # Update the file path to the decompressed file # Process the file in a streaming manner found_highmark = False with open(temp_file_path, 'rb') as file: parser = ijson.parse(file) for prefix, event, value in parser: if (prefix, event) == ('reporting_entity_name', 'string'): if "Highmark" in value: found_highmark = True break # Check for Highmark and delete file if not found if found_highmark: print(f"Success: 'Highmark' found in the file {file_url}") break else: print(f"File URL: {file_url}") print(f"Reporting Entity Name: {value}") os.remove(temp_file_path) # Clean up: delete the temp file if it still exists if os.path.exists(temp_file_path): os.remove(temp_file_path)
The code is good – I will rerun this on a computer at my house later this afternoon so it can work while we are taking the kids to an Irish Step Dance performance in Bellefonte.
Photo from last weekend…
P.S. Getting closer… This returned -> Reporting Entity Name: Highmark Blue Cross Blue Shield Delaware
Back from Irish Step Dance
Ok, so I have found the Highmark Blue Shield records — there are 35 — and they begin at index location 95. Here is some code to evaluate the structure and sample data from the first of these JSON files:
import json # Path to the JSON file file_path = '/Users/jensthorsen/Workspace/Notebooks/transparency blogpost dec 23/2023-12_378_49K0_in-network-rates_03_of_35.json' # Load and analyze the JSON file try: with open(file_path, 'r') as file: data = json.load(file) # Print the top-level keys print("Top-level keys in the JSON file:") for key in data.keys(): print(key) # Optionally, print some sample data for each key print("\nSample data for each top-level key:") for key, value in data.items(): # Print a sample of the data (e.g., the first few items if it's a list or the value itself) if isinstance(value, list): print(f"{key}: {value[:3]}") # Print the first three items of the list else: print(f"{key}: {value}") except Exception as e: print(f"An error occurred while reading the file: {e}")
Ok, so it is clear that the action is in the in_network key.
How do we find the Hershey providers in the payer file?
This code is very imperfect, and it probably fails to get all the Hershey relationships; I found examples of providers with NPI who work at Hershey who seem to show up in Highmark’s file under their own NPI; I believe there are efforts to associate all these NPI value by relationship but that is beyond the scope of this weekend.
Here is some code that leads to a nice list of dictionaries with Hershey’s NPI – 1568435477.
def cycle_through_highmark_files(iterations): return_list = [] iterations = iterations ct = 1 while ct < iterations: import ijson import requests import os import gzip from tqdm import tqdm import json # Path to the JSON file file_path = '/Users/jensthorsen/Workspace/Notebooks/transparency blogpost dec 23/2023-12-01_Highmark_Blue_Shield_of_Pennsylvania_index.json' # Reading the JSON file with open(file_path, 'r') as file: highmark_data = json.load(file) # Start index - change this to the index where you want to start (0-based) start_index = 95 for index, file_info in enumerate(highmark_data['reporting_structure'][0]['in_network_files'][start_index:start_index+iterations], start=start_index): try: file_url = file_info['location'] print(f"Downloading file {index+1}/{len(highmark_data['reporting_structure'][0]['in_network_files'])}: {file_url}") # Download file with progress bar response = requests.get(file_url, stream=True) total_size_in_bytes = int(response.headers.get('content-length', 0)) block_size = 1024 # 1 Kibibyte progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True) temp_file_path = f"temp_file_{index}.json.gz" with open(temp_file_path, 'wb') as temp_file: for data in response.iter_content(block_size): progress_bar.update(len(data)) temp_file.write(data) progress_bar.close() # Decompress if necessary if temp_file_path.endswith('.gz'): with gzip.open(temp_file_path, 'rb') as compressed_file: with open("temp_file.json", 'wb') as decompressed_file: decompressed_file.write(compressed_file.read()) os.remove(temp_file_path) temp_file_path = "temp_file.json" # Process the file found_highmark = False with open(temp_file_path, 'rb') as file: parser = ijson.parse(file) for prefix, event, value in parser: if (prefix, event) == ('reporting_entity_name', 'string'): if "Highmark" in value: found_highmark = True break # Handle results # if found_highmark: # print(f"Success: 'Highmark' found in {file_url}") # break # else: # print(f"File URL: {file_url}") # print(f"Reporting Entity Name: {value}") # os.remove(temp_file_path) except Exception as e: print(f"An error occurred: {e}") continue # Path to the JSON file file_path = '/Users/jensthorsen/Workspace/Notebooks/transparency blogpost dec 23/temp_file.json' # Load and analyze the JSON file try: with open(file_path, 'r') as file: data = json.load(file) # # Print the top-level keys # print("Top-level keys in the JSON file:") # for key in data.keys(): # print(key) # # Optionally, print some sample data for each key # print("\nSample data for each top-level key:") # for key, value in data.items(): # # Print a sample of the data (e.g., the first few items if it's a list or the value itself) # if isinstance(value, list): # print(f"{key}: {value[:3]}") # Print the first three items of the list # else: # print(f"{key}: {value}") except Exception as e: print(f"An error occurred while reading the file: {e}") # hershey_npi = 1477790871 # hershey_npi = 1568435477 # hershey_npi = 1235525676 hershey_npi = 1568435477 print("Looking for Hershey") # Assuming 'data' is your loaded JSON data in_network_data = data['in_network'] print(f"Data Length: {len(in_network_data)}") #look for Hershey counter = 1 alt_count = 1 while counter < len(in_network_data): if hershey_npi in in_network_data[counter]['negotiated_rates'][0]['provider_groups'][0]['npi']: # print(in_network_data[counter]) return_list.append(in_network_data[counter]) counter += 1 alt_count += 1 else: counter += 1 # print(f"{counter-1} Not in") # Clean up temp files if os.path.exists(temp_file_path): os.remove(temp_file_path) import json print(f"finishing loop - {ct}") ct += 1 return return_list
This returns 710 entries with these keys: dict_keys([‘negotiation_arrangement’, ‘name’, ‘billing_code_type’, ‘billing_code_type_version’, ‘billing_code’, ‘description’, ‘negotiated_rates’])
So this begs the question, how does this 12032 CPT code with an NPI code of 1568435477 in this file, $1,016.97, compare to the hospital file? Let’s take a look.
data.loc[data["CPT / HCPC"]=='12032', ["CDM Tech","CPT / HCPC","Gross Charge","Self Pay Cash Price","Highmark Comm","Highmark Medicare","Highmark CHIP", "Highmark ACA", "CBC Comm"]]
Of course they don’t tie out!
I think this exercise gives a good initial background to the sources of data, reasons for investing into efforts to parse and draw meaningful insights from these data, and hurdles that present themselves as we approach this project. Take a look and let me know what you think!