Health Data Modalities - EHR and Claims Data
Berkson’s Paradox. A and B are independent. Both cause C. Now A and B appear related.
What makes EHR data different?
It’s weird as heck.
All data you see is conditioned on being in the health system. What does a 100% healthy person look like in EHR data? They’re a missing data point lol (it’s a joke).
It’s not collected for analysis and is a byproduct of clinical care. This is the critical thing. You also don’t get the same features per patient. Events happen at irregular and informative times.
And pateints are not independent! They share wards, doctors, policies.
Missing data is the norm and is informative.
EHR is the system that generates the data. Capturing data for the purpose of billing is not a faithful representation of patient state. What about pharmacies? If you see a medication prescription, are you sure that the patient took the medication? Data represents what was done, not everything that happened.
What does it look like?
You can look at something like OpenMRS to get a sense. Not a full picture.
MEDS Schema: Subject ID, Time, Event, Value. That’s the entire schema! Note no Hospital ID… MEDS is tied to a single Healthcare System.
What is Claims Data?
EHR data is through the process of care. But Claims data is generated through the process of billing: what services and what diagnoses justified them. You have razor-thin margins the ‘business’ of Healthcare.
The temporal granularity/resolution is smaller with EHR Data than Claims Data. Also, you just see Codes + Costs and that’s it.
For ML you really need to understand the economic incentive structure. Broadly speaking you have:
- Fee For Service models (FFS) you do a service you get paid: prioritize volume and upcoding (overtreatment? decoupling patient incenvtive from hospital incentive.)
- Capitated Service models: these pay fixed amounts. prioritize under-documentation(?)
Think telemedicine and after the pandemic.
Aside: Primary care in other countries is higher paid since it’s the frontline in keeping costs down. Here in the US, it’s all about inpatient settings. In other countries it’s about prevention to keep costs down. Not here. That would be very communist.
Coding Systems
As for codes you can pick the first one or the most lucrative one (“upcoding”) or… there is a lot of financial incentive here. This is sadly one of the biggest use of AI! For both picking the most ‘lucrative’ code (by hospital) and for detecting it and downcoding. What a system.
See Zipf’s Law regarding codes.