Skip to main content

Day 01

EHR Notes

Early on, we did a Paper → EHR but did not exploit the advantages of the electronic nature of the records.

You need to reconstruct “Observational Data” from the other incentives that data is collected for (e.g. legal, research, regulatory.)

ADT Data → Admissions, Discharge, Transfer

Health system has 9 hospitals. You can use different DBs for outpatient and inpatient! However at least here (NYP/Columbia/Cornell), and in February 2020, everyone went EPIC.

150 outpatient locations (Jersey, Children’s hospital, further downtown.)

Here in the Heights, public insurance and Hispanic. Upper East Side, rich people, mostly white, private insurance.

Ancillaries: Each lab/department might have their own preference for a data collection system.

TODO: EMR versus EHR?

TODO: Data Warehousing?

OMOP is meant to be analytical (at least primarily.)

Terminology Management System → Semantic Alignment → OMOP

Epic COSMOS — Patient data across all their sites (vendors?). Cancer Staging data density is 5%!

Just using Epic doesn’t mean you’re going to get ‘clean’ or standardized data. The lung people and the breast people at the same hospital will use it differently. EHR data ultimately depends on the care signal which is the result of a lot of processes and people. Consider for example a simple question: “How many people are on the floor at an ICU?” You cannot state with confidence just based on the HL7 data coming in!

So: Which EHR? What’s the signal that populated it?

Claims Notes

Schneeweiss et al: Record generation process and sources of bias at each step.

Structured + Unstructured Data → Medical Coders (not clinicians). This is essentially for the reimbursement process: a lovely song and dance.

You must distinguish between a Payor and a Provider when you look at a piece of claims data. Which perspective? Pre-adjudication or post-adjudication?

Start from the top! Does the patient actually seek care for a malady?

Three forms, three classes of codes.

When using ICD, which version and which modification? ICD10-CM (Clinical Modification for US, there is more granularity again for billing (“acuity of care”))… ICD10-CA (Canada), etc. Half of European countries are OK with using ‘just’ ICD10. Modifications are payor driven (of course.)

There’s an OHDSI vocabulary is a collection of standardized vocabularies (six month release cycle.)

In the US there is no ‘national’ or ‘global’ patient identifier. These are all local. Columbia PID → Aetna PID are not the same. Tokenization and linkage to solve this problem are a cottage industry.

TODO: Tokenization?

Broadly, these are what are useful clinically: Demographics, Healthcare Utilization, Procedures, Diagnoses, Dischage Status.

Hark!

Diagnosis and Procedures: A procedure can rule out a disease, but the diagnosis codes in claims will show the disease!

Seeing a person with a diagnosis code does not necessarily mean they have the disease!

There is a difference between “patients with DM2” and “patients who have a diagnosis code of DM2”!

Diagnosis codes are for the purposes of reimbursement. This is a tug of war of money. And then we look at these residual artifacts and attempt to reconstruct a story. We do this by merging a lot of data sources and studying the bias to try and understand and minimize it.

The health system here is horribly broken.

Three populations fill the same forms:

  • Commercial Claims
  • Medicaid
  • Medicare

How stuff ends up in EHR:

  • You show up, you tell doctor
  • Doctor writes it down
  • It is imparted some structure (note that there is structured data and notes.)

TODO: Structured Data and Notes ← how are these populated? Are these the only ones?

info

What do we have in the source data and what can we do with it? This isn’t about what the OMOP CDM can ‘do’.

Roseanne example: Maybe on EHR column. How are we capturing? Claims data has more rules and structure. Bad faith argument: I am a beleagured doctor or I am sloppy, but I do want to get paid! So maybe would go into all rows in the EHR columns and more yes’es in the Claims column (so you get paid.)

OMOP CDM

Observation Period from the payer’s perspective, your Enrollment period is the Observation period.

“If I don’t see it, it didn’t happen” ← to support this claim you need to be observable!

Observation Period
Visit x n (Encounter with doctor)
Condition x n
Drugs
Measurement
Measurement Value
Procesure_occurrent (P codes)
Freetext Notes

Note: You can have visits without its children or you can have children without visits. This is not a stricture! Depends on the source data!

Observation_period Table: You can have two+ rows for when you switch insurances. Two Observation periods cannot overlap!

visit_occurrence: Outpatient, Emergency, Hospital, long term care visit.

Generate clinical question → Get standard concepts → Get them from OMOP → Query using OMOP concepts.

condition_source_concept_id and condition_concept_id… ? Former is same in three places in US but different in Japan? This is because we use ICD10-CM in US but they use ICD10 in Japan (for example.)


ATLAS Notes

The colors mean something when you search. Red are non-standard. Standard are blue (Standard means “standard to OHDSI” e.g. SNOMED). Purple are Classification (TODO)

Domain == Which Table to look at

Evidnet → OHDSI Evidence Network

Drug searches: most of the time you filter on ingredient level.


ETL

Several tools exist to help with the ETL process (White Rabbit, Rabbit in a Hat, Usagi, ACHILLES, Rabbit in a Hat.)

TODO: What is the value of ETL documentation?

This is almost never a straightforward process. You will need to make decisions. ETL processes are not lossless. However, George wrote a paper about how you get the same patients from if you used the source data versus from OMOP.

You can have an analysis and deployment process using an Agile methodology.

E.g. You have a lab test: order and result are two different things. From a hospital POV this matters, may not so much for Observational research. So you may smush them.

E.g. Simply having an ICD code doesn’t mean that it should be ferried to the condition table. But it may work out well most of the time.

TODO: Who assigns the condition_source_concept_id?

You can have many OMOP instances! You can merge them together! This is not magical either just because you have the same schema! E.g. You have two different DOBs for the same patient. What do you do?

Snapshots/versions: at CUIMC they only store four quarters of versions. The rest end up in archival (they’re inaccessible.)

Hark!
  • Vocabulary dictates where the data lands.
  • We’re worried about what happened to the patient and not the health system.
  • Always question the data. Clinical data is exceedingly complicated because the Clinical signal is exceedingly complicated.