Skip to main content

Research Data Management

Data tables, entry, and querying.

Data Tables

What it says on the tin. Think Excel sheets/CSVs. There’s a lot of value in storing data in a single file. You kind of need a “Final Dataset” when you do reseach in that a paper you’re writing about a Clinical Trial might not be published until >5 years later. Note PHI: study IDs are used in any table that is used for analysis and paper-writing (rare that you will need PHI details).

Now for Longitudinal Data you can use multiple rows. Just try and keep it simple. Normalize only if necessary.

Study IDs are unique and meaningless (not DOBs, UUIDs or some sequential numbering). If a cluster randomized trial, use $CLUSTER_NUMBER___$PARTICIPANT_NUMBER (for example.) The mapping is stored in some ‘vault’ that has a ‘key’ that only the PI has access to.

Cloud Providers: AWS may have an agreement with Columbia that Google does not.

REDCap

Allows resesarchers to build online surveys for data collection. Platform that makes all this easy. Might have started at Vanderbilt. It’s pretty bullet-proof from a HIPAA standpoint so you get all that for ‘free’. Supports a lot of study designs.

Doesn’t work very well for things like self-monitoring (think FitBit data).

Dictionaries

This is a living, descriptive schema of the variables in your study. Think (Variable, Type, Description) as a minimum. Any metadata? That goes here too.

Common Data Elements

How can you share your data with other researchers? Think of how you can help them with systematic reviews and meta-analyses!

Sometimes a study might require this. NINR requires PROMIS for example.

Data Entry

Paper!

A lot of clinical trials data are still collected in paper form! Think of older people, low economic situation, no access to tech.

Good thing here is that you have a paper trail. Bad things are time and human error and QA (you cannot control them skipping questions or giving garbage answers.)

Now what if you have multi-center studies? Create a Data Entry Manual and train people. Create digital validation checks. Emphasize security.

Electronic

Lots of advantages here of course. All usual caveats still apply.

Who Does the Coding?

Free-text → Coded Responses. Generally, coded responses are preferred but you need to train people. Great if you can get both!

Data Processing

Convert to some format for analysis (qualitiative: some master file for qualitiative coding). Lots of cleaning, imputation, checking for systematic errors, etc etc. Data Hygiene ftw.

As for imputation, consult a statistician: the thinking about this task changes over time. Perform sensitivity analysis with and without.

Data Querying

What it says on the tin. Lots of software for this. SAS, SPSS, R, Python, etc etc. Morae — usability testing. Nvivo — qualitiative research.

Data security

Do any and all training. Use your head (encrypt, don’t use email, etc). Ask if unsure. Maintain audit trails of who has access. Recertification.

If you are transferring anything to another organization, you must have a Data Use Agreement with that organization. You cannot even send deidentified data!

Limited Datasets & PHI

This removes Name, SSN, address, photos, images, biometric, device identifiers, VINs, certificate license numbers, and so on.

PHI removes all these plus geographic subdivisions lower than 20K people and “dates other than year” (TODO: what does this mean?)

Note that deidentification is a thing in HIPAA and anonynymization isn’t.

Data Reporting

Usually report aggregates (means, etc).

Data Sharing

You need to ask people for permission to use their data for another project (or just say I may want to use this for “future research”.) Be explicit. Seek informed consent.