The hardest part of teaching clinical database design is helping students grasp the need for precision in naming and representing data elements. Newbie modelers often assume that everyone will understand their data-capture assumptions because–well, they’re obvious. Experience has shown that having students attempt to merge data from multiple sources works well as the proverbial picture that is worth a thousand words.
I ran into the problem of merging data from disparate sources while working on the CNICS project at UAB. The variation among sites in capturing information for the same data elements was surprising. Since then, I have noticed the same variations when reviewing clinical data. About five years ago, I put together a lecture that discussed the most problematic variations and created a fictional data set to use as an instructional aid. Here are the issues addressed in the lecture/data set.
Too often, terms are considered to have an obvious meaning. For example, the term “social drinker” is in the Social History section of the data set. This is usually the interviewing clinician’s interpretation of what he/she has been told. What level of alcohol consumption does this actually represent? Even if the term is fairly well standardized within a given medical facility, the meaning is unlikely to be understood by outside reviewers.
Lack of Specificity
A common problem encountered when merging data or reviewing charts is the lack of specificity. Smoking history is a prime example. Often it appears as a “Yes/”No” value. However, research data on the effects of tobacco use is best when usage frequency is accompanied by a duration measure—20 cigarettes/day for 15 years. Another missing piece of information is the type of tobacco used. Only rarely are pipes, chewing tobacco, snuff and cigars listed as separate items.
Missing Units and Normal Ranges
Missing units and normal range information seem to happen mostly in abstracted data sets. The original units and normal values are not recorded, forcing the reviewer to make assumptions about normal ranges. Medications and lab results are usually the culprits. Most clinicians can guess the correct interpretation; others may be confused. No matter, all guessing is bad.
Abbreviation and Acronym Potpourri
Every clinical facility has its in-house abbreviations and acronyms. Unfortunately, they can make it into data sets and cause problems far from their home territory. Researchers who use them in databases without a proper data dictionary to track them can cause self-inflicted misery. It can be difficult to remember what an infrequently used acronym or abbreviation means five years later.
A blank field can be interpreted so many ways: “information not currently available”; “intentionally left blank”; “negative response”; “not done”; “unknown”; and many others. Blank fields should always be accompanied by an explanation. For the CNICS project, we created a set of special markers just to address the problem of missing data. Since then, coming up with specific meanings for blank fields has been sort of a pet project (my “being and nothingness” codes).
A lack of precision when designing databases usually results in misery somewhere down the line. This exercise has been very effective at demonstrating that good clinical data sets require attention to detail.