What is the Smallest Viable Unit of Clinical Data?

In trying to free myself from past assumptions about clinical information system design, I decided it was a good idea to question everything. Like everyone else, my view of electronic systems has been colored by the use of paper charts. Shaking off preconceptions is hard. In doing so, I have found it helpful to look at other areas where the move from paper to electronic provided vast improvements over paper.

Drawing and painting software are excellent examples.   What one can do with these programs is amazing — textures, emulations of just about any medium—oil, acrylic, watercolors, chalk.   They permit artistic experimentation in a way paper could never match.   Underlying all of this capability are algorithms that simulate the key properties that, for instance, make oil paintings look different from watercolors.   Essential properties — light absorption, spread on paper, brush strokes — have all been rendered in code. Now how do we apply these lessons to clinical software?

A proper first step seems to be separating the information from the medium.   Paper charts have a certain organization and layout. For example, they allow random access for major sections while within-section searches are sequential.   Lists are in chronological order and results are compared by flipping back-and-forth between reports. Radiology reports are present, but the actual images are not.   The paper chart imposes specific limitations on the amount of information stored, how it is organized, and how it is accessed.   In the long history of clinicians using paper charts, an accommodation occurred that tied access to and storage of clinical information to paper constraints.

Shortliffe and Barnett (1) define clinical datum as “…any single observation of a patient.”   They further state that every datum can be defined by five elements: patient, parameter, value, time, and the method by which the observation was made.   Moving from this definition of clinical datum, the next step is asking how this definition might be affected by paper concepts.   Consider for a moment that paper-based data are relatively immutable.   Therefore, everyone using a paper chart has access to the same views. In addition, whatever uses one makes of the chart’s data must be done outside the confines of the chart.   Is it possible that the static nature of the paper chart has resulted in unintended limitations on electronic systems? I see three likely areas of influence.

Data exchange
When moving clinical data from one clinician to another, what additional information is required to make what is received acceptable and useful?   Obviously, some provenance is required, but is it the same for any data transfer or is it based on the eventual use?   Stated another way: What metadata are required for exchange and how are these requirements best determined? Metadata are not discussed in terms of paper chart usage because the chart is self contained. As a result, the smallest viable unit of clinical data in a paper chart should be different than in an electronic one.

As described by Shortliffe and Barnett, a clinical datum qualifies as a relation mathematically. As such, I wonder if it is possible to describe and analyze data exchange requirements mathematically an approach that I can find no evidence for ever having been tried.

Data modeling and storage
Few EHR systems have executable workflow models or other aspects of process-awareness. They are generally data repositories and, as such, are optimized to capture structured data and redisplay them to clinicians.   Creating data models for capture and display purposes is much simpler than creating data models that, for example, support clinical research.   Concepts such as visit, encounter, pending, and explicit labeling of missing data, etc. can be quite helpful in clinical research, but are not required to create an electronic chart. Yes, RDBMS are quite flexible, but a relational schema designed with one use in mind is not readily repurposed for others.

A stable schema designed with specific use cases can become unstable if too many after-the-fact changes are made.   One way around this is to separate storage from semantics.   However, this leads to another possible issue — determining the best way to test such systems to assure that ALL query results are correct?   (If anyone knows of a test suite for information or data models in clinical systems, please send a link.) Having personally experienced fifth normal form-type query issues (2,3), I know that difficult-to-detect errors can creep into queries that return large result sets, especially when more than a few tables are involved.

Patient state vs. data model
One thing that an electronic system can do that paper never could is provide a computational model of the patient. This is an area where I think paper-based thinking is most evident.   We all know that EHR systems are repositories of clinical data and data in these systems are relatively passive. They sit there waiting to be accessed by a clinician or perhaps a CDS algorithm.   A computational model of a patient, which I refer to as a patient state matrix, could make clinical systems more useful for certain types of problems such as catching diagnostic errors.   The idea is that instead of querying RDBMS tables or even an information model, the patient matrix would be used for CDS and similar functions.   The patient state matrix would be dynamic and updated in real-time.

I have toyed with this idea for years and only recently has it seemed to be computationally-viable.   Of course, clinical data units are an issue here as well. Since the patient state matrix lives in memory and models the patient, additional elements beyond those identified by Shortliffe and Barnett would be required; for instance, a decay metric that tells how long a value is to be considered true.

I am not certain of the best way (or even if there is a best way) to optimally model and represent clinical data.   However, I do know that untested assumptions are not the path to major breakthroughs. Current assumptions about clinical data and how they are used grew out of clinicians’ experiences and paper record use.   These paper-era assumptions have affected terminology design, EHR design, data exchange efforts and other aspects of clinical informatics.   They may well be correct and represent the best possible expressions of data modeling and data use in clinical systems. Even so, I have a nagging feeling that just as workflow technology has been a glaring omission in clinical systems, the “obviousness” of clinical data properties could be hiding something just as egregious.     One thing is certain; the only way to find out is by asking questions. And the best place to start is with those questions whose answers seem patently obvious—like: Why do apples always fall down never sideways? (4)

Asking what the smallest viable unit of clinical data is requires one to address not only how clinical data are used and stored, but also how one can assure their quality, what factors are essential for their security and proper exchange, and what they ultimately represent. I think the answers to these questions are less obvious than they seem, and that is exactly what worries me. Question everything…

  1. Shortliffe EH, Barnett GO. Biomedical data: their acquisition, storage and use. In Shortliffe EH, Cimino J eds Biomedical Informatics: Computer Applications in Health Care and Biomedicine, Fourth Edition. London, UK: Springer-Verlag; 2014:39-66.
  2. Owens J, Big Data x 5th Normal Form = Really Big Data Errors. http://integrated-modeling-method.com/big-data-5nf-fifth-normal-form/ . Accessed April 12, 2015.
  3. Kent WK. A Simple Guide to Five Normal Forms in Relational Database Theory. http://www.bkent.net/Doc/simple5.htm . Accessed April 12, 2015.
  4. Isaac Newton (Apple Incident), Wikipedia. http://en.wikipedia.org/wiki/Isaac_Newton . Accessed April 12, 2015


  1. I wonder if we take data too seriously. I recently heard Bob Wachter speak about a serious error that he at least partially attributed to EHR design, particularly poorly designed alerts. He asks if not enough attention has been given to (1) what kind and how many alerts should be given; and (2) how should alerts be presented, and should they be different based on their seriousness? In a related vein, what data is important? When is it important? What should be most prominently displayed? What kind of data is most likely to cause harm if it’s not accurate? I hear a lot of people complain about too much data to sort through to put together the story of the patient, potentially missing a key piece like a needle in a haystack. I don’t have answers; it may be different in different situations, but I think it’s important now that we actually have the capacity to generate and present a lot more data than ever before.

    1. Hi Sandra. I don’t think the problem is taking data too seriously.

      Data elements are a form of abstraction about the state of the real-world. When clinicians look at a K+ level, they assume it represents a true internal condition of a patient. Therefore, when looking at data usage, say in the form of alerts, there are multiple issues. However, the main issue is what the alert is trying to say about the real-world when it fires. Many alerts are mechanical–they fire because some rule or database condition is present. However, mechanical firings are unaware of real-world states and so never learn. So, I don’t think we take data too seriously. Rather, I think that since we began collecting patient data on paper, we think of it as a chart note or lab value, and do not appreciate it as a window into the state of the patient.

  2. Interesting concept. I have worked with a couple of (and written one) systems that had, as their basic data point, the “Observation” as you describe it. GE’s Centricity even calls the table they live in the “Obs” table. The system that I have been working on has a “Response calling” system in it that sounds a lot like your “Patient State Matrix”. It uses a second table that relates a bunch of other discreet “Observations” to one “State obesrvation” (in this case, response of a disease to treatment). It is not perfect or complete yet, but we are working on it.

    1. Clyde, sounds interesting. The patient state matrix idea came about as a result of trying to figure out a way to make CDS more robust. It is an in-memory data structure. In trying to model the PSM, I decided that a new way of marking clinical data is required that takes into account truth decay (how long a value can be trusted) and other parameters not currently included in EHR systems. Creating a PSM may take time, but I think it is an essential feature of future systems. I will be writing more about this in the future, but in a longer format than a blog post.

Leave a Reply

Your email address will not be published. Required fields are marked *