Since my last post about NoSQL data stores, I have been sporadically using the book Seven Databases in Seven Weeks (1) to explore NoSQL systems. This book has been a great guide to understanding how the various types of NoSQL stores differ and what each type brings to the game.
While NoSQL systems have proven to be interesting, I am, at heart, a relational guy. No surprise there; I have been using relational systems for years. I know how to solve problems using RDBMS, and I am comfortable with data modeling and normalization—even past third normal form. A recent project, however, gave me a real-world understanding of why NoSQL systems are catching on as solutions to certain types of problems.
While trying to create a data model for an application, I encountered a common data modeling problem: how to manage supertype/subtype data (2). Anyone building a moderately-complex application has seen this problem. Obviously, applications use databases to store and retrieve information. Some information is simple, which means it is easy to store and retrieve. For example, a contacts application may store only a name, primary and secondary address, a couple of phone numbers, and a few other items. If the goal is to store no more than 10 data elements, then data modeling is simple and so is retrieval. A simple SELECT statement will do the trick.
SELECT name, primary address, primary phone FROM contacts WHERE name = “some name”
Adding or updating a record is similarly straightforward. When the data become more complex, queries become more complex as well. If SELECT queries involve joining multiple tables or inserting new records means adding to multiple tables, then locking tables to complete queries can be time-consuming. Speed was one of the drivers for NoSQL system development. Websites that handle millions of queries opted for speed over the consistency that RDBMS transactions provide (though some NOSQL system do offer transactions).
Another driver for NoSQL development was the need for flexibility. Being able to grow and change schema quickly without dealing with the issues that can occur when changing relational schema also proved to be very attractive. Now, with that bit of background, let’s take a look at my supertype/subtype data modeling issue.
Supertypes and subtypes
I am creating a searchable repository. All items are information resources, and there are seven different types (so far). Each resource has important properties that I want to be searchable. In addition, every query must be as fast as possible because this will be a public-facing web application. So what is the best data model for supporting fast queries, allowing addition of new properties as needed, and being easy to maintain?
Modeling for this scenario means deciding among the following choices: 1) one huge info_resource table that will hold all resource types, 2) a main table that holds common properties accompanied by seven additional tables that hold data specific to each resource type or, 3) one table for each resource type.
Dealing with subtypes crops up often in clinical applications. For example, if an inpatient EHR system allows for feature access by roles, there will be a need to discriminate between various types of employees who access the system. In any hospital, there will be clinical and non-clinical employees. Clinical employees will have many different backgrounds, training/certification requirements, degrees, etc. As a result, there are different categories of employees, and each category may have specific data elements that are unique to it. Consequently, the more one wishes to grant access based on specific properties, the more complex the data model is likely to become and, by extension, the queries required to access all the information for a specific employee. In other words a simple employee table that holds all required information is not feasible.
One huge table
Kicking normalization to the curb and putting everything in one huge table would work. After all, storage is cheaper than bandwidth. Using this option, one would write SELECT, UPDATE, and INSERT queries for each subtype that would operate only on the fields of interest. There would be no joins required—a good thing. However, if a large number of properties were involved, the table would be a mess. Scratch this option.
Main table linked to a sub-table for each resource type
This is workable. It allows normalization rules to be properly applied, and it feels right. Of course, now obtaining all of the information for a resource requires a join. Inserts, and possibly updates, require accessing at least two tables and some type of transaction mechanism to assure consistency.
Inserts and updates are not a real concern for me because they will be relatively infrequent. However, since website users will be able to search and browse information—the main activities—joins are a concern. Importantly, all resources can be located by searching the main info_resources table. In addition, new properties may be added easily for each resource type.
One table for each subtype without a main table
Creating a table for each subtype eliminates joins and allows for normalization, but it also makes it harder to find all pertinent resources. For example, if one resource is a cooking class and another a cookbook, they would be in separate tables, yet I would want a search for “French cooking” to return both resources. With separate tables, all tables would have to be searched to locate all resources.
Being that I have spent so much time using RDBMS, I chose option #2 as my initial data model. As always, life brings change. In this case, I found that I added both properties and tables after supposedly finalizing the data model. I also realized that there were additional types of potential resources that I might consider adding, so my stated total of seven was subject to change. Having realized this, I began thinking more about how the database would perform on a typical server. Full-text searching will be enabled on a few large fields, which raises some performance concerns.
After wrestling with option #2 for a while, I began wondering if NoSQL might be a good solution for easily adding new properties and whether using keyword instead of full-text searching would be just as effective.
The attraction of a document database such as MongoDB (1) is that it makes it easy to create records for any resource type without worrying about having to add data fields later. All one need do is append them. In addition, all resource types can be contained in a single collection (the document database equivalent of a table). Thus, a search through the collection would be able to find all products. Together, these features address my concerns regarding ease of adding properties or types, and locating all relevant products.
The utility of the document approach should not be underestimated. Consider the type of relational model required to store all types of lab results. For example, a serum K+ level is simple—it has a name “serum potassium,” a numeric value, and units. However, a Chem-6 panel consists of six name-value-unit components. A CBC with differential and peripheral smear has even more components, and the smear is reported as text. Using a document database, all of these could easily be stored and searched within the same collection. Documents make searching and storing items with varying properties simple.
Given the above, moving to a document store might seem to be the thing to do. Well… that’s not exactly clear yet. Now I must consider if there needs to be any relationships between resource types. At present, the most that I see a need for in terms of relations are who searched for what items, the popular items accessed, and things of that nature. None of these require links between resource types.
Where do I go from here? The relational model has been rendered in MySQL 5.6. It will be tested with dummy data for query speed. Since I am using OOP to build the system and there is a data access layer, the code will be compatible for either data store. Once the MySQL tests are done, I will spend a week or so testing with MongoDB. Should I decide to go with MongoDB, I’ll write a post about it.
[On another note… Many have asked about the clinical workflow forum. It was removed because of technical problems. I hope to restart the forum—actually as a Q & A feature—by the end of October. Doing so will require upgrades to the underlying software, testing, a theme update, and a few other changes. If there is no post on 10/27, you’ll know the site upgrade took more time than expected.]
- Redmond E, Wilson JR, Carter J. Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement. Dallas, TX: Pragmatic Bookshelf, 2012.
- Hoffer JA, Ramesh V, and Topi H. Chapter 4: The enhanced E-R model and business rules. In: Hoffer JA, Ramesh V, and Topi H. Modern Database Management, 9th ed. Upper Saddle River, NJ: Pearson Education; 2010:150-198.