Medical Informatics I: Principles of Database Design

Article Index
Medical Informatics I: Principles of Database Design
Page 2
Page 3
Page 4
All Pages

Principles of Database Design This lecture was delivered as part of a week-long survey course designed to familiarize individuals with the application of computer technologies and information science in medicine. The course a National Library of Medicine fellowship program directed at medical educators, medical librarians, medical administrators, and young faculty who are not currently knowledgeable but can become agents of change in their institutions.

"[Medical Informatics] is the science of organizing information to make it useful, to make it retrievable, so people can use it to solve health problems and understand health and disease better. It is the technology for implementing that science, such as databases, communication networks, and other forms of digital tools.

We are increasingly surrounded by so much information that only computers provide a plausible way to keep it under control and find out the facts that we need. That is what medical informatics is about."
Dan Masys, interview at MBL, May 2001

In practice, medical informatics is the application of technology to all aspects of health care information. It involves both the art and science of organizing medical knowledge and applying such knowledge for the purpose of preventing human disease and suffering. Technology is pervasive and necessary to handle and manipulate the growing body of medical knowledge. Technology winds its way throughout health care--from processing results of medical research to applying knowledge in clinical practice; from accessing and processing patient records to making decisions in evidence based practice; from telemedicine to knowledge-based and decision-support systems.

Below you will find the transcript for this lecture, or you can view the video here http://www.mblwhoilibrary.org/services/lecture_series/masys/lecture_video.html

Transcript Part I: Introduction:

"Principles of Database Design" has been a staple of this course for the last 7 years, for a reason that endures, and is increasingly important in our ubiquitous computing environment, where all of you probably have desk top applications that allow you to easily create databases for your own purpose, and that of your organization.

The outline of what we are going to cover in the next 90 minutes is, first of all, our explanation to you of why we think it is important to learn this.

Secondly, the principles and the paradigms. "Paradigm"...Ted Nelson has described it as an idea that is too big to get through the door. The paradigms of the generations of database systems that have evolved over the last 3 decades. The principles of the most widely used database modeling these days, relational databases, are important so we are going to focus in some detail upon how they are designed and built, look at some of the methods that are used for creating systems of various sizes, and then lastly we will have a classroom exercise where here in real time on the screen, we will convert an old-style MEDLINE record into a relational database equivalent of it, so you can understand some of the principles of both entity relationship modeling and one-to-many parent/child relationships.

Why this topic is important/Historical Perspective:

So why is this important? In our health care environments and in laboratory and research environments, you are constantly besieged by vendors who sell you very attractive looking user interfaces. But the principle message of this session is that the durability of systems--their flexibility, their power--is really in the data model and not so much in the user interface. And the data model, if well-designed, will outlast and support uses of the data that nobody even imagined when the first system was developed.

Now when I said this before, actually the people from Columbia University were here and they said, "You don't understand. The data model outlives the original system design even if it is a bad one. They persist abnormally long in some cases regardless of whether they are well done or badly done, so it is important to try to do them well if you can." The evaluation and comparison of vendor products, again which is often done just on the way they look and feel to users, is actually much more powerfully done by comparing data models if those are available, and it also gives you a vocabulary for communicating with vendors and people who are helping you to build systems. Not the least, however, is the ability to make your own databases, and most people do, in fact, with tools such as Microsoft Office, that includes Microsoft Access.

How many have built at least one database in something like Microsoft Access? So, it looks like about two-thirds, maybe three-quarters of the class has done that. Then you realize how easy it is. Any fool can build a database and many do, just like Web pages. The entry level skill set is very, very low but in fact you quickly confront the issues of data representation and modeling whenever the task gets more than just the equivalent of recipe cards.

What is a database? Well, the term is associated extricably with computers, but in fact in its generic sense, it is really any organized collection of information, though these days computer-based representation is implicit in the idea of database. The general notion is that they are systematic. That is, they have a structure and a functionality that supports automated retrieval beyond what it is capable to do with just manual record keeping, and also support automated symbol manipulation. The idea of being able to reason on the data, make conclusions based on values that are present in the database, is an important aspect of things such as clinical decision support and other forms of rule-based systems design that are increasingly built, not with coded rules but rather with reasoning using the values of data in databases.

Historically, the first computer systems were, by and large, mainframe systems from the 50's and 60's, created with a software program that wrote a file, that is, a sequence of characters on the disc that were the sole province of the program that created them. That is, they were owned by the application program. And by and large the files in those early days were what were called sequential files. If you think of a document as just a long string of characters, maybe a million characters long, the sequential file would simply have them head to tail in the sequence in which it would be read from the beginning to the end.

Sequential files gave way to what are called "random access files" where, instead of trying to find a particular value half way through a million bytes, one could look up in a table where a particular record was in the middle of that very large collection of bytes, and go directly to that location on the disc to retrieve it. Those systems were brittle in a sense that if one changed the definition of how the file was laid out, all programs that accessed the data had to be rewritten and many would stop returning the correct values, if someone for example, inserted a few extra characters into each record in the sequential file. And as well, the general problem of a program owning the data did not allow other programs to have simultaneous access to it. So they were good for single applications that were generally referred to as stovepipe applications. They didn't talk to one another. They were stand-alone applications and such programs are commonly written and still existing in the world today. It is a situation where an application has the hubris of believing that it is the entire universe of importance and does not have any way of either exchanging data with other applications or changing the structure of their contents.

Transcript Part II: Data Base Management Systems- Hierarchical Databases

So, that gave rise to a more flexible approach...the general notion of DBMS's or data-based management systems, that inserted a layer of abstraction, that is, a data model between the application and the underlying stored data. The first of them on the scene, especially in medical terms, were hierarchical databases where the general notion is that data is laid out and in a set of tree-like structures, where there is a root. The hierarchy is , more or less like an inverted tree that has child records. And each of those may have relationships to other records. The best known of these in the medical instance would have been "MUMPS" Massachusetts Utility Multiprogramming System, that Octo Barnett and the group at the MGH computer lab developed specifically for storing patient data, in a way that allowed one to very efficiently follow these links to get one patient's set of results. And in fact, most hierarchical systems...and there are lots of them out there...there is more MUMPS code and more hierarchical databases in clinical health care than there are in any other data model, although that is slowly changing through relational systems. Because of the fact that the Defense Department and the VA use a MUMPS based architecture (the DHCP architecture for all of you who are familiar with VA environments). It is built on this data model of this linked list of parent/child records where the patient's social security number is the root of each of the trees. It makes it very, very efficient to look up all the results related to one person.

Advantages of Hierarchical Databases:

The advantages compared to what came before it were that it provided very efficient storage. In a sense you only created one of these child records when the data appeared. That is, you didn't create white space or empty records. You only created it so it was packed very tightly, and you stored not only a new record that would be inserted, but also the linkage to what piece of data was related to it. And it provides very rapid access to the predetermined data hierarchy. So, looking up lab results to get to a patient's serum sodium, for example, is a blink of an eye, a few milliseconds, because the system is optimized for this patient-at-a-time look up.

Disadvantage of Hierarchical:

The disadvantage of these hierarchical models is that by and large, it tends to be very difficult to view from anything other than the tree perspective, looking down the trees. So if you want to look across the tree of all laboratory values that were from that date, the only way to get there is that you have to start at the root and look down at all the trees until you get to that corresponding level in the data.

So that for doing outcomes analysis--, secondary uses where you aggregate the data--, where you ask questions like; how many patients that we saw last month with hyperthyroidism also had hypercalcemia ---those kind of queries can sometimes take hours or days to run because of this very inefficient way that it is all "patient-at-a-time" oriented and you are trying to look across the patients rather than down the trees.

It is very hard to modify the underlying structure of the trees if you have complex relationships among the data as well. That notwithstanding, there is is a lot of MUMPS code still out there, and MUMPS databases that exist, and understanding their strengths is important. If you are a provider, it is absolutely the ideal way to look up a "patient at a time". If you are a health care organization, you are trading off a dominant efficiency in one mode of looking at data, for this relative inefficiency of trying to look across the trees.

Network Information Database:

The next model that appeared on the landscape generally in computer science and specifically in health care with Larry Weed and the PROMIS [Problem-Oriented Medical Information System] system, which is the health-oriented record that was created at the University of Vermont, and was a predecessor of the Internet as we know it. It was the database as hypertext...the so-called "network information database" where you can have individual records and the relationships could be quite arbitrary. That is, any record could be a parent/child relationship or a peer level relationship. You could have multiple relationships between all kinds of records. So it is a very powerful model for establishing a semantic, or meaning relationships, of things in health care.

When we built the PDQ system at the National Cancer Institute, we chose the PROMIS Database management system specifically because we had some very severe naming problems in non-Hodgkin's lymphoma. There were three naming systems that were used for these lymphomas that had evolved over many years, and they were a tangled network. That is, the same word was used for different kinds of lymphoma, and in different naming systems one might be a "parent" of one term and a "child" of another, but it didn't occur that way in a second naming system. It allowed us essentially to create any arbitrary set of relationships between the information objects, that is, the facts contained in a system. The strength of doing that, is that you can model "many-to-many" relationships--and we will talk more about what we mean by "many-to-many" relationships-- as well as things that look like hierarchies.

So you can make a network database look just like a MUMPS hierarchy if you needed to, and as well, you could make it look like a simple list of things where this one was chained to this one...was chained to this one... You could also make a network database look like a sequential access file. You could make it look like a hierarchy. You could even actually make it look like a relational system that we will cover later.

Disadvantages of Network Information Databases:

The disadvantages--and we certainly incurred the disadvantages when we built the PDQ system--is that it is very difficult to predict and control the effects of transitive relationships. That is, you may know that B is to C, that is true...that this is the parent of this, and this is the sibling...but then someone else comes along sometime later and creates this backlink--now all of a sudden you had a new relationship, you had B, C and D. Now there is a relationship between B and D that you didn't anticipate. And so you can get both recursion, where things are pointing to one another in loops, and you can even get these bi-directional relationships.

So what happens is you get a very tangled web--an apt metaphor because that is exactly what the Internet is like--the Hypercard of the Macintosh and the World Wide Web "where anything can point to anything...all just a bunch of links" is really this model of hypertext raised to a global level, where now it is absolutely impossible to predict or control these relationships between Websites and URL's because it is so easy to set links between things. It is also very input/output intensive. That is, the computer has to go to the disc a lot to assemble all of the picture. What this might be, for example, one logical record that describes a relationship in a physiologic system, but you have to go out and actually individually read, not only these individual facts, but also the relationships between them. And it has its potential, and I have to say more than potential, it is almost inevitable...that sooner or later your database becomes incomprehensible. You just can't understand what it is you have built. It is not unlike the Web.

Relational Database Systems:

So that gave rise to the notion of rows and columns and the dominant metaphor of our time built on the theory of relational systems which is a mathematical step theory, the key paper of which was E. F. Codd's work from about 1970. Where the unit of information is by and large a very simple and very easy to understand: two-dimensional array rows and columns. Rows and columns define a table and one can have relationships between individual data items within the table and among tables in increasingly complicated models. But each of the individual elements is this...almost looking like a 3 x 5 card...a simple two-dimensional array.

Advantages of Relational Database Systems:

The benefit of doing that is that the model, even as it grows in complexity and you get dozens and dozens of tables, the individual atomic element tends to remain understandable. It is relatively easy then to create a variety of logical aggregations by setting relationships between tables. For example, here is a data linkage which associates a value in this patient unique identifier," patient UI columns 12345, with the same data values in a column of the same name the patient unique identifier in a table whose name is lab test. So it is possible to create arbitrary sets of logical collections of records that are based on these identical data values, occurring in columns that are in different tables. One can also distinguish views, so that depending upon what your rights of access are you can block, so certain users cannot see certain tables. At the level of the row and the column it is possible to create rules and filters that make the database appear to have many, many different structures even though underlying it, it may be a much larger group of data items.

The good news is that structure is easily modifiable. That is if you add new elements, you add a new column to a table, relational database systems tend not to break because the code that looks for the existing column names, still finds them. And unlike the sequential record, where changing the underlying sequence of the characters cause the computer to look up in the wrong place, the model of a relational system is, you can add new columns and as long as you don't break the old columns or delete them, the programs written many years ago continue to work as they had originally been designed. So, it is easy to modify the structure.

Disadvantages of Relational Database Systems:

Now the disadvantage is again, it is I/O (Input/Output) intensive. Lots of output because the computer, in order to find Elmer Smith's serum sodium, has to go to this table, get this unique identifier, go to an intermediate file called an index, that tells it where to find the same value in another file. So there is lots and lots of disc activity, and if you look at the little light on your disc drive it blinks a lot before the first record appears on the screen. One logical record may be comprised of many, many physical records.

The original GenBank database for example was a flat sequential text file describing the DNA sequence, the bibliographic record and some key words and features in the sequence, and those were stored as individual flat text files. Walter Goad's group out at Los Alamos created the first version of GenBank that had those features. When the National Library of Medicine took over GenBank and moved it to the Library in the late 1980's, the fully normalized...and I will tell you what I mean by that...relational version of those same files, had 65 tables in it, because of each of the elements that could repeat within a GenBank record. We will go through that in our database exercise at the end of this session.

Object-Oriented Databases:

There was a new kid on the block in the late 80's called object-oriented databases. Has anybody worked with object-oriented databases? Smalltalk was one of the first of them, and they were the wave of the future 15 years ago. They still appear to be the wave of the future at the turn of the 21st Century.

Advantages of Object-Oriented Databases:

They have very attractive features and properties, such as the ability to store multiple data types. So we could put pictures, images, sounds, graphics, electrical signals and not worry about the data type in database. We would simply stick all of those things in the container and let the database worry about how to handle all the different processing.

The object-oriented model of database design included the notion of encapsulation of data and programs. The notion of capsulation was that, the data carried around with it the links that told the system how to behave properly. So for example, if you want to print an image, say, a black and white or colored image, often that is a different print routine than simply printing a grocery list which has alphanumeric characters. The data would carry with it the specification of which programs should do which operations on the data. An additional notion of encapsulation was as a system designer you never needed to worry about that because the data would take care of itself in that regard. And that resulted in the notion of programs, the data-based programs that did useful work could be sets of very small messages such as "print yourself". And the data would somehow know what that meant, and it would go out and gather together all of the programs it needed to print itself.

The advantages were that applications programs consist of these very high level programs, or commands and functions that do not need to know anything about the underlying data organization. It also featured the idea of modularity. That is, we could write programs that were little, provably correct modules that would always run as designed. Then you could just string them together and you would get reliable predictive effects from that. The idea of reusability, that is, we could write one of these modules and just use it for all kinds of different things. It would be sitting out there and know how to behave correctly and we could have portability between systems.

Disadvantages of Object-Oriented Databases:

The disadvantage is that these systems are early in commercialization. This text has not changed since 1987. These systems have been early in commercialization for a very, very long time. They also are very, very CPU intensive because they have to do additional levels of decoding because of this encapsulization, compared to standard data based symptoms.

And unfortunately, although we had this idea of portability, it has never been realized. That is, there are many dialects even now, of objects and queries, so there was never developed a uniform language of the types such as is found in SQL, the Structured Query Language...that "talks" to relational systems and has become a dominant language of databases.

XML Databases:

The newest arrived "kid on the block" are XML databases--extensible markup language. XML is a first cousin of HTML that you will be having more experience with in this course, and it is built on the notion of tagged formats. That is the notion of a "start tag" and an "end tag", and what is between it is the data defined by those tags. So in a sense the tags are the field names, and the data between them is the content of the data base.

XML is a proper subset of what had been a page description language. The librarians will recognize that SGML was developed by publishers as a page description language for print format, meaning it told you which words to make bold, and which words to make italic, and where to place them on the page. That page generation language evolved into HTML, which is an improper subset, in a sense that in the hypertext markup language of the Web, you could take SGML and make HTML out of it, but you lose content by doing that. You can't go back to SGML because it is an improper subset that does not have all of the elements of full SGML.

In this model the XML database, the unit record is not defined necessarily in the structure of a computer program or a data based layer, but rather is defined by what is called a "DTD" or a Document Type Description, and we will show you an example of that. Every vendor is now claiming that they have XML databases...but a lot of this is actually a layer of processing where the actual data is not stored with XML tags...it may be stored as a relational table. But when it goes in and out of the database, the tags get added or stripped off, so that it is stored in a relatively compact fashion. But as far as you are concerned, it appears to be as if the data were stored with its tags.

So here is an example of an XML coded medical record entry. It looks, again, like HTML where you have the definition of a high level object--in this case a document--and the closing tag which says, "This is the end of the document." And inside of it are a set of a defined identifiers.

So, for example, this document happens to have a document type of diagnostic radiology examinations. And then it has some other predefined variables---if you will, tagged elements of the database. And here is a definition of an event, a definition of a patient where the patient has characteristics of patient ID, patient name, patient date of birth, patient sex value, and that is the end of the patient logical record.

And you could define arbitrary information objects to store, and that is exactly the power of XMLthat you can create these things in a way that represents an almost arbitrary level of complexity...that you decompose the major object in a nested hierarchy where in-between these high level tags, like "document" and "end document" and "event" and "end event" and "patient" and "end patient", one can have arbitrary repeats of characteristics of that particular object.

This is very attractive, and it is also the case that XML can now be natively interpreted by things like Web browsers. So it is possible with XML to create not only a data model, but something that automatically the Web browser knows how to display correctly because it understands this syntax of an "object" and "end object" tag.

Advantages of XML Databases:

Very powerful model and its strengths are this flexibility to represent a very wide range of data, almost arbitrarily anything you imagine. The data carries with it it's field assignment, and it handles relatively sparse data compactly. So, for example, if there is no instance of a laboratory value for this medical record, you don't even put that tag in. You can leave it out.

Disadvantages of XML Databases:

The weaknesses are that the database tools for doing this are not nearly as mature and efficient as they are to relational systems. The messages tend to be very verbose.

I had a very painful experience in my bio-informatics work of downloading all of Medline 10.7 million citations in Medline are now available only in XML format. Has anybody looked at Medline XML format? Well, and we are going to go into it in detail, but you know the XML the record of Medline is a relatively complicated record. It has all these little sub-elements like author initials, author names, publication types, pub years, and pages, and it goes on and on. The 36 gigabytes of data of XML data that comprises Medline contains 12 gigabytes of data and 24 gigabytes of XML tags. So it is an extreme example of this verbosity of the messages and tags.
In some cases where you have relatively complex records and many, many, many records...millions of them...almost all of your processing time is spent just reading the doggone tags. I have actually been an advocate of asking NLM to publish a more compact version of the new XML, but since they are using it for internal production, I haven't gotten very far.

It is also is obviously I/O intensive because again, of this notion that whenever you encounter a tag you have to go look up what it means if you are the computer, and it trades off decreased efficiency for increased flexibility. And I put this question mark in scalability last year before I encountered this Medline problem, but I am increasingly persuaded that XML is an excellent representation for medium-sized databases of medium complexity. So, maybe up to a million records and maybe hundred fields it is good for that. Not so good at 10.7 million records, times however many fields there are in the Medline format.

Transcript Part III: Data Models:

One of the fundamental assertions of this class and I have already said it but will say it again is that the data model is really the most critical aspect of system design and function. That is where you should begin when you are in the business of either working with a vendor, or designing your own new system. Data models, to endure, should really reflect the real world objects that they describe and the relationships between those objects. And I will show you more of what I mean by that in a little bit. A correct data model (as well as an incorrect one) subserves and outlasts the original application and many not anticipated at systems start up. The object is not the instance. The object is the abstraction of all members of that class, all objects that share those same characteristics. And we group those objects by criteria that represent the abstract as an empty table. This is a relational model here, but actually you can model other underlying databases the same way. So, this is not everything that defines a dog, but for purposes of our system it contains the characteristics that are of interest to us. We have a dog name, a dog breed, its favorite food and its birth date, as an example of that set of abstracted characteristics that are meaningful to us for purposes of whatever system we are designing.

Those empty tables then can be filled in with "instances"--in computer science the two dollar word for this is "instantiation". And an "instantiation" just means we have created a new row that has actual values for each of the classes--the categories of things that are in the empty table format.

So you can see some of the examples, there. This model can be used to represent not only objects but their relationships in the real world, and events and things that happen in time and places. So the relationships between objects are themselves "attributes", that is, characteristics of the objects. And those attributes themselves can be modeled in relational tables.

Here is an example of a relationship where there is a dog table and each dog has an owner name. And that hence the relationship to another table, where this is a human being who has the relationship of owning the dog. So we have linked data from two different tables by virtue of common occurrence of the column in both of those table structures.

The important rules about which one does this, are that all of the real world things in the set; that is, the instances, the rows as you create them and fill in the table have to have all the same characteristics. This is a common problem in database design, that you could have a model that looks absolutely sensible, but fails.

Here is a table called license, and it has characteristics of the license number, the expiration date, the manufacturer the model. Here is a row that is an instance of the car that sits very nicely, but when we try to put FiFi in here, which is a different kind of object from the real world, even though it may have a license, these characteristics don't match. So what happens...you either get holes in the table because that characteristic simply doesn't apply...you are trying to store two different kinds of objects in the same table design...or you get very, very strange values. And what happens then, is you are writing computer programs to look at the values and decide what am I going to do with this one. Do I do something different with it than with the other one?

This seems ludicrous but I will tell you it happens more often than you can imagine, in what people call, "data model creep", where someone creates the first data model, looks pretty good, and everything fits together. And the next person comes along and says, "But we need to add a new field. Where shall we stick it?" And they often violate the sensibility of the object definitions implicit in the original table design.

So all instances have to conform to the same rules. There are a number of different types of objects. They represent tangible things like books and dogs and such. They also can used to represent relationship and role. So one, maybe a human being that has multiple roles. I am a doctor with respect to this person. I am a credit card holder with respect to a credit organization. I am a borrower with respect to the library. So the notion of "roles" joining these physical objects--the things they do in the world--are one of the underlying powers of these relational models.

We can also use these tables to represent things that occur in time and space--incidents or events such as the ordering of a lab test. We can also bind tables together with interactions. For example, we could have a table called "purchase" that binds a buyer to a seller, and perhaps to an object that has been bought or sold. And we can even use tables, and most commonly do, with what are called "computer-assisted software engineering tools"-- to define the content of the system, so that you have data-driven systems, where the data dictionary controls the behaviors of how the database system works.

This is a common notation...there are three major alternative forms. So if you are in the business of designing a new system, working with a vendor and you are sitting down to lay out the data model, you will see it commonly in three different ways.

The first is an empty table form, where you have the name of the table in a kind of column that runs across the top. And then you have the individual empty columns names, a graphical form where the table names the first line, and the fields that go into essentially the column names, but they become the fields of the records, are listed vertically. One that has an asterisk means that the foreign key reference, that is a data value that exists in some other linked table whenever you see an asterisk in this design.

And then there is a simple text form where you have the name of the table and in parentheses a comma separated list of the fields or the attributes of that table, and again, if you are in this design process with a vendor, they will often send you these kinds of things to get your approval for the nature of your design.

E. F. Codd came up with a set of rules by which you would know whether one of these systems is well designed or not. And they actually extend out through third and fourth normal forms, but we are only going to do the top level here because the complexities get more and more subtle.

The first rule on how you design fully normalized relational models, and normalized, in a sense, means a correct data model. One instance of an object has exactly one value for each attribute that is at each intersection of a row and a column. There is only one value. It is always there, never missing and there are no repeated groups or embedded structure.

Rule number two is that attributes must not have any internal structures. So this is a mistake. For convenience sake you might say, "Well I just want to store the name and the age and the sex." So you create this data formatting where you are going to have the age-sex. The problem is, those are two independent attributes. One changes and one doesn't and now we have to be able to take them apart for purposes for example of looking with one system that may only be interested in one, the gender, and the other system is only interested in the age.

So now we have an internal processing rule by which we have to take the data apart because we inappropriately squashed it together. We put it in one field when it really should have been two separate fields or columns.

If you obey these rules...that each instance has exactly one value for each attribute...and there is no internal structure...you are about a third of a way there for a good design. That is, you are in first normal form by the E. F. Codd rules.

Rule three extends that, to say that every attribute should represent a characteristic of the entire object, and not a characteristic of a limited part of the object. Now, that is a little more subtle, but here is an example that shows you what they mean by that.

It would be absolutely sensible...if your job was to create a little database that had the members of committees in your hospital, and you would say, "Well, I think I want to store the person's name, and I want to store the name of the committee they are on. And I want to store the date that their term expires on their appointment for that committee, and I also need to know the date that they first joined the hospital staff."

Well that may be sensible from your particular view of the data, but this is a mistake because the date you joined the hospital staff is not an attribute of your membership of your committee. That fact of the real world, that aspect of the object of that person, if you will, persists independent of whether it is stored in your data base, and it is not an attribute of being somehow of your committee membership.

So that the correct way to do that, is to store just those things that relate to the committee membership and then store a foreign key reference to, for example, the personnel table, where you get the date that somebody was first employed. So you make that a relationship, a foreign key rather than storing into your table.

Because what happens when you store it in your table and an error is discovered because you are not the gold standard for that information? You have the classic one that we have hundreds of places where we store phone numbers in databases all through the organization. Well, phone numbers change all the time, and nobody knows all of the locations where each number is stored. Part of the data gets updated and part of it does not, so this embedding of items that are not aspects of the object has a very insidious problem of data asynchrony and update.

Relationships are the abstraction of a set of associations that hold, again, systematically. That is, this is up at the level of our theoretically pure object. Not the actual instances necessarily, but when we define an object in the world, we create relationships that hold systemically between all of those classes.

So for example, a patient occupies a bed, a library contains books, a specimen is assayed. Often we use these inverse relationships, and we often pair them so we have bi-directional relationships...you can work that logic in both directions. And most relationships by and large can be stated in an inverse, though it isn't always the case that both of these are created in a database system. But if the library lends a book, one can state it in a kind of passive voice that the book is lent by the library and those might represent actually two different relationships, depending upon the data model.

Relationship Types:

There are a variety of relationship types. The simplest one is the one-to-one relationship. Each state has one, and only one, governor, and each governor governs one and only one state. When you see this, you have to wonder why are we putting these in separate tables. Because if it really is a one-to-one relationship and it always holds, then one might imagine those are actually part of a single abstraction, but there might be good reasons... For example, you might have a table of governors of all states, and might have other characteristics of the state that you store which have nothing to do with the characteristics of the government. One-to-many relationships are shown in these data models. I should say the one-to-one relationship has a single arrow on both ends. That single arrow means there is one occurrence on both sides. A one-to-many relationship has a double arrow, meaning that this, as a record, may have multiple occurrences of this related record. So one dog owner may own many dogs but each dog can have only one owner, at least in this model.

There are however, even more complicated relationships such as the relationships between authors and books. One author may write many books, but each book may have many authors, is written by many authors. And in this case, the data model shows double arrows on both ends of that relationship.

And modeling many-to-many complexities becomes one of the more challenging aspects of relational database design. In fact, the way that one generally has to do this, is to insert yet a third table because of à priori, there is no way, if you have many-to-many relationships...it is hard to establish which records in this table are related to which records in this table. So, by and large, in a relational system, in order to model a many-to-many relationship, you have to create a linking file, which has...the nature of which records are related in those two other files.

Here is an example of a many-to-many relationship, where one drug manufacturer may manufacture many, many drugs and one generic drug may be manufactured by many different manufacturers. So that, what binds those together is a linking table of the license to manufacture in a kind of FDA sense, where we have the manufacturer name, which is a foreign key reference to this table, the generic name...and then the characteristics of this license, for example, the date it was issued. That is how you establish, from these many-to-many relationships, just the ones that apply between the two related tables.

Entity Relationship Diagram:

The overall system design process has escalating layers depending upon how many people are involved in it and how big the system is. For all of the databases that you would build including the ones that you just start on your own desktop, and the ones you are going to fully own, and nobody else is ever going to use it is very much recommended that you begin with a paper and pencil exercise or...we are going to do it with Microsoft Word. We will show you how you can use that as a quick computer assisted software engineering tool to create what is called the entity relationship diagram.
The "entity" or the "object" the relationships are the things that link them together. And what is going to become your table? Don't start, even in Microsoft Access, just naming fields, because you will get in a thicket. In square brackets we have the things that are applicable for multi programmer projects. What a vendor will often do, is create a thing called a " state transition model". And that is to look at the data that is filled into those tables. Under what conditions does it change? Who is allowed to create it? Who is allowed to edit it? What processes have access to it? Those state transition models describe all of the possible values, in essence, that the data can take in the system. And some...although these tend to be reserved for more complicated...some of them take "the day in the life of a data item" view. A flow diagram that says, "Okay, if once an item is created, how does it make its way through the system to various reports, screens, users, analysis routines?" And that is the kind of data flow view that looks at every single object in the tables and decides how it changes over time, and as it moves through an organization. Because physically it may move from one database to another.

Transcript Part IV: Classroom Exercise of Creating a Relational Database:

So, with those as the background of the general principles for relational database design, lets make reality out of this by doing an exercise that is based on some information you already saw.

This was the article from this morning's session, this classic article on medical informatics, an emerging academic discipline and institutional priority. So librarians, and probably most of you, will recognize it as the classic Medline in the sense of "Classic Coke". The classic Medline format for this, which had the field at the far left column separated with a little white space and a hyphen, and then a value for that, and in some cases there would be repeating values generally separated by semicolons, and an interesting aspect of the original ELHILL design.

ELHILL is a database engine that was created in the late 60's, early 70's, to store Medline data...quite a remarkable innovation for its time. The original ELHILL system actually stored this record as a string of characters...went head to tail all the way down from here to here. And the reason was, that in order to display this on the...it was a very efficient single read to disc to get this from here to here. it was the equivalent of what we described for the GenBank system, where it was essentially a flat text file, where the unit records were marked, and where the embedded items representing the sub-elements of the record could be pulled out. By and large, it was all optimized to deliver just one big chunk.

So, lets take this original Medline record, and in this Medline unique identifier...and create a relational form for it. I have up here, as I mentioned, a relational model. And we are going to call the main record, "Main", and I will start it off...I will make it easy by saying the first thing we are going to stick there is the Medline unique identifier. This is an important...what is called, "referential integrity token" in relational systems. And that is the system generated unique identifier, meaning it attaches to no other object that can be used to link all of the sub-components of a logical record.

So we have the first field here--the Medline unique identifier. I will go back and get that Medline unique identifier from that citation, and we will just copy it and paste it here. So we have that value, bless its heart, it always comes with its formatting.

Alright, lets go back to the other--ignore the dots--and lets go back to the other model and look at the next element of a Medline citation. It is the authors. In this case, we have two authors, Bob Greenes and Ted Shortliffe. How are we going to model those in the relational mode? Do we create an author field here in this main record? You are saying no. Why don't we create an author field? Well from this article's point of view it isn't really worried that authors could be responsible for more than one.

But the problem that there may be more than one author, and we don't know how many there are, is a very fundamental limitation. So, when you are doing simple database design on your desktop, the first and slippery road to perdition is to say, "Well, I will just make room for ten of them, or 15, or 20...and I will call it "author1" and "author2" and "author3". And inevitably time wounds all heals, and it will get you, because the next... in a very short period of time someone will come along and if you made 12 there will be 13 and now you will be storing it someplace else.

So that suggests that we need a parent/child relationship, and we need a separate table for authors. What is the nature of the author table?

I am also showing you, by the way, that you can in your own teaching, pretty effectively use tools like Microsoft Word and its drawing capability to do this kind of chalkboard stuff. And the nice thing about it is that you know you can "insert" if you decide you changed the order, unlike your flip chart. Now you are trying to write all in the margin and stick stuff in. You can keep the orderly appearance here.

So here is the author table, and conveniently enough, it is filled in with the first field being the Medline UI. That is, in fact, the "referential integrity token" here. And we will put a star there because it is a "foreign key reference" meaning that same value can be found in some other table.

And what are the elements of an author unit record? Let's go look at the original thing, here. We have Greenes and Shortliffe and we will go back. We have last name...and here is a value of "Greenes" and first name, first and initial. Do we need to make first and middle initials? Well, not if you have a way...it depends upon your query language, and if your query language allows you to do wild card masking and those kinds of things, then it doesn't really gain you anything to put the middle initial separate from the first initial.

Is there anything else we want to store about the author? The presentation order of this author field is, of course, important in our world. It is...being the first author and last author is much different than being the middle author of 15, so that people get very touchy about the order.

Now there are a couple of ways to control order. The default way in most database systems is what is called the "collation order", meaning physically if you create record A and then create record B, as the system reads, it will always read the first one created before it gets to the second one. The collation order will look like the physical order of the rows in the table. But if you want to force that function, if you want to make sure you always get it right, if you inserted another author and it was the third author, then you might have a sequence number to control the presentation of the authors. It is variable, and in fact Medline doesn't have this. It actually uses just the collation order.

Anything else we need to put in the author table? Okay, so there is another field. This is an older citation, and it does have an "AD", an author address. So herein lies a potential problem...that each author might have a different address. And if one simplifies the model by some convention such as "we only store the address of the first author", then you have controlled this situation.

But you might have a version of a bibliographic database in your own institution where, in fact, the author address is an attribute of being an author. And the reason for that, is that people might want to get their mail in their role as an author, from a different place than they get their mail as a faculty member, or... you name it. So, in this case, if you were building your own bibliographic database you might wish to have a field, which is author, addressed in its various sub-elements, but that is optional, depending upon the design of your system.

So this then, has in the modeling sense, a one-to-many relationship. We will get this arrow, and we will say that is the one. And since I can't make a double arrow, I will just copy this and shorten it a little bit and superimpose it.

So we have a one-to-many relationship between the main Medline record and authors. This is a way of showing that one citation may have many authors. It doesn't matter whether we have...interestingly...it doesn't matter if we have no authors, and it doesn't matter if we have one or a thousand. All we have to do is have repeating records, each of which is linked by this Medline unique identifierand that establishes the relationship to the main record.

Alright. How about titles? Here is the title. Does the title go in the main record or does the title go someplace else? Why does it go in the main record? Okay, it is an "attribute" but that is necessary, not sufficient, because similarly these are all attributes of the citation.

Well, the rule to use for whether something goes in the core record is whether it occurs only once, and in the full-normalized form of Codd it must occur. That is, if all citations must have a title and must have one and only one title, then it goes in this main record.

In the full and correct model if the title was optional, in other words, you could have a citation with no title. Then that actually would go in a "child" table. But for most purposes people don't go to that purest extent in their relational designs, and they would just leave it blank if they either didn't know it or there wasn't one. But the general notion is, if something occurs once and only once, and must occur, it then goes in this main record, the core attribute of the system. I am using bullets here, I will make sure I have a bullet for the title.

Let's look at the other record again. Ah, now here is an interesting one. MeSH headings. You will see here a very interesting situation, where we not only have key words, a variety of key words that again represent a one-to-many relationship, but those key words also have modifiers that modify them. So how are we going to handle this?

Go back to our relational model. Obviously--I shouldn't say obviously--do we want to put the MeSH headings in the main record? No. Right. And the reason is, that is right, because there is a one-to- many relationship, and we don't know how many. We can never guess ahead of time how many there might be, so if we don't know ahead of time, we are going to have to put it in a "child" table. So we will copy that, we will make a new table over here. We will rename it MeSH headings. We will get rid of these other things and we start, once again, with the Medline UI. That establishes this one-to-many relationship with the main record.

And what is the next field we want to include in the MeSH heading? Well, how about a field called "MeSH main"? Right. So we know we have MeSH main headings, and subheadings that may modify those main headings.

Alright, so the main heading here, is "hospital information system". Let's go to the other window, and that is the main MeSH heading. Are there any other characteristics we want to store about that heading? That is probably a Freudian slip. What do we need to know about that heading? That is right. You recognize this from the old Medline...that if it has an asterisk, that means it is a main heading.

So let's call this "star", meaning that is one of the most important topics there, and that could have a value of an asterisk or no asterisk. It actually is a logical value. It could be a true-false stored in a database and that is the way it is stored actually now in a relational Medline. So if that is present it means that is one of the most important topic headings. But if we look at... actually let's go to the other one and pick up the one that had rather than hospital information system, career choice, let's look at "medical informatics" as a MeSH heading, which is modified by education and trends.

Alright. And trends is one of the starred things. So, let's put this in as the value. Medical informatics. How are we going to handle the relationship between those subheads and the main headings and the relationship to the main citation? Do we want to create fields here in the MeSH headings, which is subheadings? Does that work? No. And why doesn't that work? Because it is a one-to-many relationship. We don't know whether there are any subheadings at all, and we don't know when they occur...whether there is going to be one, two, two hundred, two thousand; well not two thousand but it could be a potentially large number. So we need to take subheading out and we will move this up and create its own table and we will call this subheading.

But now we have an interesting problem which is...if we simply relate this subheading to the Medline UI, we miss the fact that the subheading is actually a modifier of the main heading. It is not a modifier directly of the citation. So how do we do that? How do we handle that? How can we represent the fact that this has a relationship to this record and not a direct relationship to that, but that this unit...this plus this...is in fact, a modifier or a characteristic of the main Medline citation? So how do you do that?

Okay. So we would have a field called MeSH main and a field called MeSH Medline UI, and then a field called MeSH sub, which has this value of...in this case let's make it "trends" because that has a star on it. So there is "trends".

By the way, we are going to put this on the Web so you don't need to be following it on "key stroke for key stroke" here. So, we put the name "MeSH heading" or what we could do actually, and more compactly, is put a unique identifier in this case, either the MeSH tree number for that term, or as you will hear about shortly--the metathesaurus concept--unique identifier for that particular heading.

So now we have established an interesting relationship. Let's go ahead and show that we have a couple of many-to-many relationships here. It is hard to get perfect with arrows. Anyway, you get the idea.

But there is a one-to-many relationship here, and there is a similar one-to-many relationship between the MeSH heading and the subheading. This has, though, the interesting property that when we search by qualifying these elements, it allows us, for example, to search directly by subheading, bypassing main headings.

So if you just want to know all articles that are related to trends, you could retrieve all the main Medline UI's without inferring this relationship to the main heading. Or we could take advantage of the existence of the main heading and its unique relationship to the subheading. So we get the best of both worlds.

And I won't do the one-to-many arrows here. So, it is "one-to-many-to-many" that allows us to model both the direct relationship to the citation and the one-to-many relationship between subheadings and headings. The idea here...the main idea is...and in fact, MeSH main here is a foreign key reference.

The main idea is that when you have these embedded one-to-many relationships you can add what is called a tuple. That is a tuple is the occurrence of more than one set of values that represent foreign key references. So that you can establish the uniqueness of just this relationship between the subheading and the main heading.

So think about this whenever you are getting these cascading sets of relationships, where it is one-to-many-to-many is by and large, as you go down each level. And you nest another one, the number of items in your tuple, that is, the set of keys that uniquely define a relationship, tends to grow with one additional level. So if we had subheadings of subheadings, now you would have to have MeSH main and MeSH sub1 as the set of unique values that establish the relationships for this one record.

So let's go back then and we can probably then, since we have shown an example of both a nested one-to-many-to-many relationship and a single occurring, we can look at each of these additional elements and decide how to handle this.

So how about publication types? Does that go in the main record? No, it goes in a "child" table that we will probably called PT or publication type, again, with the Medline unique identifier and multiple records. Each one representing each instance of that publication type, of which there might be many.

How about entry month? This one was entered into the database in 5/90. Where does that go? Would it go in the main record? It does go in the main record because entry month occurs once and only once, and it has got to be there for all records in the database.

[Talking]

Yeah, from the perspective of this record for one Medline citation, the characteristic of its entry month occurs only once from its perspective. Now from the perspective of the database I might wish to retrieve all records whose EM is 9005 in order to do some kind of maintenance or that. So that from the database perspective, that value occurs multiple times. But the perspective, the unit record of the citation that occurs once and only once. Although many, many records may have that same value.

Go back and look at the other one. How about the abstract? Abstracts are long text strings. Increasingly they are getting complex internal structure but the original Medline records does not have a reliable internal structure that allows you to take it apart. So would you put the abstract in the main record or in a "child" record? In the main record. Okay.

So there are some tradeoff issues here. Do all Medline records have an abstract? Well, no, they don't. But by and large, our database systems allow us to put a variable length text string in a logical record, so we could, in fact, have this whole abstract in here. But you know what happens when you do that...if you declare it as a variable length character type. The database system, unbeknownst to you, creates another table and stores it as an external reference. So even if you choose you think to create a main table, in fact the system will take the prerogative if it doesn't know how many characters are going to be ... because an abstract would be 200 characters, 400 or 2,000. Most database systems will actually unbeknownst to you create a separate child file, but it is one of these levels of technical complexity that is hidden from the user. It appears to be a field in your data table even though it actually lives in a linked table.

The purest form of the abstract, since many Medline citations, especially from the early years, don't have them, would be to create a "child" table called "abstract" that is linked on the Medline unique identifier, and if it doesn't exist, no harm, no foul. No use of white space in the database. So it turns out abstracts are the kind of thing which in some data models will be an external file, and in some they will not. And neither of those is necessarily more correct than the other, depending upon what you are doing.

How about the address? We talked about that.

That is right. And then a fully normalized provably correct model whenever there is the opportunity of the case that a value may be missing. According to Cobb's rules, that needs to be a separate table. You are not allowed to have missing... no holes in the table. So it is not only singly occurring attributes, it is that all attributes must occur.

But it turns out this is the most frequently violated rule because empty space doesn't cost you very much these days, and disc is cheap and it doesn't slow down the system to have just white space in a field. So again this is the tension between the provably correct model from a computer science standpoint and operational decisions that you make to just create an efficient uniform record. So most people would just leave the white space there and leave it empty.

Alright, the source is again, a singly occurring attribute, so that would go in the main table. And depending upon the design of your system, as we mentioned, "author address" might be occurring only once in the original Medline model. There is a rule that said there could only be one author address. But, of course, that was a corruption of the real world, because the truly correct model, the one that would endure longest, is that "address" is an attribute of the author, because authors do move around. And that "address" is actually an attribute of that citation, that they wrote that paper when they were affiliated with that institution. And if they affiliate with a new institution, that former institution affiliation remains, because that was the context in which that was created.

So, you get the sense of how one decomposes these records that have internal structure, especially where it is repeating values and wherever you see these repeating values, establish what is called a parent/child relationship...a one-to-many relationship in a relational model.

Now this kind of complexity is the kind of thing that separates the beginners from the accomplished database architects. Because it is so, so much easier to just say, I just think I will leave space for 10 authors and just create that in one flat file record. But it always...you always lose in the end whenever you have tried to embed in a flat file, simple records, something that inherently in the real world is unpredictable in the number of repeated occurrences.

You can get the full transcript and watch the lecture's video here http://www.mblwhoilibrary.org/services/lecture_series/masys/index.html

< Prev		Next >

Mediformatica - The Medical Informatics Portal

Latest Blog Articles

Medical Informatics I: Principles of Database Design

Software

Healthcare IT in Egypt

BLOG

VIDEOS

Who's Online

Counter

Medical Informatics

Hospital Information Systems

Videos

Mediformatica's Blog