Medical Informatics I: Principles of Database Design

Article Index
Medical Informatics I: Principles of Database Design
Page 2
Page 3
Page 4
All Pages

Page 4 of 4

Transcript Part IV: Classroom Exercise of Creating a Relational Database:

So, with those as the background of the general principles for relational database design, lets make reality out of this by doing an exercise that is based on some information you already saw.

This was the article from this morning's session, this classic article on medical informatics, an emerging academic discipline and institutional priority. So librarians, and probably most of you, will recognize it as the classic Medline in the sense of "Classic Coke". The classic Medline format for this, which had the field at the far left column separated with a little white space and a hyphen, and then a value for that, and in some cases there would be repeating values generally separated by semicolons, and an interesting aspect of the original ELHILL design.

ELHILL is a database engine that was created in the late 60's, early 70's, to store Medline data...quite a remarkable innovation for its time. The original ELHILL system actually stored this record as a string of characters...went head to tail all the way down from here to here. And the reason was, that in order to display this on the...it was a very efficient single read to disc to get this from here to here. it was the equivalent of what we described for the GenBank system, where it was essentially a flat text file, where the unit records were marked, and where the embedded items representing the sub-elements of the record could be pulled out. By and large, it was all optimized to deliver just one big chunk.

So, lets take this original Medline record, and in this Medline unique identifier...and create a relational form for it. I have up here, as I mentioned, a relational model. And we are going to call the main record, "Main", and I will start it off...I will make it easy by saying the first thing we are going to stick there is the Medline unique identifier. This is an important...what is called, "referential integrity token" in relational systems. And that is the system generated unique identifier, meaning it attaches to no other object that can be used to link all of the sub-components of a logical record.

So we have the first field here--the Medline unique identifier. I will go back and get that Medline unique identifier from that citation, and we will just copy it and paste it here. So we have that value, bless its heart, it always comes with its formatting.

Alright, lets go back to the other--ignore the dots--and lets go back to the other model and look at the next element of a Medline citation. It is the authors. In this case, we have two authors, Bob Greenes and Ted Shortliffe. How are we going to model those in the relational mode? Do we create an author field here in this main record? You are saying no. Why don't we create an author field? Well from this article's point of view it isn't really worried that authors could be responsible for more than one.

But the problem that there may be more than one author, and we don't know how many there are, is a very fundamental limitation. So, when you are doing simple database design on your desktop, the first and slippery road to perdition is to say, "Well, I will just make room for ten of them, or 15, or 20...and I will call it "author1" and "author2" and "author3". And inevitably time wounds all heals, and it will get you, because the next... in a very short period of time someone will come along and if you made 12 there will be 13 and now you will be storing it someplace else.

So that suggests that we need a parent/child relationship, and we need a separate table for authors. What is the nature of the author table?

I am also showing you, by the way, that you can in your own teaching, pretty effectively use tools like Microsoft Word and its drawing capability to do this kind of chalkboard stuff. And the nice thing about it is that you know you can "insert" if you decide you changed the order, unlike your flip chart. Now you are trying to write all in the margin and stick stuff in. You can keep the orderly appearance here.

So here is the author table, and conveniently enough, it is filled in with the first field being the Medline UI. That is, in fact, the "referential integrity token" here. And we will put a star there because it is a "foreign key reference" meaning that same value can be found in some other table.

And what are the elements of an author unit record? Let's go look at the original thing, here. We have Greenes and Shortliffe and we will go back. We have last name...and here is a value of "Greenes" and first name, first and initial. Do we need to make first and middle initials? Well, not if you have a way...it depends upon your query language, and if your query language allows you to do wild card masking and those kinds of things, then it doesn't really gain you anything to put the middle initial separate from the first initial.

Is there anything else we want to store about the author? The presentation order of this author field is, of course, important in our world. It is...being the first author and last author is much different than being the middle author of 15, so that people get very touchy about the order.

Now there are a couple of ways to control order. The default way in most database systems is what is called the "collation order", meaning physically if you create record A and then create record B, as the system reads, it will always read the first one created before it gets to the second one. The collation order will look like the physical order of the rows in the table. But if you want to force that function, if you want to make sure you always get it right, if you inserted another author and it was the third author, then you might have a sequence number to control the presentation of the authors. It is variable, and in fact Medline doesn't have this. It actually uses just the collation order.

Anything else we need to put in the author table? Okay, so there is another field. This is an older citation, and it does have an "AD", an author address. So herein lies a potential problem...that each author might have a different address. And if one simplifies the model by some convention such as "we only store the address of the first author", then you have controlled this situation.

But you might have a version of a bibliographic database in your own institution where, in fact, the author address is an attribute of being an author. And the reason for that, is that people might want to get their mail in their role as an author, from a different place than they get their mail as a faculty member, or... you name it. So, in this case, if you were building your own bibliographic database you might wish to have a field, which is author, addressed in its various sub-elements, but that is optional, depending upon the design of your system.

So this then, has in the modeling sense, a one-to-many relationship. We will get this arrow, and we will say that is the one. And since I can't make a double arrow, I will just copy this and shorten it a little bit and superimpose it.

So we have a one-to-many relationship between the main Medline record and authors. This is a way of showing that one citation may have many authors. It doesn't matter whether we have...interestingly...it doesn't matter if we have no authors, and it doesn't matter if we have one or a thousand. All we have to do is have repeating records, each of which is linked by this Medline unique identifierand that establishes the relationship to the main record.

Alright. How about titles? Here is the title. Does the title go in the main record or does the title go someplace else? Why does it go in the main record? Okay, it is an "attribute" but that is necessary, not sufficient, because similarly these are all attributes of the citation.

Well, the rule to use for whether something goes in the core record is whether it occurs only once, and in the full-normalized form of Codd it must occur. That is, if all citations must have a title and must have one and only one title, then it goes in this main record.

In the full and correct model if the title was optional, in other words, you could have a citation with no title. Then that actually would go in a "child" table. But for most purposes people don't go to that purest extent in their relational designs, and they would just leave it blank if they either didn't know it or there wasn't one. But the general notion is, if something occurs once and only once, and must occur, it then goes in this main record, the core attribute of the system. I am using bullets here, I will make sure I have a bullet for the title.

Let's look at the other record again. Ah, now here is an interesting one. MeSH headings. You will see here a very interesting situation, where we not only have key words, a variety of key words that again represent a one-to-many relationship, but those key words also have modifiers that modify them. So how are we going to handle this?

Go back to our relational model. Obviously--I shouldn't say obviously--do we want to put the MeSH headings in the main record? No. Right. And the reason is, that is right, because there is a one-to- many relationship, and we don't know how many. We can never guess ahead of time how many there might be, so if we don't know ahead of time, we are going to have to put it in a "child" table. So we will copy that, we will make a new table over here. We will rename it MeSH headings. We will get rid of these other things and we start, once again, with the Medline UI. That establishes this one-to-many relationship with the main record.

And what is the next field we want to include in the MeSH heading? Well, how about a field called "MeSH main"? Right. So we know we have MeSH main headings, and subheadings that may modify those main headings.

Alright, so the main heading here, is "hospital information system". Let's go to the other window, and that is the main MeSH heading. Are there any other characteristics we want to store about that heading? That is probably a Freudian slip. What do we need to know about that heading? That is right. You recognize this from the old Medline...that if it has an asterisk, that means it is a main heading.

So let's call this "star", meaning that is one of the most important topics there, and that could have a value of an asterisk or no asterisk. It actually is a logical value. It could be a true-false stored in a database and that is the way it is stored actually now in a relational Medline. So if that is present it means that is one of the most important topic headings. But if we look at... actually let's go to the other one and pick up the one that had rather than hospital information system, career choice, let's look at "medical informatics" as a MeSH heading, which is modified by education and trends.

Alright. And trends is one of the starred things. So, let's put this in as the value. Medical informatics. How are we going to handle the relationship between those subheads and the main headings and the relationship to the main citation? Do we want to create fields here in the MeSH headings, which is subheadings? Does that work? No. And why doesn't that work? Because it is a one-to-many relationship. We don't know whether there are any subheadings at all, and we don't know when they occur...whether there is going to be one, two, two hundred, two thousand; well not two thousand but it could be a potentially large number. So we need to take subheading out and we will move this up and create its own table and we will call this subheading.

But now we have an interesting problem which is...if we simply relate this subheading to the Medline UI, we miss the fact that the subheading is actually a modifier of the main heading. It is not a modifier directly of the citation. So how do we do that? How do we handle that? How can we represent the fact that this has a relationship to this record and not a direct relationship to that, but that this unit...this plus this...is in fact, a modifier or a characteristic of the main Medline citation? So how do you do that?

Okay. So we would have a field called MeSH main and a field called MeSH Medline UI, and then a field called MeSH sub, which has this value of...in this case let's make it "trends" because that has a star on it. So there is "trends".

By the way, we are going to put this on the Web so you don't need to be following it on "key stroke for key stroke" here. So, we put the name "MeSH heading" or what we could do actually, and more compactly, is put a unique identifier in this case, either the MeSH tree number for that term, or as you will hear about shortly--the metathesaurus concept--unique identifier for that particular heading.

So now we have established an interesting relationship. Let's go ahead and show that we have a couple of many-to-many relationships here. It is hard to get perfect with arrows. Anyway, you get the idea.

But there is a one-to-many relationship here, and there is a similar one-to-many relationship between the MeSH heading and the subheading. This has, though, the interesting property that when we search by qualifying these elements, it allows us, for example, to search directly by subheading, bypassing main headings.

So if you just want to know all articles that are related to trends, you could retrieve all the main Medline UI's without inferring this relationship to the main heading. Or we could take advantage of the existence of the main heading and its unique relationship to the subheading. So we get the best of both worlds.

And I won't do the one-to-many arrows here. So, it is "one-to-many-to-many" that allows us to model both the direct relationship to the citation and the one-to-many relationship between subheadings and headings. The idea here...the main idea is...and in fact, MeSH main here is a foreign key reference.

The main idea is that when you have these embedded one-to-many relationships you can add what is called a tuple. That is a tuple is the occurrence of more than one set of values that represent foreign key references. So that you can establish the uniqueness of just this relationship between the subheading and the main heading.

So think about this whenever you are getting these cascading sets of relationships, where it is one-to-many-to-many is by and large, as you go down each level. And you nest another one, the number of items in your tuple, that is, the set of keys that uniquely define a relationship, tends to grow with one additional level. So if we had subheadings of subheadings, now you would have to have MeSH main and MeSH sub1 as the set of unique values that establish the relationships for this one record.

So let's go back then and we can probably then, since we have shown an example of both a nested one-to-many-to-many relationship and a single occurring, we can look at each of these additional elements and decide how to handle this.

So how about publication types? Does that go in the main record? No, it goes in a "child" table that we will probably called PT or publication type, again, with the Medline unique identifier and multiple records. Each one representing each instance of that publication type, of which there might be many.

How about entry month? This one was entered into the database in 5/90. Where does that go? Would it go in the main record? It does go in the main record because entry month occurs once and only once, and it has got to be there for all records in the database.

[Talking]

Yeah, from the perspective of this record for one Medline citation, the characteristic of its entry month occurs only once from its perspective. Now from the perspective of the database I might wish to retrieve all records whose EM is 9005 in order to do some kind of maintenance or that. So that from the database perspective, that value occurs multiple times. But the perspective, the unit record of the citation that occurs once and only once. Although many, many records may have that same value.

Go back and look at the other one. How about the abstract? Abstracts are long text strings. Increasingly they are getting complex internal structure but the original Medline records does not have a reliable internal structure that allows you to take it apart. So would you put the abstract in the main record or in a "child" record? In the main record. Okay.

So there are some tradeoff issues here. Do all Medline records have an abstract? Well, no, they don't. But by and large, our database systems allow us to put a variable length text string in a logical record, so we could, in fact, have this whole abstract in here. But you know what happens when you do that...if you declare it as a variable length character type. The database system, unbeknownst to you, creates another table and stores it as an external reference. So even if you choose you think to create a main table, in fact the system will take the prerogative if it doesn't know how many characters are going to be ... because an abstract would be 200 characters, 400 or 2,000. Most database systems will actually unbeknownst to you create a separate child file, but it is one of these levels of technical complexity that is hidden from the user. It appears to be a field in your data table even though it actually lives in a linked table.

The purest form of the abstract, since many Medline citations, especially from the early years, don't have them, would be to create a "child" table called "abstract" that is linked on the Medline unique identifier, and if it doesn't exist, no harm, no foul. No use of white space in the database. So it turns out abstracts are the kind of thing which in some data models will be an external file, and in some they will not. And neither of those is necessarily more correct than the other, depending upon what you are doing.

How about the address? We talked about that.

That is right. And then a fully normalized provably correct model whenever there is the opportunity of the case that a value may be missing. According to Cobb's rules, that needs to be a separate table. You are not allowed to have missing... no holes in the table. So it is not only singly occurring attributes, it is that all attributes must occur.

But it turns out this is the most frequently violated rule because empty space doesn't cost you very much these days, and disc is cheap and it doesn't slow down the system to have just white space in a field. So again this is the tension between the provably correct model from a computer science standpoint and operational decisions that you make to just create an efficient uniform record. So most people would just leave the white space there and leave it empty.

Alright, the source is again, a singly occurring attribute, so that would go in the main table. And depending upon the design of your system, as we mentioned, "author address" might be occurring only once in the original Medline model. There is a rule that said there could only be one author address. But, of course, that was a corruption of the real world, because the truly correct model, the one that would endure longest, is that "address" is an attribute of the author, because authors do move around. And that "address" is actually an attribute of that citation, that they wrote that paper when they were affiliated with that institution. And if they affiliate with a new institution, that former institution affiliation remains, because that was the context in which that was created.

So, you get the sense of how one decomposes these records that have internal structure, especially where it is repeating values and wherever you see these repeating values, establish what is called a parent/child relationship...a one-to-many relationship in a relational model.

Now this kind of complexity is the kind of thing that separates the beginners from the accomplished database architects. Because it is so, so much easier to just say, I just think I will leave space for 10 authors and just create that in one flat file record. But it always...you always lose in the end whenever you have tried to embed in a flat file, simple records, something that inherently in the real world is unpredictable in the number of repeated occurrences.

You can get the full transcript and watch the lecture's video here http://www.mblwhoilibrary.org/services/lecture_series/masys/index.html

<< Prev - Next

< Prev		Next >

Mediformatica - The Medical Informatics Portal

Latest Blog Articles

Medical Informatics I: Principles of Database Design - Page 4

Software

Healthcare IT in Egypt

BLOG

VIDEOS

Who's Online

Counter

Medical Informatics

Hospital Information Systems

Videos

Mediformatica's Blog