| Article Index |
|---|
| Medical Informatics I: Principles of Database Design |
| Page 2 |
| Page 3 |
| Page 4 |
| All Pages |
Transcript Part II: Data Base Management Systems- Hierarchical Databases

So, that gave rise to a more flexible approach...the general notion of DBMS's or data-based management systems, that inserted a layer of abstraction, that is, a data model between the application and the underlying stored data. The first of them on the scene, especially in medical terms, were hierarchical databases where the general notion is that data is laid out and in a set of tree-like structures, where there is a root. The hierarchy is , more or less like an inverted tree that has child records. And each of those may have relationships to other records. The best known of these in the medical instance would have been "MUMPS" Massachusetts Utility Multiprogramming System, that Octo Barnett and the group at the MGH computer lab developed specifically for storing patient data, in a way that allowed one to very efficiently follow these links to get one patient's set of results. And in fact, most hierarchical systems...and there are lots of them out there...there is more MUMPS code and more hierarchical databases in clinical health care than there are in any other data model, although that is slowly changing through relational systems. Because of the fact that the Defense Department and the VA use a MUMPS based architecture (the DHCP architecture for all of you who are familiar with VA environments). It is built on this data model of this linked list of parent/child records where the patient's social security number is the root of each of the trees. It makes it very, very efficient to look up all the results related to one person.
Advantages of Hierarchical Databases:
The advantages compared to what came before it were that it provided very efficient storage. In a sense you only created one of these child records when the data appeared. That is, you didn't create white space or empty records. You only created it so it was packed very tightly, and you stored not only a new record that would be inserted, but also the linkage to what piece of data was related to it. And it provides very rapid access to the predetermined data hierarchy. So, looking up lab results to get to a patient's serum sodium, for example, is a blink of an eye, a few milliseconds, because the system is optimized for this patient-at-a-time look up.
Disadvantage of Hierarchical:
The disadvantage of these hierarchical models is that by and large, it tends to be very difficult to view from anything other than the tree perspective, looking down the trees. So if you want to look across the tree of all laboratory values that were from that date, the only way to get there is that you have to start at the root and look down at all the trees until you get to that corresponding level in the data.
So that for doing outcomes analysis--, secondary uses where you aggregate the data--, where you ask questions like; how many patients that we saw last month with hyperthyroidism also had hypercalcemia ---those kind of queries can sometimes take hours or days to run because of this very inefficient way that it is all "patient-at-a-time" oriented and you are trying to look across the patients rather than down the trees.
It is very hard to modify the underlying structure of the trees if you have complex relationships among the data as well. That notwithstanding, there is is a lot of MUMPS code still out there, and MUMPS databases that exist, and understanding their strengths is important. If you are a provider, it is absolutely the ideal way to look up a "patient at a time". If you are a health care organization, you are trading off a dominant efficiency in one mode of looking at data, for this relative inefficiency of trying to look across the trees.
Network Information Database:
The next model that appeared on the landscape generally in computer science and specifically in health care with Larry Weed and the PROMIS [Problem-Oriented Medical Information System] system, which is the health-oriented record that was created at the University of Vermont, and was a predecessor of the Internet as we know it. It was the database as hypertext...the so-called "network information database" where you can have individual records and the relationships could be quite arbitrary. That is, any record could be a parent/child relationship or a peer level relationship. You could have multiple relationships between all kinds of records. So it is a very powerful model for establishing a semantic, or meaning relationships, of things in health care.
When we built the PDQ system at the National Cancer Institute, we chose the PROMIS Database management system specifically because we had some very severe naming problems in non-Hodgkin's lymphoma. There were three naming systems that were used for these lymphomas that had evolved over many years, and they were a tangled network. That is, the same word was used for different kinds of lymphoma, and in different naming systems one might be a "parent" of one term and a "child" of another, but it didn't occur that way in a second naming system. It allowed us essentially to create any arbitrary set of relationships between the information objects, that is, the facts contained in a system. The strength of doing that, is that you can model "many-to-many" relationships--and we will talk more about what we mean by "many-to-many" relationships-- as well as things that look like hierarchies.
So you can make a network database look just like a MUMPS hierarchy if you needed to, and as well, you could make it look like a simple list of things where this one was chained to this one...was chained to this one... You could also make a network database look like a sequential access file. You could make it look like a hierarchy. You could even actually make it look like a relational system that we will cover later.
Disadvantages of Network Information Databases:
The disadvantages--and we certainly incurred the disadvantages when we built the PDQ system--is that it is very difficult to predict and control the effects of transitive relationships. That is, you may know that B is to C, that is true...that this is the parent of this, and this is the sibling...but then someone else comes along sometime later and creates this backlink--now all of a sudden you had a new relationship, you had B, C and D. Now there is a relationship between B and D that you didn't anticipate. And so you can get both recursion, where things are pointing to one another in loops, and you can even get these bi-directional relationships.
So what happens is you get a very tangled web--an apt metaphor because that is exactly what the Internet is like--the Hypercard of the Macintosh and the World Wide Web "where anything can point to anything...all just a bunch of links" is really this model of hypertext raised to a global level, where now it is absolutely impossible to predict or control these relationships between Websites and URL's because it is so easy to set links between things. It is also very input/output intensive. That is, the computer has to go to the disc a lot to assemble all of the picture. What this might be, for example, one logical record that describes a relationship in a physiologic system, but you have to go out and actually individually read, not only these individual facts, but also the relationships between them. And it has its potential, and I have to say more than potential, it is almost inevitable...that sooner or later your database becomes incomprehensible. You just can't understand what it is you have built. It is not unlike the Web.
Relational Database Systems:
So that gave rise to the notion of rows and columns and the dominant metaphor of our time built on the theory of relational systems which is a mathematical step theory, the key paper of which was E. F. Codd's work from about 1970. Where the unit of information is by and large a very simple and very easy to understand: two-dimensional array rows and columns. Rows and columns define a table and one can have relationships between individual data items within the table and among tables in increasingly complicated models. But each of the individual elements is this...almost looking like a 3 x 5 card...a simple two-dimensional array.
Advantages of Relational Database Systems:
The benefit of doing that is that the model, even as it grows in complexity and you get dozens and dozens of tables, the individual atomic element tends to remain understandable. It is relatively easy then to create a variety of logical aggregations by setting relationships between tables. For example, here is a data linkage which associates a value in this patient unique identifier," patient UI columns 12345, with the same data values in a column of the same name the patient unique identifier in a table whose name is lab test. So it is possible to create arbitrary sets of logical collections of records that are based on these identical data values, occurring in columns that are in different tables. One can also distinguish views, so that depending upon what your rights of access are you can block, so certain users cannot see certain tables. At the level of the row and the column it is possible to create rules and filters that make the database appear to have many, many different structures even though underlying it, it may be a much larger group of data items.
The good news is that structure is easily modifiable. That is if you add new elements, you add a new column to a table, relational database systems tend not to break because the code that looks for the existing column names, still finds them. And unlike the sequential record, where changing the underlying sequence of the characters cause the computer to look up in the wrong place, the model of a relational system is, you can add new columns and as long as you don't break the old columns or delete them, the programs written many years ago continue to work as they had originally been designed. So, it is easy to modify the structure.
Disadvantages of Relational Database Systems:
Now the disadvantage is again, it is I/O (Input/Output) intensive. Lots of output because the computer, in order to find Elmer Smith's serum sodium, has to go to this table, get this unique identifier, go to an intermediate file called an index, that tells it where to find the same value in another file. So there is lots and lots of disc activity, and if you look at the little light on your disc drive it blinks a lot before the first record appears on the screen. One logical record may be comprised of many, many physical records.
The original GenBank database for example was a flat sequential text file describing the DNA sequence, the bibliographic record and some key words and features in the sequence, and those were stored as individual flat text files. Walter Goad's group out at Los Alamos created the first version of GenBank that had those features. When the National Library of Medicine took over GenBank and moved it to the Library in the late 1980's, the fully normalized...and I will tell you what I mean by that...relational version of those same files, had 65 tables in it, because of each of the elements that could repeat within a GenBank record. We will go through that in our database exercise at the end of this session.
Object-Oriented Databases:
There was a new kid on the block in the late 80's called object-oriented databases. Has anybody worked with object-oriented databases? Smalltalk was one of the first of them, and they were the wave of the future 15 years ago. They still appear to be the wave of the future at the turn of the 21st Century.
Advantages of Object-Oriented Databases:
They have very attractive features and properties, such as the ability to store multiple data types. So we could put pictures, images, sounds, graphics, electrical signals and not worry about the data type in database. We would simply stick all of those things in the container and let the database worry about how to handle all the different processing.
The object-oriented model of database design included the notion of encapsulation of data and programs. The notion of capsulation was that, the data carried around with it the links that told the system how to behave properly. So for example, if you want to print an image, say, a black and white or colored image, often that is a different print routine than simply printing a grocery list which has alphanumeric characters. The data would carry with it the specification of which programs should do which operations on the data. An additional notion of encapsulation was as a system designer you never needed to worry about that because the data would take care of itself in that regard. And that resulted in the notion of programs, the data-based programs that did useful work could be sets of very small messages such as "print yourself". And the data would somehow know what that meant, and it would go out and gather together all of the programs it needed to print itself.
The advantages were that applications programs consist of these very high level programs, or commands and functions that do not need to know anything about the underlying data organization. It also featured the idea of modularity. That is, we could write programs that were little, provably correct modules that would always run as designed. Then you could just string them together and you would get reliable predictive effects from that. The idea of reusability, that is, we could write one of these modules and just use it for all kinds of different things. It would be sitting out there and know how to behave correctly and we could have portability between systems.
Disadvantages of Object-Oriented Databases:
The disadvantage is that these systems are early in commercialization. This text has not changed since 1987. These systems have been early in commercialization for a very, very long time. They also are very, very CPU intensive because they have to do additional levels of decoding because of this encapsulization, compared to standard data based symptoms.
And unfortunately, although we had this idea of portability, it has never been realized. That is, there are many dialects even now, of objects and queries, so there was never developed a uniform language of the types such as is found in SQL, the Structured Query Language...that "talks" to relational systems and has become a dominant language of databases.
XML Databases:
The newest arrived "kid on the block" are XML databases--extensible markup language. XML is a first cousin of HTML that you will be having more experience with in this course, and it is built on the notion of tagged formats. That is the notion of a "start tag" and an "end tag", and what is between it is the data defined by those tags. So in a sense the tags are the field names, and the data between them is the content of the data base.
XML is a proper subset of what had been a page description language. The librarians will recognize that SGML was developed by publishers as a page description language for print format, meaning it told you which words to make bold, and which words to make italic, and where to place them on the page. That page generation language evolved into HTML, which is an improper subset, in a sense that in the hypertext markup language of the Web, you could take SGML and make HTML out of it, but you lose content by doing that. You can't go back to SGML because it is an improper subset that does not have all of the elements of full SGML.
In this model the XML database, the unit record is not defined necessarily in the structure of a computer program or a data based layer, but rather is defined by what is called a "DTD" or a Document Type Description, and we will show you an example of that. Every vendor is now claiming that they have XML databases...but a lot of this is actually a layer of processing where the actual data is not stored with XML tags...it may be stored as a relational table. But when it goes in and out of the database, the tags get added or stripped off, so that it is stored in a relatively compact fashion. But as far as you are concerned, it appears to be as if the data were stored with its tags.
So here is an example of an XML coded medical record entry. It looks, again, like HTML where you have the definition of a high level object--in this case a document--and the closing tag which says, "This is the end of the document." And inside of it are a set of a defined identifiers.
So, for example, this document happens to have a document type of diagnostic radiology examinations. And then it has some other predefined variables---if you will, tagged elements of the database. And here is a definition of an event, a definition of a patient where the patient has characteristics of patient ID, patient name, patient date of birth, patient sex value, and that is the end of the patient logical record.
And you could define arbitrary information objects to store, and that is exactly the power of XMLthat you can create these things in a way that represents an almost arbitrary level of complexity...that you decompose the major object in a nested hierarchy where in-between these high level tags, like "document" and "end document" and "event" and "end event" and "patient" and "end patient", one can have arbitrary repeats of characteristics of that particular object.
This is very attractive, and it is also the case that XML can now be natively interpreted by things like Web browsers. So it is possible with XML to create not only a data model, but something that automatically the Web browser knows how to display correctly because it understands this syntax of an "object" and "end object" tag.
Advantages of XML Databases:
Very powerful model and its strengths are this flexibility to represent a very wide range of data, almost arbitrarily anything you imagine. The data carries with it it's field assignment, and it handles relatively sparse data compactly. So, for example, if there is no instance of a laboratory value for this medical record, you don't even put that tag in. You can leave it out.
Disadvantages of XML Databases:
The weaknesses are that the database tools for doing this are not nearly as mature and efficient as they are to relational systems. The messages tend to be very verbose.
I had a very painful experience in my bio-informatics work of downloading all of Medline 10.7 million citations in Medline are now available only in XML format. Has anybody looked at Medline XML format? Well, and we are going to go into it in detail, but you know the XML the record of Medline is a relatively complicated record. It has all these little sub-elements like author initials, author names, publication types, pub years, and pages, and it goes on and on. The 36 gigabytes of data of XML data that comprises Medline contains 12 gigabytes of data and 24 gigabytes of XML tags. So it is an extreme example of this verbosity of the messages and tags.
In some cases where you have relatively complex records and many, many, many records...millions of them...almost all of your processing time is spent just reading the doggone tags. I have actually been an advocate of asking NLM to publish a more compact version of the new XML, but since they are using it for internal production, I haven't gotten very far.
It is also is obviously I/O intensive because again, of this notion that whenever you encounter a tag you have to go look up what it means if you are the computer, and it trades off decreased efficiency for increased flexibility. And I put this question mark in scalability last year before I encountered this Medline problem, but I am increasingly persuaded that XML is an excellent representation for medium-sized databases of medium complexity. So, maybe up to a million records and maybe hundred fields it is good for that. Not so good at 10.7 million records, times however many fields there are in the Medline format.
| < Prev | Next > |
|---|

In this section you'll find a list of companies and organizations working in the healthcare IT sector in Egypt.