by W H Inmon, Forest Rim Technology

In today’s world most medical records are in the form of EMR (or HER). With EMR the medical community is able to capture and transfer medical records with the computer.

But there is an Achilles heel to EMR. That Achilles heel is that the EMR contains text, or narrative information. The problem with text or narrative information is that in order to be usefully handled by the computer, text needs to be put into a structured format that is useful to the computer. Text – narrative – is almost useless to the computer when it is in the form of text.

By placing contents of the EMR into the form that is useful to the computer, the medical community can read and analyze an unlimited amount of data. Stated differently, when there is text in the EMR, the medical community is limited to the amount of data that can be manually read or analyzed by an individual. And an individual can only manually process a very finite number of documents. By putting the medical record in the form that us useful to the computer, the medical community opens up the possibility of reading and analyzing thousands and thousands of medical records efficiently and conveniently. When an individual tries to process those records when the records contain text, the individual can process only a small fraction of the records.

Fortunately there new is technology that allows the text found in the EMR to be transformed into a structure that is amenable to the computer. That technology is known as textual ETL.

Textual ETL is technology that reads raw, narrative text such as that found in medical records and turns that text into a standard structured data base. A central and important part of the work that textual ETL does is the disambiguation of the text found in the medical record that has been read. The data base that is produced is created in a state that is said to contain “normalized” or “disambiguated” text.

The TEXTUALETL Process

A medical record is read and processed by Textual ETL. Fig 1 shows the essential Textual ETL process.

The result of the disambiguation process done by Textual ETL is the “normalization” of text. The text is produced in a linear manner in a data base. While the data base is usable that is produced by Textual ETL, the linearity of the data found in the data base makes it less than intuitive to the neophyte. In order to make the data in the data base more usable and more intuitive, it is necessary to restructure the data inside the data base. Once restructured, the data is much more “friendly” or intuitive to the person needing to use the data.

Restructuring the Disambiguated, Normalized Data

Fig 2 shows that the data coming out of Textual ETL is restructured into a more intuitive format.

The format of the restructured rows that are produced as a result of the restructuring look like that seen in Fig 3.

At first glance the rows that are produced are a simple, flat file. Initially it is not intuitive that the rows contain anything terribly important or interesting. But on closer examination, the rows that are produced by the Textual ETL/restructuring process are very reflective of the narration found in the medical record.

In order to see the relationship between the source medical record and the restructured formatted data base record that has been created, consider the following.

Extracting Words & Phrases from the Medical Record

Fig 4 shows that the medical record has been scanned and analyzed, and that certain words and phrases have been selected from the medical record for inclusion into the data base.

The word/phrase that has been selected for inclusion in the data base is the result of one of many different types of processing done by Textual ETL. Some of the Textual ETL processes that might be responsible for the selection of the word or phrase include taxonomy resolution, homograph resolution, or acronym resolution. Or the word or phrase might have been selected by Textual ETL because of proximity resolution, stop word processing, custom variable processing or inline contextualization. There are many other techniques for selection for the word or phrase found in the medical record. Textual ETL has selected the word or phrase because it is important in the medical record and needs to be available to the research analyst.

However the word or phrase was selected for the medical record, the word or phrase is found in the row of data that has been extracted, as seen in the diagram.

Patient Identification

In the same row of data in the data base is found the patient identification, as seen in Fig 5.

It is seen in the figure that the patient identifier has been located in the medical record. The patient identifier is then attached to every row in the data base belonging to the medical record. Because the patient identifier and the word or phrase that is of interest is found in the same row, it is very immediately and patently obvious for whom the word or phrase was written in the medical record.

Negation of Word or Phrase

Another important piece of data is the negation of the term. Fig 6 shows that occasionally a term found in a medical record will be negated by the doctor writing the medical report.

Occasionally a doctor will say – “The patient does not have angina.” In this case there is a negation of the term that is found in the medical record. It is very straightforward and obvious when a term – a word or phrase – has been negated because the negation appears in the same row of data as the word or phrase being negated.

Taxonomic Identification of a Word or Phrase

Another important relationship of data is the taxonomical categorization of the word or phrase. Typical words or phrases that ae taxonomically significant in the world of medicine include words from Snomed or ICD 10, for example. Fig 7 shows that the taxonomical categorization of the word or phrase is found in the same row as the word or phrase.

Not all words or phrases have a taxonomical categorization. If that is the case this column of data will be blank. But if a word or phrase has a taxonomical categorization, this is where it will be found. Also note that on occasion a word or phrase will have more than one taxonomical categorization. If that is the case there will be more than one row of data that has been created. There will be one row of data created for each taxonomical categorization that applies to the word or phrase.

As a simple example of a taxonomical categorization, the word “Zofran” might be classified as a medication.

The taxonomical categorization may appear by various means in Textual ETL. The simplest and most common means by which taxonomical categorization appears is by simple taxonomy resolution. But there are other techniques by which taxonomy categorization appears as well.

Taxonomical categorization is most helpful in the disambiguation of text.

Sub-classification in the Medical Record

Another important piece of data found in the restructured data base is that of the sub-classification of text created by the doctor making the medical record. Fig 8 shows the sub-classification of data.

A sub classification of data may be some topic such as “Nose”. The patient may have had some condition that was notable that pertains to the nose. The doctor would simply  create a category of data for “Nose” – then the doctor would start to make comments about the nose. If those comments included the word or phrase that has been selected, then the subcategory would appear in this part of the data base record.

Super Classification of Text in the Medical Record

In line with the doctor’s creation of subcategories is the occasional “super category” of text that is created. Fig 9 shows the super category of text and where it is placed in the data base record.

A super class of categorization might look like a doctors “Impression”, an “Assessment”, or a “Treatment Plan”. The super categorization may or may not include one or more sub-classifications, depending on the doctor’s style in the creation of the medical record.

The Order of Text

Another important feature of the records of data that are created is the order in which the doctor has created the record. Fig 10 shows that the sequence of terms created by the doctor in the medical record is recorded and maintained.

Identifying the Medical Record

 And a final important piece of information is that of the identification of the medical record itself. Fig 11 shows that the identification of the medical record is retained for all the entries in the data base.

It is seen then that there is a very close correlation between the important elements of the medical record and the restructured data base that has been created. All of the information needed by the research analyst is found in the same record. There is no searching that is needed by the research analyst because all the pertinent data is held in the same record. There are no “look ups” that are required. Because all the data that is pertinent and important is included in a single row of the restructured, disambiguated data, the processing required of the computer analysts is as straightforward as it can get. Processing a data base analytically does not get to be any easier than reading a single record and processing it.

All Pertinent Information in a Single Word

Fig 12 shows that all the pertinent information needed for analysis that comes from the medical record is found in the record itself.

Visualization of the Data

Once the data base of disambiguated data/text is created, it is often used as input into visualization software. Many people like to see visualizations of data rather than data in a data base.

Fig 14 shows the usage of visualization/analytical software.

Mirror Image

Another perspective of the medical record and the restructured data base is that the restructured data base is a mirror image of the medical record. The primary difference between the two forms of the data is that the restructured record is in the form of a relational data base, which can be read and understood by the computer.

Fig 15 shows this mirror relationship.


Bill Inmon is the founder of Forest Rim Technology located in Castle Rock, Colorado. Forest Rim Technology produces textual ETL and the data base that can be restructured from Textual ETL. With Textual ETL you can turn document oriented data into an analytical data base that can be analyzed by the computer analyst.