Transforming Medical Records into a Computer Usable Database Format

Electronic Medical Records and Textual ETL

In today’s world most medical records are in the form of EMR . With EMR the medical community is able to capture and transfer medical records with the computer. But there is an Achilles heel to EMR. That Achilles heel is…

That Achilles heel is that the EMR contains text, or narrative information. The problem with text or narrative information is that in order to be usefully handled by the computer, text needs to be put into a structured format that is useful to the computer. Text – narrative – is almost useless to the computer when it is in the form of text. By placing contents of the EMR into the form that is useful to the computer, the medical community can read and analyze an unlimited amount of data. Stated differently, when there is text in the EMR, the medical community is limited to the amount of data that can be manually read or analyzed by an individual. And an individual can only manually process a very finite number of documents. 

By putting the medical record in the form that is useful to the computer, the medical community opens up the possibility of reading and analyzing thousands and thousands of medical records efficiently and conveniently. When an individual tries to process those records when the records contain text, the individual can process only a small fraction of the records. Fortunately there is technology that allows the text found in the EMR to be transformed into a structure that is amenable to the computer. 

That technology is Textual ETL. Textual ETL is technology that reads raw, narrative text such as that found in medical records and turns that text into a standard structured database. A central and important part of the work that textual ETL does is the disambiguation of the text found in the medical record that has been read. The database that is produced is created in a state that is said to contain “normalized” or “disambiguated” text. 


A medical record is read and processed by Textual ETL. Fig 1 shows the essential Textual ETL process. The result of the disambiguation process done by Textual ETL is the “normalization” of text. The text is produced in a linear manner in a database. While the database is usable that is produced by Textual ETL, the linearity of the data found in the database makes it less than intuitive to the neophyte. In order to make the data in the database more usable and more intuitive, it is necessary to restructure the data inside the database. Once restructured, the data is much more “friendly” or intuitive to the person needing to use the data. 

Restructuring the Disambiguated, Normalized Data

The data coming out of Textual ETL is restructured into a more intuitive format. At first glance the rows that are produced are a simple, flat file. Initially it is not intuitive that the rows contain anything terribly important or interesting. But on closer examination, the rows that are produced by the Textual ETL/restructuring process are very reflective of the narration found in the medical record. In order to see the relationship between the source medical record and the restructured formatted database record that has been created, consider the following. 

Extracting Words & Phrases from the Medical Record

Certain words and phrases are selected from the medical record for inclusion into the database. The word/phrase that has been selected for inclusion in the database is the result of one of many different types of processing done by Textual ETL. Or the word or phrase might have been selected by Textual ETL because of proximity resolution, stop word processing, custom variable processing or inline contextualization. There are many other techniques for selection for the word or phrase found in the medical record. Textual ETL has selected the word or phrase because it is important in the medical record and needs to be available to the research analyst. However the word or phrase was selected for the medical record, the word or phrase is found in the row of data that has been extracted, as seen in the diagram. 

Patient Identification

In the same row of data in the database is found the patient identification. The patient identifier is then attached to every row in the database belonging to the medical record. Because the patient identifier and the word or phrase that is of interest is found in the same row, it is very immediately and patently obvious for whom the word or phrase was written in the medical record. 

Negation of Word or Phrase

Another important piece of data is the negation of the term. Occasionally a term found in a medical record will be negated by the doctor writing the medical report. Occasionally a doctor will say – “The patient does not have angina.” In this case there is a negation of the term that is found in the medical record. It is very straightforward and obvious when a term – a word or phrase – has been negated because the negation appears in the same row of data as the word or phrase being negated. 

Sub-classification in the Medical Record

Another important piece of data found in the restructured database is that of the sub-classification of text created by the doctor making the medical record. A sub classification of data may be some topic such as “Nose”. The patient may have had some condition that was notable that pertains to the nose. The doctor would simply  create a category of data for “Nose” – then the doctor would start to make comments about the nose. If those comments included the word or phrase that has been selected, then the subcategory would appear in this part of the data base record. 

Super Classification of Text in the Medical Record

In line with the doctor’s creation of subcategories is the occasional “super category” of text that is created. A super class of categorization might look like a doctors “Impression”, an “Assessment”, or a “Treatment Plan”. The super categorization may or may not include one or more sub-classifications, depending on the doctor’s style in the creation of the medical record. 

Identifying the Medical Record

 And a final important piece of information is that of the identification of the medical record itself. It is seen then that there is a very close correlation between the important elements of the medical record and the restructured data base that has been created. All of the information needed by the research analyst is found in the same record. There is no searching that is needed by the research analyst because all the pertinent data is held in the same record. There are no “look ups” that are required. Because all the data that is pertinent and important is included in a single row of the restructured, disambiguated data, the processing required of the computer analysts is as straightforward as it can get. Processing a database analytically does not get to be any easier than reading a single record and processing it. 

Visualization of the Data

Once the database of disambiguated data/text is created, it is often used as input into visualization software. Many people like to see visualizations of data rather than data in a database.

Mirror Image

Another perspective of the medical record and the restructured database is that the restructured database is a mirror image of the medical record. The primary difference between the two forms of the data is that the restructured record is in the form of a relational database, which can be read and understood by the computer.

Contact us to learn more about how Textual ETL can enhance medical research with medical records.