by W H Inmon, Forest Rim Technology
With the rising costs of medicine and the advent of an aging population, there has never been a better time for accurate and thorough medical research.
For years doctors and hospitals have treated patients and kept records as to the treatment, examinations, and outcomes of the care given. And for a given patient the information has been adequate. But there is a wealth of information that can be gathered when those medical records are examined collectively. Looking at many medical records collectively can yield insight into patterns relating to disease and conditions that may not be apparent when looking at just one or two medical records. But looking at multiple medical records at once on a collective basis has been challenging until now.
When a patient undergoes medical care, there are many reasons for the encounter. There are –
- emergency care
and many more reasons why a patient needs medical care.
And every time the patient undergoes an episode of care, careful records are taken.
The essence of these records is text that describes the intricacies of the encounter or episode of care. Sometimes the text describing the encounter is verbose. Sometimes the text is terse. The amount of text and the nature of the language depends on the physician, the nature of the encounter, and many other factors.
Over time these medical records are collected by doctors, hospitals, and other agencies. For a given patient the collection of the records forms the personal medical history of the patient. There is much value to the patient from these records.
But there is an even greater value to these records when the records are examined collectively. When a research organization can examine 10,000, 100,000, and even 1,000,000 records at a time, patterns relating to disease and medical conditions start to emerge that say a lot about disease and the human condition – not just information about a given patient.
Over time medical records are collected, often times from different sources.
And it is customary for these records to be collected electronically. Standard technology has the records that are collected electronically stored on conventional systems such as MicroSoft NT, IBM DB2 or Hadoop, among others. Typically the disk storage media is used to store the data.
While electronic storage of medical records electronically has many advantages and many valid uses, there is one major drawback to the storage of medical records electronically. That disadvantage is that the records can be usefully accessed and analyzed only a patient at a time.
There are several reasons for this limitation. The first reason for the limitation is that the records are stored textually. Standard technology does not handle unstructured text well. Standard technology handles structured data, numerical data and transactions quite well. But when it comes to text, standard technology is good for storing the text but not for retrieving and analyzing the text. The lack of structure of the text defeats many of the advantages of standard technology.
A second reason why standard technology does not lend itself to the analysis of collective textual analysis is that most of the data resides on very different sources and technologies. One source of medical records is housed in MicroSoft’s NT. Another source of medical records is housed in IBM’s DB2. Another source of medical records is housed in Hadoop, and so forth. These technologies simply were never designed to work seamlessly with other technologies. Therefore it is no surprise that trying to look at medical records collectively is a real challenge when the medical records are scattered over different technologies, as they often are.
Another major challenge is that when medical records are examined collectively is that there is a difference in terminology. Orthopedic surgeons call a broken bone one thing and general practitioners call a broken bone something else. And – vice versa – the abbreviation “ha” to a cardiologist means “heart attack” while the same abbreviation to an endocrinologist means “hepatitis A”. So merely throwing a bunch of medical records together is no guarantee that a collective analysis will yield anything meaningful.
All of these problems with the integration of text and more must be surmounted if a collective analysis of medical records is to yield anything useful.
Fortunately there is a solution to the need for looking at medical records collectively. That solution is Forest Rim Technology’s Textual ETL technology.
Fig 6 shows that Forest Rim Technology reads medical records wherever they are found on whatever technology they reside in. Forest Rim Technology doesn’t care if the data comes from IBM, Teradata, NT, Oracle or any other source. As long as it is electronically readable text, Forest Rim Technology can handle it. After the medical records are read, terminology differences – synonyms and homographs – are resolved. Forest Rim Technology has sophisticated logic to handle the integration of different terminologies. The medical record data from multiple medical records is integrated into a single whole. Further edits such as stop word removal (eg. “a”, “an”, “the”, “what”, “to”, “as”, etc.) and stemming are performed to make the text that has been read pliable and ready for integrated analysis.
Forest Rim Technology creates an integrated foundation of medical data that is integrated and comes from any electronically readable source.
After Forest Rim Technology finishes the editing and conditioning of data, Forest Rim Technology can pass the data on to a reporting engine – SeePower. SeePower takes the conditioned data and produces a special kind of visualization – a SOM – or a “self organizing map”.
SOM’s are a very special kind of visualization. SOM’s reflect the entire mass of data that has been read and conditioned. SOM’s are capable of representing thousands of documents and millions of words and phrases. In addition, the SOM that is produced is dynamically accessible.
The basic idea behind a SOM is to group together text that is related and text that is aggregated. Fig 8 shows a SOM.
In Fig 8 the SOM shows that there is a concentration of information in one place and a sparsity of information elsewhere. In addition the SOM shows that there is a continuum of information from one type of information to the next. All of the text that has been read – every word and phrase – from all of the documents that have been read are represented in the SOM.
As an example, suppose the medical records were from women from 20 to 50. There would be concentrations of information from thousands of medical records about child birth, monthly cycles, and menopause. There would be less information about smoking, broken bones, and obesity. And there would be very little information about rare blood conditions, rare bone conditions, and other rare disorders.
The information that is regularly occurring in the many medical records would appear grouped together as a “dark spot” in the SOM. The information that is very infrequently occurring would appear as a “light spot” in the SOM.
One of the most useful aspects of the SOM is the ability to drill down.
When an analyst drills down, the analyst selects one word or phrase and explores the word and its relationship with other words further. In addition, the analyst can drill across. The analyst can see what text is closely related to what other text. All of this analysis is done by moving a cursor across the SOM.
As an example, suppose the analyst finds an unexpected occurrence of lots of cases of emphysema. The analyst can isolate on those cases and look at them in lots of ways – by geography, by age, by gender, by weight, by smoking habits, and so forth. The drill down can go to as low a level of detail as desired.
Furthermore, if a really deep analysis is required, the analyst can look at the source documents that the word or phrase came from.
In the case of drilling down on emphysema, the analyst can go down to the actual medical record itself.
In a word, the SOM gives the analyst the capability of exploring and analyzing thousands of medical records all at once in a visual and natural mode of exploration.
But perhaps the most interesting aspect of a SOM is the ability to show correlations of text from thousands of medical records together.
When a SOM shows a concentration of information in one place and a concentration of information elsewhere, there is a correlation of information. Sometimes that correlation of information is weak. Sometimes the correlation of information is strong. In any case the correlation shows up visually and clearly as a result of the examination of thousands of medical documents.
As an example, suppose an analyst has done a study of the records of a particular kind of cancer – say skin cancer. The analyst can see immediately the correlating factors. The analyst can see age, exposure to sunlight, skin type. But the analyst can see other kinds of relationships as well, which may not be expected, such as ingestion of vitamin C, other medications, gender, occupation, and so forth. All of the correlating factors make their appearance if they have ever been caught in a medical record.
Of course, once an analyst has detected such a correlation, the correlation can be isolated and examined further.
There is one other thing Forest Rim Technology does that is of value to the research analyst. The output from Forest Rim Technology does not have to be used visually as described. Once edited and conditioned, the data from the medical records is available for further analysis using conventional analytical tools such as SAS, Business Objects, Cognos, Tableau, Qlik, etc.
Visualization and the access and conditioning of medical records then becomes the key to looking at and analyzing medical records collectively.
Forest Rim Technology is located in Castle Rock, CO. Forest Rim Technology produces textual ETL, a technology that allows unstructured text to be disambiguated and placed into a standard data base where it can be analyzed. Forest Rim Technology was founded by Bill Inmon. For more information look at www.forestrimtech.com