by W H Inmon, Forest Rim Technology
There are many forms of unstructured text. There are contracts. There are medical records. There are doctors notes. There are comments fields. But perhaps the most widely found form of unstructured information are emails.
Some Characteristics of Emails
Each of the different forms of text have their own characteristics. Some forms of text are structurally repetitive. Some forms of text are non repetitive. Some forms of text use casual style of conversation. Other forms of text use very formal language. Some forms of text are in a single language other forms of text use multiple languages.
Given that emails are ubiquitous and given that emails are found all over the world, emails are found in every language found on the earth.
Some emails contain casual conversation. But other emails contain conversation that is vital to the business. For example, an email may contain information to the effect that a customer is mad. Or that a shipment is late or has been found to be broken. Or perhaps a bridge has been blocked or closed preventing a speedy delivery. In many cases emails contain information important to the running and operation of the system.
As a rule emails tend to be non repetitive. When a person is creating an email, there is no one sitting on the person’s shoulder dictating how the email should be read. A person can write a long email or a short email. A person can write an email formally or casually. A person may use foul language in an email if they wish. A person may write the email in English, Spanish, Russian or Chinese.
Because of the free form nature of emails, there is no structural pattern that emerges from emails. Emails are simply non repetitive, around the world.
Another characteristic of emails is that they tend to contain many abbreviations. Emails may be said to contain lots of “shorthand”. As long as the receiving person understands the message, the author of an email is free to use as many abbreviations as he/she chooses to use. Using abbreviations saves the author time in the construction of the email.
Therefore it is normal for an email to contain much sketchy and otherwise cryptic information.
Another characteristic of emails is that it often times makes sense to group the emails together. By grouping emails together, it is possible to capture the entirety of a conversation. By grouping emails together, the effect of listening to an entire conversation can be achieved. Often times the conversation found in one email is relevant to and in the context of other emails. By grouping emails together and sequencing them, the entire conversation can be captured.
Another characteristic of emails is that they can become quite voluminous over time. There are lots of reasons why corporations accumulate lots of emails over time. Sometimes emails contain spam. Other times emails are “blather”. Blather occurs in an email when there is day to day conversation about topics other than business. And on other occasions people are just plain “chatty”. For these reasons and more, over time organizations tend to accumulate lots of emails. Terabytes and terabytes of email.
There are plenty of incentives for an organization to manage the volumes of data that are found in the corporate email store. One good way to manage the volumes of email is for the organization to filter out the spam and the blather. By filtering out the spam and the blather, the corporation is left with only the emails that are relevant to the corporate business.
This filtering can be done by Forest Rim Technology’s Email filter.
After the filtering is done, the next step is to place the emails that remain in a standard relational data base. Actually the emails are not placed in a relational data base at all. Instead the emails are passed through Textual ETL, such as the Textual ETL developed by Forest Rim Technology. By passing the email text through textual ETL, a lot of good things happen –
- the words and terms that are needed for analytical processing are identified and separated from the email,
- the volume of data is significantly reduced by taking emails and breaking them down into the bare essentials,
- the many different facets of textual manipulation are provided by Textual ETL, and so forth.
The are many features of textual manipulation provided by Textual ETL. Some of them are:
- categorization by taxonomic divisions,
- alternate spelling synthesis,
- date standardization,
- text to numeric conversion,
- custom variable formatting and recognition, and so forth.
For a lengthy discussion of the features of Textual ETL, please refer to the book BUILDING THE UNSTRUCTURED DATA WAREHOUSE, Technics Publication, 2011.
By passing emails through Textual ETL, the text found in the emails is capable of being placed in a standard relational data base.
An explanation of the relationship between the relational data base and the content of the email is in order here. The relational data base that is created contains words and phrases that are useful for analytic processing. The relational data base also contains pointers back to the email. At any point in time during analytical processing that it is desired to return back to the actual email, it is very easy to do so.
This simple relational data base pointer to email design allows massive amounts of emails to be stored outside of the analytical relational data base, thus accommodating the need for not creating relational data bases of monstrous proportions.
The analysis of emails can be done by using any standard Business Intelligence tool. The analysis consists (typically) of a series of SQL statements. The analyst can then search through the emails and find whatever emails are relevant to a problem.
As an example of the analysis that can be done, consider these cases –
- A customer has a problem and is threatening to sue. The analyst can go back and find out if there have been any email exchanges between the customer and the company. The analyst can find out if there have been any internal conversations about the customer and whatever problem the customer is having. The analyst can sequence the emails in the order in which they appeared. In short the emails can provide a wealth of information about the customer and the problem the customer is having or has had.
- A shipment has been delayed and is going to be late. The analyst can find out from emails such things as why the shipment has been delayed, how much warning has been given to the customer, what departments have been involved, and what has been done proactively.
- A product has a defect and the store selling the product is upset. By looking at emails such issues as who knew about the defects, who was trying to be proactive in trying to address the defects, and so forth appear in emails. Even the non existence of emails can alert a company that the inspection process is not being done properly.
In short, by using emails the analyst can find out huge amounts of information that relate to resolving thorny issues.
There is another way to look at the analysis of emails. That way is to contrast the analysis of emails versus the analysis of raw operational data. When an analyst looks at raw operational data, the analyst can discern WHAT has happened. By doing analysis of emails, the analyst can find out WHY did it happen. And in the final analysis, when running a business, the insight gained by looking at WHY something has happened is MORE important to the business than understanding what has happened.
Emails contain much important information. The insight that can be gained by looking at emails is important and is different than the insight that can be gained by looking at what has happened.
In order to be effective, email must be filtered before it is analyzed because of the spam and the blather that typically exist in an email stream. After email is filtered it needs to be passed through Textual ETL. After having been passed through Textual ETL, a relational data base is created that can be used to analyze the email stream.
IF YOU WANT TO PROACTIVELY MANAGE YOUR EMAIL ENVIRONMENT, START WITH FOREST RIM TECHNOLOGY
Forest Rim Technology was formed by Bill Inmon in order to provide technology to bridge the gap between structured and unstructured data. Forest Rim Technology is located in Castle Rock, Colorado.