Books on Textual ETL

Our Technology: Disambigution of Textual Data

At the heart of the ability to use textual data in an analytical mode is the disambiguation of the textual data. Simply dumping textual data into a data base does nothing for the analyst’s ability to use the textual data in an analytical manner. The textual data must be preconditioned.

Bill Inmon founded FOREST RIM® to introduce the proprietary of textual disambiguation™ for the purpose of converting and preparing any type of textual data for analytical processing.

Some of the techniques for integrating raw text are simple and straightforward, such as standardizing dates so that the text can be read, and entering the text into a dbms. Once the dates have been standardized they can be compared and used analytically by an analysis tool for Business Intelligence processing. Another simple technique is that of reading text relating to numbers and turning those numbers into numerics. Without this simple facility, BI cannot be used against textual data when it comes to analysis and comparison of numerical data. And those are but two of the many techniques that FOREST RIM® has developed in order to turn textual data into a format and structure that can be meaningfully analyzed by standard Business Intelligence tools.

Other processes that must be done to textual data as part of the integration and preparation for use analytically include:

  • stemming of text
  • removing stop words
  • providing for alternate spelling
  • creation of synonym variables
  • creation of proximity variables
  • creation of patterned variables
  • creation of named variables
  • creation of list variables
  • standardization of dates
  • text to numeric conversion of unstructured data
  • standardization of numeric data into a query able formats
  • recognition of sub documents within the document
  • visualization of textual data for analysis
  • screening of textual data for relevancy

This is a short list of the many features FOREST RIM® has developed within the flagship product of TextualETL™. TextualETL™ deploys the embedded logic and technology known as “textual disambiguation™”.

Unstructured and Semi-structured Data

Textual data comes in different forms. These forms of text greatly influence the way the data is organized and integrated. Unstructured data has no implied structure. Tolstoy’s novel – WAR AND PEACE – is an example of a document where data is essentially unstructured. Except for chapters, there is no other internal structuring of text in the work.

A common form of unstructured data is that of emails. There are no rules whatsoever in the writing of the email. The author of the email can use any words, any sentence structure, and any language that he/she wishes. Emails are a good example of free form text at its best.

In semi structured data there is an implied order of text within the document. Consider a recipe book. Within the document of the recipe book there are many sub documents. There is the beginning of a new recipe and the ending of an old recipe. In order to make sense of the text found in a semi structured document it is necessary to recognize where those sub document divisions of data exist.

An extreme form of a semi structured document is a list. Not only are there subdivisions of data within a list but there are typically no recognizable forms of data that identify each member of the list. Instead, data in a list is recognized by its relative order to other variables found in the list.