Textual Disambiguation

Posted by:

Textual disambiguation is the process of taking raw text and determining the meaning of raw text by the context in which the raw text is written or otherwise exists.

There are two forms of textual disambiguation. In one form, raw text is examined a single word or phrase at a time. This type of single word or phrase textual disambiguation is common for Google and other dictionary based products.

The other form of textual disambiguation is the form of textual disambiguation where whole bodies of text are disambiguated (as opposed to a word at a time or a phrase at a time passing through disambiguation.)

Textual ETL is the process by which textual disambiguation of entire documents (and bodies of documents) is accomplished.

An early form of textual disambiguation was NLP – natural language processing. While there are some practical uses of NLP, the problem with NLP is that much of the context that is needed for disambiguation is not textual at all. The environmental surroundings, the weather, the people that are conversing, the time of day, the temperature, the date, a business occasion and many more factors greatly complicate NLP processing because these external factors are not textual at all. But these external factors greatly influence the context and interpretation of the raw text.

When textual ETL is used for textual disambiguation, a large amount of algorithms and a large amount of input are used for textual disambiguation. Typical of the input to textual ETL are taxonomies and ontologies, context vocabularies and context acronym dictionaries. Typical of the algorithms used in textual ETL are stemming algorithms, homographic resolutions, association block processing, alternate spelling, and word delimited indexing, among others.

One of the features of textual ETL is the ability to operate in multiple languages, in shorthand, in slang, instant messaging as well as proper text. Textual ETL is also used for disambiguating log tapes and other forms of logged messages. In addition textual ETL can handle improperly formed text, such as the text that comes out of OCR processing.

The output of textual ETL and textual disambiguation is the creation of a standard data base (often a relational data base) that can be accessed and analyzed by standard analytical software. In that sense textual disambiguation opens up the door to analytical processing of text.

Typical forms of input to textual ETL include standard Microsoft extensions (doc, docx, txt, xls, etc.), html, data base, email, tweets, log tapes, Big Data, etc.

Through textual ETL and textual disambiguation, the organization can start to store and analyze major blocks of raw text that could not previously be analyzed in an automated manner.

Print Friendly, PDF & Email
Bill Inmon

About the Author:

Best known as the “Father of Data Warehousing”, Bill Inmon has become the most prolific and well-known author worldwide in the data warehousing and business intelligence arena. In addition to authoring more than 50 books and 650 articles, Bill has been a monthly columnist with the Business Intelligence Network, EIM Institute and Data Management Review. In 2007, Bill was named by Computerworld as one of the "Ten IT People Who Mattered in the Last 40 Years" of the computer profession.
  Related Posts
  • No related posts found.

You must be logged in to post a comment.