Text – The New Data

Traditionally statistical methods have focused on the use of numerical data, perhaps partitioned by classification data.  A classic example of this would be oneway analysis of variance, or linear multiple regression containing classification variables that had been internally coded as integer values.

This tradition is so strong that it is quite natural to filter out unstructured textual data during the data preparation phase, to the point where our data tables contain only the data structures consistent with our traditional approaches of analysis.

But at the core of our source data there is often text-based information that provides an unstructured commentary of our data.  The text delivers context to the human-readers who consume the associated structured numerical data.

At the level of data architecture, free-form text is a mechanism for capturing information that is not articulated in the form of a data dictionary.  If we think of IT systems that act as a “system of record”, then this text-based data is often used to make real-time decisions and assessments, but is often thrown away when it comes to historical data analysis and review.

An example would be that of an insurance company.  When a customer informs the insurer of an accident or loss, an “insurance claim” is created in the company’s claims management system.  This system will track a variety of structured data associated with the claim.  For purposes of example, let’s assume the claim relates to a motor vehicle accident:

  • Date and time of the accident
  • Location of the accident
  • Was there vehicle damage?
  • Was there personal injury?
  • Severity of damage / injury?

In addition there may be other questions to help classify the nature of the accident, but it’s impossible to capture the entire essence of the incident.  There is always an open-ended commentary field to capture the voice of the claimant.  This commentary helps aid the claims handler to deal with the claim in a fair and sympathetic manner.  Rarely is is used to augment historical analysis of claims data.

But now data is everywhere.  It doesn’t just exist in our internal IT systems.  Even if we don’t actively participate in social media, it’s hard to deny it relevance as a source of determining social trends.  And with a little effort the internet can act as a source of other types of information relevant to our business, albeit not always expressed in a way that is immediately suited to data analysis.

For past projects I have written custom code to perform some crude forms of frequency analysis of words that appear in free-form text fields.  With the introduction of version 13 of JMP I have access to a much richer set of tools.  I’ll be writing additonal blog entries to explore these new capabilities of JMP Software – both standard addition and Pro.  These entries will have the classifier ‘text analysis’.

Leave a Reply

Your email address will not be published. Required fields are marked *