In my previous post I introduced the sample data table Pet Survey. I created a column formula to classify each respondent to determine whether they owned a cat, a dog, or both. In this simple example, there were signs of the problems that arise when processing unstructured text data. My classification of “dog” missed out responses referring to huskies; my classification of “cat” incorrectly included references to cattle. I looked at the Text Explorer platform and focused on the output contained in the lists of terms and phrases. In this post I want to focus on workflow: using the functionality within Text Explorer platform to gain meaningful insights into my data, and to answer specific questions.
In this post I will walk through some of the common tasks that are undertaken when we process unstructured text-based data. This will also give me the opportunity to introduce the terminology associated with text processing.
Traditionally statistical methods have focused on the use of numerical data, perhaps partitioned by classification data. A classic example of this would be oneway analysis of variance, or linear multiple regression containing classification variables that had been internally coded as integer values.
Since writing this post I have placed the associated code on the
JMP File Exchange …
The problem with the internet is that it gives you too much information, or rather, it takes too long to gather the information. I often cross reference hotel booking sites with TripAdvisor, and its a laborious process. So this evening I decided to streamline my process by writing a script to gather to user reviews into a JMP table and simple report.