Process capability indices are a convenient way of summarising process performance. They contain information about how far a process has shifted from the mean as well as the expected number of defective parts per million. In an earlier post I showed the relationship between the capability indices and the process shift. In this post I will use the indices to calculate the expected number of defective parts per million. The calculation will involve probabilities that are calculated by reference to a Normal distribution. I will use the JSL script editor to perform these calculations.

Let’s be clear. JMP reports dppm figures when it calculates the capability indices, but nonetheless I think it’s important to understand how the information is generated rather than just blindly follow software output. As a case-study, it is also a good illustration of using the script editor to utilise the probability distribution functions available in JMP.

First I want to start with a simplified scenario where the process is on target and I can work solely with the Cp index. If the process has upper and lower specs of U and L respectively and the process standard deviation is σ then

This is the ratio of the specification window to the process width. Using 6σ as a measure of process width is just a convention: when I was first introduced to quality methods it was quite common to define the process width of 5.15σ.

Let me start with the case of Cp = 1 i.e. the spec window and the process width are identical. Now the calculation of dppm is the same as calculating the probability of an observation being outside the 6σ width.

With any probability distribution (such as a Normal distribution) there are two variations in how the distribution is enumerated: a probability density function (pdf) and a cumulative probability function (cdf). In JMP the function that generates the pdf for a Normal distribution is ** Normal Density** whereas the cdf is called

The cdf for the Normal distribution takes 3 arguments:

*Normal Distribution(z, mu, sigma )*

It’s good to start with a trivial test case where the answer is obvious! If I take a standard normal i.e. with mean of zero and standard deviation of 1 then I know by symmetry that 50% of the data will be less than or equal to zero:

The result is a probability. Of course I could have multiplied by 100 if I wanted to explicitly express the result as a percentage or 10^6 is I wanted parts per million.

If I wanted to probability is being less than 3 standard deviations from the mean I could write:

To calculate the probability of being within the range +/- 3σ I can write:

With probability calculations it is often easier to calculate the logical opposite of our goal and then take one minus this value to produce the final result; this is the case with dppm calculations. In the calculation below p is the probability of being within the +/- 3σ range: so (1-p) gives me the probability of being outside the range. The 10^6 scale factor gives me the result in terms of parts per million:

This is the well-known result that 0.27% of data is outside the 6σ width of a Normal distribution.

Recall I said that an alternative definition of the process width is 5.15σ. The motivation for this definition is that the proportion outside this process width is very close to 1% (dppm looks worse but calculations are easier!).

Having illustrated the probability calculation for the case that the spec window is the same size as the process width, let me now take the case of Cp=2 i.e. the spec window has a width twice the process width. The width of the process is now 12σ and all I need to do is change the calculation to use a range +/-6σ:

For this we need to be thinking in terms of parts per billion!

So far my calculations have assumed that I have a process on target, which allows Cp to be a sufficient descriptor of process performance. Now I want to consider both Cp and Cpk. Implicit in these two statistics is a process shift:

*shift = 3σ(Cp-Cpk)*

(see my previous post for the derivation of this result).

For purposes of illustration I will use the classic criteria associated with six sigma methodology: Cp=2 and Cpk=1.5. This corresponds to a process shift of 1.5.

Without loss of generality I can assume that the shift is positive in relation to the process target (which I’ll assume to be midpoint between the spec limits) as illustrated below:

When on target the process mean is 6σ from each spec limit. With the process shift the mean is 4.5σ from the upper limit and 7.5σ from the lower limit. The number of defective parts per million will correspond to the proportion which is outside the range -7.5σ to +4.5σ:

This is the benchmark result of 3.4 parts per million defective for a “six sigma “ process.

]]>

If you don’t write JSL scripts you may never have had a need to use the script editor, so let’s first take a look at this. On the toolbar the second icon will create a new script window:

If you prefer you can use the following menu path: *File>New>Script.*

The script window is just a blank window into which we type commands. We want to use these commands to perform some calculations, so we also need an output area: right-mouse-click and select the option ** show embedded log**. The window splits into two sections:

The upper section is the input region where we can type JSL statements with results being displayed in the lower section.

JMP has a vast library of in-built functions for performing mathematical and statistical calculations. Here is a simple example of using the *PI* function to obtain the value of π:

To generate this output I type “Pi()” and then click the run-script icon:

If you think about how we write functions in mathematics we might write something like:

y = f(x)

A similar notation is used in JSL. In particular the parentheses instruct the script editor that what is being written is a reference to an in-built function (if a valid function is identified by JMP then the script editor changes the text colour to blue). The parentheses also act to contain possible arguments. In the above example the function *f* is a function of *x*. For example the f function could be a natural logarithm and x could be the value 2.7183:

If I wanted a base-10 logarithm then I would use the function *LOG10*.

I’m not limited to single functions. I can build fully featured mathematical expressions using the following operations:

In this calculation for the area of a circle I wanted to assign my radius value to a variable before defining the formula for the area. To do this I have had to write 2 lines of code in which case I have to delimit the lines with a semicolon.

To be proficient in using the script editor as a calculator you need familiarity with the functions available inside JMP. Fortunately these are documented online:

These functions are also documented in the JMP help system under the Scripting Index option.

In my next post I will illustrate using the script editor to perform probability calculations to estimate number of defective parts per million for a process where I know the capability indices.

]]>

The process capability statistic Cp compares process variation against the width of a process operating window:

where U and L are the upper and lower specification limits respectively, and s represents the standard deviation of the process variation.

In order to take account of process location the ratio is extended to include process mean :

& Min

So now we can confidently talk about process capability in terms of the indices Cp and Cpk. But it seems to me that this is convenient shorthand at the expense of transparency.

For example, if I am given values for Cp and Cpk the underlying process shift is not necessarily obvious.

Whilst the relationship between process shift and capability indices is not immediately apparent there is nonetheless a simple relationship:

where the shift Δ is measured as the distance of the process mean from the target.

The rest of this post looks at the derivation of this result.

Let’s assume without loss of generality that the process shift is positive (with respect to the target T). Then:

The process shift is so we need an appropriate expression for and and .

From the above expression:

which implies

If we assume that the specs are symmetric then

which implies

but also

therefore

this implies

therefore

Using the above expressions for and an expression for process shift can be constructed and simplified:

In this post I have derived a simple relationship between process shift and the capability indices Cp and Cpk. Given the simplicity of the relationship, my derivation feels somewhat laboured – perhaps you know of a more direct method?

]]>

Many statistical methods are expressed in the form of a hypothesis test: it’s one of the fundamental constructions within the field of inferential statistics. One of the outcomes of this construction is a probability outcome, or p-value, the notorious number which is subjected to extensive use and abuse! See for example:

When we conduct an experiment it can feel like we have collected an extensive set of data – but typically that data only represents a single sample. It’s hard to think in terms of probabilities once we have this data in our hands. Think about tossing a coin. Before a coin toss the probability of *heads* is of course 0.5, assuming a fair coin. But once the coin is tossed, I either have a heads or tails. There is no longer any ambiguity and the idea of probabilities feels irrelevant: the act of completing the experiment can make the use of statistics feel somewhat academic.

The way that probability is taught is to avoid the notion of single outcomes. We don’t just toss the coin once, we toss it 100 times – now the probability remains relevant after the event – the proportions of heads and tails are described by the probability.

The toss of a coin is of course a trivial example. But in all honesty I think that as soon as we move to more complex scenarios the probabilities get hard to understand at an intuitive level. We can do the math and understand the theory but there can be a disconnect between what is understood at an intellectual level versus an emotional *gut* feeling. At the end of the day if we conduct a complex experiment and collect the results, then we believe in those results: the p-value might be intended to help us keep in mind the probabilistic nature of the outcome, but as I’ve said before, it’s hard to think this way when we have single-outcome results in our hands.

Part of the problem with the interpretation of p-values is that it can be hard to take the null hypothesis seriously: that means not understanding the nature of type I errors. A technique that I have found useful when running training courses is to make the probabilities more visible by using simulation techniques to give a better sense of the probabilistic nature of experiment outcomes.

First I have to decide what the null hypothesis is. Tossing a coin is too trivial. I like the idea of using a simple linear regression because it is very visual. It is easy to understand at a scientific level but sufficiently complex that it contains appropriate components of statistical inference (analysis of variance, parameter estimation, etc).

The graph below shows the case of a null hypothesis (red line) and an alternative hypothesis (blue line):

If we take a sample of data under the conditions of the null hypothesis then we cannot expect the data to fall precisely on the horizontal red line. Therefore there is ambiguity as to whether the data is indicative of the null hypothesis or the alternate hypothesis. It is this ambiguity that we try and describe using p-values.

In the spirit of JMP, interactive visualisation is a much more powerful way to illustrate the ambiguity. To do this we can sample data from a population described by the null hypothesis and then use this data to build a regression model; this process of sampling and modelling can be repeated over and over, in just the same way that we can toss a coin multiple times.

The single-sample simulation can be articulated using a JMP table with a column formula:

Based on the data in the table, the Bivariate platform can be used to visualise the relationship in the data. Using a script the random components of the formula can be updated and the graph can be refreshed. Each refresh corresponds to a flip of the coin.

This is what the the null hypothesis really looks like!

The above visualisation really helps to express the problem that we are trying to solve when we use inferential techniques embedded within our model building process. And we can go further than just visualising these outcomes – we can analyse the data and get insights that provide intuition into traditional statistical theory.

One statistic that scientists feel very comfortable with is the R-square statistic. We would expect an R-square value of zero if the null hypothesis were true – but let’s take a look at how it really behaves based on our simulated outcomes. Here is a grid of just 9 runs:

Most people would probably think that an R-square of 0.76 would be associated with a “significant” model. If you teach statistics, how many times have you been asked “what is a good value for R-square?” – well now we can start introducing probabilistic thinking even to this question!

For the entire collection of simulation runs we can look at the overall distribution of R-square values:

It is interesting to ask where the 95% threshold is. That is, we would like to be able to make a statement along this lines of “95% of the time the R-square has a value less than xyz”. This can be achieved using the JMP data filter – slowly moving the slider from the right hand side until 95% of the rows have been selected:

For this scenario we can say that 95% of the time the null hypothesis generates an R-square less than 0.635. Or: there is a 5% chance that the R-square statistic will exceed 0.635 even if the null hypothesis is true.

If you have followed the logic that I described for the R-square statistic then you will realise that exactly the same procedure can be applied for statistics more appropriate to statistical inference.

Specifically, the simulation script can grab the F-ratio values displayed in the ANOVA table, and these values can be plotted as a histogram:

What have we just done? We’ve discovered the F-distribution empirically!

And using the data filter we can determine the 95% threshold:

For this sample the threshold value is 7.65. What have we done? We have just determined empirically the p-value for an alpha level of 0.05 (theoretically, for this number of degrees of freedom, an F-ratio of about 7.7 generates a p-value of 0.05).

A similar procedure can be applied to a t-ratio. However, when comparing the absolute value of the t-ratio the data filtering needs to be performed in two steps. First select a high value that eliminates 2.5% of the data:

Then adjust the lower value so that the total number of matching rows is 95% of the entire data:

Based on this example we would conclude that an absolute value for the t-ratio of 2.88 yields a p-value of 0.05. This is in excellent agreement with the theoretical value for this number of data points.

Not only is JMP an excellent tool for statistical analysis and visual discovery, but it can also be used to provide a level of intuitive understanding of statistics beyond that which is achieved through traditional teaching. The techniques that I have described in this post rely on some relatively simply JSL scripts to (1) repetitively generate regression models based on randomly sampled data and (2) extract summary statistics from the Bivariate report window. I’ve not listed the actual code for these scripts because this is a post about statistical learning, not about programming – however, feel free to contact me if you would like details of the scripts.

]]>

From the JMP menu system select * View>Add-Ins*….

From here you are presented with a list of add-ins that have been registered for your installation of JMP.

Select the add-in that you are interested in. Below the list you will see a link for the home folder. Clicking on this link will open the folder for the selected add-in.

The contents of the home window will be specific the the add-in. The folder may just contain a single script file, or may contain a more elaborate file structure.

Typically an add-in produced by Pega Analytics contains the following file structure:

]]>The idea of object-orientation is not new to JSL, but user-created objects require a complex code structure that wraps data and functions into namespaces (for example, see the navigation wizard).

In version 14, there is explicit support for classes which dramatically simplifies the process of creating reusable objects. I thought I would introduce them by means of a real- example: a notification window that shows progress when stepping through a sequence of time-consuming steps.

I have a class *ActivityStatusClass* that displays an activity status window as illustrated above. Before any class is used it first has to be defined. But the benefit comes from its use, so let me focus on that first. Here is an example of creating an instance of the ActivityStatusClass to show progress as I fit 4 nonlinear models (for the sake of clarity of code the nonlinear modelling is not relevant and has been replaced by wait statements!):

// reference ot the ActivityStatusClass include("ActivityStatusClass.jsl"); // list of notification messages for each ativity lst = { "Linear Kinetic Model", "Accelerating Kinetic Model", "Decelerating Kinetic Model", "Power Kinetic Model" }; // create an activity status object status = newObject(ActivityStatusClass(lst)); // do activity 1 ... status:start(); wait(2); // do task 2 status:startNextTask(); wait(2); // do task 3 status:startNextTask(); wait(2); // do task 4 status:startNextTask(); Wait(2); // finish the activity status (closes the window) and delete the object status:finish(); status << delete;

**Line 2**: the class has been defined in a separate jsl file (of the same name). The definition is referenced by *including* the file.

**Line 11**: a new object is instantiated by calling the function ** newObject**. The name of the object is the name of the class (no quotes). This particular object requires a single parameter which is the list of notifications created on line 4.

**Line 13**: the method * start *launches the notification window and displays the first notification message. Notice that the notation to invoke the method for the

**Lines 16, 19, 22**: each time the method * startNextTask *is invoked, the previous activity is marked as complete (an hour glass icon is replaced with a green tick mark) and the notification message for the next activity is displayed.

**Line 25: ** the * finish *method marks the final activity as complete and after a momentary pause the status window is closed.

**Line 26: ** housekeeping: the object still resides in memory but can be removed by sending the delete message to it. Note that is an object message and uses ‘<<” and not namespace colon notation.

So that’s how the object is used. In general there are a few common steps:

- Reference the class definition (typically via an
*include*statement) - Create an object using the
*newObject*function - Invoke object-specific
*methods* *Delete*the object when it is no longer required

Here is the definition of the class (as defined in the file “ActivityStatusClass.jsl”):

namesDefaultToHere(1); /*------------------------------------------------------------------------------ Class: ActivityStatusClass ------------------------------------------------------------------------------*/ defineClass("ActivityStatusClass", { // properties lstActivities = empty(), // list of tasks descriptions (strings) numActivities = 0, currentActivityIndex = empty(), winTitle = "Activity Status", width = 300, delay = 0, iconBusy = newImage(charToBlob( "..." )), iconDone = newImage(charToBlob( "..." )), // display box lists nwStatus = empty(), lstActivityStatusVLB = {}, lstActivityStatusTB = {}, /*---------------------------------------------------------------------- Constructor: constructor Input Parameters: lstActivities - string list of activity descriptions title - (optional) title for the status window ----------------------------------------------------------------------*/ _init_ = method({lstActivities}, numActivities = nItems(lstActivities); ), /*---------------------------------------------------------------------- Property: Title (get/set) The title of the status window ----------------------------------------------------------------------*/ getTitle = method({}, return(winTitle) ), setTitle = method({title}, winTitle = title ), /*---------------------------------------------------------------------- Property: Width (get/set) The width of the text boxes used to display the activity description ----------------------------------------------------------------------*/ getWidth = method({}, return(width) ), setWidth = method({value}, width = value ), /*---------------------------------------------------------------------- Property: Delay (get/set) Add a delay to the display before the next step is performed ----------------------------------------------------------------------*/ getDelay = method({}, return(delay) ), setDelay = method({value}, delay = value ), /*---------------------------------------------------------------------- Method: startNextTask Update the status display to show start of next ----------------------------------------------------------------------*/ startNextTask = method({}, vlb = lstActivityStatusVLB[currentActivityIndex]; (vlb<<child)<<delete; vlb << append(PictureBox(iconDone)); tb = lstActivityStatusTB[currentActivityIndex]; tb << setFontStyle("Normal"); tb << reshow; wait(0); currentActivityIndex++; vlb = lstActivityStatusVLB[currentActivityIndex]; vlb << append(PictureBox(iconBusy)); tb = lstActivityStatusTB[currentActivityIndex]; tb << setFontStyle("Bold"); tb << reshow; wait(delay); ), /*---------------------------------------------------------------------- Method: start Launches the status window ----------------------------------------------------------------------*/ start = method({}, nwStatus = NewWindow(winTitle, showMenu(0), showToolbars(0), BorderBox(top(20),bottom(40),left(80),right(20), lub = LineupBox(nCol(2),spacing(10)) ) ); lstActivityStatusVLB = {}; lstActivityStatusTB = {}; for (i=1,i<=numActivities,i++, lub << append(vlb =VListBox()); lub << append(tb= Text Box(lstActivities[i],<<Set Width(width))); insertInto(lstActivityStatusVLB,vlb); insertInto(lstActivityStatusTB,tb); ); currentActivityIndex = 1; vlb = lstActivityStatusVLB[1]; vlb << append(PictureBox(iconBusy)); tb = lstActivityStatusTB[1]; tb << setFontStyle("Bold"); tb << reshow; wait(delay); ), /*---------------------------------------------------------------------- Method: finish Mark the current task completed then close the status window ----------------------------------------------------------------------*/ finish = method({}, vlb = lstActivityStatusVLB[currentActivityIndex]; (vlb<<child)<<delete; vlb << append(PictureBox(iconDone)); tb = lstActivityStatusTB[currentActivityIndex]; tb << setFontStyle("Normal"); tb << reshow; wait(max(delay,1.0)); //nwStatus << closeWindow; ) } );

I’m not going to try and rationalise the code structure for the class definition, but here are a few observations:

- Properties of the class can be identified by declaring variables with initialised values (see lines 8 through to 20); these act as public variables in that they are accessible to the code of all the class methods without any qualification.
- Each property is delimited with a comma
- Each method is standard JSL (i.e. expressions separated by semicolons). Arguments can be optionally specified.
- Unlike functions, the scope of variables inside a method definition does not need to be explicitly specified (there is no
*default local*). Variables are local to the method, except for those that have been defined outside of the method call that act as class-level properties. - There is a special initialisation method that has the name
*_init_*. - Each method is delimited by a comma (but don’t put a comma after the last method!)

With regard to this specific class definition the lines 15 and 16 define image icons based on a very long character string that would overwhelm the widget that I use to display code snippets – therefore I have replaced the text with “…”. This will mean that the code will not run as intended – however, once JMP is officially being distributed I will post the code on the file exchange.

]]>

Most programming tasks require operations to be performed on data. In object-oriented programming, the goal is to create re-usable packages that combine both data and the functions that operate on the data. In JSL this can conveniently be done using namespaces to act as the container for both data and function definitions.

In the code below a namespace is used to deliver the object-oriented functionality for a navigation wizard. The code pattern is based on the code-base described by Drew Foglia, Principal Software Developer with JMP.

NavigatorClass = New Namespace("NavigatorClass-pega-analytics.co.uk"); NavigatorClass:newNavigator = function({maxIndex=1},{default local}, // create a unique namespace for this object instance ns = newNamespace(); ns:maxIndex = maxIndex; // maximum number of steps in the wizard ns:index = 1; // current index of the navigation wizard ns:traceFlow = 0; // enables diagnostic output when true ns:getIndex = evalexpr(function({}, {this=namespace(expr(ns<<getname))}, if (this:traceFlow,print("getIndex()")); return(this:index); )); ns:enableTraceMode = evalexpr(function({enable}, {this=namespace(expr(ns<<getname))}, // enable tracing of method calls if (this:traceFlow,print("enableTraceMode()")); this:traceFlow = enable; )); ns:nextStep = evalexpr(function({}, {this=namespace(expr(ns<<getname))}, // move forward to the next step of the wizard if (this:traceFlow,print("nextStep()")); this:index++; this:index = min(this:index,this:maxIndex); this:_enableButtons(); this:_updateContent(); )); ns:previousStep = evalexpr(function({}, {this=namespace(expr(ns<<getname))}, // move backward to the next step of the wizard if (this:traceFlow,print("previousStep()")); this:index--; this:index = max(this:index,1); this:_enableButtons(); this:_updateContent(); )); ns:_enableButtons = evalexpr(function({}, {this=namespace(expr(ns<<getname))}, // perform logic to correctly enable/disable navigation buttons if (this:traceFlow,print("_enableButtons()")); if (this:index >= this:maxIndex, this:btnNext << enable(0) , this:btnNext << enable(1) ); if (this:index <=1, this:btnBack << enable(0) , this:btnBack << enable(1) ) )); ns:_reportContent = evalexpr(function({}, {this=namespace(expr(ns<<getname)),content}, // display box content for the current step of the navigation wizard if (this:traceFlow,print("_reportContent()")); content = Text Box("content here for report " || char(this:index)); return(content); )); ns:_updateContent = evalexpr(function({}, {this=namespace(expr(ns<<getname))}, // refresh content if (this:traceFlow,print("_updateContent()")); this:pb << set title(char(this:index) || " of " || char(this:maxIndex)); this:bb << delete; this:pb << sib append( this:bb = Border Box(top(20), this:_reportContent() ) ); )); ns:_createNavigatorWindow = evalexpr(function({}, {this=namespace(expr(ns<<getname)),win,self}, // create the window containing navigation controls and content if (this:traceFlow,print("_createNavigatorWindow()")); win = New Window("Navigator", Border Box(left(20),right(20),bottom(10), V List Box( this:pb = Panel Box( char(this:index) || " of " || char(this:maxIndex) , Border Box(top(0),bottom(0),left(10),right(10), H List Box( this:btnBack = Button Box("< back",, <<Enable(0)), this:btnNext = Button Box("next >"), Text Box(" ") ) ) ), this:bb = Border Box(Top(20), this:_reportContent() ) ) ) ); self = this << get name; eval(parse(evalinsert("\[ this:btnNext << set script( this=namespace("^self^"); this:nextStep() ); this:btnBack << set script( this=namespace("^self^"); this:previousStep() ); ]\"))); return(win); )); // on creating a new instance of the class, render the navigation window ns:_createNavigatorWindow(); // final step - return a reference to the class instance in the form of a namespace 'object' return(ns); );

Assume the code is written in a file “NavigatorClass.jsl”. To use the class definition I would write the following code:

Include("NavigatorClass.jsl"); numSteps = 5; nav = NavigatorClass:newNavigator(numSteps);

Currently the content that is displayed in the wizard is defined by a single line of code within the function _reportContent:

ns:_reportContent = evalexpr(function({}, {this=namespace(expr(ns<<getname)),content}, // display box content for the current step of the navigation wizard if (this:traceFlow,print("_reportContent()")); content = Text Box("content here for report " || char(this:index)); return(content); ));

To make the navigator useful the code needs to be customised to display relevant content.

The code pattern for each function within the namespace looks complex. This complexity is required so that the object has the concept of *self*. This becomes apparent when you create 2 two simultaneous instances of the navigation wizard. Even though both wizards use the functions nextStep and previousStep to navigation, they successfully navigate the correct window. For simpler code implementations this would not be the case – you could click the next button in one window and t he content would change in a second window!

]]>

Let’s say I construct a regression model using the *Fit Model* platform. The usual method to visualise the model is to use the prediction profiler available from within the platform.

If I want to visualise the model later (or perform post-modelling tasks such as simulation) then I can save the prediction formula as a column and access the *Profiler* platform directly from JMP’s graph menu.

The trouble is, plotting a formula gives you just that: the formula with no sense of goodness of fit.

If you want to have confidence intervals on the curves then you have to use the profiler via the *Fit Model* platform.

*Or so I thought – but it turns out that I was wrong.*

It is possible to include the confidence interval: perhaps I should have known how to do this, but I think it’s sufficiently obscure to warrant being classed as a hidden secret!

When you have the prediction formula there is also an option to save the formula for the standard error of prediction. So far so good. But there is no obvious way to tell the prediction profiler that it should use this formula column to compute the confidence intervals. But it will do it for you if you name the columns correctly.

- The model formula column must be named

*“Pred Formula of <name>”* - The formula for the standard error must be named

*“PredSE of <name>”*

<Name> can be anything (typically the name of the response being modelled). The rest of the wording must be exact, including the space between “Pred” and “Formula” and no space with “PredSE”.

Here is an example:

If you have done it correctly the following message appears:

Click *Yes* and hey-presto!

With respect to statistical modelling the activation of the neuron can be represented as a logistic model. I want to illustrate how the behaviour of a single node within a neural network is the same as a logistic model and show how networking extends the utility of the model beyond the capabilities of a single logistic representation.

To achieve this goal I think it’s easier if I just construct some artificial data:

The data has been coloured by the (binary) response classification. Low values of X correspond to blue; high values correspond to red. A logistic model will transform the input X values into probability outcomes which will represent the probability associated with my target (red values);

JMP output for a corresponding model representation using a neural network with a single node is shown below:

If we think of the purpose of the model as detecting “red” then both models “trigger” when X exceeds a value of about 15.5, and based on the data used to “train” the models both the logistic regression and neural network have training misclassification rates of 7.5% and validation misclassification rates of 11.7%.

How will the two methods handle increased complexity? I’ve amended the data so that there are now some red data points for low values of X:

A single logistic function needs the data to transition smoothly in a single direction; all that happens is that the red data points at the low values of X become misclassified:

What I need to do is to build two separate models, one to model the red values at low X values and the other to model the red values for high X values. In fact I only need to build the model once and then use a global data filter to adjust the data that is being included:

Now imagine these two models working together over the entire range of X. The prediction profile would look something like this:

For a neural network, this combination of models is achieved by having a network consisting of two neurons:

In isolation each neuron is performing a simple regression. The power of neural networks is this ability to network neurons together so that in combination they can produce a single model descriptive of the entire data, rather than having to isolate special cases and model the data separately.

Notice in particular that when I created the two logistics models, I had to look at the data and make a decision to “cut” the data at X=12. For the neural network, this point, at which the two logistic models “join”, emerges automatically from the network as observed in the prediction profiler.

]]>

This is the default output that I have from the *Text Explorer* platform working with the sample data *Pet Survey*.

Let’s try and use the platform to determine whether a respondent owns a cat or a dog. To do this I want to focus on the terms. Really I’m only interested in “cat” and “dog”, but I have to take into account possible variations.

Common variations of a word share the same stem. The most obvious example is “cat” and “cats”. I would like to collapse both of these words into a single term “cat”.

Stemming is the process of combining words that start with the same sequence of characters (the stem).

When the platform is launched there is an option to specify stem rules:* Stem For Combining* and *Stem All Terms*. The default is *No Stemming*.

The above output is based on the default option of no stemming. But I can change the option from the red triangle hotspot:

*Term Options> Stemming*

I want to stem the terms for the purpose of combining the terms so I can select the option *Stem For Combining. *The output now changes:

Whereas I had 46 counts of dog and 48 counts of dogs I now have 94 counts of the stem dog.

Another good example is the stem bark; this includes 7 instances of “bark”, 12 instances of “barking” and 5 instances of “barks”.

The effect of stemming rules is to allow multiple tokens to be combined into a single term. Sometimes we want to combine terms even though they don’t share a common stem. In this data, I want to place “huskies” with the “dog” term. I can achieve that by *recoding*. Note that unlike column recoding, this functionality is local to the Text Explorer platform. To activate recoding I select the option from the red triangle hotspot:

*Term Options> Manage Recodes*

Selecting this option launches a window for managing recodes:

With this recoding 7 instances of “huskies” are now combined with the 94 “dog” terms.

*[technical note: JMP applies recoding rules before stem rules]*

The purpose of generating the “cat” and “dog” terms was to help determine whether a particular respondent was the owner of a cat or dog.

As you would expect from JMP, the reports within the Text Explorer platform are live-linked to the source data table. That means you can right-click on the dog term, choose *Select Rows* and all associated rows in the table are selected. Rather than selecting the rows I want to generate an indicator column (or a formula) that indicates the status; and I want to do it for both dog and cat terms:

This generates indicator columns that can be used to filter and classify the data rows:

I would like to use the indicator to determine whether or not the respondent owns a dog or cat. I can use it that way but there will be a degree of misclassification. The indicator doesn’t know the context of the words – it’s simply an indication of whether a term appears in the body of text.

A stronger indication of ownership might be a phrase such as “my dog” or “our cat”.

Most of the phrases relate to pet behaviour. That’s not surprising given the original question of the survey (“Think about your cat or dog. What’s the first thought that comes to mind?”). But I want to look for phrases that relate to ownership. In the data there are phrases such as “my cat …”; why is it that these phrases are not listed? The answer is stop words. Phrases are not listed if they start or end with stop words; “my” and “our” are both built-in stop words.

To exclude a built-in stop word I need to include it in the local exceptions list:

With these exceptions the most frequent phrases become “my cat” an “my dog”. Notice the high occurrence of the term “my”; on its own it’s not particularly informative hence the reason it is a built-in stop word.

Just as with terms, I can right-click on phrases and save them as indicators. Whilst these indicators give me much less coverage of the data, they give me a very high degree of confidence that they indicate ownership of a particular type of pet.

]]>