// purge the temporary folder files = filesIndirectory(tmpDir); for (i=1,i<=nItems(files),i++, path = convertFilePath(files[i],base(tmpDir)); if (!isDirectory(path), deleteFile(path) ) );

]]>

To enable code folding enable to option under the *Script Editor* section of *Preferences*, found under the *File* menu.

I use it with my user-defined functions to give me an overview of contents within an include file. Combined with appropriately placed comments this helps to summarise the contents of a library of functions. Here is an example:

]]>

The calculations of process capability analysis can be reversed so that for a given set of target capability values the associated specification limits can be generated. The calculation is straight-forward for a normal distribution but needs a bit more thought when it comes to asymmetric distributions.

Traditionally process capability is defined with respect to a normal distribution. The capability index is the ratio of the specification width to process width:

Where σ is the standard deviation of the process variation and two-sided spec limits (USL,LSL) are assumed. The width of the specification window can be identified simply by rearranging the above formula:

In the case where we haven’t established our spec limits then we can substitute a target value for Cp to generate the width of the specification window (USL-LSL).

If our spec limits are symmetric with respect to our target, and the process is on-target, then the symmetry can be used to determine the levels of the lower and upper specs:

,

Where TGT represents our process target. I’ve used the ‘target’ superscript to make it more explicit that I am using a Cp value that represents the target value, and not the value derived from the data.

I can illustrate this with a specific example. Suppose that we have a process with a mean value of 100 and a standard deviation of 10 and that we want to identify spec limits that would result in a Cp of 2.0.

I can simply plug the numbers into the above formulae:

The reverse calculation requires simple algebraic operations applied to the standard definitions of process capability indices. No knowledge of statistical theory is required.

The above calculations can be verified by taking a (large) random sample from a normal distribution and performing a process capability study using the proposed specification limits:

The simulation confirms that my proposed specification limits yield values a value of 2.0 for Cp.

The reverse calculation for non-normal (asymmetric) distributions is more complex and first it is necessary to understand how the definition of capability indices is generalised for distributions such as lognormal and Weibull.

Whilst capability studies can be summarised by simple indices, they encapsulate the notion of defective parts per million. To illustrate this let’s take a simple case where we have a process on target and a process capability of 1. In this instance, by definition, the specification width is identical to the process width of 6σ.

The conversion of capability indices to dppm is dependent on the underlying distribution

We know that for a normal distribution 99.73% of the data is contained with this range. That’s another way of saying that 0.27% of data is outside of spec, equivalent to 2700 defective parts per million (dppm). The conversion of capability indices to dppm is therefore dependent on using the underlying distribution to generate probability outcomes.

Without loss of generality let’s assume that we have asymmetric process data that can be characterised using a Weibull distribution. If 6σ is the width of a normal distribution then what is the width of a Weibull distribution?

The appropriate way to define the width of the Weibull distribution is so that it has equivalent probability outcomes to the normal distribution: the width can be defined so that it contains 99.73% of the data. Furthermore this interval can be located in such a way that 0.135% of the data falls either side of this interval.

To illustrate this principle I have generated some random data sampled from a lognormal distribution and through a process of trial and error identified specification limits that satisfy the above criteria:

Having established the principle for generalising capability indices I now want to explore the calculations that are required to generate proposed spec limits.

The 6σ width that we associate with a normal distribution can be generalised to an interval that contains 99.73% of the data. More specifically:

Where Pu is the upper percentile (100%-0.135%) and Pl is the lower percentile (0.135%).

This can be used to generalise the definition of process capability:

Furthermore the one-sided capability indices can be generalised by replacing the average value (Ybar) with the median (P50):

and

Let me take the case where the process is on-target. In that case Cpl = Cpu = Cp.

Therefore

I can rearrange this to get an expression for LSL:

Taking the same approach for Cpu reveals:

As usual I want to verify my calculations, which I can do using simulated data. The first step is for me to generate a sample of data selected randomly from a Weibull distribution:

Let me assume that my target value for Cp is 1.50. I can use this value in my expressions for the specification limits:

But I also need to calculate the percentiles P50, Pu and PL. I can do this using the *Weibull Quantile* function:

Using the alpha and beta parameter estimates from the fitted distribution I calculate the following values:

Now I have all the information that I need to calculate the spec limit values required to generate my desired Cp goal. In fact let me do it as a short JSL script:

I estimate the specs to be 70 and 319.

Am I right? I can use JMP to perform the process capability analysis using these numbers:

Spot on!

]]>In this post I will explore the relationship between a lognormal distribution and a normal distribution.

JMP has a collection of functions for generating random data sampled from a specific distribution:

So it’s easy for me to generate data for both a normal and lognormal distribution, and to compare them:

Now that I can look at the lognormal distribution let me take a closer look at its parameters. To generate random data from a lognormal distribution I use the following function:

Here is the distribution using mu=4.6 and sigma=0.35:

What I find confusing is that sigma is not the standard deviation of the data and mu is not the mean. Presumably then, they relate to the parameters of the associated normal distribution. Let’s see. I can create a new variable Z which is the log transform of the data:

Hey presto – the mean and standard deviation match the parameters I used for the *Random Lognormal* function.

. . . I want to generate a lognormal distribution and I want to specify the values for the mean and standard deviation? Let me take a specific example:

I want to generate a lognormal distribution with the same mean and standard deviation as the above data.

The calculation is more complex than you might expect. If and s represent the mean and standard deviation of the normal distribution then the parameters for the lognormal distribution are given by:

Applying these equations to the above data yields values of -0.005 and 0.1 respectively.

Finally, I can verify these numbers by using them with the Random Lognormal function to generate some sample data. If I have the correct parameters then the data will have a mean of 1.0 and a standard deviation of 0.1:

]]>

Process capability indices are a convenient way of summarising process performance. They contain information about how far a process has shifted from the mean as well as the expected number of defective parts per million. In an earlier post I showed the relationship between the capability indices and the process shift. In this post I will use the indices to calculate the expected number of defective parts per million. The calculation will involve probabilities that are calculated by reference to a Normal distribution. I will use the JSL script editor to perform these calculations.

Let’s be clear. JMP reports dppm figures when it calculates the capability indices, but nonetheless I think it’s important to understand how the information is generated rather than just blindly follow software output. As a case-study, it is also a good illustration of using the script editor to utilise the probability distribution functions available in JMP.

First I want to start with a simplified scenario where the process is on target and I can work solely with the Cp index. If the process has upper and lower specs of U and L respectively and the process standard deviation is σ then

This is the ratio of the specification window to the process width. Using 6σ as a measure of process width is just a convention: when I was first introduced to quality methods it was quite common to define the process width of 5.15σ.

Let me start with the case of Cp = 1 i.e. the spec window and the process width are identical. Now the calculation of dppm is the same as calculating the probability of an observation being outside the 6σ width.

With any probability distribution (such as a Normal distribution) there are two variations in how the distribution is enumerated: a probability density function (pdf) and a cumulative probability function (cdf). In JMP the function that generates the pdf for a Normal distribution is ** Normal Density** whereas the cdf is called

The cdf for the Normal distribution takes 3 arguments:

*Normal Distribution(z, mu, sigma )*

It’s good to start with a trivial test case where the answer is obvious! If I take a standard normal i.e. with mean of zero and standard deviation of 1 then I know by symmetry that 50% of the data will be less than or equal to zero:

The result is a probability. Of course I could have multiplied by 100 if I wanted to explicitly express the result as a percentage or 10^6 is I wanted parts per million.

If I wanted to probability is being less than 3 standard deviations from the mean I could write:

To calculate the probability of being within the range +/- 3σ I can write:

With probability calculations it is often easier to calculate the logical opposite of our goal and then take one minus this value to produce the final result; this is the case with dppm calculations. In the calculation below p is the probability of being within the +/- 3σ range: so (1-p) gives me the probability of being outside the range. The 10^6 scale factor gives me the result in terms of parts per million:

This is the well-known result that 0.27% of data is outside the 6σ width of a Normal distribution.

Recall I said that an alternative definition of the process width is 5.15σ. The motivation for this definition is that the proportion outside this process width is very close to 1% (dppm looks worse but calculations are easier!).

Having illustrated the probability calculation for the case that the spec window is the same size as the process width, let me now take the case of Cp=2 i.e. the spec window has a width twice the process width. The width of the process is now 12σ and all I need to do is change the calculation to use a range +/-6σ:

For this we need to be thinking in terms of parts per billion!

So far my calculations have assumed that I have a process on target, which allows Cp to be a sufficient descriptor of process performance. Now I want to consider both Cp and Cpk. Implicit in these two statistics is a process shift:

*shift = 3σ(Cp-Cpk)*

(see my previous post for the derivation of this result).

For purposes of illustration I will use the classic criteria associated with six sigma methodology: Cp=2 and Cpk=1.5. This corresponds to a process shift of 1.5.

Without loss of generality I can assume that the shift is positive in relation to the process target (which I’ll assume to be midpoint between the spec limits) as illustrated below:

When on target the process mean is 6σ from each spec limit. With the process shift the mean is 4.5σ from the upper limit and 7.5σ from the lower limit. The number of defective parts per million will correspond to the proportion which is outside the range -7.5σ to +4.5σ:

This is the benchmark result of 3.4 parts per million defective for a “six sigma “ process.

]]>

If you don’t write JSL scripts you may never have had a need to use the script editor, so let’s first take a look at this. On the toolbar the second icon will create a new script window:

If you prefer you can use the following menu path: *File>New>Script.*

The script window is just a blank window into which we type commands. We want to use these commands to perform some calculations, so we also need an output area: right-mouse-click and select the option ** show embedded log**. The window splits into two sections:

The upper section is the input region where we can type JSL statements with results being displayed in the lower section.

JMP has a vast library of in-built functions for performing mathematical and statistical calculations. Here is a simple example of using the *PI* function to obtain the value of π:

To generate this output I type “Pi()” and then click the run-script icon:

If you think about how we write functions in mathematics we might write something like:

y = f(x)

A similar notation is used in JSL. In particular the parentheses instruct the script editor that what is being written is a reference to an in-built function (if a valid function is identified by JMP then the script editor changes the text colour to blue). The parentheses also act to contain possible arguments. In the above example the function *f* is a function of *x*. For example the f function could be a natural logarithm and x could be the value 2.7183:

If I wanted a base-10 logarithm then I would use the function *LOG10*.

I’m not limited to single functions. I can build fully featured mathematical expressions using the following operations:

In this calculation for the area of a circle I wanted to assign my radius value to a variable before defining the formula for the area. To do this I have had to write 2 lines of code in which case I have to delimit the lines with a semicolon.

To be proficient in using the script editor as a calculator you need familiarity with the functions available inside JMP. Fortunately these are documented online:

These functions are also documented in the JMP help system under the Scripting Index option.

In my next post I will illustrate using the script editor to perform probability calculations to estimate number of defective parts per million for a process where I know the capability indices.

]]>

The process capability statistic Cp compares process variation against the width of a process operating window:

where U and L are the upper and lower specification limits respectively, and σ represents the standard deviation of the process variation.

In order to take account of process location the ratio is extended to include process mean :

& Min

So now we can confidently talk about process capability in terms of the indices Cp and Cpk. But it seems to me that this is convenient shorthand at the expense of transparency.

For example, if I am given values for Cp and Cpk the underlying process shift is not necessarily obvious.

Whilst the relationship between process shift and capability indices is not immediately apparent there is nonetheless a simple relationship:

where the shift Δ is measured as the distance of the process mean from the target.

The rest of this post looks at the derivation of this result.

Let’s assume without loss of generality that the process shift is positive (with respect to the target T). Then:

The process shift is so we need an appropriate expression for and and .

From the above expression:

which implies

If we assume that the specs are symmetric then

which implies

but also

therefore

this implies

therefore

Using the above expressions for and an expression for process shift can be constructed and simplified:

In this post I have derived a simple relationship between process shift and the capability indices Cp and Cpk. Given the simplicity of the relationship, my derivation feels somewhat laboured – perhaps you know of a more direct method?

]]>

Many statistical methods are expressed in the form of a hypothesis test: it’s one of the fundamental constructions within the field of inferential statistics. One of the outcomes of this construction is a probability outcome, or p-value, the notorious number which is subjected to extensive use and abuse! See for example:

When we conduct an experiment it can feel like we have collected an extensive set of data – but typically that data only represents a single sample. It’s hard to think in terms of probabilities once we have this data in our hands. Think about tossing a coin. Before a coin toss the probability of *heads* is of course 0.5, assuming a fair coin. But once the coin is tossed, I either have a heads or tails. There is no longer any ambiguity and the idea of probabilities feels irrelevant: the act of completing the experiment can make the use of statistics feel somewhat academic.

The way that probability is taught is to avoid the notion of single outcomes. We don’t just toss the coin once, we toss it 100 times – now the probability remains relevant after the event – the proportions of heads and tails are described by the probability.

The toss of a coin is of course a trivial example. But in all honesty I think that as soon as we move to more complex scenarios the probabilities get hard to understand at an intuitive level. We can do the math and understand the theory but there can be a disconnect between what is understood at an intellectual level versus an emotional *gut* feeling. At the end of the day if we conduct a complex experiment and collect the results, then we believe in those results: the p-value might be intended to help us keep in mind the probabilistic nature of the outcome, but as I’ve said before, it’s hard to think this way when we have single-outcome results in our hands.

Part of the problem with the interpretation of p-values is that it can be hard to take the null hypothesis seriously: that means not understanding the nature of type I errors. A technique that I have found useful when running training courses is to make the probabilities more visible by using simulation techniques to give a better sense of the probabilistic nature of experiment outcomes.

First I have to decide what the null hypothesis is. Tossing a coin is too trivial. I like the idea of using a simple linear regression because it is very visual. It is easy to understand at a scientific level but sufficiently complex that it contains appropriate components of statistical inference (analysis of variance, parameter estimation, etc).

The graph below shows the case of a null hypothesis (red line) and an alternative hypothesis (blue line):

If we take a sample of data under the conditions of the null hypothesis then we cannot expect the data to fall precisely on the horizontal red line. Therefore there is ambiguity as to whether the data is indicative of the null hypothesis or the alternate hypothesis. It is this ambiguity that we try and describe using p-values.

In the spirit of JMP, interactive visualisation is a much more powerful way to illustrate the ambiguity. To do this we can sample data from a population described by the null hypothesis and then use this data to build a regression model; this process of sampling and modelling can be repeated over and over, in just the same way that we can toss a coin multiple times.

The single-sample simulation can be articulated using a JMP table with a column formula:

Based on the data in the table, the Bivariate platform can be used to visualise the relationship in the data. Using a script the random components of the formula can be updated and the graph can be refreshed. Each refresh corresponds to a flip of the coin.

This is what the the null hypothesis really looks like!

The above visualisation really helps to express the problem that we are trying to solve when we use inferential techniques embedded within our model building process. And we can go further than just visualising these outcomes – we can analyse the data and get insights that provide intuition into traditional statistical theory.

One statistic that scientists feel very comfortable with is the R-square statistic. We would expect an R-square value of zero if the null hypothesis were true – but let’s take a look at how it really behaves based on our simulated outcomes. Here is a grid of just 9 runs:

Most people would probably think that an R-square of 0.76 would be associated with a “significant” model. If you teach statistics, how many times have you been asked “what is a good value for R-square?” – well now we can start introducing probabilistic thinking even to this question!

For the entire collection of simulation runs we can look at the overall distribution of R-square values:

It is interesting to ask where the 95% threshold is. That is, we would like to be able to make a statement along this lines of “95% of the time the R-square has a value less than xyz”. This can be achieved using the JMP data filter – slowly moving the slider from the right hand side until 95% of the rows have been selected:

For this scenario we can say that 95% of the time the null hypothesis generates an R-square less than 0.635. Or: there is a 5% chance that the R-square statistic will exceed 0.635 even if the null hypothesis is true.

If you have followed the logic that I described for the R-square statistic then you will realise that exactly the same procedure can be applied for statistics more appropriate to statistical inference.

Specifically, the simulation script can grab the F-ratio values displayed in the ANOVA table, and these values can be plotted as a histogram:

What have we just done? We’ve discovered the F-distribution empirically!

And using the data filter we can determine the 95% threshold:

For this sample the threshold value is 7.65. What have we done? We have just determined empirically the p-value for an alpha level of 0.05 (theoretically, for this number of degrees of freedom, an F-ratio of about 7.7 generates a p-value of 0.05).

A similar procedure can be applied to a t-ratio. However, when comparing the absolute value of the t-ratio the data filtering needs to be performed in two steps. First select a high value that eliminates 2.5% of the data:

Then adjust the lower value so that the total number of matching rows is 95% of the entire data:

Based on this example we would conclude that an absolute value for the t-ratio of 2.88 yields a p-value of 0.05. This is in excellent agreement with the theoretical value for this number of data points.

Not only is JMP an excellent tool for statistical analysis and visual discovery, but it can also be used to provide a level of intuitive understanding of statistics beyond that which is achieved through traditional teaching. The techniques that I have described in this post rely on some relatively simply JSL scripts to (1) repetitively generate regression models based on randomly sampled data and (2) extract summary statistics from the Bivariate report window. I’ve not listed the actual code for these scripts because this is a post about statistical learning, not about programming – however, feel free to contact me if you would like details of the scripts.

]]>

From the JMP menu system select * View>Add-Ins*….

From here you are presented with a list of add-ins that have been registered for your installation of JMP.

Select the add-in that you are interested in. Below the list you will see a link for the home folder. Clicking on this link will open the folder for the selected add-in.

The contents of the home window will be specific the the add-in. The folder may just contain a single script file, or may contain a more elaborate file structure.

Typically an add-in produced by Pega Analytics contains the following file structure:

]]>The idea of object-orientation is not new to JSL, but user-created objects require a complex code structure that wraps data and functions into namespaces (for example, see the navigation wizard).

In version 14, there is explicit support for classes which dramatically simplifies the process of creating reusable objects. I thought I would introduce them by means of a real- example: a notification window that shows progress when stepping through a sequence of time-consuming steps.

I have a class *ActivityStatusClass* that displays an activity status window as illustrated above. Before any class is used it first has to be defined. But the benefit comes from its use, so let me focus on that first. Here is an example of creating an instance of the ActivityStatusClass to show progress as I fit 4 nonlinear models (for the sake of clarity of code the nonlinear modelling is not relevant and has been replaced by wait statements!):

// reference ot the ActivityStatusClass include("ActivityStatusClass.jsl"); // list of notification messages for each ativity lst = { "Linear Kinetic Model", "Accelerating Kinetic Model", "Decelerating Kinetic Model", "Power Kinetic Model" }; // create an activity status object status = newObject(ActivityStatusClass(lst)); // do activity 1 ... status:start(); wait(2); // do task 2 status:startNextTask(); wait(2); // do task 3 status:startNextTask(); wait(2); // do task 4 status:startNextTask(); Wait(2); // finish the activity status (closes the window) and delete the object status:finish(); status << delete;

**Line 2**: the class has been defined in a separate jsl file (of the same name). The definition is referenced by *including* the file.

**Line 11**: a new object is instantiated by calling the function ** newObject**. The name of the object is the name of the class (no quotes). This particular object requires a single parameter which is the list of notifications created on line 4.

**Line 13**: the method * start *launches the notification window and displays the first notification message. Notice that the notation to invoke the method for the

**Lines 16, 19, 22**: each time the method * startNextTask *is invoked, the previous activity is marked as complete (an hour glass icon is replaced with a green tick mark) and the notification message for the next activity is displayed.

**Line 25: ** the * finish *method marks the final activity as complete and after a momentary pause the status window is closed.

**Line 26: ** housekeeping: the object still resides in memory but can be removed by sending the delete message to it. Note that is an object message and uses ‘<<” and not namespace colon notation.

So that’s how the object is used. In general there are a few common steps:

- Reference the class definition (typically via an
*include*statement) - Create an object using the
*newObject*function - Invoke object-specific
*methods* *Delete*the object when it is no longer required

Here is the definition of the class (as defined in the file “ActivityStatusClass.jsl”):

namesDefaultToHere(1); /*------------------------------------------------------------------------------ Class: ActivityStatusClass ------------------------------------------------------------------------------*/ defineClass("ActivityStatusClass", { // properties lstActivities = empty(), // list of tasks descriptions (strings) numActivities = 0, currentActivityIndex = empty(), winTitle = "Activity Status", width = 300, delay = 0, iconBusy = newImage(charToBlob( "..." )), iconDone = newImage(charToBlob( "..." )), // display box lists nwStatus = empty(), lstActivityStatusVLB = {}, lstActivityStatusTB = {}, /*---------------------------------------------------------------------- Constructor: constructor Input Parameters: lstActivities - string list of activity descriptions title - (optional) title for the status window ----------------------------------------------------------------------*/ _init_ = method({lstActivities}, numActivities = nItems(lstActivities); ), /*---------------------------------------------------------------------- Property: Title (get/set) The title of the status window ----------------------------------------------------------------------*/ getTitle = method({}, return(winTitle) ), setTitle = method({title}, winTitle = title ), /*---------------------------------------------------------------------- Property: Width (get/set) The width of the text boxes used to display the activity description ----------------------------------------------------------------------*/ getWidth = method({}, return(width) ), setWidth = method({value}, width = value ), /*---------------------------------------------------------------------- Property: Delay (get/set) Add a delay to the display before the next step is performed ----------------------------------------------------------------------*/ getDelay = method({}, return(delay) ), setDelay = method({value}, delay = value ), /*---------------------------------------------------------------------- Method: startNextTask Update the status display to show start of next ----------------------------------------------------------------------*/ startNextTask = method({}, vlb = lstActivityStatusVLB[currentActivityIndex]; (vlb<<child)<<delete; vlb << append(PictureBox(iconDone)); tb = lstActivityStatusTB[currentActivityIndex]; tb << setFontStyle("Normal"); tb << reshow; wait(0); currentActivityIndex++; vlb = lstActivityStatusVLB[currentActivityIndex]; vlb << append(PictureBox(iconBusy)); tb = lstActivityStatusTB[currentActivityIndex]; tb << setFontStyle("Bold"); tb << reshow; wait(delay); ), /*---------------------------------------------------------------------- Method: start Launches the status window ----------------------------------------------------------------------*/ start = method({}, nwStatus = NewWindow(winTitle, showMenu(0), showToolbars(0), BorderBox(top(20),bottom(40),left(80),right(20), lub = LineupBox(nCol(2),spacing(10)) ) ); lstActivityStatusVLB = {}; lstActivityStatusTB = {}; for (i=1,i<=numActivities,i++, lub << append(vlb =VListBox()); lub << append(tb= Text Box(lstActivities[i],<<Set Width(width))); insertInto(lstActivityStatusVLB,vlb); insertInto(lstActivityStatusTB,tb); ); currentActivityIndex = 1; vlb = lstActivityStatusVLB[1]; vlb << append(PictureBox(iconBusy)); tb = lstActivityStatusTB[1]; tb << setFontStyle("Bold"); tb << reshow; wait(delay); ), /*---------------------------------------------------------------------- Method: finish Mark the current task completed then close the status window ----------------------------------------------------------------------*/ finish = method({}, vlb = lstActivityStatusVLB[currentActivityIndex]; (vlb<<child)<<delete; vlb << append(PictureBox(iconDone)); tb = lstActivityStatusTB[currentActivityIndex]; tb << setFontStyle("Normal"); tb << reshow; wait(max(delay,1.0)); //nwStatus << closeWindow; ) } );

I’m not going to try and rationalise the code structure for the class definition, but here are a few observations:

- Properties of the class can be identified by declaring variables with initialised values (see lines 8 through to 20); these act as public variables in that they are accessible to the code of all the class methods without any qualification.
- Each property is delimited with a comma
- Each method is standard JSL (i.e. expressions separated by semicolons). Arguments can be optionally specified.
- Unlike functions, the scope of variables inside a method definition does not need to be explicitly specified (there is no
*default local*). Variables are local to the method, except for those that have been defined outside of the method call that act as class-level properties. - There is a special initialisation method that has the name
*_init_*. - Each method is delimited by a comma (but don’t put a comma after the last method!)

With regard to this specific class definition the lines 15 and 16 define image icons based on a very long character string that would overwhelm the widget that I use to display code snippets – therefore I have replaced the text with “…”. This will mean that the code will not run as intended – however, once JMP is officially being distributed I will post the code on the file exchange.

]]>