In this post I take a closer look at the different model types that are available to support curve fitting.

Each of the model categories contain a variety of models with differing numbers of parameters:

If you use linear regression (standard least squares) you will be familiar with this type of model:

. . .

Whilst gradient descent algorithms can be used to estimate these parameters, the primary role of curve fitting is to fit parameters that form part of a nonlinear equation – typically representing some mechanistic model relating to a scientific application. All other model types fall into this category of nonlinear models.

The basic sigmoid function takes the following form:

It characterises the case where an unbounded x variable is transformed into a y variable that is contained within a range 0 to 1. It is therefore particularly useful for modelling a response that represents a proportion.

The logistic function introduces one or more parameters to generalise the behaviour of this S-curve. For example, a parameter can be introduced to control the growth rate:

The curve has a point of inflection at x=0. The introduction of a second parameter allows the location of this inflection point to be adjusted:

This is the formula for the *Logistic 2P* model.

Whilst y is a continuous response, these types of model are often used to model a binary outcome (0 or 1). In this case the y value is interpreted as the probability of an outcome of 1 given a specified value of x.

The *Logistic 3P* model introduces a third parameter allowing the curve to have an upper asymptote other than 1:

And the *Logistic 4P* model provides a description of both upper and lower asymptotes with parameters c and d:

There is also a *Logistic 5P* model that allows the curve to be asymmetric about the inflection point:

The logistic functions described earlier typically represent the case where the response is derived from the probability of a binary outcome. Alternatively, we can model the S-curve on the basis that it represents the cumulative distribution function Φ of a Normal distribution:

where the parameter *a* represents the growth rate and *b* is the point of inflection. This is the *Probit 2P* model.

The *Probit 4P* model introduces parameters to control the lower and upper asymptotes:

The 5-parameter logistic model describes an S-shaped curve that is asymmetric about the inflection point. A Gompertz curve can be considered to be a special case of this model. As described in Wikipedia the model was first proposed as a description of human mortality.

A four-parameter model is also available that provides parameters for both lower and upper asymptotes.

Another S-shaped curve is the *Weibull Growth* model, often used in reliability engineering:

Where *a* is the upper asymptote, *b* is the growth rate, and *c* is the inflection point.

*Exponential 2P* is the basic exponential model:

The parameter *b* is a scaling parameter and λ represents the growth rate. If λ is negative, then it represents the rate of decay.

The *Exponential 3P* model adds an additive term to control the asymptote of the curve:

An alternative parameterisation is the mechanistic growth model:

JMP also supports bi-exponential models. These models are the sum of two exponentials and appear as 4-parameter and 5-parameter models:

Growth of cells in a bioreactor can be characterised by a number of phases:

JMP’s *Cell Growth 4P* model takes the form:

where:

a = peak value if mortality rate is zero

b = response at time zero

c = cell division rate

d = cell mortality rate

The bell-shaped curve associated with a Normal distribution is more generically described by a *Gaussian* function of the form:

The Lorentzian curve is superficially similar to the Gaussian bell-shape, but has heavier tails:

Peak curves are used, for example, to model spectroscopic peaks.

For both models the parameter *a* corresponds to the maximum value of the peak; *b* represents the growth rate, and *c* is the critical point: the value of x where the curve reaches its maximum value.

Pharmocokinetic models seek to describe the kinetics of a drug once it has been administered into the body. The *One Compartment Oral Dose* model has the following parameterisation:

where:

a = area under the curve

b = elimination rate

c = absorption rate

JMP also supports a *Two Compartment IV Bolus Dose* model, but that is beyond my latex skills!

Named after the biochemists Leonor Michaelis (1875-1949) and Maud Menten (1879-1960), this model is used to describe enzyme kinetics:

The parameter *a* represents the maximum reaction rate (in literature often referred to as V_{max}), and the *b* parameter (in literature often referred to as the Michaelis constant K_{m}) is the value of x such that the response is half V_{max} ; it is an inverse measure of the substrates affinity for the enzyme.

We are not limited to selecting from a pre-defined library of curve types. Any nonlinear function can be expressed as a column formula and fitted using the *Nonlinear Platform*. In fact it is one of my most frequently used platforms. But that is a topic for another day.

A string variable is created by enclosing the text within double-quotes.

str = "this is a text string"

The script editor will automatically colour code string variables as purple (this behaviour can be customised under preferences).

What if the text string itself contains the double-quote character? In this instance the quote character needs to be preceded by a special escape notation that indicates that the character is part of the string and not the delimiter of the string:

str = "my name is "\!"David\!""

If a string requires a large number of double-quotes the use of escape sequences becomes tedious and error prone. Fortunately JSL provides an alternative notation that can be used:

str = " . . . \[ . . . ]\ . . . "

Text within the square brackets can be written without the need to use escape sequences, for example:

str = "\[ my name is "David"]\"

Often a string variable is constructed by adding multiple strings together. String addition is referred to as concatenation and is performing using the concatenation operator: ||.

myName = "David"; str = "My name is " || myName;

In the above example the string variable *str* was constructed from a literal quoted string plus a second variable that contained a string. Sometimes the second variable is numeric, in which case it must be explicitly converted to a string. This can be performed using the *Char* function:

age = 25; str = "My age is " || Char(age) || " – yeah, I wish!"

In the above example it was necessary to concatenate three elements in order to construct the string that had the age variable embedded in it. Another technique is to use string substitution.

We could write the above string with a place-holder for the age:

str = "My age is pAge – yeah, I wish!"

And now we can substitute the actual value for the placeholder:

str = SubstituteInto(str,char(age))

Note that the value being substituted needs to be a string, so in this example the *Char* function is still required to convert age from a number to a string.

With numeric values it is often the case that we want to perform some form of conditional logic based on their value:

If (yield > 80, status = "good" , status = "poor" );

With string values we can also perform conditional logic:

If (status == "good", colour = "green" , colour = "red" );

Or equivalently, using the *Match* function:

Match(status, "good", colour = "green", "bad", colour = "red" );

String comparisons are case-sensitive. JMP has functions to transform case: *lowercase*, *uppercase*, and *titlecase*. So a more robust comparison would be:

Match(lowercase(status), "good", colour = "green", "bad", colour = "red" );

The number of a characters in a text string can be determined using the *Length* function:

Example:

str = "Hello World"; nChars = Length(str); // nChars = 11

To locate portions of text within a string, JMP provides a variety of functions, including *word*, *words*, *left*, *right*, and *substr*.

*strWord = Word( n, string, <delimiter> )*

The *word* function returns the n’th word within a string. The default delimiter for each word within the string is a space character. However, the optional delimiter field can be used to identify an alternate character to be used.

*Example:*

str = "Hello 'John'"; secondWord = word(2,str); // secondWord = "'John'" strName = word(2,str,"'"); // str = "John"

**lstWords = Words( string, <delimiter> )**

Whereas the w*ord* function identifies a single specific word within a string, the *words* function returns all the individual words as items of a list:

If an empty string is specified as the delimiter then each character of the string is treated as a separate word.

*Example 1: isolating individual characters*

strName = "John"; lstChars = Words(strName,""); // lstChars = {"J", "o", "h", "n"}

*Example 2: counting the number of words*

Number Of Words = Function({str},{default local}, nitems(words(str)) ); str = "this is a sentence"; n = Number Of Words(str); // n = 4

*Example 3: determining the file extension of a file*

filePath = “c:\documents\big class.jmp”; w = Words(filePath,”.”); filetype = w[nitems(w)]; // filetype = “jmp”

**strLeftMostChars = Left( string, n, <filler> )**

The *left* function can be used to extract the *n* leftmost characters of a string. An optional filler character can be specified for instances where the string may be less than n characters in length.

Similarly there is a *right* function to extract the *n* rightmost characters.

*Example : determining whether a file is a JMP table*

filePath = “c:\documents\big class.jmp”; if (right(filePath,4)==".jmp", isJmpTable = 1 , isJmpTable = 0 );

**strSubstring = Substr( string, offset, <count> )**

The *substr* function returns part of the string composed of *count* characters starting at position *offset*.

*Example:*

str = "Hello 'John'"; strQuotedName = Word(2,str); strName = Substr(strQuotedName,2,length(strQuotedName)-2); // str = "John"

Pattern matching in JSL provides an exceptionally powerful and flexible mechanism for searching and manipulating text strings. Central to pattern matching is the creation of variables that contain pattern definitions. These definitions are then processing by pattern matching functions. I will illustrate the principle based on a scenario that I am currently working on.

I have a column formula that contains a model of the form:

Where K1, K2, N1 and C are parameters that are estimated using the nonlinear platform. I want to inspect the formula for the column and retrieve the K2 value, which represents the activation energy for this kinetic equation.

I can retrieve the formula by sending the *getFormula* message to the column; this is what it looks like:

Parameter( {K1 = 0.016232665, K2 = -683.35374, N1 = 1.6719301, C =0.06863445}, K1 * Exp( -K2 / :Temperature) * :RH ^ N1 * :Time + C )

Notice that there is a pattern to how the parameter values are specified:

*K2 = <k2_value>,*

I can describe this pattern by defining a pattern matching variable:

pattern = "K2 = " + PatArb()>>k2_value + ", "

This pattern variable says “*find K2 followed by an equals sign, then some arbitrary text, then a comma*”. It also stores the arbitrary text in the variable *k2_value*. The value is arbitrary in that it doesn’t have a known value, but it is the value I am seeking.

Now I can apply the pattern to a string representation to the formula, check that I have a successful match, and if so then I can convert the string value of *k2_value* to a number:

fml = char( :Y << get formula ); pattern = "K2 = " + PatArb()>>k2_value + ","; success = patMatch(fml,pattern); if (success, k2 = num(k2_value); , write("Failed to find k2") ); // now k2 = -683.35374;

]]>

Version 12 of JMP introduced the ** Zip Archive** object for manipulating zip files. A zip archive object is created using the

za = open("Data.zip", zip);

The first thing that we might want to do is to determine the contents of the zip file. The ** dir **message can be sent to a zip archive object to product a directory list of the contents:

To read the contents of the first JMP table within the zip file I can send the ** read **message to the zip archive object:

blobdata = za << read( lst[1], format(blob) )

The JMP file is in a binary format, not straight text. I deal with that by adding ‘format(blob)’ to the message. The data that describes the contents of this file is contained in a variable that I have named *blobdata*.

To create the physical file I need to save this data:

path = “c:\documents\” || lst[1]; savetextfile(path,blobdata) ;

Now I can access the JMP data in the usual way:

dt = open(path);

Now that I have established the code framework for handling the contents of the zip file, I can iterative over all of the contained files:

za = open("Data.zip", zip); lst = za << dir; for (i=1,i<=nitems(lst),i++, blobdata = za << read(lst[i],format(blob)); path = “c:\documents\” || lst[i]; savetextfile(path,blobdata); );

Let’s say that I want to perform the following tasks:

- Allow the user to select the zip file
- Present the user a list of files contained within the zip file
- Allow the user to select a file
- Open the selected file

Here is the code:

namesdefaulttohere(1); // let the user pick the zip file zipPath = pickfile("Select File:",,{"Zip Files|zip"}); if (ismissing(zipPath),throw()); // list contents of zip file za = open(zipPath,zip); lst = za << dir; // let the user select a file from the list nw = NewWindow("Selection",<<modal, <<onClose( sel = lb << getSelected ), BorderBox(top(10),bottom(20),left(20),right(20), VListBox( TextBox("Select a file:"), SpacerBox(size(0,6)), lb = ListBox(lst,maxselected(1)), ) ) ); if (nw["Button"]==-1,throw()); // grab the name of the file from the list selection if (nitems(sel)>0, file = sel[1] , throw() ); // unzip the file blob = za << read(file,format(blob)); // open the table filePath = convertfilepath(file,base("$TEMP")); show(filePath); savetextfile(filePath,blob); dt = open(filePath);

]]>

TransparentRGB = function({r,g,b,opacity=0.65},{default local}, red = opacity*r + (1-opacity); green = opacity*g + (1-opacity); blue = opacity*b + (1-opacity); return(RGBColor(red,green,blue)); );

]]>

// purge the temporary folder files = filesIndirectory(tmpDir); for (i=1,i<=nItems(files),i++, path = convertFilePath(files[i],base(tmpDir)); if (!isDirectory(path), deleteFile(path) ) );

]]>

To enable code folding enable to option under the *Script Editor* section of *Preferences*, found under the *File* menu.

I use it with my user-defined functions to give me an overview of contents within an include file. Combined with appropriately placed comments this helps to summarise the contents of a library of functions. Here is an example:

]]>

The calculations of process capability analysis can be reversed so that for a given set of target capability values the associated specification limits can be generated. The calculation is straight-forward for a normal distribution but needs a bit more thought when it comes to asymmetric distributions.

Traditionally process capability is defined with respect to a normal distribution. The capability index is the ratio of the specification width to process width:

Where σ is the standard deviation of the process variation and two-sided spec limits (USL,LSL) are assumed. The width of the specification window can be identified simply by rearranging the above formula:

In the case where we haven’t established our spec limits then we can substitute a target value for Cp to generate the width of the specification window (USL-LSL).

If our spec limits are symmetric with respect to our target, and the process is on-target, then the symmetry can be used to determine the levels of the lower and upper specs:

,

Where TGT represents our process target. I’ve used the ‘target’ superscript to make it more explicit that I am using a Cp value that represents the target value, and not the value derived from the data.

I can illustrate this with a specific example. Suppose that we have a process with a mean value of 100 and a standard deviation of 10 and that we want to identify spec limits that would result in a Cp of 2.0.

I can simply plug the numbers into the above formulae:

The reverse calculation requires simple algebraic operations applied to the standard definitions of process capability indices. No knowledge of statistical theory is required.

The above calculations can be verified by taking a (large) random sample from a normal distribution and performing a process capability study using the proposed specification limits:

The simulation confirms that my proposed specification limits yield values a value of 2.0 for Cp.

The reverse calculation for non-normal (asymmetric) distributions is more complex and first it is necessary to understand how the definition of capability indices is generalised for distributions such as lognormal and Weibull.

Whilst capability studies can be summarised by simple indices, they encapsulate the notion of defective parts per million. To illustrate this let’s take a simple case where we have a process on target and a process capability of 1. In this instance, by definition, the specification width is identical to the process width of 6σ.

The conversion of capability indices to dppm is dependent on the underlying distribution

We know that for a normal distribution 99.73% of the data is contained with this range. That’s another way of saying that 0.27% of data is outside of spec, equivalent to 2700 defective parts per million (dppm). The conversion of capability indices to dppm is therefore dependent on using the underlying distribution to generate probability outcomes.

Without loss of generality let’s assume that we have asymmetric process data that can be characterised using a Weibull distribution. If 6σ is the width of a normal distribution then what is the width of a Weibull distribution?

The appropriate way to define the width of the Weibull distribution is so that it has equivalent probability outcomes to the normal distribution: the width can be defined so that it contains 99.73% of the data. Furthermore this interval can be located in such a way that 0.135% of the data falls either side of this interval.

To illustrate this principle I have generated some random data sampled from a lognormal distribution and through a process of trial and error identified specification limits that satisfy the above criteria:

Having established the principle for generalising capability indices I now want to explore the calculations that are required to generate proposed spec limits.

The 6σ width that we associate with a normal distribution can be generalised to an interval that contains 99.73% of the data. More specifically:

Where Pu is the upper percentile (100%-0.135%) and Pl is the lower percentile (0.135%).

This can be used to generalise the definition of process capability:

Furthermore the one-sided capability indices can be generalised by replacing the average value (Ybar) with the median (P50):

and

Let me take the case where the process is on-target. In that case Cpl = Cpu = Cp.

Therefore

I can rearrange this to get an expression for LSL:

Taking the same approach for Cpu reveals:

As usual I want to verify my calculations, which I can do using simulated data. The first step is for me to generate a sample of data selected randomly from a Weibull distribution:

Let me assume that my target value for Cp is 1.50. I can use this value in my expressions for the specification limits:

But I also need to calculate the percentiles P50, Pu and PL. I can do this using the *Weibull Quantile* function:

Using the alpha and beta parameter estimates from the fitted distribution I calculate the following values:

Now I have all the information that I need to calculate the spec limit values required to generate my desired Cp goal. In fact let me do it as a short JSL script:

I estimate the specs to be 70 and 319.

Am I right? I can use JMP to perform the process capability analysis using these numbers:

Spot on!

]]>In this post I will explore the relationship between a lognormal distribution and a normal distribution.

JMP has a collection of functions for generating random data sampled from a specific distribution:

So it’s easy for me to generate data for both a normal and lognormal distribution, and to compare them:

Now that I can look at the lognormal distribution let me take a closer look at its parameters. To generate random data from a lognormal distribution I use the following function:

Here is the distribution using mu=4.6 and sigma=0.35:

What I find confusing is that sigma is not the standard deviation of the data and mu is not the mean. Presumably then, they relate to the parameters of the associated normal distribution. Let’s see. I can create a new variable Z which is the log transform of the data:

Hey presto – the mean and standard deviation match the parameters I used for the *Random Lognormal* function.

. . . I want to generate a lognormal distribution and I want to specify the values for the mean and standard deviation? Let me take a specific example:

I want to generate a lognormal distribution with the same mean and standard deviation as the above data.

The calculation is more complex than you might expect. If and s represent the mean and standard deviation of the normal distribution then the parameters for the lognormal distribution are given by:

Applying these equations to the above data yields values of -0.005 and 0.1 respectively.

Finally, I can verify these numbers by using them with the Random Lognormal function to generate some sample data. If I have the correct parameters then the data will have a mean of 1.0 and a standard deviation of 0.1:

]]>

Process capability indices are a convenient way of summarising process performance. They contain information about how far a process has shifted from the mean as well as the expected number of defective parts per million. In an earlier post I showed the relationship between the capability indices and the process shift. In this post I will use the indices to calculate the expected number of defective parts per million. The calculation will involve probabilities that are calculated by reference to a Normal distribution. I will use the JSL script editor to perform these calculations.

Let’s be clear. JMP reports dppm figures when it calculates the capability indices, but nonetheless I think it’s important to understand how the information is generated rather than just blindly follow software output. As a case-study, it is also a good illustration of using the script editor to utilise the probability distribution functions available in JMP.

First I want to start with a simplified scenario where the process is on target and I can work solely with the Cp index. If the process has upper and lower specs of U and L respectively and the process standard deviation is σ then

This is the ratio of the specification window to the process width. Using 6σ as a measure of process width is just a convention: when I was first introduced to quality methods it was quite common to define the process width of 5.15σ.

Let me start with the case of Cp = 1 i.e. the spec window and the process width are identical. Now the calculation of dppm is the same as calculating the probability of an observation being outside the 6σ width.

With any probability distribution (such as a Normal distribution) there are two variations in how the distribution is enumerated: a probability density function (pdf) and a cumulative probability function (cdf). In JMP the function that generates the pdf for a Normal distribution is ** Normal Density** whereas the cdf is called

The cdf for the Normal distribution takes 3 arguments:

*Normal Distribution(z, mu, sigma )*

It’s good to start with a trivial test case where the answer is obvious! If I take a standard normal i.e. with mean of zero and standard deviation of 1 then I know by symmetry that 50% of the data will be less than or equal to zero:

The result is a probability. Of course I could have multiplied by 100 if I wanted to explicitly express the result as a percentage or 10^6 is I wanted parts per million.

If I wanted to probability is being less than 3 standard deviations from the mean I could write:

To calculate the probability of being within the range +/- 3σ I can write:

With probability calculations it is often easier to calculate the logical opposite of our goal and then take one minus this value to produce the final result; this is the case with dppm calculations. In the calculation below p is the probability of being within the +/- 3σ range: so (1-p) gives me the probability of being outside the range. The 10^6 scale factor gives me the result in terms of parts per million:

This is the well-known result that 0.27% of data is outside the 6σ width of a Normal distribution.

Recall I said that an alternative definition of the process width is 5.15σ. The motivation for this definition is that the proportion outside this process width is very close to 1% (dppm looks worse but calculations are easier!).

Having illustrated the probability calculation for the case that the spec window is the same size as the process width, let me now take the case of Cp=2 i.e. the spec window has a width twice the process width. The width of the process is now 12σ and all I need to do is change the calculation to use a range +/-6σ:

For this we need to be thinking in terms of parts per billion!

So far my calculations have assumed that I have a process on target, which allows Cp to be a sufficient descriptor of process performance. Now I want to consider both Cp and Cpk. Implicit in these two statistics is a process shift:

*shift = 3σ(Cp-Cpk)*

(see my previous post for the derivation of this result).

For purposes of illustration I will use the classic criteria associated with six sigma methodology: Cp=2 and Cpk=1.5. This corresponds to a process shift of 1.5.

Without loss of generality I can assume that the shift is positive in relation to the process target (which I’ll assume to be midpoint between the spec limits) as illustrated below:

When on target the process mean is 6σ from each spec limit. With the process shift the mean is 4.5σ from the upper limit and 7.5σ from the lower limit. The number of defective parts per million will correspond to the proportion which is outside the range -7.5σ to +4.5σ:

This is the benchmark result of 3.4 parts per million defective for a “six sigma “ process.

]]>