Pattern matching is an incredibly powerful technique for interrogating text strings for the purpose of matching and manipulating string patterns. In this post I will illustrate some of the basic principles of pattern matching. This will be followed by more advanced scenarios.
Pattern Matching Variables
Let’s say I have the following text string (hover your mouse over the line to see the full text):
1 |
str = "<header>eqpt=e1,batch=a1</header<data>a=10,b=20,c=30,d=40,e=50</data>"; |
Such as string (but probably much larger!) might be derived from loading data from an external file (using the function Load Text File).
Let’s say I want to extract the name of the equipment (“e1”) from this text string. I can locate the value by noting that it is preceded by the text “eqpt=” and is followed by a comma. If I assume that these values will be constant for all text strings then I can create a JSL pattern variable:
1 |
pattern = "eqpt=" + PatArb() + ","; |
PatArb is a pattern matching function that matches an arbitrary pattern of text (in this example that text is “e1”).
Pattern Matching
Now that I have described the pattern that I am interested in I can ask whether my text string contains the pattern:
1 |
isMatch = Pat Match( str, pattern ); |
isMatch will have the value 1 (true) if the match is found, otherwise it will be 0 (false).
Extracting Matched Patterns
Knowing that I have found a match is useful but it doesn’t tell me the value of the match – remember, I want to find the name of the equipment (“e1” in this example).
The information I am interested in corresponds to the arbitrary text string identified by the Pat Arb function. I can ask JMP to store the matched text in a variable using the following notation:
PatArb() >> variableName
Using this notation my pattern variable becomes:
1 |
pattern = "eqpt=" + PatArb()>>eqptName + ","; |
Now when I apply the pattern matching I can identify the piece of equipment:
1 2 3 4 5 6 |
isMatch = Pat Match( str, pattern ); If (isMatch, Show(eqptName) , Print("equipment name not found") ); |
If you run this code you should see in the JMP log window that the variable eqptName has been assigned the value “e1”.
Extracting a Data Assignment
Let’s take a look at another example. In addition to the header section there is also a data section:
1 |
<data>a=10,b=20,c=30,d=40,e=50</data> |
In particular it contains assignments with the following pattern:
variableName = value
I want to extract this information. Let’s try the simplest pattern definition that we could use to define this:
1 |
pattern = "<data>" + PatArb()>>variableName + "=" + PatArb()>>value + "</data>"; |
To apply the pattern I use this code:
1 2 3 4 5 6 |
isMatch = Pat Match( str, pattern ); If (isMatch, Show(variableName,value) , Print("pattern did not match") ); |
This correctly identifies the variable name “a” but the assigned value is “10,b=20,c=30,d=40,e=50”. It’s taken everything to the right of the equals sign. I can fix that by saying that the value is delimited by a comma:
1 |
pattern = "<data>" + PatArb()>>variableName + "=" + PatArb()>>value + ","; |
This works; my log window contains the following output:
1 2 |
variableName = "a"; value = "10"; |
Or at least, it works in the sense that I have extracted the first variable. If I wanted the next variable I would need a different pattern:
1 |
pattern = "<data>" + PatArb() + "," + PatArb()>>variableName + "=" + PatArb()>>value + ","; |
Writing a separate pattern for each variable isn’t a viable solution. I want to define a single pattern and use it iteratively.
Iteration with String Replacement
One way of doing this is to apply a pattern to the string, then throw away the part of the string that matched the pattern. Then I apply the pattern again. This is easier to understand by example.
First I am going to simplify the problem by extracting just the data component from the string:
1 2 3 4 5 |
str = "<header>eqpt=e1,batch=a1</header<data>a=10,b=20,c=30,d=40,e=50</data>"; strData = ""; pattern = "<data>" + Pat Arb()>>strData + "</data>"; Pat Match( str, pattern ); show(strData); |
The variable strdata now contains the string:
“a=10,b=20,c=30,d=40,e=50”
I want to create a pattern to pick out the first variable assignment:
1 |
pattern = PatArb()>>name + "=" + PatArb()>>value + ","; |
This new pattern can be applied to the data string:
1 2 3 4 5 6 |
isMatch = Pat Match(strData,pattern); If (isMatch, Show(name,value) , Print("no match") ); |
This successfully extracts the first name and value:
1 2 |
name = "a"; value = "10"; |
Now I want to throw the first part of the data string away, so that I can apply the pattern to the subsequent text. The Pat Match function has an optional third argument. This argument defines some replacement text.
If a pattern match is found, then the text that matched the pattern is replaced with the replacement text. If I want to throw away the text that matched the pattern I can use a null string.
1 2 3 4 5 6 7 8 |
Show(strData); isMatch = Pat Match(strData,pattern, ""); If (isMatch, Show(name,value); Show(strData); , Print("no match") ); |
The log window contains the following output:
1 2 3 4 |
strData = "a=10,b=20,c=30,d=40,e=50"; name = "a"; value = "10"; strData = "b=20,c=30,d=40,e=50"; |
Notice that the data string now starts with my next variable assignment. Now I just need to re-apply my pattern. I can do this using a While loop. Also, since I am now creating multiple variable name/value pairs, it is more convenient to place them in an associative array:
1 2 3 4 5 6 7 8 9 |
arrData = Associative Array(); isMatch = 1; While(isMatch, isMatch = Pat Match(strData,pattern, ""); If (isMatch, arrData[name] = Num(value) ) ); show(arrData); |
My log window looks like this:
1 |
arrData = ["a" => 10, "b" => 20, "c" => 30, "d" => 40]; |
I’ve successfully iterated through the text string creating pairs of assignments stored in an associative array.
There is one problem to deal with. It hasn’t picked out the last assigment (“e=50”). My pattern explicitly states that the assignment is followed by a comma. That condition is not satisfied for the last assignment. I’m going to make a pragmatic solution which is to simply append a comma to the end of my data string! Here is my final code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
str = "<header>eqpt=e1,batch=a1</header<data>a=10,b=20,c=30,d=40,e=50</data>"; strData = ""; pattern = "<data>" + Pat Arb()>>strData + "</data>"; Pat Match( str, pattern ); strData = strData || ","; pattern = PatArb()>>name + "=" + PatArb()>>value + ","; arrData = Associative Array(); isMatch = 1; While(isMatch, isMatch = Pat Match(strData,pattern, ""); If (isMatch, arrData[name] = Num(value) ) ); show(arrData); |
I like the sentinel comma in the last example. Thanks for this post!
Vielen Dank! Wollt ich nur sagen.
Thanks!