How does pattern matching work behind the scenes in F#? - algorithm

I am completely new to F# (and functional programming in general) but I see pattern matching used everywhere in sample code. I am wondering for example how pattern matching actually works? For example, I imagine it working the same as a for loop in other languages and checking for matches on each item in a collection. This is probably far from correct, how does it actually work behind the scenes?

How does pattern matching actually work? The same as a for loop?
It is about as far from a for loop as you could imagine: instead of looping, a pattern match is compiled to an efficient automaton. There are two styles of automaton, which I call the "decision tree" and the "French style." Each style offers different advantages: the decision tree inspects the minimum number of values needed to make a decision, but a naive implementation may require exponential code space in the worst case. The French style offers a different time-space tradeoff, with good but not optimal guarantees for both.
But the absolutely definitive work on this problem is Luc Maranget's excellent paper "Compiling Pattern Matching to Good Decisions Trees from the 2008 ML Workshop. Luc's paper basically shows how to get the best of both worlds. If you want a treatment that may be slightly more accessible to the amateur, I humbly recommend my own offering When Do Match-Compilation Heuristics Matter?
Writing a pattern-match compiler is easy and fun!

It depends on what kind of pattern matching do you mean - it is quite powerful construct and can be used in all sorts of ways. However, I'll try to explain how pattern matching works on lists. You can write for example these patterns:
match l with
| [1; 2; 3] -> // specific list of 3 elements
| 1::rest -> // list starting with 1 followed by more elements
| x::xs -> // non-empty list with element 'x' followed by a list
| [] -> // empty list (no elements)
The F# list is actually a discriminated union containing two cases - [] representing an empty list or x::xs representing a list with first element x followed by some other elements. In C#, this might be represented like this:
// Represents any list
abstract class List<T> { }
// Case '[]' representing an empty list
class EmptyList<T> : List<T> { }
// Case 'x::xs' representing list with element followed by other list
class ConsList<T> : List<T> {
public T Value { get; set; }
public List<T> Rest { get; set; }
}
The patterns above would be compiled to the following (I'm using pseudo-code to make this simpler):
if (l is ConsList) && (l.Value == 1) &&
(l.Rest is ConsList) && (l.Rest.Value == 2) &&
(l.Rest.Rest is ConsList) && (l.Rest.Rest.Value == 3) &&
(l.Rest.Rest.Rest is EmptyList) then
// specific list of 3 elements
else if (l is ConsList) && (l.Value == 1) then
var rest = l.Rest;
// list starting with 1 followed by more elements
else if (l is ConsList) then
var x = l.Value, xs = l.Rest;
// non-empty list with element 'x' followed by a list
else if (l is EmptyList) then
// empty list (no elements)
As you can see, there is no looping involved. When processing lists in F#, you would use recursion to implement looping, but pattern matching is used on individual elements (ConsList) that together compose the entire list.
Pattern matching on lists is a specific case of discriminated union which is discussed by sepp2k. There are other constructs that may appear in pattern matching, but essentially all of them are compiled using some (complicated) if statement.

No, it doesn't loop. If you have a pattern match like this
match x with
| Foo a b -> a + b
| Bar c -> c
this compiles down to something like this pseudo code:
if (x is a Foo)
let a = (first element of x) in
let b = (second element of x) in
a+b
else if (x is a Bar)
let c = (first element of x) in
c
If Foo and Bar are constructors from an algebraic data type (i.e. a type defined like type FooBar = Foo int int | Bar int) the operations x is a Foo and x is a Bar are simple comparisons. If they are defined by an active pattern, the operations are defined by that pattern.

If you compile your F# code to an .exe then take a look with Reflector you can see what the C# "equivalent" of the F# code.
I've used this method to look at F# examples quite a bit.

Related

Kotlin map not working with List of String

I have been working on code where I have to generate all possible ways to the target string. I am using the below-mentioned code.
Print Statement:
println("---------- How Construct -------")
println("${
window.howConstruct("purple", listOf(
"purp",
"p",
"ur",
"le",
"purpl"
))
}")
Function Call:
fun howConstruct(
target: String,
wordBank: List<String>,
): List<List<String>> {
if (target.isEmpty()) return emptyList()
var result = emptyList<List<String>>()
for (word in wordBank) {
if (target.indexOf(word) == 0) { // Starting with prefix
val substring = target.substring(word.length)
val suffixWays = howConstruct(substring, wordBank)
val targetWays = suffixWays.map { way ->
val a = way.toMutableList().apply {
add(word)
}
a.toList()
}
result = targetWays
}
}
return result
}
Expected Output:-
[['purp','le'],['p','ur','p','le']]
Current Output:-
[]
Your code is almost working; only a couple of small changes are needed to get the required output:
If the target is empty, return listOf(emptyList()) instead of emptyList().
Use add(0, word) instead of add(word).
The first of those changes is the important one. Your function returns a list of matches; and since each match is itself a list of strings, it returns a list of lists of strings. Once your code has matched the entire target and calls itself one last time, it returned an empty list — i.e. no matches — instead of a list containing an empty list — meaning one match with no remaining strings.
The second change simply fixes the order of strings within each match, which was reversed (because it appended the prefix after the returned suffix match).
However, there are many others ways that code could be improved. Rather than list them all individually, it's probably easier to give an alternative version:
fun howConstruct(target: String, wordBank: List<String>
): List<List<String>>
= if (target == "") listOf(emptyList())
else wordBank.filter{ target.endsWith(it) } // Look for suffixes of the target in the word bank
.flatMap { suffix: String ->
howConstruct(target.removeSuffix(suffix), wordBank) // For each, recurse to search the rest
.map{ it + suffix } } // And append the suffix to each match.
That does almost exactly the same as your code, except that it searches from the end of the string — matching suffixes — instead of from the beginning. The result is the same; the main benefit is that it's simpler to append a suffix string to a partial match list (using +) than to prepend a prefix (which is quite messy, as you found).
However, it's a lot more concise, mainly because it uses a functional style — in particular, it uses filter() to determine which words are valid suffixes, and flatMap() to collate the list of matches corresponding to each one recursively, as well as map() to append the suffix to each one (like your code does). That avoids all the business of looping over lists, creating lists, and adding to them. As a result, it doesn't need to deal with mutable lists or variables, avoiding some sources of confusion and error.
I've written it as an expression body (with = instead of { … }) for simplicity. I find that's simpler and clearer for short functions — this one is about the limit, though. It might fit as it an extension function on String, since it's effectively returning a transformation of the string, without any side-effects — though again, that tends to work best on short functions.
There are also several small tweaks. It's a bit simpler — and more efficient — to use startsWith() or endsWith() instead of indexOf(); removePrefix() or removeSuffix() is arguably slightly clearer than substring(); and I find == "" clearer than isEmpty().
(Also, the name howConstruct() doesn't really describe the result very well, but I haven't come up with anything better so far…)
Many of these changes are of course a matter of personal preference, and I'm sure other developers would write it in many other ways! But I hope this has given some ideas.

Do any functional programming languages have syntax sugar for changing part of an object?

In imperative programming, there is concise syntax sugar for changing part of an object, e.g. assigning to a field:
foo.bar = new_value
Or to an element of an array, or in some languages an array-like list:
a[3] = new_value
In functional programming, the idiom is not to mutate part of an existing object, but to create a new object with most of the same values, but a different value for that field or element.
At the semantic level, this brings about significant improvements in ease of understanding and composing code, albeit not without trade-offs.
I am asking here about the trade-offs at the syntax level. In general, creating a new object with most of the same values, but a different value for one field or element, is a much more heavyweight operation in terms of how it looks in your code.
Is there any functional programming language that provides syntax sugar to make that operation look more concise? Obviously you can write a function to do it, but imperative languages provide syntax sugar to make it more concise than calling a procedure; do any functional languages provide syntax sugar to make it more concise than calling a function? I could swear that I have seen syntax sugar for at least the object.field case, in some functional language, though I forget which one it was.
(Performance is out of scope here. In this context, I am talking only about what the code looks like and does, not how fast it does it.)
Haskell records have this functionality. You can define a record to be:
data Person = Person
{ name :: String
, age :: Int
}
And an instance:
johnSmith :: Person
johnSmith = Person
{ name = "John Smith"
, age = 24
}
And create an alternation:
johnDoe :: Person
johnDoe = johnSmith {name = "John Doe"}
-- Result:
-- johnDoe = Person
-- { name = "John Doe"
-- , age = 24
-- }
This syntax, however, is cumbersome when you have to update deeply nested records. We've got a library lens that solves this problem quite well.
However, Haskell lists do not provide an update syntax because updating on lists will have an O(n) cost - they are singly-linked lists.
If you want efficient update on list-like collections, you can use Arrays in the array package, or Vectors in the vector package. They both have the infix operator (//) for updating:
alteredVector = someVector // [(1, "some value")]
-- similar to `someVector[1] = "some value"`
it is not built-in, but I think infix notation is convenient enough!
One language with that kind of sugar is F#. It allows you to write
let myRecord3 = { myRecord2 with Y = 100; Z = 2 }
Scala also has sugar for updating a Map:
ms + (k -> v)
ms updated (k,v)
In a language such as Haskell, you would need to write this yourself. If you can express the update as a key-value pair, you might define
let structure' =
update structure key value
or
update structure (key, value)
which would let you use infix notation such as
structure `update` (key, value)
structure // (key, value)
As a proof of concept, here is one possible (inefficient) implementation, which also fails if your index is out of range:
module UpdateList (updateList, (//)) where
import Data.List (splitAt)
updateList :: [a] -> (Int,a) -> [a]
updateList xs (i,y) = let ( initial, (_:final) ) = splitAt i xs
in initial ++ (y:final)
infixl 6 // -- Same precedence as +
(//) :: [a] -> (Int,a) -> [a]
(//) = updateList
With this definition, ["a","b","c","d"] // (2,"C") returns ["a","b","C","d"]. And [1,2] // (2,3) throws a runtime exception, but I leave that as an exercise for the reader.
H. Rhen gave an example of Haskell record syntax that I did not know about, so I’ve removed the last part of my answer. See theirs instead.

What is an algorithm/structure that can be used to effectively find if a object matches any of a set of patterns?

A pattern is a hash with values and functions. For example:
pattern = {a:1,b:2,c:function(x){ return x<5; }}
There is a function that checks if an object matches a pattern. For example, an object will match the pattern above if obj.a == 1, obj.b == 2 and obj.c < 5. Some tests:
matches(pattern,{a:1,b:3,c:2}) == false // because b != 2
matches(pattern,{a:1,b:2,c:7}) == false // because c >= 5
matches(pattern,{a:1,b:2,c:3}) == true //fine
matches(pattern,{a:1,b:2,c:2,d:4}) == true //no problems in having extras
Suppose that I have a set of patterns and I want to find if an object matches any of those patterns. I could check one by one, but, this way, I have an O(n) complexity, where n is the number of patterns. I have a feeling that this can be optimized if I use the set of patterns to build some other structure; but I'm not sure what that structure could be. Thoughts?
You can create a decision tree (Or optimize it to BDD data structure). This requires reading each relevant variable only once during evaluation of each object.
A BDD is a way to evaluate a logical formula, in your case the logical formula is
pattern_1 OR pattern_2 OR pattern_3 OR .... OR pattern_n

Efficient data structure/algorithm for transliteration based word lookup

I'm looking for a efficient data structure/algorithm for storing and searching transliteration based word lookup (like google do: http://www.google.com/transliterate/ but I'm not trying to use google transliteration API). Unfortunately, the natural language I'm trying to work on doesn't have any soundex implemented, so I'm on my own.
For an open source project currently I'm using plain arrays for storing word list and dynamically generating regular expression (based on user input) to match them. It works fine, but regular expression is too powerful or resource intensive than I need. For example, I'm afraid this solution will drain too much battery if I try to port it to handheld devices, as searching over thousands of words with regular expression is too much costly.
There must be a better way to accomplish this for complex languages, how does Pinyin input method work for example? Any suggestion on where to start?
Thanks in advance.
Edit: If I understand correctly, this is suggested by #Dialecticus-
I want to transliterate from Language1, which has 3 characters a,b,c to Language2, which has 6 characters p,q,r,x,y,z. As a result of difference in numbers of characters each language possess and their phones, it is not often possible to define one-to-one mapping.
Lets assume phonetically here is our associative arrays/transliteration table:
a -> p, q
b -> r
c -> x, y, z
We also have a valid word lists in plain arrays for Language2:
...
px
qy
...
If the user types ac, the possible combinations become px, py, pz, qx, qy, qz after transliteration step 1. In step 2 we have to do another search in valid word list and will have to eliminate everyone of them except px and qy.
What I'm doing currently is not that different from the above approach. Instead of making possible combinations using the transliteration table, I'm building a regular expression [pq][xyz] and matching that with my valid word list, which provides the output px and qy.
I'm eager to know if there is any better method than that.
From what I understand, you have an input string S in an alphabet (lets call it A1) and you want to convert it to the string S' which is its equivalent in another alphabet A2. Actually, if I understand correctly, you want to generate a list [S'1,S'2,...,S'n] of output strings which might potentially be equivalent to S.
One approach that comes to mind is for each word in the list of valid words in A2 generate a list of strings in A1 that matches the. Using the example in your edit, we have
px->ac
qy->ac
pr->ab
(I have added an extra valid word pr for clarity)
Now that we know what possible series of input symbols will always map to a valid word, we can use our table to build a Trie.
Each node will hold a pointer to a list of valid words in A2 that map to the sequence of symbols in A1 that form the path from the root of the Trie to the current node.
Thus for our example, the Trie would look something like this
Root (empty)
| a
|
V
+---Node (empty)---+
| b | c
| |
V V
Node (px,qy) Node (pr)
Starting at the root node, as symbols are consumed transitions are made from the current node to its child marked with the symbol consumed until we have read the entire string. If at any point no transition is defined for that symbol, the entered string does not exist in our trie and thus does not map to a valid word in our target language. Otherwise, at the end of the process, the list of words associated with the current node is the list of valid words the input string maps to.
Apart from the initial cost of building the trie (the trie can be shipped pre-built if we never want the list of valid words to change), this takes O(n) on the length of the input to find a list of mapping valid words.
Using a Trie also provide the advantage that you can also use it to find the list of all valid words that can be generated by adding more symbols to the end of the input - i.e. a prefix match. For example, if fed with the input symbol 'a', we can use the trie to find all valid words that can begin with 'a' ('px','qr','py'). But doing that is not as fast as finding the exact match.
Here's a quick hack at a solution (in Java):
import java.util.*;
class TrieNode{
// child nodes - size of array depends on your alphabet size,
// her we are only using the lowercase English characters 'a'-'z'
TrieNode[] next=new TrieNode[26];
List<String> words;
public TrieNode(){
words=new ArrayList<String>();
}
}
class Trie{
private TrieNode root=null;
public void addWord(String sourceLanguage, String targetLanguage){
root=add(root,sourceLanguage.toCharArray(),0,targetLanguage);
}
private static int convertToIndex(char c){ // you need to change this for your alphabet
return (c-'a');
}
private TrieNode add(TrieNode cur, char[] s, int pos, String targ){
if (cur==null){
cur=new TrieNode();
}
if (s.length==pos){
cur.words.add(targ);
}
else{
cur.next[convertToIndex(s[pos])]=add(cur.next[convertToIndex(s[pos])],s,pos+1,targ);
}
return cur;
}
public List<String> findMatches(String text){
return find(root,text.toCharArray(),0);
}
private List<String> find(TrieNode cur, char[] s, int pos){
if (cur==null) return new ArrayList<String>();
else if (pos==s.length){
return cur.words;
}
else{
return find(cur.next[convertToIndex(s[pos])],s,pos+1);
}
}
}
class MyMiniTransliiterator{
public static void main(String args[]){
Trie t=new Trie();
t.addWord("ac","px");
t.addWord("ac","qy");
t.addWord("ab","pr");
System.out.println(t.findMatches("ac")); // prints [px,qy]
System.out.println(t.findMatches("ab")); // prints [pr]
System.out.println(t.findMatches("ba")); // prints empty list since this does not match anything
}
}
This is a very simple trie, no compression or speedups and only works on lower case English characters for the input language. But it can be easily modified for other character sets.
I would build transliterated sentence one symbol at the time, instead of one word at the time. For most languages it is possible to transliterate every symbol independently of other symbols in the word. You can still have exceptions as whole words that have to be transliterated as complete words, but transliteration table of symbols and exceptions will surely be smaller than transliteration table of all existing words.
Best structure for transliteration table is some sort of associative array, probably utilizing hash tables. In C++ there's std::unordered_map, and in C# you would use Dictionary.

An efficient technique to replace an occurence in a sequence with mutable or immutable state

I am searching for an efficient a technique to find a sequence of Op occurences in a Seq[Op]. Once an occurence is found, I want to replace the occurence with a defined replacement and run the same search again until the list stops changing.
Scenario:
I have three types of Op case classes. Pop() extends Op, Push() extends Op and Nop() extends Op. I want to replace the occurence of Push(), Pop() with Nop(). Basically the code could look like seq.replace(Push() ~ Pop() ~> Nop()).
Problem:
Now that I call seq.replace(...) I will have to search in the sequence for an occurence of Push(), Pop(). So far so good. I find the occurence. But now I will have to splice the occurence form the list and insert the replacement.
Now there are two options. My list could be mutable or immutable. If I use an immutable list I am scared regarding performance because those sequences are usually 500+ elements in size. If I replace a lot of occurences like A ~ B ~ C ~> D ~ E I will create a lot of new objects If I am not mistaken. However I could also use a mutable sequence like ListBuffer[Op].
Basically from a linked-list background I would just do some pointer-bending and after a total of four operations I am done with the replacement without creating new objects. That is why I am now concerned about performance. Especially since this is a performance-critical operation for me.
Question:
How would you implement the replace() method in a Scala fashion and what kind of data structure would you use keeping in mind that this is a performance-critical operation?
I am happy with answers that point me in the right direction or pseudo code. No need to write a full replace method.
Thank you.
Ok, some considerations to be made. First, recall that, on lists, tail does not create objects, and prepending (::) only creates one object for each prepended element. That's pretty much as good as you can get, generally speaking.
One way of doing this would be this:
def myReplace(input: List[Op], pattern: List[Op], replacement: List[Op]) = {
// This function should be part of an KMP algorithm instead, for performance
def compare(pattern: List[Op], list: List[Op]): Boolean = (pattern, list) match {
case (x :: xs, y :: ys) if x == y => compare(xs, ys)
case (Nil, Nil) => true
case _ => false
}
var processed: List[Op] = Nil
var unprocessed: List[Op] = input
val patternLength = pattern.length
val reversedReplacement = replacement.reverse
// Do this until we finish processing the whole sequence
while (unprocessed.nonEmpty) {
// This inside algorithm would be better if replaced by KMP
// Quickly process non-matching sequences
while (unprocessed.nonEmpty && unprocessed.head != pattern.head) {
processed ::= unprocessed.head
unprocessed = unprocessed.tail
}
if (unprocessed.nonEmpty) {
if (compare(pattern, unprocessed)) {
processed :::= reversedReplacement
unprocessed = unprocessed drop patternLength
} else {
processed ::= unprocessed.head
unprocessed = unprocessed.tail
}
}
}
processed.reverse
}
You may gain speed by using KMP, particularly if the pattern searched for is long.
Now, what is the problem with this algorithm? The problem is that it won't test if the replaced pattern causes a match before that position. For instance, if I replace ACB with C, and I have an input AACBB, then the result of this algorithm will be ACB instead of C.
To avoid this problem, you should create a backtrack. First, you check at which position in your pattern the replacement may happen:
val positionOfReplacement = pattern.indexOfSlice(replacement)
Then, you modify the replacement part of the algorithm this:
if (compare(pattern, unprocessed)) {
if (positionOfReplacement > 0) {
unprocessed :::= replacement
unprocessed :::= processed take positionOfReplacement
processed = processed drop positionOfReplacement
} else {
processed :::= reversedReplacement
unprocessed = unprocessed drop patternLength
}
} else {
This will backtrack enough to solve the problem.
This algorithm won't deal efficiently, however, with multiply patterns at the same time, which I guess is where you are going. For that, you'll probably need some adaptation of KMP, to do it efficiently, or, otherwise, use a DFA to control possible matchings. It gets even worse if you want to match both AB and ABC.
In practice, the full blow problem is equivalent to regex match & replace, where the replace is a function of the match. Which means, of course, you may want to start looking into regex algorithms.
EDIT
I was forgetting to complete my reasoning. If that technique doesn't work for some reason, then my advice is going with an immutable tree-based vector. Tree-based vectors enable replacement of partial sequences with low amount of copying.
And if that doesn't do, then the solution is doubly linked lists. And pick one from a library with slice replacement -- otherwise you may end up spending way too much time debugging a known but tricky algorithm.

Resources