F# List optimisation - performance

From an unordered list of int, I want to have the smallest difference between two elements. I have a code that is working but way to slow. Can anyone sugest some change to improve the performance? Please explain why you did the change and what will be the performance gain.
let allInt = [ 5; 8; 9 ]
let sortedList = allInt |> List.sort;
let differenceList = [ for a in 0 .. N-2 do yield sortedList.Item a - sortedList.Item a + 1 ]
printfn "%i" (List.min differenceList) // print 1 (because 9-8 smallest difference)
I think I'm doing to much list creation or iteration but I don't know how to write it differently in F#...yet.
Edit: I'm testing this code on list with 100 000 items or more.
Edit 2: I believe that if I can calculte the difference and have the min in one go it should improve the perf a lot, but I don't know how to do that, anay idea?
Thanks in advance

The List.Item performs in O(n) time and is probably the main performance bottle neck in your code. The evaluation of differenceList iterates the elements of sortedList by index, which means the performance is around O((N-2)(2(N-2))), which simplifies to O(N^2), where N is the number of elements in sortedList. For long lists, this will eventually perform badly.
What I would do is to eliminate calls to Item and instead use the List.pairwise operation
let data =
[ let rnd = System.Random()
for i in 1..100000 do yield rnd.Next() ]
#time
let result =
data
|> List.sort
|> List.pairwise // convert list from [a;b;c;...] to [(a,b); (b,c); ...]
|> List.map (fun (a,b) -> a - b |> abs) // Calculates the absolute difference
|> List.min
#time
The #time directives lets me measure execution time in F# Interactive and the output I get when running this code is:
--> Timing now on
Real: 00:00:00.029, CPU: 00:00:00.031, GC gen0: 1, gen1: 1, gen2: 0
val result : int = 0
--> Timing now off

F#'s built-in list type is implemented as a linked list, which means accessing elements by index has to enumerate the list all the way to the index each time. In your case you have two index accesses repeated N-2 times, getting slower and slower with each iteration, as the index grows and each access needs to go through longer part of the list.
First way out of this would be using an array instead of a list, which is a trivial change, but grants you faster index access.
(*
[| and |] let you define an array literal,
alternatively use List.toArray allInt
*)
let allInt = [| 5; 8; 9 |]
let sortedArray = allInt |> Array.sort;
let differenceList = [ for a in 0 .. N-2 do yield sortedArray.[a] - sortedArray.[a + 1] ]
Another approach might be pairing up the neighbours in the list, subtracting them and then finding a min.
let differenceList =
sortedList
|> List.pairwise
|> List.map (fun (x,y) -> x - y)
List.pairwise takes a list of elements and returns a list of the neighbouring pairs. E.g. in your example List.pairwise [ 5; 8; 9 ] = [ (5, 8); (8, 9) ], so that you can easily work with the pairs in the next step, the subtraction mapping.
This way is better, but these functions from List module take a list as input and produce a new list as the output, having to pass through the list 3 times (1 for pairwise, 1 for map, 1 for min at the end). To solve this, you can use functions from the Seq module, which work with .NETs IEnumerable<'a> interface allowing lazy evaluation resulting usually in fewer passes.
Fortunately in this case Seq defines alternatives for all the functions we use here, so the next step is trivial:
let differenceSeq =
sortedList
|> Seq.pairwise
|> Seq.map (fun (x,y) -> x - y)
let minDiff = Seq.min differenceSeq
This should need only one enumeration of the list (excluding the sorting phase of course).
But I cannot guarantee you which approach will be fastest. My bet would be on simply using an array instead of the list, but to find out, you will have to try it out and measure for yourself, on your data and your hardware. BehchmarkDotNet library can help you with that.

The rest of your question is adequately covered by the other answers, so I won't duplicate them. But nobody has yet addressed the question you asked in your Edit 2. To answer that question, if you're doing a calculation and then want the minimum result of that calculation, you want List.minBy. One clue that you want List.minBy is when you find yourself doing a map followed by a min operation (as both the other answers are doing): that's a classic sign that you want minBy, which does that in one operation instead of two.
There's one gotcha to watch out for when using List.minBy: It returns the original value, not the result of the calculation. I.e., if you do ints |> List.pairwise |> List.minBy (fun (a,b) -> abs (a - b)), then what List.minBy is going to return is a pair of items, not the difference. It's written that way because if it gives you the original value but you really wanted the result, you can always recalculate the result; but if it gave you the result and you really wanted the original value, you might not be able to get it. (Was that difference of 1 the difference between 8 and 9, or between 4 and 5?)
So in your case, you could do:
let allInt = [5; 8; 9]
let minPair =
allInt
|> List.pairwise
|> List.minBy (fun (x,y) -> abs (x - y))
let a, b = minPair
let minDifference = abs (a - b)
printfn "The difference between %d and %d was %d" a b minDifference
The List.minBy operation also exists on sequences, so if your list is large enough that you want to avoid creating an intermediate list of pairs, then use Seq.pairwise and Seq.minBy instead:
let allInt = [5; 8; 9]
let minPair =
allInt
|> Seq.pairwise
|> Seq.minBy (fun (x,y) -> abs (x - y))
let a, b = minPair
let minDifference = abs (a - b)
printfn "The difference between %d and %d was %d" a b minDifference
EDIT: Yes, I see that you've got a list of 100,000 items. So you definitely want the Seq version of this. The F# seq type is just IEnumerable, so if you're used to C#, think of the Seq functions as LINQ expressions and you'll have the right idea.
P.S. One thing to note here: see how I'm doing let a, b = minPair? That's called destructuring assignment, and it's really useful. I could also have done this:
let a, b =
allInt
|> Seq.pairwise
|> Seq.minBy (fun (x,y) -> abs (x - y))
and it would have given me the same result. Seq.minBy returns a tuple of two integers, and the let a, b = (tuple of two integers) expression takes that tuple, matches it against the pattern a, b, and thus assigns a to have the value of that tuple's first item, and b to have the value of that tuple's second item. Notice how I used the phrase "matches it against the pattern": this is the exact same thing as when you use a match expression. Explaining match expressions would make this answer too long, so I'll just point you to an excellent reference on them if you haven't already read it:
https://fsharpforfunandprofit.com/posts/match-expression/

Here is my solution:
let minPair xs =
let foo (x, y) = abs (x - y)
xs
|> List.allPairs xs
|> List.filter (fun (x, y) -> x <> y)
|> List.minBy foo
|> foo

Related

Can someone please help explain how this merge sort algorithm works?

I started learning F# yesterday and am struggling a bit with all the new functional programming.
I am trying to understand this implementation of merge sort which uses a split function. The split function is defined as:
let rec split = function
| [] -> ([], [])
| [a] -> ([a], [])
| a :: b :: cs -> let (r, s) = split cs
in (a :: r, b :: s)
The way I understand it, we take a list and return a tuple of lists, having split the list into two halves. If we pattern match on the empty list we return a tuple of empty lists, if we match on a list with one element we return a tuple with the list and an empty list, but the recursive case is eluding me.
a :: b :: cs means a prepended to b prepended to cs, right? So this is the case where the list has at least 3 elements? If so, we return two values, r and s, but I have not seen this "in" keyword used before. As far as I can tell, we prepend a, the first element, to r and b, the second element, to s, and then split on the remainder of the list, cs. But this does not appear to split the list in half to me.
Could anybody please help explain how the recursive case works? Thanks a lot.
You can ignore the in keyword in this case, so you can read the last case as just:
| a :: b :: cs ->
let (r, s) = split cs
(a :: r, b :: s)
Note that this will match any list of length 2 or greater, not 3 as you originally thought. When the list has exactly two elements, cs will be the empty list.
So what's going on in this case is:
If the list has at least 2 elements:
Name the first element a
Name the second element b
Name the rest of the list cs (even if it's empty)
Split cs recursively, which gives us two new lists, r and s
Create two more new lists:
One with a on the front of r
The other with b on the front of s
Return the two new lists
You can see this in operation if you call the function like this:
split [] |> printfn "%A" // [],[]
split [1] |> printfn "%A" // [1],[]
split [1; 2] |> printfn "%A" // [1],[2]
split [1; 2; 3] |> printfn "%A" // [1; 3],[2]
split [1; 2; 3; 4] |> printfn "%A" // [1; 3],[2; 4]
split [1; 2; 3; 4; 5] |> printfn "%A" // [1; 3; 5],[2; 4]
Update: What exactly does in do?
The in keyword is just a way to put a let-binding inside an expression. So, for example, we could write let x = 5 in x + x, which is an expression that has the value 10. This syntax is inherited from OCaml, and is still useful when you want to write the entire expression on one line.
In modern F#, we can use whitespace/indentation instead, by replacing the in keyword with a newline. So nowadays, we would usually write this expression as follows:
let x = 5
x + x
The two forms are semantically equivalent. More details here.
cs is [] when there are only two items in the list. When there are 3 or more items in the list then it recurses where cs is the list without the first two items. When there is only one item it returns [a],[] and when the list is empty it returns [],[] .

How can I select a random value from a list using F#

I'm new to F# and I'm trying to figure out how to return a random string value from a list/array of strings.
I have a list like this:
["win8FF40", "win10Chrome45", "win7IE11"]
How can I randomly select and return one item from the list above?
Here is my first try:
let combos = ["win8FF40";"win10Chrome45";"win7IE11"]
let getrandomitem () =
let rnd = System.Random()
fun (combos : string[]) -> combos.[rnd.Next(combos.Length)]
Both the answers given here by latkin and mydogisbox are good, but I still want to add a third approach that I sometimes use. This approach isn't faster, but it's more flexible and more composable, and fast enough for small sequences. Depending on your needs, you can use one of higher performance options given here, or you can use the following.
Single-argument function using Random
Instead of directly enabling you to select a single element, I often define a shuffleR function like this:
open System
let shuffleR (r : Random) xs = xs |> Seq.sortBy (fun _ -> r.Next())
This function has the type System.Random -> seq<'a> -> seq<'a>, so it works with any sort of sequence: lists, arrays, collections, and lazily evaluated sequences (although not with infinite sequences).
If you want a single random element from a list, you can still do that:
> [1..100] |> shuffleR (Random ()) |> Seq.head;;
val it : int = 85
but you can also take, say, three randomly picked elements:
> [1..100] |> shuffleR (Random ()) |> Seq.take 3;;
val it : seq<int> = seq [95; 92; 12]
No-argument function
Sometimes, I don't care about having to pass in that Random value, so I instead define this alternative version:
let shuffleG xs = xs |> Seq.sortBy (fun _ -> Guid.NewGuid())
It works in the same way:
> [1..100] |> shuffleG |> Seq.head;;
val it : int = 11
> [1..100] |> shuffleG |> Seq.take 3;;
val it : seq<int> = seq [69; 61; 42]
Although the purpose of Guid.NewGuid() isn't to provide random numbers, it's often random enough for my purposes - random, in the sense of being unpredictable.
Generalised function
Neither shuffleR nor shuffleG are truly random. Due to the ways Random and Guid.NewGuid() work, both functions may result in slightly skewed distributions. If this is a concern, you can define an even more general-purpose shuffle function:
let shuffle next xs = xs |> Seq.sortBy (fun _ -> next())
This function has the type (unit -> 'a) -> seq<'b> -> seq<'b> when 'a : comparison. It can still be used with Random:
> let r = Random();;
val r : Random
> [1..100] |> shuffle (fun _ -> r.Next()) |> Seq.take 3;;
val it : seq<int> = seq [68; 99; 54]
> [1..100] |> shuffle (fun _ -> r.Next()) |> Seq.take 3;;
val it : seq<int> = seq [99; 63; 11]
but you can also use it with some of the cryptographically secure random number generators provided by the Base Class Library:
open System.Security.Cryptography
open System.Collections.Generic
let rng = new RNGCryptoServiceProvider ()
let bytes = Array.zeroCreate<byte> 100
rng.GetBytes bytes
let q = bytes |> Queue
FSI:
> [1..100] |> shuffle (fun _ -> q.Dequeue()) |> Seq.take 3;;
val it : seq<int> = seq [74; 82; 61]
Unfortunately, as you can see from this code, it's quite cumbersome and brittle. You have to know the length of the sequence up front; RNGCryptoServiceProvider implements IDisposable, so you should make sure to dispose of rng after use; and items will be removed from q after use, which means it's not reusable.
Cryptographically random sort or selection
Instead, if you really need a cryptographically correct sort or selection, it'd be easier to do it like this:
let shuffleCrypto xs =
let a = xs |> Seq.toArray
use rng = new RNGCryptoServiceProvider ()
let bytes = Array.zeroCreate a.Length
rng.GetBytes bytes
Array.zip bytes a |> Array.sortBy fst |> Array.map snd
Usage:
> [1..100] |> shuffleCrypto |> Array.head;;
val it : int = 37
> [1..100] |> shuffleCrypto |> Array.take 3;;
val it : int [] = [|35; 67; 36|]
This isn't something I've ever had to do, though, but I thought I'd include it here for the sake of completeness. While I haven't measured it, it's most likely not the fastest implementation, but it should be cryptographically random.
Your problem is that you are mixing Arrays and F# Lists (*type*[] is a type notation for Array). You could modify it like this to use lists:
let getrandomitem () =
let rnd = System.Random()
fun (combos : string list) -> List.nth combos (rnd.Next(combos.Length))
That being said, indexing into a List is usually a bad idea since it has O(n) performance since an F# list is basically a linked-list. You would be better off making combos into an array if possible like this:
let combos = [|"win8FF40";"win10Chrome45";"win7IE11"|]
I wrote a blog post on exactly this topic a while ago: http://latkin.org/blog/2013/11/16/selecting-a-random-element-from-a-linked-list-3-approaches-in-f/
3 approaches are given there, with discussion of performance and tradeoffs of each.
To summarize:
// pro: simple, fast in practice
// con: 2-pass (once to get length, once to select nth element)
let method1 lst (rng : Random) =
List.nth lst (rng.Next(List.length lst))
// pro: ~1 pass, list length is not bound by int32
// con: more complex, slower in practice
let method2 lst (rng : Random) =
let rec step remaining picks top =
match (remaining, picks) with
| ([], []) -> failwith "Don't pass empty list"
// if only 1 element is picked, this is the result
| ([], [p]) -> p
// if multiple elements are picked, select randomly from them
| ([], ps) -> step ps [] -1
| (h :: t, ps) ->
match rng.Next() with
// if RNG makes new top number, picks list is reset
| n when n > top -> step t [h] n
// if RNG ties top number, add current element to picks list
| n when n = top -> step t (h::ps) top
// otherwise ignore and move to next element
| _ -> step t ps top
step lst [] -1
// pro: exactly 1 pass
// con: more complex, slowest in practice due to tuple allocations
let method3 lst (rng : Random) =
snd <| List.fold (fun (i, pick) elem ->
if rng.Next(i) = 0 then (i + 1, elem)
else (i + 1, pick)
) (0, List.head lst) lst
Edit: I should clarify that above shows a few ways to get a random element from a list, assuming you must use a list. If it fits with the rest of your program's design, it is definitely more efficient to take a random element from an array.

F# insert/remove item from list

How should I go about removing a given element from a list? As an example, say I have list ['A'; 'B'; 'C'; 'D'; 'E'] and want to remove the element at index 2 to produce the list ['A'; 'B'; 'D'; 'E']? I've already written the following code which accomplishes the task, but it seems rather inefficient to traverse the start of the list when I already know the index.
let remove lst i =
let rec remove lst lst' =
match lst with
| [] -> lst'
| h::t -> if List.length lst = i then
lst' # t
else
remove t (lst' # [h])
remove lst []
let myList = ['A'; 'B'; 'C'; 'D'; 'E']
let newList = remove myList 2
Alternatively, how should I insert an element at a given position? My code is similar to the above approach and most likely inefficient as well.
let insert lst i x =
let rec insert lst lst' =
match lst with
| [] -> lst'
| h::t -> if List.length lst = i then
lst' # [x] # lst
else
insert t (lst' # [h])
insert lst []
let myList = ['A'; 'B'; 'D'; 'E']
let newList = insert myList 2 'C'
Removing element at the specified index isn't a typical operation in functional programming - that's why it seems difficult to find the right implementation of these operations. In functional programming, you'll usually process the list element-by-element using recursion, or implement the processing in terms of higher-level declarative operations. Perhaps if you could clarfiy what is your motivation, we can give a better answer.
Anyway, to implement the two operations you wanted, you can use existing higher-order functions (that traverse the entire list a few times, because there is really no good way of doing this without traversing the list):
let removeAt index input =
input
// Associate each element with a boolean flag specifying whether
// we want to keep the element in the resulting list
|> List.mapi (fun i el -> (i <> index, el))
// Remove elements for which the flag is 'false' and drop the flags
|> List.filter fst |> List.map snd
To insert element to the specified index, you could write:
let insertAt index newEl input =
// For each element, we generate a list of elements that should
// replace the original one - either singleton list or two elements
// for the specified index
input |> List.mapi (fun i el -> if i = index then [newEl; el] else [el])
|> List.concat
However, as noted earlier - unless you have a very good reasons for using these functions, you should probably consider describing your goals more broadly and use an alternative (more functional) solution.
Seems the most idiomatic (not tail recursive):
let rec insert v i l =
match i, l with
| 0, xs -> v::xs
| i, x::xs -> x::insert v (i - 1) xs
| i, [] -> failwith "index out of range"
let rec remove i l =
match i, l with
| 0, x::xs -> xs
| i, x::xs -> x::remove (i - 1) xs
| i, [] -> failwith "index out of range"
it seems rather inefficient to
traverse the start of the list when I
already know the index.
F# lists are singly-linked lists, so you don't have indexed access to them. But most of the time, you don't need it. The majority of indexed operations on arrays are iteration from front to end, which is exactly the most common operation on immutable lists. Its also pretty common to add items to the end of an array, which isn't really the most efficient operation on singly linked lists, but most of the time you can use the "cons and reverse" idiom or use an immutable queue to get the same result.
Arrays and ResizeArrays are really the best choice if you need indexed access, but they aren't immutable. A handful of immutable data structures like VLists allow you to create list-like data structures supporting O(1) cons and O(log n) indexed random access if you really need it.
If you need random access in a list, consider using System.Collections.Generic.List<T> or System.Collections.Generic.LinkedList<T> instead of a F# list.
I know this has been here for a while now, but just had to do something like this recently and I came up with this solution, maybe it isn't the most efficient, but it surely is the shortest idiomatic code I found for it
let removeAt index list =
list |> List.indexed |> List.filter (fun (i, _) -> i <> index) |> List.map snd
The List.Indexed returns a list of tuples which are the index in the list and the actual item in that position after that all it takes is to filter the one tuple matching the inputted index and get the actual item afterwards.
I hope this helps someone who's not extremely concerned with efficiency and wants brief code
The following includes a bit of error checking as well
let removeAt index = function
| xs when index >= 0 && index < List.length xs ->
xs
|> List.splitAt index
|> fun (x,y) -> y |> List.skip 1 |> List.append x
| ys -> ys
Lets go thru it and explain the code
// use the function syntax
let removeAt index = function
// then check if index is within the size of the list
| xs when index >= 0 && index < List.length xs ->
xs
// split the list before the given index
// splitAt : int -> 'T list -> ('T list * 'T list)
// this gives us a tuple of the the list with 2 sublists
|> List.splitAt index
// define a function which
// first drops the element on the snd tuple element
// then appends the remainder of that sublist to the fst tuple element
// and return all of it
|> fun (x,y) -> y |> List.skip 1 |> List.append x
//index out of range so return the original list
| ys -> ys
And if you don't like the idea of simply returning the original list on indexOutOfRange - wrap the return into something
let removeAt index = function
| xs when index >= 0 && index < List.length xs ->
xs
|> List.splitAt index
|> fun (x,y) -> y |> List.skip 1 |> List.append x
|> Some
| ys -> None
I think this should be quite faster than Juliet's or Tomas' proposal but most certainly Mauricio's comment is hitting it home. If one needs to remove or delete items other data structures seem a better fit.

Transformation of a list of positions to a 2D array of positions using functional programming (F#)

How would you make the folowing code functional with the same speed? In general, as an input I have a list of objects containing position coordinates and other stuff and I need to create a 2D array consisting those objects.
let m = Matrix.Generic.create 6 6 []
let pos = [(1.3,4.3); (5.6,5.4); (1.5,4.8)]
pos |> List.iter (fun (pz,py) ->
let z, y = int pz, int py
m.[z,y] <- (pz,py) :: m.[z,y]
)
It could be probably done in this way:
let pos = [(1.3,4.3); (5.6,5.4); (1.5,4.8)]
Matrix.generic.init 6 6 (fun z y ->
pos |> List.fold (fun state (pz,py) ->
let iz, iy = int pz, int py
if iz = z && iy = y then (pz,py) :: state else state
) []
)
But I guess it would be much slower because it loops through the whole matrix times the list versus the former list iteration...
PS: the code might be wrong as I do not have F# on this computer to check it.
It depends on the definition of "functional". I would say that a "functional" function means that it always returns the same result for the same parameters and that it doesn't modify any global state (or the value of parameters if they are mutable). I think this is a sensible definition for F#, but it also means that there is nothing "dis-functional" with using mutation locally.
In my point of view, the following function is "functional", because it creates and returns a new matrix instead of modifying an existing one, but of course, the implementation of the function uses mutation.
let performStep m =
let res = Matrix.Generic.create 6 6 []
let pos = [(1.3,4.3); (5.6,5.4); (1.5,4.8)]
for pz, py in pos do
let z, y = int pz, int py
res.[z,y] <- (pz,py) :: m.[z,y]
res
Mutation-free version:
Now, if you wanted to make the implementation fully functional, then I would start by creating a matrix that contains Some(pz, py) in the places where you want to add the new list element to the element of the matrix and None in all other places. I guess this could be done by initializing a sparse matrix. Something like this:
let sp = pos |> List.map (fun (pz, py) -> int pz, int py, (pz, py))
let elementsToAdd = Matrix.Generic.initSparse 6 6 sp
Then you should be able to combine the original matrix m with the newly created elementsToAdd. This can be certainly done using init (however, having something like map2 would be maybe nicer):
let res = Matrix.init 6 6 (fun i j ->
match elementsToAdd.[i, j], m.[i, j] with
| Some(n), res -> n::res
| _, res -> res )
There is still quite likely some mutation hidden in the F# library functions (such as init and initSparse), but at least it shows one way to implement the operation using more primitive operations.
EDIT: This will work only if you need to add at most single element to each matrix cell. If you wanted to add multiple elements, you'd have to group them first (e.g. using Seq.groupBy)
You can do something like this:
[1.3, 4.3; 5.6, 5.4; 1.5, 4.8]
|> Seq.groupBy (fun (pz, py) -> int pz, int py)
|> Seq.map (fun ((pz, py), ps) -> pz, py, ps)
|> Matrix.Generic.initSparse 6 6
But in your question you said:
How would you make the folowing code functional with the same speed?
And in a later comment you said:
Well, I try to avoid mutability so that the code would be simple to paralelize in the future
I am afraid this is a triumph of hope over reality. Functional code generally has poor absolute performance and scales badly when parallelized. Given the huge amount of allocation this code is doing, you're not likely to see any performance gain from parallelism at all.
Why do you want to do it functionally? The Matrix type is designed to be mutated, so the way you're doing it now looks good to me.
If you really want to do it functionally, though, here's what I'd do:
let pos = [(1.3,4.3); (5.6,5.4); (1.5,4.8)]
let addValue m k v =
if Map.containsKey k m then
Map.add k (v::m.[k]) m
else
Map.add k [v] m
let map =
pos
|> List.map (fun (x,y) -> (int x, int y),(x,y))
|> List.fold (fun m (p,q) -> addValue m p q) Map.empty
let m = Matrix.Generic.init 6 6 (fun x y -> if (Map.containsKey (x,y) map) then map.[x,y] else [])
This runs through the list once, creating an immutable map from indices to lists of points. Then, we initialize each entry in the matrix, doing a single map lookup for each entry. This should take total time O(M + N log N) where M and N are the number of entries in your matrix and list respectively. I believe that your original solution using mutation takes O(M+N) time and your revised solution takes O(M*N) time.

Why is using a sequence so much slower than using a list in this example

Background:
I have a sequence of contiguous, time-stamped data.
The data-sequence has holes in it, some large, others just a single missing value.
Whenever the hole is just a single missing value, I want to patch the holes using a dummy-value (larger holes will be ignored).
I would like to use lazy generation of the patched sequence, and I am thus using Seq.unfold.
I have made two versions of the method to patch the holes in the data.
The first consumes the sequence of data with holes in it and produces the patched sequence. This is what i want, but the methods runs horribly slow when the number of elements in the input sequence rises above 1000, and it gets progressively worse the more elements the input sequence contains.
The second method consumes a list of the data with holes and produces the patched sequence and it runs fast. This is however not what I want, since this forces the instantiation of the entire input-list in memory.
I would like to use the (sequence -> sequence) method rather than the (list -> sequence) method, to avoid having the entire input-list in memory at the same time.
Questions:
1) Why is the first method so slow (getting progressively worse with larger input lists)
(I am suspecting that it has to do with repeatedly creating new sequences with Seq.skip 1, but I am not sure)
2) How can I make the patching of holes in the data fast, while using an input sequence rather than an input list?
The code:
open System
// Method 1 (Slow)
let insertDummyValuesWhereASingleValueIsMissing1 (timeBetweenContiguousValues : TimeSpan) (values : seq<(DateTime * float)>) =
let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
(None, values) |> Seq.unfold (fun (prevValue, restOfValues) ->
if restOfValues |> Seq.isEmpty then
None // Reached the end of the input seq
else
let currentValue = Seq.hd restOfValues
if prevValue.IsNone then
Some(currentValue, (Some(currentValue), Seq.skip 1 restOfValues )) // Only happens to the first item in the seq
else
let currentTime = fst currentValue
let prevTime = fst prevValue.Value
let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0) // 42 is chosen here for obvious reasons, making this comment superfluous
Some(dummyValue, (Some(dummyValue), restOfValues))
else
Some(currentValue, (Some(currentValue), Seq.skip 1 restOfValues))) // Either the two values were contiguous, or the gap between them was too large to patch
// Method 2 (Fast)
let insertDummyValuesWhereASingleValueIsMissing2 (timeBetweenContiguousValues : TimeSpan) (values : (DateTime * float) list) =
let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
(None, values) |> Seq.unfold (fun (prevValue, restOfValues) ->
match restOfValues with
| [] -> None // Reached the end of the input list
| currentValue::restOfValues ->
if prevValue.IsNone then
Some(currentValue, (Some(currentValue), restOfValues )) // Only happens to the first item in the list
else
let currentTime = fst currentValue
let prevTime = fst prevValue.Value
let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0)
Some(dummyValue, (Some(dummyValue), currentValue::restOfValues))
else
Some(currentValue, (Some(currentValue), restOfValues))) // Either the two values were contiguous, or the gap between them was too large to patch
// Test data
let numbers = {1.0..10000.0}
let contiguousTimeStamps = seq { for n in numbers -> DateTime.Now.AddMinutes(n)}
let dataWithOccationalHoles = Seq.zip contiguousTimeStamps numbers |> Seq.filter (fun (dateTime, num) -> num % 77.0 <> 0.0) // Has a gap in the data every 77 items
let timeBetweenContiguousValues = (new TimeSpan(0,1,0))
// The fast sequence-patching (method 2)
dataWithOccationalHoles |> List.of_seq |> insertDummyValuesWhereASingleValueIsMissing2 timeBetweenContiguousValues |> Seq.iter (fun pair -> printfn "%f %s" (snd pair) ((fst pair).ToString()))
// The SLOOOOOOW sequence-patching (method 1)
dataWithOccationalHoles |> insertDummyValuesWhereASingleValueIsMissing1 timeBetweenContiguousValues |> Seq.iter (fun pair -> printfn "%f %s" (snd pair) ((fst pair).ToString()))
Any time you break apart a seq using Seq.hd and Seq.skip 1 you are almost surely falling into the trap of going O(N^2). IEnumerable<T> is an awful type for recursive algorithms (including e.g. Seq.unfold), since these algorithms almost always have the structure of 'first element' and 'remainder of elements', and there is no efficient way to create a new IEnumerable that represents the 'remainder of elements'. (IEnumerator<T> is workable, but its API programming model is not so fun/easy to work with.)
If you need the original data to 'stay lazy', then you should use a LazyList (in the F# PowerPack). If you don't need the laziness, then you should use a concrete data type like 'list', which you can 'tail' into in O(1).
(You should also check out Avoiding stack overflow (with F# infinite sequences of sequences) as an FYI, though it's only tangentially applicable to this problem.)
Seq.skip constructs a new sequence. I think that is why your original approach is slow.
My first inclination is to use a sequence expression and Seq.pairwise. This is fast and easy to read.
let insertDummyValuesWhereASingleValueIsMissingSeq (timeBetweenContiguousValues : TimeSpan) (values : seq<(DateTime * float)>) =
let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
seq {
yield Seq.hd values
for ((prevTime, _), ((currentTime, _) as next)) in Seq.pairwise values do
let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0) // 42 is chosen here for obvious reasons, making this comment superfluous
yield dummyValue
yield next
}

Resources