performance of static member constraint functions

performance of static member constraint functions - performance

I'm trying to learn static member constraints in F#. From reading Tomas Petricek's blog post, I understand that writing an inline function that "uses only operations that are themselves written using static member constraints" will make my function work correctly for all numeric types that satisfy those constraints. This question indicates that inline works somewhat similarly to c++ templates, so I wasn't expecting any performance difference between these two functions:
let MultiplyTyped (A : double[,]) (B : double[,]) =
let rA, cA = (Array2D.length1 A) - 1, (Array2D.length2 A) - 1
let cB = (Array2D.length2 B) - 1
let C = Array2D.zeroCreate<double> (Array2D.length1 A) (Array2D.length2 B)
for i = 0 to rA do
for k = 0 to cA do
for j = 0 to cB do
C.[i,j] <- C.[i,j] + A.[i,k] * B.[k,j]
C
let inline MultiplyGeneric (A : 'T[,]) (B : 'T[,]) =
let rA, cA = Array2D.length1 A - 1, Array2D.length2 A - 1
let cB = Array2D.length2 B - 1
let C = Array2D.zeroCreate<'T> (Array2D.length1 A) (Array2D.length2 B)
for i = 0 to rA do
for k = 0 to cA do
for j = 0 to cB do
C.[i,j] <- C.[i,j] + A.[i,k] * B.[k,j]
C
Nevertheless, to multiply two 1024 x 1024 matrixes, MultiplyTyped completes in an average of 2550 ms on my machine, whereas MultiplyGeneric takes about 5150 ms. I originally thought that zeroCreate was at fault in the generic version, but changing that line to the one below didn't make a difference.
let C = Array2D.init<'T> (Array2D.length1 A) (Array2D.length2 B) (fun i j -> LanguagePrimitives.GenericZero)
Is there something I'm missing here to make MultiplyGeneric perform the same as MultiplyTyped? Or is this expected?
edit: I should mention that this is VS2010, F# 2.0, Win7 64bit, release build. Platform target is x64 (to test larger matrices) - this makes a difference: x86 produces similar results for the two functions.
Bonus question: the type inferred for MultiplyGeneric is the following:
val inline MultiplyGeneric :
^T [,] -> ^T [,] -> ^T [,]
when ( ^T or ^a) : (static member ( + ) : ^T * ^a -> ^T) and
^T : (static member ( * ) : ^T * ^T -> ^a)
Where does the ^a type come from?
edit 2: here's my testing code:
let r = new System.Random()
let A = Array2D.init 1024 1024 (fun i j -> r.NextDouble())
let B = Array2D.init 1024 1024 (fun i j -> r.NextDouble())
let test f =
let sw = System.Diagnostics.Stopwatch.StartNew()
f() |> ignore
sw.Stop()
printfn "%A" sw.ElapsedMilliseconds
for i = 1 to 5 do
test (fun () -> MultiplyTyped A B)
for i = 1 to 5 do
test (fun () -> MultiplyGeneric A B)

Good question. I'll answer the easy part first: the ^a is just part of the natural generalization process. Imagine you had a type like this:
type T = | T with
static member (+)(T, i:int) = T
static member (*)(T, T) = 0
Then you can still use your MultiplyGeneric function with arrays of this type: multiplying elements of A and B will give you ints, but that's okay because you can still add them to elements of C and get back values of type T to store back into C.
As to your performance question, I'm afraid I don't have a great explanation. Your basic understanding is right - using MultiplyGeneric with double[,] arguments should be equivalent to using MultiplyTyped. If you use ildasm to look at the IL the compiler generates for the following F# code:
let arr = Array2D.zeroCreate 1024 1024
let f1 = MultiplyTyped arr
let f2 = MultiplyGeneric arr
let timer = System.Diagnostics.Stopwatch()
timer.Start()
f1 arr |> ignore
printfn "%A" timer.Elapsed
timer.Restart()
f2 arr |> ignore
printfn "%A" timer.Elapsed
then you can see that the compiler really does generate identical code for each of them, putting the inlined code for MultipyGeneric into an internal static function. The only difference that I see in the generated code is in the names of locals, and when running from the command line I get roughly equal elapsed times. However, running from FSI I see a difference similar to what you've reported.
It's not clear to me why this would be. As I see it there are two possibilities:
FSI's code generation may be doing something slightly different than the static compiler
The CLR's JIT compiler may be treat code generated at runtime slightly differently from compiled code. For instance, as I mentioned my code above using MultiplyGeneric actually results in an internal method that contains the inlined body. Perhaps the CLR's JIT handles the difference between public and internal methods differently when they are generated at runtime than when they are in statically compiled code.

I'd like to see your benchmarks. I don't get the same results (VS 2012 F# 3.0 Win 7 64-bit).
let m = Array2D.init 1024 1024 (fun i j -> float i * float j)
let test f =
let sw = System.Diagnostics.Stopwatch.StartNew()
f() |> ignore
sw.Stop()
printfn "%A" sw.Elapsed
test (fun () -> MultiplyTyped m m)
> 00:00:09.6013188
test (fun () -> MultiplyGeneric m m)
> 00:00:09.1686885
Decompiling with Reflector, the functions look identical.
Regarding your last question, the least restrictive constraint is inferred. In this line
C.[i,j] <- C.[i,j] + A.[i,k] * B.[k,j]
because the result type of A.[i,k] * B.[k,j] is unspecified, and is passed immediately to (+), an extra type could be involved. If you want to tighten the constraint you can replace that line with
let temp : 'T = A.[i,k] * B.[k,j]
C.[i,j] <- C.[i,j] + temp
That will change the signature to
val inline MultiplyGeneric :
A: ^T [,] -> B: ^T [,] -> ^T [,]
when ^T : (static member ( * ) : ^T * ^T -> ^T) and
^T : (static member ( + ) : ^T * ^T -> ^T)
EDIT
Using your test, here's the output:
//MultiplyTyped
00:00:09.9904615
00:00:09.5489653
00:00:10.0562346
00:00:09.7023183
00:00:09.5123992
//MultiplyGeneric
00:00:09.1320273
00:00:08.8195283
00:00:08.8523408
00:00:09.2496603
00:00:09.2950196
Here's the same test on ideone (with a few minor changes to stay within the time limit: 512x512 matrix and one test iteration). It runs F# 2.0 and produced similar results.

Related

Found ** in Ocaml, but not for exponentiation

In a book about logics (https://www.cl.cam.ac.uk/~jrh13/atp/OCaml/real.ml), I find this kind of code:
let integer_qelim =
simplify ** evalc **
lift_qelim linform (cnnf posineq ** evalc) cooper;;
I have seen ** before, but it was for exponentiation purposes, whereas here I do not think it is used for that, as the datatypes are not numeric. I would say it is some king of function combinator, but no idea.
I think this book was written for a version 3.06, but an updated code for 4 (https://github.com/newca12/ocaml-atp) maintains this, so ** is still used in that way that I do not understand.

In OCaml, you can bind to operators any behavior, e.g.,
let ( ** ) x y = print_endline x; print_endline y
so that "hello" ** "world" would print
hello
world
In the code that you reference, the (**) operator is bound to function composition:
let ( ** ) = fun f g x -> f(g x)

That's an utility-function defined in lib.ml:
let ( ** ) = fun f g x -> f(g x);;
It's a composition-operator, often referred to as compose in other examples.
You can use it like this:
let a x = x^"a" in
let b x = x^"b" in
let c x = x^"c" in
let foo = a ** b ** c in
foo "input-";;
- : string = "input-cba"
You could write it as
let foo x = a (b (c x))
or
let foo x = a ## b ## c x
or
let foo x = c x |> b |> a
as well.

F# function takes too many arguments or used in a context not expected

I'm trying to implement a cost function and I currently have
let computeCost (X : Matrix<double>) (y : Vector<double>) (theta : Vector<double>) =
let m = y.Count |> double
let J = (1.0/(2.0*m))*(((X*theta - y) |> Vector.map (fun x -> x*x)).Sum)
J
For some reason I get an error on the half after the first * saying "This function takes too many arguments, or is used in a context where a function is not expected."
However, when I do this
let computeCost (X : Matrix<double>) (y : Vector<double>) (theta : Vector<double>) =
let m = y.Count |> double
let J = (((X*theta - y) |> Vector.map (fun x -> x*x)).Sum)
J
It works perfectly fine and it says that val J:float which is what I expect. But as soon as add in the second piece which is the (1.0/(2.0*m)) part I get the error. I have parenthesis around everything so I don't see how it can be some partial function being applied or something along those lines. I'm sure it's something dumb but I can't seem to figure it out.

Nevermind, I'm dumb and I fell back into my C# ways of using .Sum() The actual way of using it is
let computeCost (X : Matrix<double>) (y : Vector<double>) (theta : Vector<double>) =
let m = y.Count |> double
let J = (1.0/(2.0*m)) * (((X*theta - y) |> Vector.map (fun x -> x*x)) |> Vector.sum)
J
And this seemed to fix it.

Small difference in types

I have three functions that ought to be equal:
let add1 x = x + 1
let add2 = (+) 1
let add3 = (fun x -> x + 1)
Why do the types of these methods differ?
add1 and add3 are int -> int, but add2 is (int -> int).
They all work as expected, I am just curious as to why FSI presents them differently?

This is typically an unimportant distinction, but if you're really curious, see the Arity Conformance for Values section of the F# spec.
My quick summary would be that (int -> int) is a superset of int -> int. Since add1 and add3 are syntactic functions, they are inferred to have the more specific type int -> int, while add2 is a function value and is therefore inferred to have the type (int -> int) (and cannot be treated as an int -> int).

F# image manipulation performance problem

I am currently trying to improve the performance of an F# program to make it as fast as its C# equivalent. The program does apply a filter array to a buffer of pixels. Access to memory is always done using pointers.
Here is the C# code which is applied to each pixel of an image:
unsafe private static byte getPixelValue(byte* buffer, double* filter, int filterLength, double filterSum)
{
double sum = 0.0;
for (int i = 0; i < filterLength; ++i)
{
sum += (*buffer) * (*filter);
++buffer;
++filter;
}
sum = sum / filterSum;
if (sum > 255) return 255;
if (sum < 0) return 0;
return (byte) sum;
}
The F# code looks like this and takes three times as long as the C# program:
let getPixelValue (buffer:nativeptr<byte>) (filterData:nativeptr<float>) filterLength filterSum : byte =
let rec accumulatePixel (acc:float) (buffer:nativeptr<byte>) (filter:nativeptr<float>) i =
if i > 0 then
let newAcc = acc + (float (NativePtr.read buffer) * (NativePtr.read filter))
accumulatePixel newAcc (NativePtr.add buffer 1) (NativePtr.add filter 1) (i-1)
else
acc
let acc = (accumulatePixel 0.0 buffer filterData filterLength) / filterSum
match acc with
| _ when acc > 255.0 -> 255uy
| _ when acc < 0.0 -> 0uy
| _ -> byte acc
Using mutable Variables and a for loop in F# does result in the same speed as using recursion. All Projects are configured to run in Release Mode with Code Optimization turned on.
How could the performance of the F# version be improved?
EDIT:
The bottleneck seems to be in (NativePtr.get buffer offset). If I replace this code with a fixed value and also replace the corresponding code in the C# version with a fixed value, I get about the same speed for both programs. In fact, in C# the speed does not change at all, but in F# it makes a huge difference.
Can this behaviour possibly be changed or is it rooted deeply in the architecture of F#?
EDIT 2:
I refactored the code again to use for-loops. The execution speed remains the same:
let mutable acc <- 0.0
let mutable f <- filterData
let mutable b <- tBuffer
for i in 1 .. filter.FilterLength do
acc <- acc + (float (NativePtr.read b)) * (NativePtr.read f)
f <- NativePtr.add f 1
b <- NativePtr.add b 1
If I compare the IL code of a version that uses (NativePtr.read b) and another version that is the same except that it uses a fixed value 111uy instead of reading it from the pointer, Only the following lines in the IL code change:
111uy has IL-Code ldc.i4.s 0x6f (0.3 seconds)
(NativePtr.read b) has IL-Code lines ldloc.s b and ldobj uint8 (1.4 seconds)
For comparison: C# does the filtering in 0.4 seconds.
The fact that reading the filter does not impact performance while reading from the image buffer does is somehow confusing. Before I filter a line of the image I copy the line into a buffer that has the length of a line. That's why the read operations are not spread all over the image but are within this buffer, which has a size of about 800 bytes.

If we look at the actual IL code of the inner loop which traverses both buffers in parallel generated by C# compiler (relevant part):
L_0017: ldarg.0
L_0018: ldc.i4.1
L_0019: conv.i
L_001a: add
L_001b: starg.s buffer
L_001d: ldarg.1
L_001e: ldc.i4.8
L_001f: conv.i
L_0020: add
and F# compiler:
L_0017: ldc.i4.1
L_0018: conv.i
L_0019: sizeof uint8
L_001f: mul
L_0020: add
L_0021: ldarg.2
L_0022: ldc.i4.1
L_0023: conv.i
L_0024: sizeof float64
L_002a: mul
L_002b: add
we'll notice that while C# code uses only add operator while F# needs both mul and add. But obviously on each step we only need to increment pointers (by 'sizeof byte' and 'sizeof float' values respectively), not to calculate address (addrBase + (sizeof byte)) F# mul is unnecessary (it always multiplies by 1).
The cause for that is that C# defines ++ operator for pointers while F# provides only add : nativeptr<'T> -> int -> nativeptr<'T> operator:
[<NoDynamicInvocation>]
let inline add (x : nativeptr<'a>) (n:int) : nativeptr<'a> = to_nativeint x + nativeint n * (# "sizeof !0" type('a) : nativeint #) |> of_nativeint
So it's not "rooted deeply" in F#, it's just that module NativePtr lacks inc and dec functions.
Btw, I suspect the above sample could be written in a more concise manner if the arguments were passed as arrays instead of raw pointers.
UPDATE:
So does the following code have only 1% speed up (it seems to generate very similar to C# IL):
let getPixelValue (buffer:nativeptr<byte>) (filterData:nativeptr<float>) filterLength filterSum : byte =
let rec accumulatePixel (acc:float) (buffer:nativeptr<byte>) (filter:nativeptr<float>) i =
if i > 0 then
let newAcc = acc + (float (NativePtr.read buffer) * (NativePtr.read filter))
accumulatePixel newAcc (NativePtr.ofNativeInt <| (NativePtr.toNativeInt buffer) + (nativeint 1)) (NativePtr.ofNativeInt <| (NativePtr.toNativeInt filter) + (nativeint 8)) (i-1)
else
acc
let acc = (accumulatePixel 0.0 buffer filterData filterLength) / filterSum
match acc with
| _ when acc > 255.0 -> 255uy
| _ when acc < 0.0 -> 0uy
| _ -> byte acc
Another thought: it might also depend on the number of calls to getPixelValue your test does (F# splits this function into two methods while C# does it in one).
Is it possible that you post your testing code here?
Regarding array - I'd expect the code be at least more concise (and not unsafe).
UPDATE #2:
Looks like the actual bottleneck here is byte->float conversion.
C#:
L_0003: ldarg.1
L_0004: ldind.u1
L_0005: conv.r8
F#:
L_000c: ldarg.1
L_000d: ldobj uint8
L_0012: conv.r.un
L_0013: conv.r8
For some reason F# uses the following path: byte->float32->float64 while C# does only byte->float64. Not sure why is that, but with the following hack my F# version runs with the same speed as C# on gradbot test sample (BTW, thanks gradbot for the test!):
let inline preadConvert (p : nativeptr<byte>) = (# "conv.r8" (# "ldobj !0" type (byte) p : byte #) : float #)
let inline pinc (x : nativeptr<'a>) : nativeptr<'a> = NativePtr.toNativeInt x + (# "sizeof !0" type('a) : nativeint #) |> NativePtr.ofNativeInt
let rec accumulatePixel_ed (acc, buffer, filter, i) =
if i > 0 then
accumulatePixel_ed
(acc + (preadConvert buffer) * (NativePtr.read filter),
(pinc buffer),
(pinc filter),
(i-1))
else
acc
Results:
adrian 6374985677.162810 1408.870900 ms
gradbot 6374985677.162810 1218.908200 ms
C# 6374985677.162810 227.832800 ms
C# Offset 6374985677.162810 224.921000 ms
mutable 6374985677.162810 1254.337300 ms
ed'ka 6374985677.162810 227.543100 ms
LAST UPDATE
It turned out that we can achieve the same speed even without any hacks:
let rec accumulatePixel_ed_last (acc, buffer, filter, i) =
if i > 0 then
accumulatePixel_ed_last
(acc + (float << int16 <| NativePtr.read buffer) * (NativePtr.read filter),
(NativePtr.add buffer 1),
(NativePtr.add filter 1),
(i-1))
else
acc
All we need to do is to convert byte into, say int16 and then into float. This way 'costly' conv.r.un instruction will be avoided.
PS Relevant conversion code from "prim-types.fs" :
let inline float (x: ^a) =
(^a : (static member ToDouble : ^a -> float) (x))
when ^a : float = (# "" x : float #)
when ^a : float32 = (# "conv.r8" x : float #)
// [skipped]
when ^a : int16 = (# "conv.r8" x : float #)
// [skipped]
when ^a : byte = (# "conv.r.un conv.r8" x : float #)
when ^a : decimal = (System.Convert.ToDouble((# "" x : decimal #)))

How does this compare? It has less calls to NativePtr.
let getPixelValue (buffer:nativeptr<byte>) (filterData:nativeptr<float>) filterLength filterSum : byte =
let accumulatePixel (acc:float) (buffer:nativeptr<byte>) (filter:nativeptr<float>) length =
let rec accumulate acc offset =
if offset < length then
let newAcc = acc + (float (NativePtr.get buffer offset) * (NativePtr.get filter offset))
accumulate newAcc (offset + 1)
else
acc
accumulate acc 0
let acc = (accumulatePixel 0.0 buffer filterData filterLength) / filterSum
match acc with
| _ when acc > 255.0 -> 255uy
| _ when acc < 0.0 -> 0uy
| _ -> byte acc
F# source code of NativePtr.
[<NoDynamicInvocation>]
[<CompiledName("AddPointerInlined")>]
let inline add (x : nativeptr<'T>) (n:int) : nativeptr<'T> = toNativeInt x + nativeint n * (# "sizeof !0" type('T) : nativeint #) |> ofNativeInt
[<NoDynamicInvocation>]
[<CompiledName("GetPointerInlined")>]
let inline get (p : nativeptr<'T>) n = (# "ldobj !0" type ('T) (add p n) : 'T #)

My results on a larger test.
adrian 6374730426.098020 1561.102500 ms
gradbot 6374730426.098020 1842.768000 ms
C# 6374730426.098020 150.793500 ms
C# Offset 6374730426.098020 150.318900 ms
mutable 6374730426.098020 1446.616700 ms
F# test code
open Microsoft.FSharp.NativeInterop
open System.Runtime.InteropServices
open System.Diagnostics
open AccumulatePixel
#nowarn "9"
let test size fn =
let bufferByte = Marshal.AllocHGlobal(size * 4)
let bufferFloat = Marshal.AllocHGlobal(size * 8)
let bi = NativePtr.ofNativeInt bufferByte
let bf = NativePtr.ofNativeInt bufferFloat
let random = System.Random()
for i in 1 .. size do
NativePtr.set bi i (byte <| random.Next() % 256)
NativePtr.set bf i (random.NextDouble())
let duration (f, name) =
let stopWatch = Stopwatch.StartNew()
let time = f(0.0, bi, bf, size)
stopWatch.Stop()
printfn "%10s %f %f ms" name time stopWatch.Elapsed.TotalMilliseconds
List.iter duration fn
Marshal.FreeHGlobal bufferFloat
Marshal.FreeHGlobal bufferByte
let rec accumulatePixel_adrian (acc, buffer, filter, i) =
if i > 0 then
let newAcc = acc + (float (NativePtr.read buffer) * (NativePtr.read filter))
accumulatePixel_adrian (newAcc, (NativePtr.add buffer 1), (NativePtr.add filter 1), (i - 1))
else
acc
let accumulatePixel_gradbot (acc, buffer, filter, length) =
let rec accumulate acc offset =
if offset < length then
let newAcc = acc + (float (NativePtr.get buffer offset) * (NativePtr.get filter offset))
accumulate newAcc (offset + 1)
else
acc
accumulate acc 0
let accumulatePixel_mutable (acc, buffer, filter, length) =
let mutable acc = 0.0
let mutable f = filter
let mutable b = buffer
for i in 1 .. length do
acc <- acc + (float (NativePtr.read b)) * (NativePtr.read f)
f <- NativePtr.add f 1
b <- NativePtr.add b 1
acc
[
accumulatePixel_adrian, "adrian";
accumulatePixel_gradbot, "gradbot";
AccumulatePixel.getPixelValue, "C#";
AccumulatePixel.getPixelValueOffset, "C# Offset";
accumulatePixel_mutable, "mutable";
]
|> test 100000000
System.Console.ReadLine() |> ignore
C# test code
namespace AccumulatePixel
{
public class AccumulatePixel
{
unsafe public static double getPixelValue(double sum, byte* buffer, double* filter, int filterLength)
{
for (int i = 0; i < filterLength; ++i)
{
sum += (*buffer) * (*filter);
++buffer;
++filter;
}
return sum;
}
unsafe public static double getPixelValueOffset(double sum, byte* buffer, double* filter, int filterLength)
{
for (int i = 0; i < filterLength; ++i)
{
sum += buffer[i] * filter[i];
}
return sum;
}
}
}

How do I translate this Haskell to F#?

I'm trying to learn F# by translating some Haskell code I wrote a very long time ago, but I'm stuck!
percent :: Int -> Int -> Float
percent a b = (fromInt a / fromInt b) * 100
freqs :: String -> [Float]
freqs ws = [percent (count x ws) (lowers ws) | x <- ['a' .. 'z']]
I've managed this:
let percent a b = (float a / float b) * 100.
although i dont like having to have the . after the 100.
What is the name of the operation I am performing in freqs, and how do I translate it to F#?
Edit: count and lowers are Char -> String -> Int and String -> Int respectively, and I have translated these already.

This is a list comprehension, and in F# it looks like the last two lines below:
// stub out since don't know the implementation
let count (c:char) (s:string) = 4
let lowers (s:string) = 10
// your code
let percent a b = (float a / float b) * 100.
let freq ws = [for x in ['a'..'z'] do
yield percent (count x ws) (lowers ws)]
More generally I think Haskell list comprehensions have the form suggested by the example below, and the corresponding F# is shown.
// Haskell
// [e(x,y) | x <- l1, y <- l2, pred(x,y)]
// F#
[for x in l1 do
for y in l2 do
if pred(x,y) then
yield e(x,y)]

Note that Brian's F# code:
let freq ws = [for x in ['a'..'z'] do yield percent (count x ws) (lowers ws)]
Can be written more elegantly as:
let freq ws = [for x in 'a'..'z' -> percent (count x ws) (lowers ws)]

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

performance of static member constraint functions - performance

Related

Found ** in Ocaml, but not for exponentiation

F# function takes too many arguments or used in a context not expected

Small difference in types

F# image manipulation performance problem

How do I translate this Haskell to F#?

Categories

Resources