Basic F# / Rust performance comparison - performance

this is a simplistic performance test based on https://www.youtube.com/watch?v=QlMLB2-G25c which compares the performance of rust vs wasm vs python vs go
the original rust program (from https://github.com/masmullin2000/random-sort-examples) is:
use rand::prelude::*;
fn main() {
let vec = make_random_vec(1_000_000, 100);
for _ in 0..250 {
let mut v = vec.clone();
// v.sort_unstable();
v.sort(); // using stable sort as f# sort is a stable sort
}
}
pub fn make_random_vec(sz: usize, modulus: i64) -> Vec<i64> {
let mut v: Vec<i64> = Vec::with_capacity(sz);
for _ in 0..sz {
let x: i64 = random();
v.push(x % modulus);
}
v
}
so i created the following f# program to compare against rust
open System
let rec cls (arr:int64 array) count =
if count > 0 then
let v1 = Array.copy arr
let v2 = Array.sort v1
cls arr (count-1)
else
()
let rnd = Random()
let rndArray = Array.init 1000000 (fun _ -> int64 (rnd.Next(100)))
cls rndArray 250 |> ignore
i was expecting f# to be slower (both running on WSL2) but got the following times on my Core i7 8th gen laptop
Rust - around 17 seconds
Rust (unstable sort) - around 2.7 seconds
F# - around 11 seconds
my questions:
is the dotnet compiler doing some sort of optimisation that throws away some of the processing because the return values are not being used resulting in the f# code running faster or am i doing something wrong?
does f# have an unstable sort function that i can use to compare against the Rust unstable sort?

Related

Which Rust RNG should be used for multithreaded sampling?

I am trying to create a function in Rust which will sample from M normal distributions N times. I have the sequential version below, which runs fine. I am trying to parallelize it using Rayon, but am encountering the error
Rc<UnsafeCell<ReseedingRng<rand_chacha::chacha::ChaCha12Core, OsRng>>> cannot be sent between threads safely
It seems my rand::thread_rng does not implement the traits Send and Sync. I tried using StdRng and OsRng which both do, to no avail because then I get errors that the variables pred and rng cannot be borrowed as mutable because they are captured in a Fn closure.
This is the working code below. It errors when I change the first into_iter() to into_par_iter().
use rand_distr::{Normal, Distribution};
use std::time::Instant;
use rayon::prelude::*;
fn rprednorm(n: i32, means: Vec<f64>, sds: Vec<f64>) -> Vec<Vec<f64>> {
let mut rng = rand::thread_rng();
let mut preds = vec![vec![0.0; n as usize]; means.len()];
(0..means.len()).into_iter().for_each(|i| {
(0..n).into_iter().for_each(|j| {
let normal = Normal::new(means[i], sds[i]).unwrap();
preds[i][j as usize] = normal.sample(&mut rng);
})
});
preds
}
fn main() {
let means = vec![0.0; 67000];
let sds = vec![1.0; 67000];
let start = Instant::now();
let preds = rprednorm(100, means, sds);
let duration = start.elapsed();
println!("{:?}", duration);
}
Any advice on how to make these two iterators parallel?
Thanks.
It seems my rand::thread_rng does not implement the traits Send and Sync.
Why are you trying to send a thread_rng? The entire point of thread_rng is that it's a per-thread RNG.
then I get errors that the variables pred and rng cannot be borrowed as mutable because they are captured in a Fn closure.
Well yes, you need to Clone the StdRng (or Copy the OsRng) into each closure. As for pred, that can't work for a similar reason: once you parallelise the loop the compiler does not know that every i is distinct, so as far as it's concerned the write access to i could overlap (you could have two iterations running in parallel which try to write to the same place at the same time) which is illegal.
The solution is to use rayon to iterate in parallel over the destination vector:
fn rprednorm(n: i32, means: Vec<f64>, sds: Vec<f64>) -> Vec<Vec<f64>> {
let mut preds = vec![vec![0.0; n as usize]; means.len()];
preds.par_iter_mut().enumerate().for_each(|(i, e)| {
let mut rng = rand::thread_rng();
(0..n).into_iter().for_each(|j| {
let normal = Normal::new(means[i], sds[i]).unwrap();
e[j as usize] = normal.sample(&mut rng);
})
});
preds
}
Alternatively with OsRng, it's just a marker ZST, so you can refer to it as a value:
fn rprednorm(n: i32, means: Vec<f64>, sds: Vec<f64>) -> Vec<Vec<f64>> {
let mut preds = vec![vec![0.0; n as usize]; means.len()];
preds.par_iter_mut().enumerate().for_each(|(i, e)| {
(0..n).into_iter().for_each(|j| {
let normal = Normal::new(means[i], sds[i]).unwrap();
e[j as usize] = normal.sample(&mut rand::rngs::OsRng);
})
});
preds
}
StdRng doesn't seem very suitable to this use-case, as you'll either have to create one per toplevel iteration to get different samplings, or you'll have to initialise a base rng then clone it once per spark, and they'll all have the same sequence (as they'll share a seed).

Rust sorting uses surprisingly few comparaisons

I am currently learning Rust (using the Rust book), and one page mentions counting the number of times the sorting key was used while sorting an array. I modified the code in order to count this for arbitrary sizes, and here is the code :
fn main() {
const MAX: i32 = 10000;
for n in 1..MAX {
let mut v: Vec<i32> = (1..n).collect();
let mut ops = 0;
v.sort_by(|x, y| {
ops += 1;
x.cmp(y)
});
if n-2 >= 0 {
assert_eq!(n-2, ops);
}
// println!("A list of {n} elements is sorted in {ops} operations");
}
}
However, it seems that in order to sort a vector of n elements, Rust only needs n-2 comparaisons (the code above runs without panicking).
How can this be possible ? Aren't sorts supposed to be in O(n*log(n)) ?
Is it because Rust somehow "noticed" that my input vector was already sorted ?
Even in that case, how can a vector of length 2 can be sorted without any comparaisons ? Shouldn't it at least be n-1 ?
The biggest misconseption you have, I think, is:
fn main() {
const SIZE: i32 = 5;
let v: Vec<i32> = (1..SIZE).collect();
println!("{}", v.len());
}
4
The range 1..SIZE does not include SIZE and contains SIZE-1 elements.
Further, it will already be sorted, so it's as simple as iterating through it once.
See here:
fn main() {
const SIZE: i32 = 5;
let mut v: Vec<i32> = (1..SIZE).collect();
let mut ops = 0;
v.sort_by(|x, y| {
ops += 1;
let result = x.cmp(y);
println!(" - cmp {} vs {} => {:?}", x, y, result);
result
});
println!("Total comparisons: {}", ops);
}
- cmp 4 vs 3 => Greater
- cmp 3 vs 2 => Greater
- cmp 2 vs 1 => Greater
Total comparisons: 3
it seems that in order to sort a vector of n elements, Rust only needs n-2 comparaisons
That is incorrect. In order to sort an already sorted vector (which yours are), Rust needs n-1 comparisons. It doesn't detect that, that's just an inherent property of the mergesort implementation that Rust uses.
If it isn't already sorted, it will be more:
fn main() {
let mut v: Vec<i32> = vec![2, 4, 1, 3];
let mut ops = 0;
v.sort_by(|x, y| {
ops += 1;
let result = x.cmp(y);
println!(" - cmp {} vs {} => {:?}", x, y, result);
result
});
println!("Total comparisons: {}", ops);
}
- cmp 3 vs 1 => Greater
- cmp 1 vs 4 => Less
- cmp 3 vs 4 => Less
- cmp 1 vs 2 => Less
- cmp 3 vs 2 => Greater
Total comparisons: 5
FYI sort_by:
pub fn sort_by<F>(&mut self, mut compare: F)
where
F: FnMut(&T, &T) -> Ordering,
{
merge_sort(self, |a, b| compare(a, b) == Less);
}
and it actually invokes merge_sort:
/// This merge sort borrows some (but not all) ideas from TimSort, which is described in detail
/// [here](https://github.com/python/cpython/blob/main/Objects/listsort.txt).
///
/// The algorithm identifies strictly descending and non-descending subsequences, which are called
/// natural runs. There is a stack of pending runs yet to be merged. Each newly found run is pushed
/// onto the stack, and then some pairs of adjacent runs are merged until these two invariants are
/// satisfied:
///
/// 1. for every `i` in `1..runs.len()`: `runs[i - 1].len > runs[i].len`
/// 2. for every `i` in `2..runs.len()`: `runs[i - 2].len > runs[i - 1].len + runs[i].len`
///
/// The invariants ensure that the total running time is *O*(*n* \* log(*n*)) worst-case.
#[cfg(not(no_global_oom_handling))]
fn merge_sort<T, F>(v: &mut [T], mut is_less: F)
how can a vector of length 2 be sorted without any comparisons? Shouldn't it at least be n-1?
(1..2) returns a slice of length 1 (start from 1, but less than 2). So, when n == 2 in your code, please note that the length of the vector is one.
Let me demonstrate how it will actually go in the merge_sort if the input is a slice shorter than or equal to 2.
// MAX_INSERTION: 20
if len <= MAX_INSERTION {
// if the len is less than 1, it won't use `is_less` closure to let you count the cmp.
if len >= 2 {
for i in (0..len - 1).rev() {
insert_head(&mut v[i..], &mut is_less); // <- go into `insert_head`.
}
}
return;
}
fn insert_head<T, F>(v: &mut [T], is_less: &mut F)
where
F: FnMut(&T, &T) -> bool,
{
if v.len() >= 2 && is_less(&v[1], &v[0]) // <- here it uses the closure to make comparison.
So if your input is less than 21, short arrays will get sorted in place via insertion sort to avoid allocations.

What part of my code to filter prime numbers causes it to slow down as it processes?

I am doing some problems on Project Euler. This challenge requires filtering prime numbers from an array. I was halfway near my solution when I found out that Rust was a bit slow. I added a progressbar to check the progress.
Here is the code:
extern crate pbr;
use self::pbr::ProgressBar;
pub fn is_prime(i: i32) -> bool {
for d in 2..i {
if i % d == 0 {
return false;
}
}
true
}
pub fn calc_sum_loop(max_num: i32) -> i32 {
let mut pb = ProgressBar::new(max_num as u64);
pb.format("[=>_]");
let mut sum_primes = 0;
for i in 1..max_num {
if is_prime(i) {
sum_primes += i;
}
pb.inc();
}
sum_primes
}
pub fn solve() {
println!("About to calculate sum of primes in the first 20000");
println!("When using a forloop {:?}", calc_sum_loop(400000));
}
I am calling the solve function from my main.rs file. Turns out that the number of iterations in my for loop is a lot faster in the beginning and a lot slower later.
➜ euler-rust git:(master) ✗ cargo run --release
Finished release [optimized] target(s) in 0.05s
Running `target/release/euler-rust`
About to calculate sum of primes..
118661 / 400000 [===========>__________________________] 29.67 % 48780.25/s 6s
...
...
400000 / 400000 [=======================================] 100.00 % 23725.24/s
I am sort of drawing a blank in what might be causing this slowdown. It feels like Rust should be able to be much faster than what I am currently seeing. Note that I am telling Cargo to build with the --release flag. I am aware that not doing this might slow things down even further.
The function which is slowing the execution is:
is_prime(i: i32)
You may consider to use more efficient crate like primes or you can check efficient prime number checking algorithms here

Is this a valid implementation of `std::mem::drop`?

According to The Rust Programming Language, ch15-03, std::mem::drop takes an object, receives its ownership, and calls its drop function.
That's what this code does:
fn my_drop<T>(x: T) {}
fn main() {
let x = 5;
let y = &x;
let mut z = 4;
let v = vec![3, 4, 2, 5, 3, 5];
my_drop(v);
}
Is this what std::mem::drop does? Does it perform any other cleanup tasks other than these?
Let's take a look at the source:
#[inline]
#[stable(feature = "rust1", since = "1.0.0")]
pub fn drop<T>(_x: T) { }
#[inline] gives a hint to the compiler that the function should be inlined. #[stable] is used by the standard library to mark APIs that are available on the stable channel. Otherwise, it's really just an empty function! When _x goes out of scope as drop returns, its destructor is run; there is no other way to perform cleanup tasks implicitly in Rust.

performance of static member constraint functions

I'm trying to learn static member constraints in F#. From reading Tomas Petricek's blog post, I understand that writing an inline function that "uses only operations that are themselves written using static member constraints" will make my function work correctly for all numeric types that satisfy those constraints. This question indicates that inline works somewhat similarly to c++ templates, so I wasn't expecting any performance difference between these two functions:
let MultiplyTyped (A : double[,]) (B : double[,]) =
let rA, cA = (Array2D.length1 A) - 1, (Array2D.length2 A) - 1
let cB = (Array2D.length2 B) - 1
let C = Array2D.zeroCreate<double> (Array2D.length1 A) (Array2D.length2 B)
for i = 0 to rA do
for k = 0 to cA do
for j = 0 to cB do
C.[i,j] <- C.[i,j] + A.[i,k] * B.[k,j]
C
let inline MultiplyGeneric (A : 'T[,]) (B : 'T[,]) =
let rA, cA = Array2D.length1 A - 1, Array2D.length2 A - 1
let cB = Array2D.length2 B - 1
let C = Array2D.zeroCreate<'T> (Array2D.length1 A) (Array2D.length2 B)
for i = 0 to rA do
for k = 0 to cA do
for j = 0 to cB do
C.[i,j] <- C.[i,j] + A.[i,k] * B.[k,j]
C
Nevertheless, to multiply two 1024 x 1024 matrixes, MultiplyTyped completes in an average of 2550 ms on my machine, whereas MultiplyGeneric takes about 5150 ms. I originally thought that zeroCreate was at fault in the generic version, but changing that line to the one below didn't make a difference.
let C = Array2D.init<'T> (Array2D.length1 A) (Array2D.length2 B) (fun i j -> LanguagePrimitives.GenericZero)
Is there something I'm missing here to make MultiplyGeneric perform the same as MultiplyTyped? Or is this expected?
edit: I should mention that this is VS2010, F# 2.0, Win7 64bit, release build. Platform target is x64 (to test larger matrices) - this makes a difference: x86 produces similar results for the two functions.
Bonus question: the type inferred for MultiplyGeneric is the following:
val inline MultiplyGeneric :
^T [,] -> ^T [,] -> ^T [,]
when ( ^T or ^a) : (static member ( + ) : ^T * ^a -> ^T) and
^T : (static member ( * ) : ^T * ^T -> ^a)
Where does the ^a type come from?
edit 2: here's my testing code:
let r = new System.Random()
let A = Array2D.init 1024 1024 (fun i j -> r.NextDouble())
let B = Array2D.init 1024 1024 (fun i j -> r.NextDouble())
let test f =
let sw = System.Diagnostics.Stopwatch.StartNew()
f() |> ignore
sw.Stop()
printfn "%A" sw.ElapsedMilliseconds
for i = 1 to 5 do
test (fun () -> MultiplyTyped A B)
for i = 1 to 5 do
test (fun () -> MultiplyGeneric A B)
Good question. I'll answer the easy part first: the ^a is just part of the natural generalization process. Imagine you had a type like this:
type T = | T with
static member (+)(T, i:int) = T
static member (*)(T, T) = 0
Then you can still use your MultiplyGeneric function with arrays of this type: multiplying elements of A and B will give you ints, but that's okay because you can still add them to elements of C and get back values of type T to store back into C.
As to your performance question, I'm afraid I don't have a great explanation. Your basic understanding is right - using MultiplyGeneric with double[,] arguments should be equivalent to using MultiplyTyped. If you use ildasm to look at the IL the compiler generates for the following F# code:
let arr = Array2D.zeroCreate 1024 1024
let f1 = MultiplyTyped arr
let f2 = MultiplyGeneric arr
let timer = System.Diagnostics.Stopwatch()
timer.Start()
f1 arr |> ignore
printfn "%A" timer.Elapsed
timer.Restart()
f2 arr |> ignore
printfn "%A" timer.Elapsed
then you can see that the compiler really does generate identical code for each of them, putting the inlined code for MultipyGeneric into an internal static function. The only difference that I see in the generated code is in the names of locals, and when running from the command line I get roughly equal elapsed times. However, running from FSI I see a difference similar to what you've reported.
It's not clear to me why this would be. As I see it there are two possibilities:
FSI's code generation may be doing something slightly different than the static compiler
The CLR's JIT compiler may be treat code generated at runtime slightly differently from compiled code. For instance, as I mentioned my code above using MultiplyGeneric actually results in an internal method that contains the inlined body. Perhaps the CLR's JIT handles the difference between public and internal methods differently when they are generated at runtime than when they are in statically compiled code.
I'd like to see your benchmarks. I don't get the same results (VS 2012 F# 3.0 Win 7 64-bit).
let m = Array2D.init 1024 1024 (fun i j -> float i * float j)
let test f =
let sw = System.Diagnostics.Stopwatch.StartNew()
f() |> ignore
sw.Stop()
printfn "%A" sw.Elapsed
test (fun () -> MultiplyTyped m m)
> 00:00:09.6013188
test (fun () -> MultiplyGeneric m m)
> 00:00:09.1686885
Decompiling with Reflector, the functions look identical.
Regarding your last question, the least restrictive constraint is inferred. In this line
C.[i,j] <- C.[i,j] + A.[i,k] * B.[k,j]
because the result type of A.[i,k] * B.[k,j] is unspecified, and is passed immediately to (+), an extra type could be involved. If you want to tighten the constraint you can replace that line with
let temp : 'T = A.[i,k] * B.[k,j]
C.[i,j] <- C.[i,j] + temp
That will change the signature to
val inline MultiplyGeneric :
A: ^T [,] -> B: ^T [,] -> ^T [,]
when ^T : (static member ( * ) : ^T * ^T -> ^T) and
^T : (static member ( + ) : ^T * ^T -> ^T)
EDIT
Using your test, here's the output:
//MultiplyTyped
00:00:09.9904615
00:00:09.5489653
00:00:10.0562346
00:00:09.7023183
00:00:09.5123992
//MultiplyGeneric
00:00:09.1320273
00:00:08.8195283
00:00:08.8523408
00:00:09.2496603
00:00:09.2950196
Here's the same test on ideone (with a few minor changes to stay within the time limit: 512x512 matrix and one test iteration). It runs F# 2.0 and produced similar results.

Resources