Why is my for loop code slower than an iterator? - performance

I am trying to solve the leetcode problem distribute-candies. It is easy, just find out the minimum between the candies' kinds and candies half number.
Here's my solution (cost 48ms):
use std::collections::HashSet;
pub fn distribute_candies(candies: Vec<i32>) -> i32 {
let sister_candies = (candies.len() / 2) as i32;
let mut kind = 0;
let mut candies_kinds = HashSet::new();
for candy in candies.into_iter() {
if candies_kinds.insert(candy) {
kind += 1;
if kind > sister_candies {
return sister_candies;
}
}
}
kind
}
However, I found a solution using an iterator:
use std::collections::HashSet;
use std::cmp::min;
pub fn distribute_candies(candies: Vec<i32>) -> i32 {
min(candies.iter().collect::<HashSet<_>>().len(), candies.len() / 2) as i32
}
and it costs 36ms.
I can't quite understand why the iterator solution is faster than my for loop solution. Are there some magic optimizations that Rust is performing in the background?

The main difference is that the iterator version internally uses Iterator::size_hint to determine how much space to reserve in the HashSet before collecting into it. This prevents repeatedly having to reallocate as the set grows.
You can do the same using HashSet::with_capacity instead of HashSet::new:
let mut candies_kinds = HashSet::with_capacity(candies.len());
In my benchmark this single change makes your code significantly faster than the iterator. However, if I simplify your code to remove the early bailout optimisation, it runs in almost exactly the same time as the iterator version.
pub fn distribute_candies(candies: &[i32]) -> i32 {
let sister_candies = (candies.len() / 2) as i32;
let mut candies_kinds = HashSet::with_capacity(candies.len());
for candy in candies.into_iter() {
candies_kinds.insert(candy);
}
sister_candies.min(candies_kinds.len() as i32)
}
Timings:
test tests::bench_iter ... bench: 262,315 ns/iter (+/- 23,704)
test tests::bench_loop ... bench: 307,697 ns/iter (+/- 16,119)
test tests::bench_loop_with_capacity ... bench: 112,194 ns/iter (+/- 18,295)
test tests::bench_loop_with_capacity_no_bailout ... bench: 259,961 ns/iter (+/- 17,712)
This suggests to me that the HashSet preallocation is the dominant difference. Your additional optimisation also proves to be very effective - at least with the dataset that I happened to choose.

Related

Basic F# / Rust performance comparison

this is a simplistic performance test based on https://www.youtube.com/watch?v=QlMLB2-G25c which compares the performance of rust vs wasm vs python vs go
the original rust program (from https://github.com/masmullin2000/random-sort-examples) is:
use rand::prelude::*;
fn main() {
let vec = make_random_vec(1_000_000, 100);
for _ in 0..250 {
let mut v = vec.clone();
// v.sort_unstable();
v.sort(); // using stable sort as f# sort is a stable sort
}
}
pub fn make_random_vec(sz: usize, modulus: i64) -> Vec<i64> {
let mut v: Vec<i64> = Vec::with_capacity(sz);
for _ in 0..sz {
let x: i64 = random();
v.push(x % modulus);
}
v
}
so i created the following f# program to compare against rust
open System
let rec cls (arr:int64 array) count =
if count > 0 then
let v1 = Array.copy arr
let v2 = Array.sort v1
cls arr (count-1)
else
()
let rnd = Random()
let rndArray = Array.init 1000000 (fun _ -> int64 (rnd.Next(100)))
cls rndArray 250 |> ignore
i was expecting f# to be slower (both running on WSL2) but got the following times on my Core i7 8th gen laptop
Rust - around 17 seconds
Rust (unstable sort) - around 2.7 seconds
F# - around 11 seconds
my questions:
is the dotnet compiler doing some sort of optimisation that throws away some of the processing because the return values are not being used resulting in the f# code running faster or am i doing something wrong?
does f# have an unstable sort function that i can use to compare against the Rust unstable sort?

Which Rust RNG should be used for multithreaded sampling?

I am trying to create a function in Rust which will sample from M normal distributions N times. I have the sequential version below, which runs fine. I am trying to parallelize it using Rayon, but am encountering the error
Rc<UnsafeCell<ReseedingRng<rand_chacha::chacha::ChaCha12Core, OsRng>>> cannot be sent between threads safely
It seems my rand::thread_rng does not implement the traits Send and Sync. I tried using StdRng and OsRng which both do, to no avail because then I get errors that the variables pred and rng cannot be borrowed as mutable because they are captured in a Fn closure.
This is the working code below. It errors when I change the first into_iter() to into_par_iter().
use rand_distr::{Normal, Distribution};
use std::time::Instant;
use rayon::prelude::*;
fn rprednorm(n: i32, means: Vec<f64>, sds: Vec<f64>) -> Vec<Vec<f64>> {
let mut rng = rand::thread_rng();
let mut preds = vec![vec![0.0; n as usize]; means.len()];
(0..means.len()).into_iter().for_each(|i| {
(0..n).into_iter().for_each(|j| {
let normal = Normal::new(means[i], sds[i]).unwrap();
preds[i][j as usize] = normal.sample(&mut rng);
})
});
preds
}
fn main() {
let means = vec![0.0; 67000];
let sds = vec![1.0; 67000];
let start = Instant::now();
let preds = rprednorm(100, means, sds);
let duration = start.elapsed();
println!("{:?}", duration);
}
Any advice on how to make these two iterators parallel?
Thanks.
It seems my rand::thread_rng does not implement the traits Send and Sync.
Why are you trying to send a thread_rng? The entire point of thread_rng is that it's a per-thread RNG.
then I get errors that the variables pred and rng cannot be borrowed as mutable because they are captured in a Fn closure.
Well yes, you need to Clone the StdRng (or Copy the OsRng) into each closure. As for pred, that can't work for a similar reason: once you parallelise the loop the compiler does not know that every i is distinct, so as far as it's concerned the write access to i could overlap (you could have two iterations running in parallel which try to write to the same place at the same time) which is illegal.
The solution is to use rayon to iterate in parallel over the destination vector:
fn rprednorm(n: i32, means: Vec<f64>, sds: Vec<f64>) -> Vec<Vec<f64>> {
let mut preds = vec![vec![0.0; n as usize]; means.len()];
preds.par_iter_mut().enumerate().for_each(|(i, e)| {
let mut rng = rand::thread_rng();
(0..n).into_iter().for_each(|j| {
let normal = Normal::new(means[i], sds[i]).unwrap();
e[j as usize] = normal.sample(&mut rng);
})
});
preds
}
Alternatively with OsRng, it's just a marker ZST, so you can refer to it as a value:
fn rprednorm(n: i32, means: Vec<f64>, sds: Vec<f64>) -> Vec<Vec<f64>> {
let mut preds = vec![vec![0.0; n as usize]; means.len()];
preds.par_iter_mut().enumerate().for_each(|(i, e)| {
(0..n).into_iter().for_each(|j| {
let normal = Normal::new(means[i], sds[i]).unwrap();
e[j as usize] = normal.sample(&mut rand::rngs::OsRng);
})
});
preds
}
StdRng doesn't seem very suitable to this use-case, as you'll either have to create one per toplevel iteration to get different samplings, or you'll have to initialise a base rng then clone it once per spark, and they'll all have the same sequence (as they'll share a seed).

What part of my code to filter prime numbers causes it to slow down as it processes?

I am doing some problems on Project Euler. This challenge requires filtering prime numbers from an array. I was halfway near my solution when I found out that Rust was a bit slow. I added a progressbar to check the progress.
Here is the code:
extern crate pbr;
use self::pbr::ProgressBar;
pub fn is_prime(i: i32) -> bool {
for d in 2..i {
if i % d == 0 {
return false;
}
}
true
}
pub fn calc_sum_loop(max_num: i32) -> i32 {
let mut pb = ProgressBar::new(max_num as u64);
pb.format("[=>_]");
let mut sum_primes = 0;
for i in 1..max_num {
if is_prime(i) {
sum_primes += i;
}
pb.inc();
}
sum_primes
}
pub fn solve() {
println!("About to calculate sum of primes in the first 20000");
println!("When using a forloop {:?}", calc_sum_loop(400000));
}
I am calling the solve function from my main.rs file. Turns out that the number of iterations in my for loop is a lot faster in the beginning and a lot slower later.
➜ euler-rust git:(master) ✗ cargo run --release
Finished release [optimized] target(s) in 0.05s
Running `target/release/euler-rust`
About to calculate sum of primes..
118661 / 400000 [===========>__________________________] 29.67 % 48780.25/s 6s
...
...
400000 / 400000 [=======================================] 100.00 % 23725.24/s
I am sort of drawing a blank in what might be causing this slowdown. It feels like Rust should be able to be much faster than what I am currently seeing. Note that I am telling Cargo to build with the --release flag. I am aware that not doing this might slow things down even further.
The function which is slowing the execution is:
is_prime(i: i32)
You may consider to use more efficient crate like primes or you can check efficient prime number checking algorithms here

Zero-cost abstractions: performance of for-loop vs. iterators

Reading the Zero-cost abstractions and looking at Introduction to rust: a low-level language with high-level abstractions I tried to compare two approaches to computing the dot product of a vector: one using a for loop and one using iterators.
#![feature(test)]
extern crate rand;
extern crate test;
use std::cmp::min;
fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
return result;
}
fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum::<f64>()
}
#[cfg(test)]
mod bench {
use test::Bencher;
use rand::{Rng,thread_rng};
use super::*;
const LEN: usize = 30;
#[test]
fn test_1() {
let x = [1.0, 2.0, 3.0];
let y = [2.0, 4.0, 6.0];
let result = dot_product_1(&x, &y);
assert_eq!(result, 28.0);
}
#[test]
fn test_2() {
let x = [1.0, 2.0, 3.0];
let y = [2.0, 4.0, 6.0];
let result = dot_product_2(&x, &y);
assert_eq!(result, 28.0);
}
fn rand_array(cnt: u32) -> Vec<f64> {
let mut rng = thread_rng();
(0..cnt).map(|_| rng.gen::<f64>()).collect()
}
#[bench]
fn bench_small_1(b: &mut Bencher) {
let samples = rand_array(2*LEN as u32);
b.iter(|| {
dot_product_1(&samples[0..LEN], &samples[LEN..2*LEN])
})
}
#[bench]
fn bench_small_2(b: &mut Bencher) {
let samples = rand_array(2*LEN as u32);
b.iter(|| {
dot_product_2(&samples[0..LEN], &samples[LEN..2*LEN])
})
}
}
The later of the links above claims that the version with the iterators should have similar performance "and actually be a little bit faster". However, when benchmarking the two, I get very different results:
running 2 tests
test bench::bench_small_loop ... bench: 20 ns/iter (+/- 1)
test bench::bench_small_iter ... bench: 24 ns/iter (+/- 2)
test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out
So, where did the "zero-cost abstraction" go?
Update: Adding the foldr example provided by #wimh and using split_at instead of slices give the following result.
running 3 tests
test bench::bench_small_fold ... bench: 18 ns/iter (+/- 1)
test bench::bench_small_iter ... bench: 21 ns/iter (+/- 1)
test bench::bench_small_loop ... bench: 24 ns/iter (+/- 1)
test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out
So it seems that the additional time comes directly or indirectly from constructing the slices inside the measured code. To check that it indeed was the case, I tried the following two approaches with the same result (here shown for foldr case and using map + sum):
#[bench]
fn bench_small_iter(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let s0 = &samples[0..LEN];
let s1 = &samples[LEN..2 * LEN];
b.iter(|| dot_product_iter(s0, s1))
}
#[bench]
fn bench_small_fold(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_fold(s0, s1))
}
It seems zero cost to me. I wrote your code slightly more idiomatically, using the same random values for both tests, and then tested multiple times:
fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
result
}
fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum()
}
fn rand_array(cnt: usize) -> Vec<f64> {
let mut rng = rand::rngs::StdRng::seed_from_u64(42);
rng.sample_iter(&rand::distributions::Standard).take(cnt).collect()
}
#[bench]
fn bench_small_1(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_1(s0, s1))
}
#[bench]
fn bench_small_2(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_2(s0, s1))
}
bench_small_1 20 ns/iter (+/- 6)
bench_small_2 17 ns/iter (+/- 1)
bench_small_1 19 ns/iter (+/- 3)
bench_small_2 17 ns/iter (+/- 2)
bench_small_1 19 ns/iter (+/- 5)
bench_small_2 17 ns/iter (+/- 3)
bench_small_1 19 ns/iter (+/- 3)
bench_small_2 24 ns/iter (+/- 7)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 24 ns/iter (+/- 1)
bench_small_1 27 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 0)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 17 ns/iter (+/- 1)
In 9 of the 10 runs, the idiomatic code was faster than the for loop. This was done on 2.9 GHz Core i9 (I9-8950HK) with 32 GB RAM, compiled with rustc 1.31.0-nightly (fc403ad98 2018-09-30).
For fun, I rewrote the benchmark to use criterion, a port of the Haskell benchmarking library.
Cargo.toml
[package]
name = "mats-zero-cost-rust"
version = "0.1.0"
authors = ["mats"]
[dev-dependencies]
criterion = "0.2"
rand = "0.4"
[[bench]]
name = "my_benchmark"
harness = false
benches/my_benchmark.rs
#[macro_use]
extern crate criterion;
extern crate rand;
use std::cmp::min;
use criterion::Criterion;
use rand::{thread_rng, Rng};
const LEN: usize = 30;
fn dot_product_loop(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
return result;
}
fn dot_product_iter(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum()
}
fn rand_array(cnt: u32) -> Vec<f64> {
let mut rng = thread_rng();
(0..cnt).map(|_| rng.gen()).collect()
}
fn criterion_loop_with_slice(c: &mut Criterion) {
c.bench_function("loop with slice", |b| {
let samples = rand_array(2 * LEN as u32);
b.iter(|| dot_product_loop(&samples[0..LEN], &samples[LEN..2 * LEN]))
});
}
fn criterion_loop_without_slice(c: &mut Criterion) {
c.bench_function("loop without slice", |b| {
let samples = rand_array(2 * LEN as u32);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_loop(s0, s1))
});
}
fn criterion_iter_with_slice(c: &mut Criterion) {
c.bench_function("iterators with slice", |b| {
let samples = rand_array(2 * LEN as u32);
b.iter(|| dot_product_iter(&samples[0..LEN], &samples[LEN..2 * LEN]))
});
}
fn criterion_iter_without_slice(c: &mut Criterion) {
c.bench_function("iterators without slice", |b| {
let samples = rand_array(2 * LEN as u32);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_iter(s0, s1))
});
}
criterion_group!(benches, criterion_loop_with_slice, criterion_loop_without_slice, criterion_iter_with_slice, criterion_iter_without_slice);
criterion_main!(benches);
I observe these results;
kolmodin#blin:~/code/mats-zero-cost-rust$ cargo bench
Compiling mats-zero-cost-rust v0.1.0 (/home/kolmodin/code/mats-zero-cost-rust)
Finished release [optimized] target(s) in 1.16s
Running target/release/deps/my_benchmark-6f00e042fd40bc1d
Gnuplot not found, disabling plotting
loop with slice time: [7.5794 ns 7.6843 ns 7.8016 ns]
Found 14 outliers among 100 measurements (14.00%)
1 (1.00%) high mild
13 (13.00%) high severe
loop without slice time: [24.384 ns 24.486 ns 24.589 ns]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) low severe
1 (1.00%) low mild
iterators with slice time: [13.842 ns 13.852 ns 13.863 ns]
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe
iterators without slice time: [13.420 ns 13.424 ns 13.430 ns]
Found 12 outliers among 100 measurements (12.00%)
1 (1.00%) low mild
1 (1.00%) high mild
10 (10.00%) high severe
Gnuplot not found, disabling plotting
Using rustc 1.30.0 (da5f414c2 2018-10-24) on an AMD Ryzen 7 2700X.
The iterator implementation gets similar results for using slice and split_at.
The loop implementation gets very different results. The version with slice is significantly faster.

Trying famous branch-prediction example sometimes results in strange times

I tried to duplicate the example in this famous question. My code looks like this:
#![feature(test)]
extern crate rand;
extern crate test;
use test::Bencher;
use rand::{thread_rng, Rng};
type ItemType = u8;
type SumType = u64;
const TEST_SIZE: usize = 32_768;
#[bench]
fn bench_train(b: &mut Bencher) {
let numbers = get_random_vec();
b.iter(|| calc_sum(&numbers));
}
#[bench]
fn bench_train_sort(b: &mut Bencher) {
let mut numbers = get_random_vec();
numbers.sort(); // <-- the magic difference
b.iter(|| calc_sum(&numbers));
}
fn get_random_vec() -> Vec<ItemType> {
thread_rng().gen_iter().take(TEST_SIZE).collect()
}
fn calc_sum(numbers: &Vec<ItemType>) -> SumType {
let mut sum = 0;
for &num in numbers {
if num < ItemType::max_value() / 2 {
sum += num.into();
}
}
sum
}
If I benchmark the exact code from above I get reasonable results (like in the linked question):
test bench_train ... bench: 148,611 ns/iter (+/- 8,445)
test bench_train_sort ... bench: 21,064 ns/iter (+/- 1,980)
However, if I change SumType to u8 both versions run equally fast and much faster overall:
test bench_train ... bench: 1,272 ns/iter (+/- 64)
test bench_train_sort ... bench: 1,280 ns/iter (+/- 170)
First of: of course, the sum will overflow all the time, but in release mode the overflow checks of Rust are disabled, so we just calculate a wrong result without panicking. Could this be the reason for the surprisingly short time?
Even stranger: when I change the implementation of calc_sum to something more idiomatic, the results change again. My second implementation:
fn calc_sum(numbers: &Vec<ItemType>) -> SumType {
numbers.iter()
.filter(|&&num| num < ItemType::max_value() / 2)
.fold(0, |acc, &num| acc + (num as SumType))
}
With this implementation the SumType doesn't matter anymore. With u8 as well as with u64 I get these results:
test bench_train ... bench: 144,411 ns/iter (+/- 12,533)
test bench_train_sort ... bench: 16,966 ns/iter (+/- 1,100)
So we again get the numbers we are expecting. So the question is:
What is the reason for the strange running times?
PS: I tested with cargo bench which compiles in release mode.
PPS: I just noticed that in the first implementation of calc_sum I use into() for casting, whereas I use as in the second example. When also using as in the first example, I get more strange numbers. With SumType = u64:
test bench_train ... bench: 39,850 ns/iter (+/- 2,355)
test bench_train_sort ... bench: 39,344 ns/iter (+/- 2,581)
With SumType = u8:
test bench_train ... bench: 1,184 ns/iter (+/- 339)
test bench_train_sort ... bench: 1,239 ns/iter (+/- 85)
I took a quick look at the assembler code, and it appears that if you use SumType = u8 then LLVM generates SSE2 instructions to do vector operations, which is much faster. In theory, LLVM should be able to optimize filter(...).fold(...) to the same code, but in practice it cannot always remove overhead of abstraction. I hope when MIR is added, Rust won't rely on LLVM to do Rust-specific optimizations.

Resources