Zero-cost abstractions: performance of for-loop vs. iterators - performance

Reading the Zero-cost abstractions and looking at Introduction to rust: a low-level language with high-level abstractions I tried to compare two approaches to computing the dot product of a vector: one using a for loop and one using iterators.
#![feature(test)]
extern crate rand;
extern crate test;
use std::cmp::min;
fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
return result;
}
fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum::<f64>()
}
#[cfg(test)]
mod bench {
use test::Bencher;
use rand::{Rng,thread_rng};
use super::*;
const LEN: usize = 30;
#[test]
fn test_1() {
let x = [1.0, 2.0, 3.0];
let y = [2.0, 4.0, 6.0];
let result = dot_product_1(&x, &y);
assert_eq!(result, 28.0);
}
#[test]
fn test_2() {
let x = [1.0, 2.0, 3.0];
let y = [2.0, 4.0, 6.0];
let result = dot_product_2(&x, &y);
assert_eq!(result, 28.0);
}
fn rand_array(cnt: u32) -> Vec<f64> {
let mut rng = thread_rng();
(0..cnt).map(|_| rng.gen::<f64>()).collect()
}
#[bench]
fn bench_small_1(b: &mut Bencher) {
let samples = rand_array(2*LEN as u32);
b.iter(|| {
dot_product_1(&samples[0..LEN], &samples[LEN..2*LEN])
})
}
#[bench]
fn bench_small_2(b: &mut Bencher) {
let samples = rand_array(2*LEN as u32);
b.iter(|| {
dot_product_2(&samples[0..LEN], &samples[LEN..2*LEN])
})
}
}
The later of the links above claims that the version with the iterators should have similar performance "and actually be a little bit faster". However, when benchmarking the two, I get very different results:
running 2 tests
test bench::bench_small_loop ... bench: 20 ns/iter (+/- 1)
test bench::bench_small_iter ... bench: 24 ns/iter (+/- 2)
test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out
So, where did the "zero-cost abstraction" go?
Update: Adding the foldr example provided by #wimh and using split_at instead of slices give the following result.
running 3 tests
test bench::bench_small_fold ... bench: 18 ns/iter (+/- 1)
test bench::bench_small_iter ... bench: 21 ns/iter (+/- 1)
test bench::bench_small_loop ... bench: 24 ns/iter (+/- 1)
test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out
So it seems that the additional time comes directly or indirectly from constructing the slices inside the measured code. To check that it indeed was the case, I tried the following two approaches with the same result (here shown for foldr case and using map + sum):
#[bench]
fn bench_small_iter(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let s0 = &samples[0..LEN];
let s1 = &samples[LEN..2 * LEN];
b.iter(|| dot_product_iter(s0, s1))
}
#[bench]
fn bench_small_fold(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_fold(s0, s1))
}

It seems zero cost to me. I wrote your code slightly more idiomatically, using the same random values for both tests, and then tested multiple times:
fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
result
}
fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum()
}
fn rand_array(cnt: usize) -> Vec<f64> {
let mut rng = rand::rngs::StdRng::seed_from_u64(42);
rng.sample_iter(&rand::distributions::Standard).take(cnt).collect()
}
#[bench]
fn bench_small_1(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_1(s0, s1))
}
#[bench]
fn bench_small_2(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_2(s0, s1))
}
bench_small_1 20 ns/iter (+/- 6)
bench_small_2 17 ns/iter (+/- 1)
bench_small_1 19 ns/iter (+/- 3)
bench_small_2 17 ns/iter (+/- 2)
bench_small_1 19 ns/iter (+/- 5)
bench_small_2 17 ns/iter (+/- 3)
bench_small_1 19 ns/iter (+/- 3)
bench_small_2 24 ns/iter (+/- 7)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 24 ns/iter (+/- 1)
bench_small_1 27 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 0)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 17 ns/iter (+/- 1)
In 9 of the 10 runs, the idiomatic code was faster than the for loop. This was done on 2.9 GHz Core i9 (I9-8950HK) with 32 GB RAM, compiled with rustc 1.31.0-nightly (fc403ad98 2018-09-30).

For fun, I rewrote the benchmark to use criterion, a port of the Haskell benchmarking library.
Cargo.toml
[package]
name = "mats-zero-cost-rust"
version = "0.1.0"
authors = ["mats"]
[dev-dependencies]
criterion = "0.2"
rand = "0.4"
[[bench]]
name = "my_benchmark"
harness = false
benches/my_benchmark.rs
#[macro_use]
extern crate criterion;
extern crate rand;
use std::cmp::min;
use criterion::Criterion;
use rand::{thread_rng, Rng};
const LEN: usize = 30;
fn dot_product_loop(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
return result;
}
fn dot_product_iter(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum()
}
fn rand_array(cnt: u32) -> Vec<f64> {
let mut rng = thread_rng();
(0..cnt).map(|_| rng.gen()).collect()
}
fn criterion_loop_with_slice(c: &mut Criterion) {
c.bench_function("loop with slice", |b| {
let samples = rand_array(2 * LEN as u32);
b.iter(|| dot_product_loop(&samples[0..LEN], &samples[LEN..2 * LEN]))
});
}
fn criterion_loop_without_slice(c: &mut Criterion) {
c.bench_function("loop without slice", |b| {
let samples = rand_array(2 * LEN as u32);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_loop(s0, s1))
});
}
fn criterion_iter_with_slice(c: &mut Criterion) {
c.bench_function("iterators with slice", |b| {
let samples = rand_array(2 * LEN as u32);
b.iter(|| dot_product_iter(&samples[0..LEN], &samples[LEN..2 * LEN]))
});
}
fn criterion_iter_without_slice(c: &mut Criterion) {
c.bench_function("iterators without slice", |b| {
let samples = rand_array(2 * LEN as u32);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_iter(s0, s1))
});
}
criterion_group!(benches, criterion_loop_with_slice, criterion_loop_without_slice, criterion_iter_with_slice, criterion_iter_without_slice);
criterion_main!(benches);
I observe these results;
kolmodin#blin:~/code/mats-zero-cost-rust$ cargo bench
Compiling mats-zero-cost-rust v0.1.0 (/home/kolmodin/code/mats-zero-cost-rust)
Finished release [optimized] target(s) in 1.16s
Running target/release/deps/my_benchmark-6f00e042fd40bc1d
Gnuplot not found, disabling plotting
loop with slice time: [7.5794 ns 7.6843 ns 7.8016 ns]
Found 14 outliers among 100 measurements (14.00%)
1 (1.00%) high mild
13 (13.00%) high severe
loop without slice time: [24.384 ns 24.486 ns 24.589 ns]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) low severe
1 (1.00%) low mild
iterators with slice time: [13.842 ns 13.852 ns 13.863 ns]
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe
iterators without slice time: [13.420 ns 13.424 ns 13.430 ns]
Found 12 outliers among 100 measurements (12.00%)
1 (1.00%) low mild
1 (1.00%) high mild
10 (10.00%) high severe
Gnuplot not found, disabling plotting
Using rustc 1.30.0 (da5f414c2 2018-10-24) on an AMD Ryzen 7 2700X.
The iterator implementation gets similar results for using slice and split_at.
The loop implementation gets very different results. The version with slice is significantly faster.

Related

Why is Vec::with_capacity slower than Vec::new for small final lengths?

Consider this code.
type Int = i32;
const MAX_NUMBER: Int = 1_000_000;
fn main() {
let result1 = with_new();
let result2 = with_capacity();
assert_eq!(result1, result2)
}
fn with_new() -> Vec<Int> {
let mut result = Vec::new();
for i in 0..MAX_NUMBER {
result.push(i);
}
result
}
fn with_capacity() -> Vec<Int> {
let mut result = Vec::with_capacity(MAX_NUMBER as usize);
for i in 0..MAX_NUMBER {
result.push(i);
}
result
}
Both functions produce the same output. One uses Vec::new, the other uses Vec::with_capacity. For small values of MAX_NUMBER (like in the example), with_capacity is slower than new. Only for larger final vector lengths (e.g. 100 million) the version using with_capacity is as fast as using new.
Flamegraph for 1 million elements
Flamegraph for 100 million elements
It is my understanding that with_capacity should always be faster if the final length is known, because data on the heap is allocated once which should result in a single chunk. In contrast, the version with new grows the vector MAX_NUMBER times, which results in more allocations.
What am I missing?
Edit
The first section was compiled with the debug profile. If I use the release profile with the following settings in Cargo.toml
[package]
name = "vec_test"
version = "0.1.0"
edition = "2021"
[profile.release]
opt-level = 3
debug = 2
I still get the following result for a length of 10 million.
I was not able to reproduce this in a synthetic benchmark:
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
fn with_new(size: i32) -> Vec<i32> {
let mut result = Vec::new();
for i in 0..size {
result.push(i);
}
result
}
fn with_capacity(size: i32) -> Vec<i32> {
let mut result = Vec::with_capacity(size as usize);
for i in 0..size {
result.push(i);
}
result
}
pub fn with_new_benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("with_new");
for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000].iter() {
group.bench_with_input(BenchmarkId::from_parameter(size), size, |b, &size| {
b.iter(|| with_new(size));
});
}
group.finish();
}
pub fn with_capacity_benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("with_capacity");
for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000].iter() {
group.bench_with_input(BenchmarkId::from_parameter(size), size, |b, &size| {
b.iter(|| with_capacity(size));
});
}
group.finish();
}
criterion_group!(benches, with_new_benchmark, with_capacity_benchmark);
criterion_main!(benches);
Here's the output (with outliers and other benchmarking stuff removed):
with_new/100 time: [331.17 ns 331.38 ns 331.61 ns]
with_new/1000 time: [1.1719 us 1.1731 us 1.1745 us]
with_new/10000 time: [8.6784 us 8.6840 us 8.6899 us]
with_new/100000 time: [77.524 us 77.596 us 77.680 us]
with_new/1000000 time: [1.6966 ms 1.6978 ms 1.6990 ms]
with_new/10000000 time: [22.063 ms 22.084 ms 22.105 ms]
with_capacity/100 time: [76.796 ns 76.859 ns 76.926 ns]
with_capacity/1000 time: [497.90 ns 498.14 ns 498.39 ns]
with_capacity/10000 time: [5.0058 us 5.0084 us 5.0112 us]
with_capacity/100000 time: [50.358 us 50.414 us 50.470 us]
with_capacity/1000000 time: [1.0861 ms 1.0868 ms 1.0876 ms]
with_capacity/10000000 time: [10.644 ms 10.656 ms 10.668 ms]
The with_capacity runs were consistently faster than with_new. The closest ones being in the 10,000- to 1,000,000-element runs where with_capacity still only took ~60% of the time, while others had it taking half the time or less.
It crossed my mind that there could be some strange const-propagation behavior going on, but even with individual functions with hard-coded sizes (playground for brevity), the behavior didn't significantly change:
with_new/100 time: [313.87 ns 314.22 ns 314.56 ns]
with_new/1000 time: [1.1498 us 1.1505 us 1.1511 us]
with_new/10000 time: [7.9062 us 7.9095 us 7.9130 us]
with_new/100000 time: [77.925 us 77.990 us 78.057 us]
with_new/1000000 time: [1.5675 ms 1.5683 ms 1.5691 ms]
with_new/10000000 time: [20.956 ms 20.990 ms 21.023 ms]
with_capacity/100 time: [76.943 ns 76.999 ns 77.064 ns]
with_capacity/1000 time: [535.00 ns 536.22 ns 537.21 ns]
with_capacity/10000 time: [5.1122 us 5.1150 us 5.1181 us]
with_capacity/100000 time: [50.064 us 50.092 us 50.122 us]
with_capacity/1000000 time: [1.0768 ms 1.0776 ms 1.0784 ms]
with_capacity/10000000 time: [10.600 ms 10.613 ms 10.628 ms]
Your testing code only calls each strategy once, so its conceivable that your testing environment is changed after calling the first one (a potential culprit being heap fragmentation as suggested by #trent_formerly_cl in the comments, though there could be others: cpu boosting/throttling, spatial and/or temporal cache locality, OS behavior, etc.). A benchmarking framework like criterion helps avoid a lot of these problems by iterating each test multiple times (including warmup iterations).

Why useless if statement is improving performance?

I'm writing a small library in rust as a 'collision detection' module. In trying to make a line-segment - line-segment intersection test as fast as possible, I've run into a strange discrepancy where seemingly useless code is making the function run faster.
My goal is optimising this method as much as possible, and knowing why the executable code changes performance as significantly as it does I would imagine would be helpful in achieving that goal. Whether it be a quirk in the rust compiler or something else, if anyone is knowledgeable to be able to understand this, here's what I have so far:
The two functions:
fn line_line(a1x: f64, a1y: f64, a2x: f64, a2y: f64, b1x: f64, b1y: f64, b2x: f64, b2y: f64) -> bool {
let dax = a2x - a1x;
let day = a2y - a1y;
let dbx = b2x - b1x;
let dby = b2y - b1y;
let dot = dax * dby - dbx * day;
let dd = dot * dot;
let nd1x = a1x - b1x;
let nd1y = a1y - b1y;
let t = (dax * nd1y - day * nd1x) * dot;
let u = (dbx * nd1y - dby * nd1x) * dot;
u >= 0.0 && u <= dd && t >= 0.0 && t <= dd
}
fn line_line_if(a1x: f64, a1y: f64, a2x: f64, a2y: f64, b1x: f64, b1y: f64, b2x: f64, b2y: f64) -> bool {
let dax = a2x - a1x;
let day = a2y - a1y;
let dbx = b2x - b1x;
let dby = b2y - b1y;
let dot = dax * dby - dbx * day;
if dot == 0.0 { return false } // useless branch if dot isn't 0 ?
let dd = dot * dot;
let nd1x = a1x - b1x;
let nd1y = a1y - b1y;
let t = (dax * nd1y - day * nd1x) * dot;
let u = (dbx * nd1y - dby * nd1x) * dot;
u >= 0.0 && u <= dd && t >= 0.0 && t <= dd
}
Criterion-rs bench code:
fn criterion_benchmark(c: &mut Criterion) {
c.bench_function("line-line", |b| b.iter(|| line_line(
black_box(0.0), black_box(1.0), black_box(2.0), black_box(0.0),
black_box(1.0), black_box(0.0), black_box(2.0), black_box(4.0))));
c.bench_function("line-line-if", |b| b.iter(|| line_line_if(
black_box(0.0), black_box(1.0), black_box(2.0), black_box(0.0),
black_box(1.0), black_box(0.0), black_box(2.0), black_box(4.0))));
}
Note that with these inputs, the if statement will not evaluate to true.
assert_eq!(line_line_if(0.0, 1.0, 2.0, 0.0, 1.0, 0.0, 2.0, 4.0), true); // does not panic
Criterion-rs bench results:
line-line time: [5.6718 ns 5.7203 ns 5.7640 ns]
line-line-if time: [5.1215 ns 5.1791 ns 5.2312 ns]
Godbolt.org rust compilation results for these two pieces of code.
Unfortunately I cannot read asm as well as I'd like. I do not know how to ask rustc to compile this without performing an inline of the entire calculation within the website either, but if you can do that I hope this may be of help?
My current environment is:
-Windows 10 (latest standard)
-Ryzen 3200G cpu
-Standard Rust install (no nightly/beta), compiling with znver1 (last I checked)
Testing is being done with cargo bench command, which I assume to be making optimizations, but haven't found a confirmation of this anywhere. Criterion-rs is doing all of the benchmark handling.
Thank you in advance if you're able to answer this.
EDIT:
Comparing the code in the godbolt link with -C opt-level=3 --target x86_64-pc-windows-msvc (the toolchain used by my installation of rustc currently) reveals that the compiler has inserted many unpcklpd and unpckhpd instructions into the function which may be slowing the code down rather than optimizing it. Setting the opt-level parameter to 2 does remove the extra unpck instructions without otherwise significantly changing the compilation result. However, adding
[profile.release]
opt-level = 2
While causing a recompilation, the backwards performance remained, same at opt-level = 1. I do not have target-cpu=native specified.

Why rayon-based parallel processing takes more time than serial processing?

Learning Rayon, I wanted to compare the performace of parallel calculation and serial calculation of Fibonacci series. Here's my code:
use rayon;
use std::time::Instant;
fn main() {
let nth = 30;
let now = Instant::now();
let fib = fibonacci_serial(nth);
println!(
"[s] The {}th number in the fibonacci sequence is {}, elapsed: {}",
nth,
fib,
now.elapsed().as_micros()
);
let now = Instant::now();
let fib = fibonacci_parallel(nth);
println!(
"[p] The {}th number in the fibonacci sequence is {}, elapsed: {}",
nth,
fib,
now.elapsed().as_micros()
);
}
fn fibonacci_parallel(n: u64) -> u64 {
if n <= 1 {
return n;
}
let (a, b) = rayon::join(|| fibonacci_parallel(n - 2), || fibonacci_parallel(n - 1));
a + b
}
fn fibonacci_serial(n: u64) -> u64 {
if n <= 1 {
return n;
}
fibonacci_serial(n - 2) + fibonacci_serial(n - 1)
}
Run in Rust Playground
I expected the elapsed time of parallel calculation would be smaller than the elapsed time of serial caculation, but the result was opposite:
# `s` stands for serial calculation and `p` for parallel
[s] The 30th number in the fibonacci sequence is 832040, elapsed: 12127
[p] The 30th number in the fibonacci sequence is 832040, elapsed: 990379
My implementation for serial/parallel calculation would have flaws. But if not, why am I seeing these results?
I think the real reason is, that you create n² threads which is not good. In every call of fibonacci_parallel you create another pair of threads for rayon and because you call fibonacci_parallel again in the closure you create yet another pair of threads.
This is utterly terrible for the OS/rayon.
An approach to solve this problem could be this:
fn fibonacci_parallel(n: u64) -> u64 {
fn inner(n: u64) -> u64 {
if n <= 1 {
return n;
}
inner(n - 2) + inner(n - 1)
}
if n <= 1 {
return n;
}
let (a, b) = rayon::join(|| inner(n - 2), || inner(n - 1));
a + b
}
You create two threads which both execute the inner function. With this addition I get
op#VBOX /t/t/foo> cargo run --release 40
Finished release [optimized] target(s) in 0.03s
Running `target/release/foo 40`
[s] The 40th number in the fibonacci sequence is 102334155, elapsed: 1373741
[p] The 40th number in the fibonacci sequence is 102334155, elapsed: 847343
But as said, for low numbers parallel execution is not worth it:
op#VBOX /t/t/foo> cargo run --release 20
Finished release [optimized] target(s) in 0.02s
Running `target/release/foo 20`
[s] The 10th number in the fibonacci sequence is 6765, elapsed: 82
[p] The 10th number in the fibonacci sequence is 6765, elapsed: 241

Why is my for loop code slower than an iterator?

I am trying to solve the leetcode problem distribute-candies. It is easy, just find out the minimum between the candies' kinds and candies half number.
Here's my solution (cost 48ms):
use std::collections::HashSet;
pub fn distribute_candies(candies: Vec<i32>) -> i32 {
let sister_candies = (candies.len() / 2) as i32;
let mut kind = 0;
let mut candies_kinds = HashSet::new();
for candy in candies.into_iter() {
if candies_kinds.insert(candy) {
kind += 1;
if kind > sister_candies {
return sister_candies;
}
}
}
kind
}
However, I found a solution using an iterator:
use std::collections::HashSet;
use std::cmp::min;
pub fn distribute_candies(candies: Vec<i32>) -> i32 {
min(candies.iter().collect::<HashSet<_>>().len(), candies.len() / 2) as i32
}
and it costs 36ms.
I can't quite understand why the iterator solution is faster than my for loop solution. Are there some magic optimizations that Rust is performing in the background?
The main difference is that the iterator version internally uses Iterator::size_hint to determine how much space to reserve in the HashSet before collecting into it. This prevents repeatedly having to reallocate as the set grows.
You can do the same using HashSet::with_capacity instead of HashSet::new:
let mut candies_kinds = HashSet::with_capacity(candies.len());
In my benchmark this single change makes your code significantly faster than the iterator. However, if I simplify your code to remove the early bailout optimisation, it runs in almost exactly the same time as the iterator version.
pub fn distribute_candies(candies: &[i32]) -> i32 {
let sister_candies = (candies.len() / 2) as i32;
let mut candies_kinds = HashSet::with_capacity(candies.len());
for candy in candies.into_iter() {
candies_kinds.insert(candy);
}
sister_candies.min(candies_kinds.len() as i32)
}
Timings:
test tests::bench_iter ... bench: 262,315 ns/iter (+/- 23,704)
test tests::bench_loop ... bench: 307,697 ns/iter (+/- 16,119)
test tests::bench_loop_with_capacity ... bench: 112,194 ns/iter (+/- 18,295)
test tests::bench_loop_with_capacity_no_bailout ... bench: 259,961 ns/iter (+/- 17,712)
This suggests to me that the HashSet preallocation is the dominant difference. Your additional optimisation also proves to be very effective - at least with the dataset that I happened to choose.

Trying famous branch-prediction example sometimes results in strange times

I tried to duplicate the example in this famous question. My code looks like this:
#![feature(test)]
extern crate rand;
extern crate test;
use test::Bencher;
use rand::{thread_rng, Rng};
type ItemType = u8;
type SumType = u64;
const TEST_SIZE: usize = 32_768;
#[bench]
fn bench_train(b: &mut Bencher) {
let numbers = get_random_vec();
b.iter(|| calc_sum(&numbers));
}
#[bench]
fn bench_train_sort(b: &mut Bencher) {
let mut numbers = get_random_vec();
numbers.sort(); // <-- the magic difference
b.iter(|| calc_sum(&numbers));
}
fn get_random_vec() -> Vec<ItemType> {
thread_rng().gen_iter().take(TEST_SIZE).collect()
}
fn calc_sum(numbers: &Vec<ItemType>) -> SumType {
let mut sum = 0;
for &num in numbers {
if num < ItemType::max_value() / 2 {
sum += num.into();
}
}
sum
}
If I benchmark the exact code from above I get reasonable results (like in the linked question):
test bench_train ... bench: 148,611 ns/iter (+/- 8,445)
test bench_train_sort ... bench: 21,064 ns/iter (+/- 1,980)
However, if I change SumType to u8 both versions run equally fast and much faster overall:
test bench_train ... bench: 1,272 ns/iter (+/- 64)
test bench_train_sort ... bench: 1,280 ns/iter (+/- 170)
First of: of course, the sum will overflow all the time, but in release mode the overflow checks of Rust are disabled, so we just calculate a wrong result without panicking. Could this be the reason for the surprisingly short time?
Even stranger: when I change the implementation of calc_sum to something more idiomatic, the results change again. My second implementation:
fn calc_sum(numbers: &Vec<ItemType>) -> SumType {
numbers.iter()
.filter(|&&num| num < ItemType::max_value() / 2)
.fold(0, |acc, &num| acc + (num as SumType))
}
With this implementation the SumType doesn't matter anymore. With u8 as well as with u64 I get these results:
test bench_train ... bench: 144,411 ns/iter (+/- 12,533)
test bench_train_sort ... bench: 16,966 ns/iter (+/- 1,100)
So we again get the numbers we are expecting. So the question is:
What is the reason for the strange running times?
PS: I tested with cargo bench which compiles in release mode.
PPS: I just noticed that in the first implementation of calc_sum I use into() for casting, whereas I use as in the second example. When also using as in the first example, I get more strange numbers. With SumType = u64:
test bench_train ... bench: 39,850 ns/iter (+/- 2,355)
test bench_train_sort ... bench: 39,344 ns/iter (+/- 2,581)
With SumType = u8:
test bench_train ... bench: 1,184 ns/iter (+/- 339)
test bench_train_sort ... bench: 1,239 ns/iter (+/- 85)
I took a quick look at the assembler code, and it appears that if you use SumType = u8 then LLVM generates SSE2 instructions to do vector operations, which is much faster. In theory, LLVM should be able to optimize filter(...).fold(...) to the same code, but in practice it cannot always remove overhead of abstraction. I hope when MIR is added, Rust won't rely on LLVM to do Rust-specific optimizations.

Resources