Why is VecDeque slower than a Vec? - data-structures

I'm beginning to optimize performance of a crate, and I swapped out a Vec for a VecDeque. The container maintains elements in sorted order (it's supposed to be fairly small, so I didn't yet bother trying a heap) and is occasionally split down the middle into two separate containers (another reason I haven't yet tried a heap) with drain.
I'd expect this second operation to be much faster: I can copy the first half of the collection out, then simply rotate and decrease the length of the original (now second) collection. However, when I run my #[bench] tests, performing the above operations a variable number of times, (below in million ns/iters) I observed a performance decrease with the VecDeque:
test a
test b
test c
test d
Vec
12.6
5.9
5.9
3.8
VecDeque
13.6
8.9
7.3
5.8
A reproducible example (gist):
#![feature(test)]
extern crate test;
use std::collections::VecDeque;
fn insert_in_sorted_order_vec<E: Ord + Eq>(t: &mut Vec<E>, k: E) {
match t.binary_search(&k) {
Ok(i) => t[i] = k,
Err(i) => t.insert(i, k),
}
}
fn insert_in_sorted_order_vecdeque<E: Ord + Eq>(t: &mut VecDeque<E>, k: E) {
match t.binary_search(&k) {
Ok(i) => t[i] = k,
Err(i) => t.insert(i, k),
}
}
fn split_vec<T>(mut t: Vec<T>) -> (Vec<T>, Vec<T>) {
let a = t.drain(0..(t.len() / 2)).collect();
(a, t)
}
fn split_vecdeque<T>(mut t: VecDeque<T>) -> (VecDeque<T>, VecDeque<T>) {
let a = t.drain(0..(t.len() / 2)).collect();
(a, t)
}
#[cfg(test)]
mod tests {
use super::*;
use test::Bencher;
static ITERS_BEFORE_SPLIT: u32 = 50;
static ITERS_TIME: u32 = 10_000;
#[bench]
fn vec_manip(b: &mut Bencher) {
b.iter(|| {
let mut v = Vec::new();
for i in 0..(ITERS_TIME / ITERS_BEFORE_SPLIT) {
for j in 1..(ITERS_BEFORE_SPLIT + 1) {
insert_in_sorted_order_vec(&mut v, i * j / (i + j)); // 'random'-ish illustrative number
}
v = split_vec(v).1;
}
});
}
#[bench]
fn vecdeque_manip(b: &mut Bencher) {
b.iter(|| {
let mut v = VecDeque::new();
for i in 0..(ITERS_TIME / ITERS_BEFORE_SPLIT) {
for j in 1..(ITERS_BEFORE_SPLIT + 1) {
insert_in_sorted_order_vecdeque(&mut v, i * j / (i + j)); // 'random'-ish illustrative number
}
v = split_vecdeque(v).1;
}
});
}
}
The Vec implementation took 69.2k ns/iter, and the VecDeque implementation took 91.8k.
I've repeated and verified these results a number of times - why is it that performance decreases with this more flexible data structure?
These results were obtained by running cargo bench.
Linux 5.11
3900X (12 cores, 3.8-4.6 GHz)
32GB 3200 MHz RAM
rustc 1.55.0-nightly
default cargo bench options (optimized, no debug symbols as far as I can tell, etc.)
Edit
I changed the split_vecdeque method to utilize split_off instead of drain().collect() (see below). It looks like this method is guaranteed to not reallocate or shift anything around, instead just moving the head and tail pointers around; see the documentation and implementation. That, however, performs even worse than the original VecDeque at 98.2k ns/iter. For larger values (ITERS_BEFORE_SPLIT = 50_000, ITERS_TIME = 5_000_000), though performance (21.8m ns/iter) is better than drain (23.1 ns/iter) and worse than Vec (19.1 ns/iter).
fn split_vecdeque<T>(mut t: VecDeque<T>) -> (VecDeque<T>, VecDeque<T>) {
let a = t.split_off(t.len() / 2);
(t, a)
}

A VecDeque is like a Vec but supports pushing and popping from both ends efficiently. In order to do this, it uses a single, contiguous buffer (just like a Vec), but treats it as two partitions; a head and a tail.
The structure is laid out like this:
pub struct VecDeque<T> {
tail: usize,
head: usize,
buf: RawVec<T>,
}
Items in the buffer are ordered like this:
[[tail: 5, 6, 7] ...unused... [head: 1, 2, 3, 4]]
Adding an item to the end of the collection will append to the the tail, using some of the unused space. Adding to the start of the collection will add to the start of the head, eating into the same space. When the head and tail meet in the middle, the VecDeque is full and will need to reallocate.
Compared with Vec:
pub struct Vec<T> {
buf: RawVec<T>,
len: usize,
}
Which uses its buffer like this:
[1, 2, 4, 5, 6, 7 ...unused...]
Adding an item at the end is fast, but adding an item at the start requires copying all of the existing items to make space.
Most operations on VecDeque are made more complicated by this layout and this will slightly reduce its performance. Even retrieving its length is more complicated:
pub fn len(&self) -> usize {
count(self.tail, self.head, self.cap())
}
The whole point of VecDeque is to make certain operations faster, namely pushing and popping the start of the collection. Vec is very slow at this, especially if there are a lot of items, because it involves moving all of the other items to make space. The structure of VecDeque makes these operations fast but at the expense of performance of other operations in comparison to Vec.
Your tests doesn't appear to take advantage of VecDeque's design, since they are dominated by calls to insert, which involves the expensive copying of many items in both cases.

Related

What's the fastest sound way to take ownership of elements of a Vec

Suppose I have the following:
fn into_three_tuple(mut v: Vec<String>) -> (String, String, String) {
if v.len() == 3 {
// ???
} else {
panic!()
}
}
What's should I replace ??? with to achieve the best performance?
Possible solutions
Sure, I could do
...
if v.len() == 3 {
let mut iter = v.into_iter();
(v.next().unwrap(), v.next().unwrap(), v.next().unwrap())
} else {
...
or similarly
if v.len() == 3 {
let mut iter = v.into_iter();
let e2 = v.pop.unwrap();
let e1 = v.pop.unwrap();
let e0 = v.pop.unwrap();
(e0, e1, e2)
} else {
...
Problems with those solutions
Both of these implementations use unwrap, which, if I understand correctly, performs a runtime check. But since we have the v.len() == 3 condition, we know the vector is guaranteed to have 3 elements, so the the runtime check is unnecessary.
Also, the into_iter solution may introduce the additional overhead of creating the iterator, and the pop solution may introduce the additional overhead of decreasing the v's internal len field, which seems silly since v will be dropped immediately after extracting the elements (so we don't care whether its len field is accurate).
Question
Is there some (possibly unsafe) way that's more efficient (e.g., some way to directly take ownership of elements at arbitrary indices)?
Or perhaps the compiler is already smart enough to skip these extraneous operations?
Or do I just have to live with suboptimal performance?
Note
In case you're wondering why I'm obsessing over such a tiny micro-optimization, I'm writing a fairly speed-critical application, and this function will be called a significant amount of times.
There is an operation provided in the standard library to extract a fixed number of items from a vector: converting it to an array.
use std::convert::TryInto; // needed only if not using 2021 edition / Rust 1.55 or earlier
pub fn into_three_tuple(v: Vec<String>) -> (String, String, String) {
let three_array: [String; 3] = v.try_into().unwrap();
let [a, b, c] = three_array;
(a, b, c)
}
I'm not really familiar with reading x86 assembly, but this does compile down to simpler code with fewer branches. I would generally expect that this is in most cases the fastest way to unpack a three-element vector; if it is reliably slower, then that would be a performance bug in the standard library which should be reported and fixed.
You should also consider using an array [String; 3] instead of a tuple in the rest of your program. The type is shorter to write, and they allow you to use array and slice operations to act on all three strings. Additionally, tuples do not have a guaranteed memory layout, and arrays do (even though practically they are likely to be identical in this case).
Changing the return type to be an array makes the function trivial — possibly useful for the type declaration, but not containing any interesting code:
pub fn into_three_array(v: Vec<String>) -> [String; 3] {
v.try_into().unwrap()
}
Disclaimer: As mentioned in the comments, you should benchmark to check that any of this actually makes a difference in your program. The fact that you're using many 3-element vectors (which are heap-allocated and therefore comparatively inefficient) shows that you may be over-optimizing, or optimizing at the wrong place. Having said that...
the into_iter solution may introduce the additional overhead of creating the iterator
Note that the "iterator" is a tiny on-stack value entirely transparent to the compiler, which can proceed to inline/eliminate it entirely.
Or perhaps the compiler is already smart enough to skip these extraneous operations?
In many cases, a check for v.len() == <concrete number> is indeed sufficient for the compiler to omit bounds checking because it has proof of the vector size. However, that doesn't appear to work with the approaches you've tried. After modifying the code to std::process::exit() if v.len() != 3 so the only panic is from the runtime checks, the runtime checks (as evidence by calls to panic) are still not removed either with the .pop() or with the into_iter() approach.
Is there some (possibly unsafe) way that's more efficient (e.g., some way to directly take ownership of elements at arbitrary indices)?
Yes. One approach is to use unreachable_unchecked() to avoid the panic where we can prove the calls to next() will succeed:
use std::hint::unreachable_unchecked;
pub fn into_three_tuple(v: Vec<String>) -> (String, String, String) {
if v.len() == 3 {
let mut v = v.into_iter();
unsafe {
let e0 = v.next().unwrap_or_else(|| unreachable_unchecked());
let e1 = v.next().unwrap_or_else(|| unreachable_unchecked());
let e2 = v.next().unwrap_or_else(|| unreachable_unchecked());
(e0, e1, e2)
}
} else {
panic!()
}
}
Modifying the code in the same way as the above shows no panic-related code.
Still, that relies on the compiler being smart enough. If you want to ensure the bound checks are not done, Rust unsafe allows you to do that as well. You can use as_ptr() to obtain a raw pointer to the elements stored in the vector, and read them from there directly. You need to call set_len() to prevent the vector from dropping the elements you've moved, but to still allow it to deallocate the storage.
pub fn into_three_tuple(mut v: Vec<String>) -> (String, String, String) {
if v.len() == 3 {
unsafe {
v.set_len(0);
let ptr = v.as_ptr();
let e0 = ptr.read();
let e1 = ptr.add(1).read();
let e2 = ptr.add(2).read();
(e0, e1, e2)
}
} else {
panic!("expected Vec of length 3")
}
}
The generated code again shows no bound check related panics, which is expected because there are no calls to functions that performs a checked access to data.

How can you extend a collection in parallel?

I have a HashMap which I'd like to add elements to as fast as possible. I tried using par_extend, but it actually ended up being slower than the serial version. My guess is that it is evaluating the iterator in parallel, but extending the collection serially. Here's my code:
use std::collections::HashMap;
use rayon::prelude::*;
use time::Instant;
fn main() {
let n = 1e7 as i64;
// serial version
let mut t = Instant::now();
let mut m = HashMap::new();
m.extend((1..n).map(|i| (i, i)));
println!("Time in serial version: {}", t.elapsed().as_seconds_f64());
// parallel version - slower
t = Instant::now();
let mut m2 = HashMap::new();
m2.par_extend((1..n).into_par_iter().map(|i| (i, i)));
println!("Time in parallel version: {}", t.elapsed().as_seconds_f64());
}
Is there a faster way to extend a HashMap that actually adds the elements in parallel? Or a similar data structure that can be extended in parallel? I know this would run faster with something like an FnvHashMap, but it seems like it should also be possible to speed this up with parallelism. (and yes, I'm compiling with --release)

How to sort a Vector in descending order in Rust?

In Rust, the sorting methods of a Vec always arrange the elements from smallest to largest. What is a general-purpose way of sorting from largest to smallest instead?
If you have a vector of numbers, you can provide a key extraction function that "inverts" the number like this:
let mut numbers: Vec<u32> = vec![100_000, 3, 6, 2];
numbers.sort_by_key(|n| std::u32::MAX - n);
But that's not very clear, and it's not straightforward to extend that method to other types like strings.
There are at least three ways to do it.
Flipped comparison function
vec.sort_by(|a, b| b.cmp(a))
This switches around the order in which elements are compared, so that smaller elements appear larger to the sorting function and vice versa.
Wrapper with reverse Ord instance
use std::cmp::Reverse;
vec.sort_by_key(|w| Reverse(*w));
Reverse is a generic wrapper which has an Ord instance that is the opposite of the wrapped type's ordering.
If you try to return a Reverse containing a reference by removing the *, that results in a lifetime problem, same as when you return a reference directly inside sort_by_key (see also this question). Hence, this code snippet can only be used with vectors where the keys are Copy types.
Sorting then reversing
vec.sort();
vec.reverse();
It initially sorts in the wrong order and then reverses all elements.
Performance
I benchmarked the three methods with criterion for a length 100_000 Vec<u64>. The timing results are listed in the order above. The left and right values show the lower and upper bounds of the confidence interval respectively, and the middle value is criterion's best estimate.
Performance is comparable, although the flipped comparison function seems to be a tiny bit slower:
Sorting/sort_1 time: [6.2189 ms 6.2539 ms 6.2936 ms]
Sorting/sort_2 time: [6.1828 ms 6.1848 ms 6.1870 ms]
Sorting/sort_3 time: [6.2090 ms 6.2112 ms 6.2138 ms]
To reproduce, save the following files as benches/sort.rs and Cargo.toml, then run cargo bench. There is an additional benchmark in there which checks that the cost of cloning the vector is irrelevant compared to the sorting, it only takes a few microseconds.
fn generate_shuffled_data() -> Vec<u64> {
use rand::Rng;
let mut rng = rand::thread_rng();
(0..100000).map(|_| rng.gen::<u64>()).collect()
}
pub fn no_sort<T: Ord>(vec: Vec<T>) -> Vec<T> {
vec
}
pub fn sort_1<T: Ord>(mut vec: Vec<T>) -> Vec<T> {
vec.sort_by(|a, b| b.cmp(a));
vec
}
pub fn sort_2<T: Ord + Copy>(mut vec: Vec<T>) -> Vec<T> {
vec.sort_by_key(|&w| std::cmp::Reverse(w));
vec
}
pub fn sort_3<T: Ord>(mut vec: Vec<T>) -> Vec<T> {
vec.sort();
vec.reverse();
vec
}
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn comparison_benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("Sorting");
let data = generate_shuffled_data();
group.bench_function("no_sort", |b| {
b.iter(|| black_box(no_sort(data.clone())))
});
group.bench_function("sort_1", |b| {
b.iter(|| black_box(sort_1(data.clone())))
});
group.bench_function("sort_2", |b| {
b.iter(|| black_box(sort_2(data.clone())))
});
group.bench_function("sort_3", |b| {
b.iter(|| black_box(sort_3(data.clone())))
});
group.finish()
}
criterion_group!(benches, comparison_benchmark);
criterion_main!(benches);
[package]
name = "sorting_bench"
version = "0.1.0"
authors = ["nnnmmm"]
edition = "2018"
[[bench]]
name = "sort"
harness = false
[dev-dependencies]
criterion = "0.3"
rand = "0.7.3"

Unable to collect a filtered a Vec into itself [duplicate]

This question already has answers here:
is it possible to filter on a vector in-place?
(4 answers)
Closed 3 years ago.
I've started working through the Project Euler problems in Rust and came across #3 where the easiest quick-approach would be to implement a Sieve of Eratosthenes.
In doing so, my algorithm creates an iterator to then filter non-primes out of and assign it back to the original vector, but I'm receiving an error that Vec<u32> can't be built from Iterator<Item=&u32>.
Code:
fn eratosthenes_sieve(limit: u32) -> Vec<u32> {
let mut primes: Vec<u32> = Vec::new();
let mut range: Vec<u32> = (2..=limit).collect();
let mut length = range.len();
loop {
let p = range[0];
primes.push(p);
range = range.iter().filter(|&n| *n % p != 0).collect();
if length == range.len() {
break;
}
length = range.len();
}
primes
}
Error:
error[E0277]: a collection of type `std::vec::Vec<u32>` cannot be built from an iterator over elements of type `&u32`
--> src\main.rs:42:55
|
42 | range = range.iter().filter(|&n| *n % p != 0).collect();
| ^^^^^^^ a collection of type `std::vec::Vec<u32>` cannot be built from `std::iter::Iterator<Item=&u32>`
|
= help: the trait `std::iter::FromIterator<&u32>` is not implemented for `std::vec::Vec<u32>`
Why is the closure wrapping the values in extra borrows?
Explanation
According to the error message, the expression range.iter().filter(|&n| *n % p != 0) is an iterator over items of type &u32: a reference to an u32. You expected an iterator over u32 by value. So let's walk backwards:
The filter(...) part of the iterator chain has actually nothing to do with your problem. When we take a look at Iterator::filter, we see that it returns Filter<Self, P>. This type implements Iterator:
impl<I: Iterator, P> Iterator for Filter<I, P>
where
P: FnMut(&I::Item) -> bool,
{
type Item = I::Item;
// ...
}
The important part here is that type Item = I::Item, meaning that the item type of I (the original iterator) is passed through exactly. No reference is added.
This leaves .iter(): that's the cause of the problem. Vec::iter returns slice::Iter<T> which implements Iterator:
impl<'a, T> Iterator for Iter<'a, T> {
type Item = &'a T;
// ...
}
And here we see that the item type is a reference to T (the element type of the vector).
Solutions
In general cases, you could call .cloned() on any iterator that iterates over references to get a new iterator that iterates over the items by value (by cloning each item). For types that implement Copy you can (and should) use .copied(). E.g. range.iter().filter(|&n| *n % p != 0).copied().collect().
However, in this case there is a better solution: since you don't need the vector afterwards anymore, you can just call into_iter() instead of iter() in order to directly get an iterator over u32 by value. This consumes the vector, making it inaccessible afterwards. But, as said, that's not a problem here.
range = range.into_iter().filter(|&n| n % p != 0).collect();
Also note that I removed the * in *n, as the dereference is not necessary anymore.
Other hints
Always reallocating a new vector is not very fast. The Sieve of Eratosthenes is classically implemented in a different way: instead of storing the numbers, one only stores Booleans to denote for each number if it's prime or not. The numbers are never stored explicitly, but implicitly by using the indices of the vector/array.
And to make it really fast, one should not use a Vec<bool> but instead a dedicated bitvec. Vec<bool> stores one byte per bool, although only one bit would be necessary. The de-facto crate that offers such a bit vector is bit-vec, which conveniently also shows an example implementation of Sieve of Eratosthenes in its documentation.

Multiplication algorithm slower than expected [duplicate]

I am trying to benchmark getting keys from a Rust hash map. I have the following benchmark:
#[bench]
fn rust_get(b: &mut Bencher) {
let (hash, keys) =
get_random_hash::<HashMap<String, usize>>(&HashMap::with_capacity, &rust_insert_fn);
let mut keys = test::black_box(keys);
b.iter(|| {
for k in keys.drain(..) {
hash.get(&k);
}
});
}
where get_random_hash is defined as:
fn get_random_hash<T>(
new: &Fn(usize) -> T,
insert: &Fn(&mut T, String, usize) -> (),
) -> (T, Vec<String>) {
let mut keys = Vec::with_capacity(HASH_SIZE);
let mut hash = new(HASH_CAPACITY);
for i in 0..HASH_SIZE {
let k: String = format!("{}", Uuid::new_v4());
keys.push(k.clone());
insert(&mut hash, k, i);
}
return (hash, keys);
}
and rust_insert_fn is:
fn rust_insert_fn(map: &mut HashMap<String, usize>, key: String, value: usize) {
map.insert(key, value);
}
However, when I run the benchmark, it is clearly optimized out:
test benchmarks::benchmarks::rust_get ... bench: 1 ns/iter (+/- 0)
I thought test::black_box would solve the problem but it doesn't look like it does. I have even tried wrapping thehash.get(&k) in the for loop withtest::black_box` but that still optimizes the code. How should I correctly get the code to run without being optimized out?
EDIT - Even the following does optimizes out the get operation:
#[bench]
fn rust_get(b: &mut Bencher) {
let (hash, keys) = get_random_hash::<HashMap<String, usize>>(&HashMap::with_capacity, &rust_insert_fn);
let mut keys = test::black_box(keys);
b.iter(|| {
let mut n = 0;
for k in keys.drain(..) {
hash.get(&k);
n += 1;
};
return n;
});
}
Interestingly, the following benchmarks work:
#[bench]
fn rust_get_random(b: &mut Bencher) {
let (hash, _) = get_random_hash::<HashMap<String, usize>>(&HashMap::with_capacity, &rust_insert_fn);
b.iter(|| {
for _ in 0..HASH_SIZE {
hash.get(&format!("{}", Uuid::new_v4()));
}
});
}
#[bench]
fn rust_insert(b: &mut Bencher) {
b.iter(|| {
let mut hash = HashMap::with_capacity(HASH_CAPACITY);
for i in 0..HASH_SIZE {
let k: String = format!("{}", Uuid::new_v4());
hash.insert(k, i);
}
});
}
but this also does not:
#[bench]
fn rust_del(b: &mut Bencher) {
let (mut hash, keys) = get_random_hash::<HashMap<String, usize>>(&HashMap::with_capacity, &rust_insert_fn);
let mut keys = test::black_box(keys);
b.iter(|| {
for k in keys.drain(..) {
hash.remove(&k);
};
});
}
Here is the full gist.
How does a compiler optimizer work?
An optimizer is nothing more than a pipeline of analyses and transformations. Each individual analysis or transformation is relatively simple, and the optimal order to apply them is unknown and generally determined by heuristics.
How does this affect my benchmark?
Benchmarks are complicated in that in general you wish to measure optimized code, but at the same time some analyses or transformations may remove the code you were interested in rendering the benchmark useless.
It is therefore important to have a passing acquaintance with the analyses and transformation passes of the particular optimizer you are using so as to be able to understand:
which ones are undesirable,
how to foil them.
As mentioned, most passes are relatively simple, and therefore foiling them is relatively simple as well. The difficulty lies in the fact that there is a hundred or more of them and you have to know which one is kicking in to be able to foil it.
What optimizations am I running afoul of?
There are a few specific optimizations which very often play often with benchmarks:
Constant Propagation: allows evaluating part of the code at compile-time,
Loop Invariant Code Motion: allows lifting the evaluation of some piece of code outside the loop,
Dead Code Elimination: removes code that is not useful.
What? How dare the optimizer mangle my code so?
The optimizer operates under the so-called as-if rule. This basic rule allows the optimizer to perform any transformation which does not change the output of the program. That is, it should not change the observable behavior of the program in general.
On top of that, a few changes are generally explicitly allowed. The most obvious being that the run-time is expected to shrink, this in turn means that thread interleaving may differ, and some languages give even more wiggle room.
I used black_box!
What is black_box? It's a function whose definition is specifically opaque to the optimizer. This has some implications on the optimizations the compiler is allowed to perform since it may have side-effects. This therefore mean:
the transformed code must perform the very same number of calls to black_box than the original code,
the transformed code must perform said calls in the same order with regard to the passed in arguments,
no assumption can be made on the value returned by black_box.
Thus, surgical use of black_box can foil certain optimizations. Blind use, however, may not foil the right ones.
What optimizations am I running afoul of?
Let's start from the naive code:
#[bench]
fn rust_get(b: &mut Bencher) {
let (hash, mut keys): (HashMap<String, usize>, _) =
get_random_hash(&HashMap::with_capacity, &rust_insert_fn);
b.iter(|| {
for k in keys.drain(..) {
hash.get(&k);
}
});
}
The assumption is that the loop inside b.iter() will iterate over all keys and perform a hash.get() for each of them:
The result of hash.get() is unused,
hash.get() is a pure function, meaning that is has no side-effect.
Thus, this loop can be rewritten as:
b.iter(|| { for k in keys.drain(..) {} })
We are running afoul of Dead Code Elimination (or some variant): the code serves no purpose, thus it is eliminated.
It may even be that the compiler is smart enough to realize that for k in keys.drain(..) {} can be optimized into drop(keys).
A surgical application of black_box can, however, foil DCE:
b.iter(|| {
for k in keys.drain(..) {
black_box(hash.get(&k));
}
});
As per the effects of black_box described above:
the loop can no longer be optimized out, as it would change the number of calls to black_box,
each call to black_box must be performed with the expected argument.
There is still one possible hurdle: Constant Propagation. Specifically if the compiler realizes that all keys yield the same value, it could optimize out hash.get(&k) and replace it by said value.
This can be achieved by obfuscating the keys: let mut keys = black_box(keys);, as you did above, or the map. If you were to benchmark an empty map, the latter would be necessary, here they are equal.
We thus get:
#[bench]
fn rust_get(b: &mut Bencher) {
let (hash, keys): (HashMap<String, usize>, _) =
get_random_hash(&HashMap::with_capacity, &rust_insert_fn);
let mut keys = test::black_box(keys);
b.iter(|| {
for k in keys.drain(..) {
test::black_box(hash.get(&k));
}
});
}
A final tip.
Benchmarks are complicated enough that you should be extra careful to only benchmark what you wish to benchmark.
In this particular case, there are two method calls:
keys.drain(),
hash.get().
Since the benchmark name suggests, to me, that what you aim for is to measure the performance of get, I can only assume that the call to keys.drain(..) is a mistake.
Thus, the benchmark really should be:
#[bench]
fn rust_get(b: &mut Bencher) {
let (hash, keys): (HashMap<String, usize>, _) =
get_random_hash(&HashMap::with_capacity, &rust_insert_fn);
let keys = test::black_box(keys);
b.iter(|| {
for k in &keys {
test::black_box(hash.get(k));
}
});
}
In this instance, this is even more critical in that the closure passed to b.iter() is expected to run multiple times: if you drain the keys the first time, what's left afterward? An empty Vec...
... which may actually be all that is really happening here; since b.iter() runs the closure until its time stabilizes, it may just be draining the Vec in the first run and then time an empty loop.

Resources