Multiplication algorithm slower than expected [duplicate] - performance

I am trying to benchmark getting keys from a Rust hash map. I have the following benchmark:
#[bench]
fn rust_get(b: &mut Bencher) {
let (hash, keys) =
get_random_hash::<HashMap<String, usize>>(&HashMap::with_capacity, &rust_insert_fn);
let mut keys = test::black_box(keys);
b.iter(|| {
for k in keys.drain(..) {
hash.get(&k);
}
});
}
where get_random_hash is defined as:
fn get_random_hash<T>(
new: &Fn(usize) -> T,
insert: &Fn(&mut T, String, usize) -> (),
) -> (T, Vec<String>) {
let mut keys = Vec::with_capacity(HASH_SIZE);
let mut hash = new(HASH_CAPACITY);
for i in 0..HASH_SIZE {
let k: String = format!("{}", Uuid::new_v4());
keys.push(k.clone());
insert(&mut hash, k, i);
}
return (hash, keys);
}
and rust_insert_fn is:
fn rust_insert_fn(map: &mut HashMap<String, usize>, key: String, value: usize) {
map.insert(key, value);
}
However, when I run the benchmark, it is clearly optimized out:
test benchmarks::benchmarks::rust_get ... bench: 1 ns/iter (+/- 0)
I thought test::black_box would solve the problem but it doesn't look like it does. I have even tried wrapping thehash.get(&k) in the for loop withtest::black_box` but that still optimizes the code. How should I correctly get the code to run without being optimized out?
EDIT - Even the following does optimizes out the get operation:
#[bench]
fn rust_get(b: &mut Bencher) {
let (hash, keys) = get_random_hash::<HashMap<String, usize>>(&HashMap::with_capacity, &rust_insert_fn);
let mut keys = test::black_box(keys);
b.iter(|| {
let mut n = 0;
for k in keys.drain(..) {
hash.get(&k);
n += 1;
};
return n;
});
}
Interestingly, the following benchmarks work:
#[bench]
fn rust_get_random(b: &mut Bencher) {
let (hash, _) = get_random_hash::<HashMap<String, usize>>(&HashMap::with_capacity, &rust_insert_fn);
b.iter(|| {
for _ in 0..HASH_SIZE {
hash.get(&format!("{}", Uuid::new_v4()));
}
});
}
#[bench]
fn rust_insert(b: &mut Bencher) {
b.iter(|| {
let mut hash = HashMap::with_capacity(HASH_CAPACITY);
for i in 0..HASH_SIZE {
let k: String = format!("{}", Uuid::new_v4());
hash.insert(k, i);
}
});
}
but this also does not:
#[bench]
fn rust_del(b: &mut Bencher) {
let (mut hash, keys) = get_random_hash::<HashMap<String, usize>>(&HashMap::with_capacity, &rust_insert_fn);
let mut keys = test::black_box(keys);
b.iter(|| {
for k in keys.drain(..) {
hash.remove(&k);
};
});
}
Here is the full gist.

How does a compiler optimizer work?
An optimizer is nothing more than a pipeline of analyses and transformations. Each individual analysis or transformation is relatively simple, and the optimal order to apply them is unknown and generally determined by heuristics.
How does this affect my benchmark?
Benchmarks are complicated in that in general you wish to measure optimized code, but at the same time some analyses or transformations may remove the code you were interested in rendering the benchmark useless.
It is therefore important to have a passing acquaintance with the analyses and transformation passes of the particular optimizer you are using so as to be able to understand:
which ones are undesirable,
how to foil them.
As mentioned, most passes are relatively simple, and therefore foiling them is relatively simple as well. The difficulty lies in the fact that there is a hundred or more of them and you have to know which one is kicking in to be able to foil it.
What optimizations am I running afoul of?
There are a few specific optimizations which very often play often with benchmarks:
Constant Propagation: allows evaluating part of the code at compile-time,
Loop Invariant Code Motion: allows lifting the evaluation of some piece of code outside the loop,
Dead Code Elimination: removes code that is not useful.
What? How dare the optimizer mangle my code so?
The optimizer operates under the so-called as-if rule. This basic rule allows the optimizer to perform any transformation which does not change the output of the program. That is, it should not change the observable behavior of the program in general.
On top of that, a few changes are generally explicitly allowed. The most obvious being that the run-time is expected to shrink, this in turn means that thread interleaving may differ, and some languages give even more wiggle room.
I used black_box!
What is black_box? It's a function whose definition is specifically opaque to the optimizer. This has some implications on the optimizations the compiler is allowed to perform since it may have side-effects. This therefore mean:
the transformed code must perform the very same number of calls to black_box than the original code,
the transformed code must perform said calls in the same order with regard to the passed in arguments,
no assumption can be made on the value returned by black_box.
Thus, surgical use of black_box can foil certain optimizations. Blind use, however, may not foil the right ones.
What optimizations am I running afoul of?
Let's start from the naive code:
#[bench]
fn rust_get(b: &mut Bencher) {
let (hash, mut keys): (HashMap<String, usize>, _) =
get_random_hash(&HashMap::with_capacity, &rust_insert_fn);
b.iter(|| {
for k in keys.drain(..) {
hash.get(&k);
}
});
}
The assumption is that the loop inside b.iter() will iterate over all keys and perform a hash.get() for each of them:
The result of hash.get() is unused,
hash.get() is a pure function, meaning that is has no side-effect.
Thus, this loop can be rewritten as:
b.iter(|| { for k in keys.drain(..) {} })
We are running afoul of Dead Code Elimination (or some variant): the code serves no purpose, thus it is eliminated.
It may even be that the compiler is smart enough to realize that for k in keys.drain(..) {} can be optimized into drop(keys).
A surgical application of black_box can, however, foil DCE:
b.iter(|| {
for k in keys.drain(..) {
black_box(hash.get(&k));
}
});
As per the effects of black_box described above:
the loop can no longer be optimized out, as it would change the number of calls to black_box,
each call to black_box must be performed with the expected argument.
There is still one possible hurdle: Constant Propagation. Specifically if the compiler realizes that all keys yield the same value, it could optimize out hash.get(&k) and replace it by said value.
This can be achieved by obfuscating the keys: let mut keys = black_box(keys);, as you did above, or the map. If you were to benchmark an empty map, the latter would be necessary, here they are equal.
We thus get:
#[bench]
fn rust_get(b: &mut Bencher) {
let (hash, keys): (HashMap<String, usize>, _) =
get_random_hash(&HashMap::with_capacity, &rust_insert_fn);
let mut keys = test::black_box(keys);
b.iter(|| {
for k in keys.drain(..) {
test::black_box(hash.get(&k));
}
});
}
A final tip.
Benchmarks are complicated enough that you should be extra careful to only benchmark what you wish to benchmark.
In this particular case, there are two method calls:
keys.drain(),
hash.get().
Since the benchmark name suggests, to me, that what you aim for is to measure the performance of get, I can only assume that the call to keys.drain(..) is a mistake.
Thus, the benchmark really should be:
#[bench]
fn rust_get(b: &mut Bencher) {
let (hash, keys): (HashMap<String, usize>, _) =
get_random_hash(&HashMap::with_capacity, &rust_insert_fn);
let keys = test::black_box(keys);
b.iter(|| {
for k in &keys {
test::black_box(hash.get(k));
}
});
}
In this instance, this is even more critical in that the closure passed to b.iter() is expected to run multiple times: if you drain the keys the first time, what's left afterward? An empty Vec...
... which may actually be all that is really happening here; since b.iter() runs the closure until its time stabilizes, it may just be draining the Vec in the first run and then time an empty loop.

Related

What's the fastest sound way to take ownership of elements of a Vec

Suppose I have the following:
fn into_three_tuple(mut v: Vec<String>) -> (String, String, String) {
if v.len() == 3 {
// ???
} else {
panic!()
}
}
What's should I replace ??? with to achieve the best performance?
Possible solutions
Sure, I could do
...
if v.len() == 3 {
let mut iter = v.into_iter();
(v.next().unwrap(), v.next().unwrap(), v.next().unwrap())
} else {
...
or similarly
if v.len() == 3 {
let mut iter = v.into_iter();
let e2 = v.pop.unwrap();
let e1 = v.pop.unwrap();
let e0 = v.pop.unwrap();
(e0, e1, e2)
} else {
...
Problems with those solutions
Both of these implementations use unwrap, which, if I understand correctly, performs a runtime check. But since we have the v.len() == 3 condition, we know the vector is guaranteed to have 3 elements, so the the runtime check is unnecessary.
Also, the into_iter solution may introduce the additional overhead of creating the iterator, and the pop solution may introduce the additional overhead of decreasing the v's internal len field, which seems silly since v will be dropped immediately after extracting the elements (so we don't care whether its len field is accurate).
Question
Is there some (possibly unsafe) way that's more efficient (e.g., some way to directly take ownership of elements at arbitrary indices)?
Or perhaps the compiler is already smart enough to skip these extraneous operations?
Or do I just have to live with suboptimal performance?
Note
In case you're wondering why I'm obsessing over such a tiny micro-optimization, I'm writing a fairly speed-critical application, and this function will be called a significant amount of times.
There is an operation provided in the standard library to extract a fixed number of items from a vector: converting it to an array.
use std::convert::TryInto; // needed only if not using 2021 edition / Rust 1.55 or earlier
pub fn into_three_tuple(v: Vec<String>) -> (String, String, String) {
let three_array: [String; 3] = v.try_into().unwrap();
let [a, b, c] = three_array;
(a, b, c)
}
I'm not really familiar with reading x86 assembly, but this does compile down to simpler code with fewer branches. I would generally expect that this is in most cases the fastest way to unpack a three-element vector; if it is reliably slower, then that would be a performance bug in the standard library which should be reported and fixed.
You should also consider using an array [String; 3] instead of a tuple in the rest of your program. The type is shorter to write, and they allow you to use array and slice operations to act on all three strings. Additionally, tuples do not have a guaranteed memory layout, and arrays do (even though practically they are likely to be identical in this case).
Changing the return type to be an array makes the function trivial — possibly useful for the type declaration, but not containing any interesting code:
pub fn into_three_array(v: Vec<String>) -> [String; 3] {
v.try_into().unwrap()
}
Disclaimer: As mentioned in the comments, you should benchmark to check that any of this actually makes a difference in your program. The fact that you're using many 3-element vectors (which are heap-allocated and therefore comparatively inefficient) shows that you may be over-optimizing, or optimizing at the wrong place. Having said that...
the into_iter solution may introduce the additional overhead of creating the iterator
Note that the "iterator" is a tiny on-stack value entirely transparent to the compiler, which can proceed to inline/eliminate it entirely.
Or perhaps the compiler is already smart enough to skip these extraneous operations?
In many cases, a check for v.len() == <concrete number> is indeed sufficient for the compiler to omit bounds checking because it has proof of the vector size. However, that doesn't appear to work with the approaches you've tried. After modifying the code to std::process::exit() if v.len() != 3 so the only panic is from the runtime checks, the runtime checks (as evidence by calls to panic) are still not removed either with the .pop() or with the into_iter() approach.
Is there some (possibly unsafe) way that's more efficient (e.g., some way to directly take ownership of elements at arbitrary indices)?
Yes. One approach is to use unreachable_unchecked() to avoid the panic where we can prove the calls to next() will succeed:
use std::hint::unreachable_unchecked;
pub fn into_three_tuple(v: Vec<String>) -> (String, String, String) {
if v.len() == 3 {
let mut v = v.into_iter();
unsafe {
let e0 = v.next().unwrap_or_else(|| unreachable_unchecked());
let e1 = v.next().unwrap_or_else(|| unreachable_unchecked());
let e2 = v.next().unwrap_or_else(|| unreachable_unchecked());
(e0, e1, e2)
}
} else {
panic!()
}
}
Modifying the code in the same way as the above shows no panic-related code.
Still, that relies on the compiler being smart enough. If you want to ensure the bound checks are not done, Rust unsafe allows you to do that as well. You can use as_ptr() to obtain a raw pointer to the elements stored in the vector, and read them from there directly. You need to call set_len() to prevent the vector from dropping the elements you've moved, but to still allow it to deallocate the storage.
pub fn into_three_tuple(mut v: Vec<String>) -> (String, String, String) {
if v.len() == 3 {
unsafe {
v.set_len(0);
let ptr = v.as_ptr();
let e0 = ptr.read();
let e1 = ptr.add(1).read();
let e2 = ptr.add(2).read();
(e0, e1, e2)
}
} else {
panic!("expected Vec of length 3")
}
}
The generated code again shows no bound check related panics, which is expected because there are no calls to functions that performs a checked access to data.

Why is VecDeque slower than a Vec?

I'm beginning to optimize performance of a crate, and I swapped out a Vec for a VecDeque. The container maintains elements in sorted order (it's supposed to be fairly small, so I didn't yet bother trying a heap) and is occasionally split down the middle into two separate containers (another reason I haven't yet tried a heap) with drain.
I'd expect this second operation to be much faster: I can copy the first half of the collection out, then simply rotate and decrease the length of the original (now second) collection. However, when I run my #[bench] tests, performing the above operations a variable number of times, (below in million ns/iters) I observed a performance decrease with the VecDeque:
test a
test b
test c
test d
Vec
12.6
5.9
5.9
3.8
VecDeque
13.6
8.9
7.3
5.8
A reproducible example (gist):
#![feature(test)]
extern crate test;
use std::collections::VecDeque;
fn insert_in_sorted_order_vec<E: Ord + Eq>(t: &mut Vec<E>, k: E) {
match t.binary_search(&k) {
Ok(i) => t[i] = k,
Err(i) => t.insert(i, k),
}
}
fn insert_in_sorted_order_vecdeque<E: Ord + Eq>(t: &mut VecDeque<E>, k: E) {
match t.binary_search(&k) {
Ok(i) => t[i] = k,
Err(i) => t.insert(i, k),
}
}
fn split_vec<T>(mut t: Vec<T>) -> (Vec<T>, Vec<T>) {
let a = t.drain(0..(t.len() / 2)).collect();
(a, t)
}
fn split_vecdeque<T>(mut t: VecDeque<T>) -> (VecDeque<T>, VecDeque<T>) {
let a = t.drain(0..(t.len() / 2)).collect();
(a, t)
}
#[cfg(test)]
mod tests {
use super::*;
use test::Bencher;
static ITERS_BEFORE_SPLIT: u32 = 50;
static ITERS_TIME: u32 = 10_000;
#[bench]
fn vec_manip(b: &mut Bencher) {
b.iter(|| {
let mut v = Vec::new();
for i in 0..(ITERS_TIME / ITERS_BEFORE_SPLIT) {
for j in 1..(ITERS_BEFORE_SPLIT + 1) {
insert_in_sorted_order_vec(&mut v, i * j / (i + j)); // 'random'-ish illustrative number
}
v = split_vec(v).1;
}
});
}
#[bench]
fn vecdeque_manip(b: &mut Bencher) {
b.iter(|| {
let mut v = VecDeque::new();
for i in 0..(ITERS_TIME / ITERS_BEFORE_SPLIT) {
for j in 1..(ITERS_BEFORE_SPLIT + 1) {
insert_in_sorted_order_vecdeque(&mut v, i * j / (i + j)); // 'random'-ish illustrative number
}
v = split_vecdeque(v).1;
}
});
}
}
The Vec implementation took 69.2k ns/iter, and the VecDeque implementation took 91.8k.
I've repeated and verified these results a number of times - why is it that performance decreases with this more flexible data structure?
These results were obtained by running cargo bench.
Linux 5.11
3900X (12 cores, 3.8-4.6 GHz)
32GB 3200 MHz RAM
rustc 1.55.0-nightly
default cargo bench options (optimized, no debug symbols as far as I can tell, etc.)
Edit
I changed the split_vecdeque method to utilize split_off instead of drain().collect() (see below). It looks like this method is guaranteed to not reallocate or shift anything around, instead just moving the head and tail pointers around; see the documentation and implementation. That, however, performs even worse than the original VecDeque at 98.2k ns/iter. For larger values (ITERS_BEFORE_SPLIT = 50_000, ITERS_TIME = 5_000_000), though performance (21.8m ns/iter) is better than drain (23.1 ns/iter) and worse than Vec (19.1 ns/iter).
fn split_vecdeque<T>(mut t: VecDeque<T>) -> (VecDeque<T>, VecDeque<T>) {
let a = t.split_off(t.len() / 2);
(t, a)
}
A VecDeque is like a Vec but supports pushing and popping from both ends efficiently. In order to do this, it uses a single, contiguous buffer (just like a Vec), but treats it as two partitions; a head and a tail.
The structure is laid out like this:
pub struct VecDeque<T> {
tail: usize,
head: usize,
buf: RawVec<T>,
}
Items in the buffer are ordered like this:
[[tail: 5, 6, 7] ...unused... [head: 1, 2, 3, 4]]
Adding an item to the end of the collection will append to the the tail, using some of the unused space. Adding to the start of the collection will add to the start of the head, eating into the same space. When the head and tail meet in the middle, the VecDeque is full and will need to reallocate.
Compared with Vec:
pub struct Vec<T> {
buf: RawVec<T>,
len: usize,
}
Which uses its buffer like this:
[1, 2, 4, 5, 6, 7 ...unused...]
Adding an item at the end is fast, but adding an item at the start requires copying all of the existing items to make space.
Most operations on VecDeque are made more complicated by this layout and this will slightly reduce its performance. Even retrieving its length is more complicated:
pub fn len(&self) -> usize {
count(self.tail, self.head, self.cap())
}
The whole point of VecDeque is to make certain operations faster, namely pushing and popping the start of the collection. Vec is very slow at this, especially if there are a lot of items, because it involves moving all of the other items to make space. The structure of VecDeque makes these operations fast but at the expense of performance of other operations in comparison to Vec.
Your tests doesn't appear to take advantage of VecDeque's design, since they are dominated by calls to insert, which involves the expensive copying of many items in both cases.

How can you extend a collection in parallel?

I have a HashMap which I'd like to add elements to as fast as possible. I tried using par_extend, but it actually ended up being slower than the serial version. My guess is that it is evaluating the iterator in parallel, but extending the collection serially. Here's my code:
use std::collections::HashMap;
use rayon::prelude::*;
use time::Instant;
fn main() {
let n = 1e7 as i64;
// serial version
let mut t = Instant::now();
let mut m = HashMap::new();
m.extend((1..n).map(|i| (i, i)));
println!("Time in serial version: {}", t.elapsed().as_seconds_f64());
// parallel version - slower
t = Instant::now();
let mut m2 = HashMap::new();
m2.par_extend((1..n).into_par_iter().map(|i| (i, i)));
println!("Time in parallel version: {}", t.elapsed().as_seconds_f64());
}
Is there a faster way to extend a HashMap that actually adds the elements in parallel? Or a similar data structure that can be extended in parallel? I know this would run faster with something like an FnvHashMap, but it seems like it should also be possible to speed this up with parallelism. (and yes, I'm compiling with --release)

How to sort a Vector in descending order in Rust?

In Rust, the sorting methods of a Vec always arrange the elements from smallest to largest. What is a general-purpose way of sorting from largest to smallest instead?
If you have a vector of numbers, you can provide a key extraction function that "inverts" the number like this:
let mut numbers: Vec<u32> = vec![100_000, 3, 6, 2];
numbers.sort_by_key(|n| std::u32::MAX - n);
But that's not very clear, and it's not straightforward to extend that method to other types like strings.
There are at least three ways to do it.
Flipped comparison function
vec.sort_by(|a, b| b.cmp(a))
This switches around the order in which elements are compared, so that smaller elements appear larger to the sorting function and vice versa.
Wrapper with reverse Ord instance
use std::cmp::Reverse;
vec.sort_by_key(|w| Reverse(*w));
Reverse is a generic wrapper which has an Ord instance that is the opposite of the wrapped type's ordering.
If you try to return a Reverse containing a reference by removing the *, that results in a lifetime problem, same as when you return a reference directly inside sort_by_key (see also this question). Hence, this code snippet can only be used with vectors where the keys are Copy types.
Sorting then reversing
vec.sort();
vec.reverse();
It initially sorts in the wrong order and then reverses all elements.
Performance
I benchmarked the three methods with criterion for a length 100_000 Vec<u64>. The timing results are listed in the order above. The left and right values show the lower and upper bounds of the confidence interval respectively, and the middle value is criterion's best estimate.
Performance is comparable, although the flipped comparison function seems to be a tiny bit slower:
Sorting/sort_1 time: [6.2189 ms 6.2539 ms 6.2936 ms]
Sorting/sort_2 time: [6.1828 ms 6.1848 ms 6.1870 ms]
Sorting/sort_3 time: [6.2090 ms 6.2112 ms 6.2138 ms]
To reproduce, save the following files as benches/sort.rs and Cargo.toml, then run cargo bench. There is an additional benchmark in there which checks that the cost of cloning the vector is irrelevant compared to the sorting, it only takes a few microseconds.
fn generate_shuffled_data() -> Vec<u64> {
use rand::Rng;
let mut rng = rand::thread_rng();
(0..100000).map(|_| rng.gen::<u64>()).collect()
}
pub fn no_sort<T: Ord>(vec: Vec<T>) -> Vec<T> {
vec
}
pub fn sort_1<T: Ord>(mut vec: Vec<T>) -> Vec<T> {
vec.sort_by(|a, b| b.cmp(a));
vec
}
pub fn sort_2<T: Ord + Copy>(mut vec: Vec<T>) -> Vec<T> {
vec.sort_by_key(|&w| std::cmp::Reverse(w));
vec
}
pub fn sort_3<T: Ord>(mut vec: Vec<T>) -> Vec<T> {
vec.sort();
vec.reverse();
vec
}
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn comparison_benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("Sorting");
let data = generate_shuffled_data();
group.bench_function("no_sort", |b| {
b.iter(|| black_box(no_sort(data.clone())))
});
group.bench_function("sort_1", |b| {
b.iter(|| black_box(sort_1(data.clone())))
});
group.bench_function("sort_2", |b| {
b.iter(|| black_box(sort_2(data.clone())))
});
group.bench_function("sort_3", |b| {
b.iter(|| black_box(sort_3(data.clone())))
});
group.finish()
}
criterion_group!(benches, comparison_benchmark);
criterion_main!(benches);
[package]
name = "sorting_bench"
version = "0.1.0"
authors = ["nnnmmm"]
edition = "2018"
[[bench]]
name = "sort"
harness = false
[dev-dependencies]
criterion = "0.3"
rand = "0.7.3"

How does iter() differ to a plain iteration over a HashMap? [duplicate]

I am doing the Rust by Example tutorial which has this code snippet:
// Vec example
let vec1 = vec![1, 2, 3];
let vec2 = vec![4, 5, 6];
// `iter()` for vecs yields `&i32`. Destructure to `i32`.
println!("2 in vec1: {}", vec1.iter() .any(|&x| x == 2));
// `into_iter()` for vecs yields `i32`. No destructuring required.
println!("2 in vec2: {}", vec2.into_iter().any(| x| x == 2));
// Array example
let array1 = [1, 2, 3];
let array2 = [4, 5, 6];
// `iter()` for arrays yields `&i32`.
println!("2 in array1: {}", array1.iter() .any(|&x| x == 2));
// `into_iter()` for arrays unusually yields `&i32`.
println!("2 in array2: {}", array2.into_iter().any(|&x| x == 2));
I am thoroughly confused — for a Vec the iterator returned from iter yields references and the iterator returned from into_iter yields values, but for an array these iterators are identical?
What is the use case/API for these two methods?
TL;DR:
The iterator returned by into_iter may yield any of T, &T or &mut T, depending on the context.
The iterator returned by iter will yield &T, by convention.
The iterator returned by iter_mut will yield &mut T, by convention.
The first question is: "What is into_iter?"
into_iter comes from the IntoIterator trait:
pub trait IntoIterator
where
<Self::IntoIter as Iterator>::Item == Self::Item,
{
type Item;
type IntoIter: Iterator;
fn into_iter(self) -> Self::IntoIter;
}
You implement this trait when you want to specify how a particular type is to be converted into an iterator. Most notably, if a type implements IntoIterator it can be used in a for loop.
For example, Vec implements IntoIterator... thrice!
impl<T> IntoIterator for Vec<T>
impl<'a, T> IntoIterator for &'a Vec<T>
impl<'a, T> IntoIterator for &'a mut Vec<T>
Each variant is slightly different.
This one consumes the Vec and its iterator yields values (T directly):
impl<T> IntoIterator for Vec<T> {
type Item = T;
type IntoIter = IntoIter<T>;
fn into_iter(mut self) -> IntoIter<T> { /* ... */ }
}
The other two take the vector by reference (don't be fooled by the signature of into_iter(self) because self is a reference in both cases) and their iterators will produce references to the elements inside Vec.
This one yields immutable references:
impl<'a, T> IntoIterator for &'a Vec<T> {
type Item = &'a T;
type IntoIter = slice::Iter<'a, T>;
fn into_iter(self) -> slice::Iter<'a, T> { /* ... */ }
}
While this one yields mutable references:
impl<'a, T> IntoIterator for &'a mut Vec<T> {
type Item = &'a mut T;
type IntoIter = slice::IterMut<'a, T>;
fn into_iter(self) -> slice::IterMut<'a, T> { /* ... */ }
}
So:
What is the difference between iter and into_iter?
into_iter is a generic method to obtain an iterator, whether this iterator yields values, immutable references or mutable references is context dependent and can sometimes be surprising.
iter and iter_mut are ad-hoc methods. Their return type is therefore independent of the context, and will conventionally be iterators yielding immutable references and mutable references, respectively.
The author of the Rust by Example post illustrates the surprise coming from the dependence on the context (i.e., the type) on which into_iter is called, and is also compounding the problem by using the fact that:
IntoIterator is not implemented for [T; N], only for &[T; N] and &mut [T; N] -- it will be for Rust 2021.
When a method is not implemented for a value, it is automatically searched for references to that value instead
which is very surprising for into_iter since all types (except [T; N]) implement it for all 3 variations (value and references).
Arrays implement IntoIterator (in such a surprising fashion) to make it possible to iterate over references to them in for loops.
As of Rust 1.51, it's possible for the array to implement an iterator that yields values (via array::IntoIter), but the existing implementation of IntoIterator that automatically references makes it hard to implement by-value iteration via IntoIterator.
I came here from Google seeking a simple answer which wasn't provided by the other answers. Here's that simple answer:
iter() iterates over the items by reference
iter_mut() iterates over the items, giving a mutable reference to each item
into_iter() iterates over the items, moving them into the new scope
So for x in my_vec { ... } is essentially equivalent to my_vec.into_iter().for_each(|x| ... ) - both move the elements of my_vec into the ... scope.
If you just need to look at the data, use iter, if you need to edit/mutate it, use iter_mut, and if you need to give it a new owner, use into_iter.
This was helpful: http://hermanradtke.com/2015/06/22/effectively-using-iterators-in-rust.html
I think there's something to clarify a bit more. Collection types, such as Vec<T> and VecDeque<T>, have into_iter method that yields T because they implement IntoIterator<Item=T>. There's nothing to stop us to create a type Foo<T> if which is iterated over, it will yield not T but another type U. That is, Foo<T> implements IntoIterator<Item=U>.
In fact, there are some examples in std: &Path implements IntoIterator<Item=&OsStr> and &UnixListener implements IntoIterator<Item=Result<UnixStream>>.
The difference between into_iter and iter
Back to the original question on the difference between into_iter and iter. Similar to what others have pointed out, the difference is that into_iter is a required method of IntoIterator which can yield any type specified in IntoIterator::Item. Typically, if a type implements IntoIterator<Item=I>, by convention it has also two ad-hoc methods: iter and iter_mut which yield &I and &mut I, respectively.
What it implies is that we can create a function that receives a type that has into_iter method (i.e. it is an iterable) by using a trait bound:
fn process_iterable<I: IntoIterator>(iterable: I) {
for item in iterable {
// ...
}
}
However, we can't* use a trait bound to require a type to have iter method or iter_mut method, because they're just conventions. We can say that into_iter is more widely useable than iter or iter_mut.
Alternatives to iter and iter_mut
Another interesting thing to observe is that iter is not the only way to get an iterator that yields &T. By convention (again), collection types SomeCollection<T> in std which have iter method also have their immutable reference types &SomeCollection<T> implement IntoIterator<Item=&T>. For example, &Vec<T> implements IntoIterator<Item=&T>, so it enables us to iterate over &Vec<T>:
let v = vec![1, 2];
// Below is equivalent to: `for item in v.iter() {`
for item in &v {
println!("{}", item);
}
If v.iter() is equivalent to &v in that both implement IntoIterator<Item=&T>, why then does Rust provide both? It's for ergonomics. In for loops, it's a bit more concise to use &v than v.iter(); but in other cases, v.iter() is a lot clearer than (&v).into_iter():
let v = vec![1, 2];
let a: Vec<i32> = v.iter().map(|x| x * x).collect();
// Although above and below are equivalent, above is a lot clearer than below.
let b: Vec<i32> = (&v).into_iter().map(|x| x * x).collect();
Similarly, in for loops, v.iter_mut() can be replaced with &mut v:
let mut v = vec![1, 2];
// Below is equivalent to: `for item in v.iter_mut() {`
for item in &mut v {
*item *= 2;
}
When to provide (implement) into_iter and iter methods for a type
If the type has only one “way” to be iterated over, we should implement both. However, if there are two ways or more it can be iterated over, we should instead provide an ad-hoc method for each way.
For example, String provides neither into_iter nor iter because there are two ways to iterate it: to iterate its representation in bytes or to iterate its representation in characters. Instead, it provides two methods: bytes for iterating the bytes and chars for iterating the characters, as alternatives to iter method.
* Well, technically we can do it by creating a trait. But then we need to impl that trait for each type we want to use. Meanwhile, many types in std already implement IntoIterator.
.into_iter() is not implemented for a array itself, but only &[]. Compare:
impl<'a, T> IntoIterator for &'a [T]
type Item = &'a T
with
impl<T> IntoIterator for Vec<T>
type Item = T
Since IntoIterator is defined only on &[T], the slice itself cannot be dropped the same way as Vec when you use the values. (values cannot be moved out)
Now, why that's the case is a different issues, and I'd like to learn myself. Speculating: array is the data itself, slice is only a view into it. In practice you cannot move the array as a value into another function, just pass a view of it, so you cannot consume it there either.
IntoIterator and Iterator are usually used like this.
We implement IntoIterator for structures that has an inner/nested value (or is behind a reference) that either implements Iterator or has an intermediate "Iter" structure.
For example, lets create a "new" data structure:
struct List<T>;
// Looks something like this:
// - List<T>(Option<Box<ListCell<T>>>)
// - ListCell<T> { value: T, next: List<T> }
We want this List<T> to be iterable, so this should be a good place to implement Iterator right? Yes, we could do that, but that would limit us in certain ways.
Instead we create an intermediate "iterable" structure and implement the Iterator trait:
// NOTE: I have removed all lifetimes to make it less messy.
struct ListIter<T> { cursor: &List<T> };
impl<T> Iterator for ListIter<T> {
type Item = &T;
fn next(&mut self) -> Option<Self::Item> {...}
}
So now we need to somehow connect List<T> and ListIter<T>. This can be done by implementing IntoIterator for List.
impl<T> IntoIterator for List<T> {
type Item = T;
type Iter = ListIter<Self::Item>;
fn into_iter(self) -> Self::Iter { ListIter { cursor: &self } }
}
IntoIterator can also be implemented multiple times for container struct if for example it contains different nested iterable fields or we have some higher kinded type situation.
Lets say we have a Collection<T>: IntoIterator trait that will be implemented by multiple data structures, e.g. List<T>, Vector<T> and Tree<T> that also have their respective Iter; ListIter<T>, VectorIter<T> and TreeIter<T>. But what does this actually mean when we go from generic to specific code?
fn wrapper<C>(value: C) where C: Collection<i32> {
let iter = value.into_iter() // But what iterator are we?
...
}
This code is not 100% correct, lifetimes and mutability support are omitted.

Resources