How to benchmark programs in Rust?

How to benchmark programs in Rust? - time

Is it possible to benchmark programs in Rust? If yes, how? For example, how would I get execution time of program in seconds?

It might be worth noting two years later (to help any future Rust programmers who stumble on this page) that there are now tools to benchmark Rust code as a part of one's test suite.
(From the guide link below) Using the #[bench] attribute, one can use the standard Rust tooling to benchmark methods in their code.
extern crate test;
use test::Bencher;
#[bench]
fn bench_xor_1000_ints(b: &mut Bencher) {
b.iter(|| {
// Use `test::black_box` to prevent compiler optimizations from disregarding
// Unused values
test::black_box(range(0u, 1000).fold(0, |old, new| old ^ new));
});
}
For the command cargo bench this outputs something like:
running 1 test
test bench_xor_1000_ints ... bench: 375 ns/iter (+/- 148)
test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured
Links:
The Rust Book (section on benchmark tests)
"The Nightly Book" (section on the test crate)
test::Bencher docs

For measuring time without adding third-party dependencies, you can use std::time::Instant:
fn main() {
use std::time::Instant;
let now = Instant::now();
// Code block to measure.
{
my_function_to_measure();
}
let elapsed = now.elapsed();
println!("Elapsed: {:.2?}", elapsed);
}

There are several ways to benchmark your Rust program. For most real benchmarks, you should use a proper benchmarking framework as they help with a couple of things that are easy to screw up (including statistical analysis). Please also read the "Why writing benchmarks is hard" section at the very bottom!
Quick and easy: Instant and Duration from the standard library
To quickly check how long a piece of code runs, you can use the types in std::time. The module is fairly minimal, but it is fine for simple time measurements. You should use Instant instead of SystemTime as the former is a monotonically increasing clock and the latter is not. Example (Playground):
use std::time::Instant;
let before = Instant::now();
workload();
println!("Elapsed time: {:.2?}", before.elapsed());
The underlying platform-specific implementations of std's Instant are specified in the documentation. In short: currently (and probably forever) you can assume that it uses the best precision that the platform can provide (or something very close to it). From my measurements and experiences, this is typically approximately around 20 ns.
If std::time does not offer enough features for your case, you could take a look at chrono. However, for measuring durations, it's unlikely you need that external crate.
Using a benchmarking framework
Using frameworks is often a good idea, because they try to prevent you from making common mistakes.
Rust's built-in benchmarking framework (nightly only)
Rust has a convenient built-in benchmarking feature, which is unfortunately still unstable as of 2019-07. You have to add the #[bench] attribute to your function and make it accept one &mut test::Bencher argument:
#![feature(test)]
extern crate test;
use test::Bencher;
#[bench]
fn bench_workload(b: &mut Bencher) {
b.iter(|| workload());
}
Executing cargo bench will print:
running 1 test
test bench_workload ... bench: 78,534 ns/iter (+/- 3,606)
test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out
Criterion
The crate criterion is a framework that runs on stable, but it is a bit more complicated than the built-in solution. It does more sophisticated statistical analysis, offers a richer API, produces more information and can even automatically generate plots.
See the "Quickstart" section for more information on how to use Criterion.
Why writing benchmarks is hard
There are many pitfalls when writing benchmarks. A single mistake can make your benchmark results meaningless. Here is a list of important but commonly forgotten points:
Compile with optimizations: rustc -O3 or cargo build --release. When you are executing your benchmarks with cargo bench, Cargo will automatically enable optimizations. This step is important as there are often large performance difference between optimized and unoptimized Rust code.
Repeat the workload: only running your workload once is almost always useless. There are many things that can influence your timing: overall system load, the operating system doing stuff, CPU throttling, file system caches, and so on. So repeat your workload as often as possible. For example, Criterion runs every benchmarks for at least 5 seconds (even if the workload only takes a few nanoseconds). All measured times can then be analyzed, with mean and standard deviation being the standard tools.
Make sure your benchmark isn't completely removed: benchmarks are very artificial by nature. Usually, the result of your workload is not inspected as you only want to measure the duration. However, this means that a good optimizer could remove your whole benchmark because it does not have side-effects (well, apart from the passage of time). So to trick the optimizer, you have to somehow use your result value so that your workload cannot be removed. An easy way is to print the result. A better solution is something like black_box. This function basically hides a value from LLVM in that LLVM cannot know what will happen with the value. Nothing happens, but LLVM doesn't know. That is the point.
Good benchmarking frameworks use a block box in several situations. For example, the closure given to the iter method (for both, the built-in and Criterion Bencher) can return a value. That value is automatically passed into a black_box.
Beware of constant values: similarly to the point above, if you specify constant values in a benchmark, the optimizer might generate code specifically for that value. In extreme cases, your whole workload could be constant-folded into a single constant, meaning that your benchmark is useless. Pass all constant values through black_box to avoid LLVM optimizing too aggressively.
Beware of measurement overhead: measuring a duration takes time itself. That is usually only tens of nanoseconds, but can influence your measured times. So for all workloads that are faster than a few tens of nanoseconds, you should not measure each execution time individually. You could execute your workload 100 times and measure how long all 100 executions took. Dividing that by 100 gives you the average single time. The benchmarking frameworks mentioned above also use this trick. Criterion also has a few methods for measuring very short workloads that have side effects (like mutating something).
Many other things: sadly, I cannot list all difficulties here. If you want to write serious benchmarks, please read more online resources.

If you simply want to time a piece of code, you can use the time crate. time meanwhile deprecated, though. A follow-up crate is chrono.
Add time = "*" to your Cargo.toml.
Add
extern crate time;
use time::PreciseTime;
before your main function and
let start = PreciseTime::now();
// whatever you want to do
let end = PreciseTime::now();
println!("{} seconds for whatever you did.", start.to(end));
Complete example
Cargo.toml
[package]
name = "hello_world" # the name of the package
version = "0.0.1" # the current version, obeying semver
authors = [ "you#example.com" ]
[[bin]]
name = "rust"
path = "rust.rs"
[dependencies]
rand = "*" # Or a specific version
time = "*"
rust.rs
extern crate rand;
extern crate time;
use rand::Rng;
use time::PreciseTime;
fn main() {
// Creates an array of 10000000 random integers in the range 0 - 1000000000
//let mut array: [i32; 10000000] = [0; 10000000];
let n = 10000000;
let mut array = Vec::new();
// Fill the array
let mut rng = rand::thread_rng();
for _ in 0..n {
//array[i] = rng.gen::<i32>();
array.push(rng.gen::<i32>());
}
// Sort
let start = PreciseTime::now();
array.sort();
let end = PreciseTime::now();
println!("{} seconds for sorting {} integers.", start.to(end), n);
}

This answer is outdated! The time crate does not offer any advantages over std::time in regards to benchmarking. Please see the answers below for up to date information.
You might try timing individual components within the program using the time crate.

A quick way to find out the execution time of a program, regardless of implementation language, is to run time prog on the command line. For example:
~$ time sleep 4
real 0m4.002s
user 0m0.000s
sys 0m0.000s
The most interesting measurement is usually user, which measures the actual amount of work done by the program, regardless of what's going on in the system (sleep is a pretty boring program to benchmark). real measures the actual time that elapsed, and sys measures the amount of work done by the OS on behalf of the program.

Currently, there is no interface to any of the following Linux functions:
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts)
getrusage
times (manpage: man 2 times)
The available ways to measure the CPU time and hotspots of a Rust program on Linux are:
/usr/bin/time program
perf stat program
perf record --freq 100000 program; perf report
valgrind --tool=callgrind program; kcachegrind callgrind.out.*
The output of perf report and valgrind depends on the availability of debugging information in the program. It may not work.

I created a small crate for this (measure_time), which logs or prints the time until end of scope.
#[macro_use]
extern crate measure_time;
fn main() {
print_time!("measure function");
do_stuff();
}

The other solution of measuring execution time is creating a custom type, for example, a struct and implement Drop trait for it.
For example:
struct Elapsed(&'static str, std::time::SystemTime);
impl Drop for Elapsed {
fn drop(&mut self) {
println!(
"operation {} finished for {} ms",
self.0,
self.1.elapsed().unwrap_or_default().as_millis()
);
}
}
impl Elapsed {
pub fn start(op: &'static str) -> Elapsed {
let now = std::time::SystemTime::now();
Elapsed(op, now)
}
}
And using it in some function:
fn some_heavy_work() {
let _exec_time = Elapsed::start("some_heavy_work_fn");
// Here's some code.
}
When the function ends, the drop method for _exec_time will be called and the message will be printed.

Related

How to benchmark Rust performance? [duplicate]

Is it possible to benchmark programs in Rust? If yes, how? For example, how would I get execution time of program in seconds?

It might be worth noting two years later (to help any future Rust programmers who stumble on this page) that there are now tools to benchmark Rust code as a part of one's test suite.
(From the guide link below) Using the #[bench] attribute, one can use the standard Rust tooling to benchmark methods in their code.
extern crate test;
use test::Bencher;
#[bench]
fn bench_xor_1000_ints(b: &mut Bencher) {
b.iter(|| {
// Use `test::black_box` to prevent compiler optimizations from disregarding
// Unused values
test::black_box(range(0u, 1000).fold(0, |old, new| old ^ new));
});
}
For the command cargo bench this outputs something like:
running 1 test
test bench_xor_1000_ints ... bench: 375 ns/iter (+/- 148)
test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured
Links:
The Rust Book (section on benchmark tests)
"The Nightly Book" (section on the test crate)
test::Bencher docs

For measuring time without adding third-party dependencies, you can use std::time::Instant:
fn main() {
use std::time::Instant;
let now = Instant::now();
// Code block to measure.
{
my_function_to_measure();
}
let elapsed = now.elapsed();
println!("Elapsed: {:.2?}", elapsed);
}

If you simply want to time a piece of code, you can use the time crate. time meanwhile deprecated, though. A follow-up crate is chrono.
Add time = "*" to your Cargo.toml.
Add
extern crate time;
use time::PreciseTime;
before your main function and
let start = PreciseTime::now();
// whatever you want to do
let end = PreciseTime::now();
println!("{} seconds for whatever you did.", start.to(end));
Complete example
Cargo.toml
[package]
name = "hello_world" # the name of the package
version = "0.0.1" # the current version, obeying semver
authors = [ "you#example.com" ]
[[bin]]
name = "rust"
path = "rust.rs"
[dependencies]
rand = "*" # Or a specific version
time = "*"
rust.rs
extern crate rand;
extern crate time;
use rand::Rng;
use time::PreciseTime;
fn main() {
// Creates an array of 10000000 random integers in the range 0 - 1000000000
//let mut array: [i32; 10000000] = [0; 10000000];
let n = 10000000;
let mut array = Vec::new();
// Fill the array
let mut rng = rand::thread_rng();
for _ in 0..n {
//array[i] = rng.gen::<i32>();
array.push(rng.gen::<i32>());
}
// Sort
let start = PreciseTime::now();
array.sort();
let end = PreciseTime::now();
println!("{} seconds for sorting {} integers.", start.to(end), n);
}

This answer is outdated! The time crate does not offer any advantages over std::time in regards to benchmarking. Please see the answers below for up to date information.
You might try timing individual components within the program using the time crate.

A quick way to find out the execution time of a program, regardless of implementation language, is to run time prog on the command line. For example:
~$ time sleep 4
real 0m4.002s
user 0m0.000s
sys 0m0.000s
The most interesting measurement is usually user, which measures the actual amount of work done by the program, regardless of what's going on in the system (sleep is a pretty boring program to benchmark). real measures the actual time that elapsed, and sys measures the amount of work done by the OS on behalf of the program.

Currently, there is no interface to any of the following Linux functions:
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts)
getrusage
times (manpage: man 2 times)
The available ways to measure the CPU time and hotspots of a Rust program on Linux are:
/usr/bin/time program
perf stat program
perf record --freq 100000 program; perf report
valgrind --tool=callgrind program; kcachegrind callgrind.out.*
The output of perf report and valgrind depends on the availability of debugging information in the program. It may not work.

I created a small crate for this (measure_time), which logs or prints the time until end of scope.
#[macro_use]
extern crate measure_time;
fn main() {
print_time!("measure function");
do_stuff();
}

The other solution of measuring execution time is creating a custom type, for example, a struct and implement Drop trait for it.
For example:
struct Elapsed(&'static str, std::time::SystemTime);
impl Drop for Elapsed {
fn drop(&mut self) {
println!(
"operation {} finished for {} ms",
self.0,
self.1.elapsed().unwrap_or_default().as_millis()
);
}
}
impl Elapsed {
pub fn start(op: &'static str) -> Elapsed {
let now = std::time::SystemTime::now();
Elapsed(op, now)
}
}
And using it in some function:
fn some_heavy_work() {
let _exec_time = Elapsed::start("some_heavy_work_fn");
// Here's some code.
}
When the function ends, the drop method for _exec_time will be called and the message will be printed.

How can this function in Haskell be optimised

As part of an advent of code challenge, I've written the following functions in Haskell:
simulateUntilRepeat_int a b i = if (a /= b) then (simulateUntilRepeat_int a (updateCycle b) (i+1)) else i
simulateUntilRepeat a = simulateUntilRepeat_int a (updateCycle a) 1
The purpose of this is to take a list of moons and simulate their movement until they resume their original position, returning the number of cycles it took for them to get there. (the function updateCycle does one iteration of the simulation). However, when I attempt to run this it uses all available memory and then gets killed by the operating system. The question does admit that this may take a very large number of cycles.
Googling around about this problem I find the usual fix is to make some of the parameters strict, but I think I've experimented with all possible permutations of strictness on the parameters to no avail. By the looks of this function, I'd have anticipated the compiler would be able to use the tail recursion optimisation and turn it into a loop, but this seems to not be happening somehow.
A friend of mine, who is knowledgeable in haskell suggested changing the form of the function to the following:
f a b0 = length (takeWhile (/= a) (iterate updateCycle b0))
But doing this didn't fix it either, leaving me out of ideas.

The comments are undoubtedly correct that your approach is not the intended solution method.
However, the functions you've posted would not, in and of themselves, cause a memory leak, fail to tail recurse, or lead to poor performance. Given your code above plus the definitions:
updateCycle 4686774942 = 0
updateCycle n = n+1
main = do
print $ simulateUntilRepeat (0 :: Int)
and compiling with -O2, the program runs in constant memory on my laptop in about 30 seconds. Adding explicit type signatures to use Int in place of Integer for the iteration count:
simulateUntilRepeat_int :: Int -> Int -> Int -> Int
simulateUntilRepeat :: Int -> Int
it runs in about 2.4 seconds.
So, to understand why your program is gobbling all available memory or why your strictness annotations failed to make a difference, it would probably be necessary to see the whole working program (or preferably a minimal example that illustrates the performance problem). If the program is short, and the question is "why is the performance of this program totally unreasonable?" instead of "how can I optimize my program to run as fast as possible?", it might still be a good SO question. Otherwise, the Code Review site might be better -- you can post a larger program there and ask for general performance advice, and that's considered on-topic for that site.

Fastest way to generate random numbers

libc has random which
uses a nonlinear additive feedback random number generator employing a default table of size 31 long integers to return successive pseudo-random numbers
I'm looking for a random function written in Rust with the same speed. It doesn't need to be crypto-safe, pseudo-random is good enough.
Looking through the rand crate it seems that XorShiftRng is the closest fit to this need:
The Xorshift algorithm is not suitable for cryptographic purposes but is very fast
When I use it like this:
extern crate time;
use rand::Rng;
let mut rng = rand::XorShiftRng::new_unseeded();
rng.next_u64();
it is about 33% slower than libc's random. (Sample code generating 8'000'000 random numbers).
In the end, I'll need i64 random numbers, so when I run rng.gen() this is already 100% slower than libc's random. When casting with rng.next_u64() as i64, then is is 60% slower.
Is there any way to achieve the same speed in rust without using any unsafe code?

Be sure to compile the code you are measuring in release mode, otherwise your benchmark is not representative of Rust's performance.
In order to obtain meaningful numbers, you must also modify the benchmark to do something with the generated numbers, e.g. collect them into a vector1. Failing to do so can make the compiler optimize the whole loop away because it has no side effects. This is what happened in your second attempt that led you to conclude that XorShiftRng is 760,000 thousand times faster than libc::random.
With the changed benchmark run in release mode, XorShiftRng ends up approximately 2x faster than libc::random:
PT0.101378490S seconds for libc::random
PT0.050827393S seconds for XorShiftRng
1
The compiler could also be smart enough to realize that the vector is also unused and optimize it away as well, but current rustc does not do so, and storing elements into a vector is enough. A simple and future-proof way to ensure that the generation is not optimized away is to sum up the numbers and write out the result.

Golang benchmarking b.StopTimer() hangs -- is it me?

I've run into a problem benchmarking my Go programs, and it looks to me like a Go bug. Please help me understand why it's not.
In a nutshell: benchmarking hangs when I call b.StopTimer() even if I then call b.StartTimer() afterwards. AFAICT I am using the functions as intended.
This is on Mac OSX 10.11.5 "El Capitan," with go version go1.6.1 darwin/amd64; everything else seems to be fine. I will try to check this on Linux later today and update the question.
Here's the code. Run with go test -bench=. as usual.
package bmtimer
import (
"testing"
"time"
)
// These appear to not matter; they can be very short.
var COSTLY_PREP_DELAY = time.Nanosecond
var RESET_STATE_DELAY = time.Nanosecond
// This however -- the measured function delay -- must not be too short.
// The shorter it is, the more iterations will be timed. If it's infinitely
// short then you will hang -- or perhaps *appear* to hang?
var FUNCTION_OF_INTEREST_DELAY = time.Microsecond
func CostlyPrep() { time.Sleep(COSTLY_PREP_DELAY) }
func ResetState() { time.Sleep(RESET_STATE_DELAY) }
func FunctionOfInterest() { time.Sleep(FUNCTION_OF_INTEREST_DELAY) }
func TestPlaceHolder(t *testing.T) {}
func BenchmarkSomething(b *testing.B) {
CostlyPrep()
b.ResetTimer()
for n := 0; n < b.N; n++ {
FunctionOfInterest()
b.StopTimer()
ResetState()
b.StartTimer()
}
}
Thanks in advance for any tips! I've come up short on Google as well as here.
Edit: It seems this only happens of FunctionOfInterest() is really fast; and yes, I specifically want to do this inside my benchmark loop. I've updated the sample code to make that more clear. And I think I have it figured out now.

TL;DR: it's not a bug, it's just a little tricky.
I'm pretty sure I know what's going on. The use of StopTimer and StartTimer inside the benchmarking loop is not the problem; the problem is that the function I do want to test is too fast for its own good. Or for my good anyway.
By adjusting the sleep time used by that function, we can pretty easily observe the benchmarking differences. At least for something this simple, Go is running more iterations the faster the measured code returns. This is known behavior and I should have thought of it. One finds output like this:
PASS
BenchmarkSomething-4 1000000 2079 ns/op
ok bmtimer 36.587s
When the FunctionOfInterest returns very fast -- or of course "infinitely" fast, as in func tooFast() {} -- then Go keeps running more iterations of it, trying to get a reliable benchmark. You might run out of patience. Or, your computer might:
*** Test killed with quit: ran too long (10m0s).
FAIL bmtimer 600.037s
(That was ten minutes at 100% CPU on one core.)
The good news of course is that it's not a bug; the bad news is it's still pretty annoying that it doesn't just stop at, say, a million iterations and tell you its best guess.
But then again, the other good news is that encountering this problem indicates your function is fast enough and probably doesn't need benchmarking.
The remaining mystery is: Why does this only happen when we stop and start the timer?
Without fiddling with the timer, I can call my noop function and it benchmarks it just fine, in a mere two billion iterations:
BenchmarkSomething-4 2000000000 0.33 ns/op
ok bmtimer 0.706s
My best guess is based on the comment from #JimB -- stopping the world takes at least a little time, and when you're running two billion iterations that can easily run over ten minutes. Like the man said: a billion here, a billion there, and pretty soon you're talking about real money.
(That's my guess anyway. Smarter folks than I may yet correct me.)
Thus the moral of our story is: yes by all means do take advantage of b.StopTimer() and b.StartTimer() in order to handle state if you have no other way to deal with it. But keep in mind that for very fast functions, it may be too much for the benchmarker.
I do wish I knew a better way to isolate the specific function benchmark for fast functions. My initial hack that I used before trying StopTimer was to have another benchmark function that benchmarks just the extra code, so you'd get two numbers: ns/op with everything, and ns/op without the function of interest. That works, and it's probably even more accurate than stopping the timer, but it does require advance knowledge of how to interpret the benchmark output.

Replicate() versus a for loop?

Does anyone know how the replicate() function works in R and how efficient it is relative to using a for loop?
For example, is there any efficiency difference between...
means <- replicate(100000, mean(rnorm(50)))
And...
means <- c()
for(i in 1:100000) {
means <- c(means, mean(rnorm(50)))
}
(I may have typed something slightly off above, but you get the idea.)

You can just benchmark the code and get your answer empirically. Note that I also added a second for loop flavor which circumvents the growing vector problem by preallocating the vector.
repl_function = function(no_rep) means <- replicate(no_rep, mean(rnorm(50)))
for_loop = function(no_rep) {
means <- c()
for(i in 1:no_rep) {
means <- c(means, mean(rnorm(50)))
}
means
}
for_loop_prealloc = function(no_rep) {
means <- vector(mode = "numeric", length = no_rep)
for(i in 1:no_rep) {
means[i] <- mean(rnorm(50))
}
means
}
no_loops = 50e3
benchmark(repl_function(no_loops),
for_loop(no_loops),
for_loop_prealloc(no_loops),
replications = 3)
test replications elapsed relative user.self sys.self
2 for_loop(no_loops) 3 18.886 6.274 17.803 0.894
3 for_loop_prealloc(no_loops) 3 3.209 1.066 3.189 0.000
1 repl_function(no_loops) 3 3.010 1.000 2.997 0.000
user.child sys.child
2 0 0
3 0 0
1 0 0
Looking at the relative column, the un-preallocated for loop is 6.2 times slower. However, the preallocated for loop is just as fast as replicate.

replicate is a wrapper for sapply, which itself is a wrapper for lapply. lapply is ultimately an .Internal function that is written in C and performs the looping in an optimised way, rather than through the interpreter. It's main advantages are efficient memory management, especially compared to the highly inefficient vector growing method you present above.

I have a very different experience with replicate which also confuses me. It often happens that my R crashes and my laptop hangs when I use replicate compared to for and this surprises me, as for the reasons mentioned above, I also expected a C-written function to outperform the for loop. For example, if you execute the functions below, you'll see that for loop is faster than replicate
system.time(for (i in 1:10) runif(1e7))
# user system elapsed
# 3.340 0.218 3.558
system.time(replicate(10, runif(1e7)))
# user system elapsed
# 4.622 0.484 5.109
so with 10 replicates, the for loop is clearly faster. If you repeat it for 100 replicates you get similar results. So I wonder if anyone can come with an example that shows its practical privileges compared to for.
PS I also created a function for the runif(1e7) and that made no difference in the comparison. Basically I failed to come with any example that shows the advantage of replicate.

Vectorization is the key difference between them. I will tray to explain this point. R is an high-level-interpreted computer language. It takes care of many basic computer tasks for you. When you write
x <- 2.0
you don’t have to tell your computer that
“2.0” is a floating-point number;
“x” should store numeric-type data;
it has to find a place in memory to put “5”;
it has to register “x” as a pointer to a certain place in memory.
R figures these things by itself.
But, for such comfortable issue, there is a price: it is slower than low level languages.
In C or FORTRAN, much of this "test if" would be accomplished during the compilation step, not during the program execution. They are translated into binary computer language (0/1) after they are written, BUT before they are run. This allows the compiler to organize the binary machine code in an optimal way for the computer to interpret.
What does this have to do with vectorization in R? Well, many R functions are actually written in a a compiled language, such as C, C++, and FORTRAN, and have a small R “wrapper”. This is the difference between yours approach. for loops add further test if operations that the machine has to do on data, making it slower

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio