I tried to duplicate the example in this famous question. My code looks like this:
#![feature(test)]
extern crate rand;
extern crate test;
use test::Bencher;
use rand::{thread_rng, Rng};
type ItemType = u8;
type SumType = u64;
const TEST_SIZE: usize = 32_768;
#[bench]
fn bench_train(b: &mut Bencher) {
let numbers = get_random_vec();
b.iter(|| calc_sum(&numbers));
}
#[bench]
fn bench_train_sort(b: &mut Bencher) {
let mut numbers = get_random_vec();
numbers.sort(); // <-- the magic difference
b.iter(|| calc_sum(&numbers));
}
fn get_random_vec() -> Vec<ItemType> {
thread_rng().gen_iter().take(TEST_SIZE).collect()
}
fn calc_sum(numbers: &Vec<ItemType>) -> SumType {
let mut sum = 0;
for &num in numbers {
if num < ItemType::max_value() / 2 {
sum += num.into();
}
}
sum
}
If I benchmark the exact code from above I get reasonable results (like in the linked question):
test bench_train ... bench: 148,611 ns/iter (+/- 8,445)
test bench_train_sort ... bench: 21,064 ns/iter (+/- 1,980)
However, if I change SumType to u8 both versions run equally fast and much faster overall:
test bench_train ... bench: 1,272 ns/iter (+/- 64)
test bench_train_sort ... bench: 1,280 ns/iter (+/- 170)
First of: of course, the sum will overflow all the time, but in release mode the overflow checks of Rust are disabled, so we just calculate a wrong result without panicking. Could this be the reason for the surprisingly short time?
Even stranger: when I change the implementation of calc_sum to something more idiomatic, the results change again. My second implementation:
fn calc_sum(numbers: &Vec<ItemType>) -> SumType {
numbers.iter()
.filter(|&&num| num < ItemType::max_value() / 2)
.fold(0, |acc, &num| acc + (num as SumType))
}
With this implementation the SumType doesn't matter anymore. With u8 as well as with u64 I get these results:
test bench_train ... bench: 144,411 ns/iter (+/- 12,533)
test bench_train_sort ... bench: 16,966 ns/iter (+/- 1,100)
So we again get the numbers we are expecting. So the question is:
What is the reason for the strange running times?
PS: I tested with cargo bench which compiles in release mode.
PPS: I just noticed that in the first implementation of calc_sum I use into() for casting, whereas I use as in the second example. When also using as in the first example, I get more strange numbers. With SumType = u64:
test bench_train ... bench: 39,850 ns/iter (+/- 2,355)
test bench_train_sort ... bench: 39,344 ns/iter (+/- 2,581)
With SumType = u8:
test bench_train ... bench: 1,184 ns/iter (+/- 339)
test bench_train_sort ... bench: 1,239 ns/iter (+/- 85)
I took a quick look at the assembler code, and it appears that if you use SumType = u8 then LLVM generates SSE2 instructions to do vector operations, which is much faster. In theory, LLVM should be able to optimize filter(...).fold(...) to the same code, but in practice it cannot always remove overhead of abstraction. I hope when MIR is added, Rust won't rely on LLVM to do Rust-specific optimizations.
Related
Consider this code.
type Int = i32;
const MAX_NUMBER: Int = 1_000_000;
fn main() {
let result1 = with_new();
let result2 = with_capacity();
assert_eq!(result1, result2)
}
fn with_new() -> Vec<Int> {
let mut result = Vec::new();
for i in 0..MAX_NUMBER {
result.push(i);
}
result
}
fn with_capacity() -> Vec<Int> {
let mut result = Vec::with_capacity(MAX_NUMBER as usize);
for i in 0..MAX_NUMBER {
result.push(i);
}
result
}
Both functions produce the same output. One uses Vec::new, the other uses Vec::with_capacity. For small values of MAX_NUMBER (like in the example), with_capacity is slower than new. Only for larger final vector lengths (e.g. 100 million) the version using with_capacity is as fast as using new.
Flamegraph for 1 million elements
Flamegraph for 100 million elements
It is my understanding that with_capacity should always be faster if the final length is known, because data on the heap is allocated once which should result in a single chunk. In contrast, the version with new grows the vector MAX_NUMBER times, which results in more allocations.
What am I missing?
Edit
The first section was compiled with the debug profile. If I use the release profile with the following settings in Cargo.toml
[package]
name = "vec_test"
version = "0.1.0"
edition = "2021"
[profile.release]
opt-level = 3
debug = 2
I still get the following result for a length of 10 million.
I was not able to reproduce this in a synthetic benchmark:
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
fn with_new(size: i32) -> Vec<i32> {
let mut result = Vec::new();
for i in 0..size {
result.push(i);
}
result
}
fn with_capacity(size: i32) -> Vec<i32> {
let mut result = Vec::with_capacity(size as usize);
for i in 0..size {
result.push(i);
}
result
}
pub fn with_new_benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("with_new");
for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000].iter() {
group.bench_with_input(BenchmarkId::from_parameter(size), size, |b, &size| {
b.iter(|| with_new(size));
});
}
group.finish();
}
pub fn with_capacity_benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("with_capacity");
for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000].iter() {
group.bench_with_input(BenchmarkId::from_parameter(size), size, |b, &size| {
b.iter(|| with_capacity(size));
});
}
group.finish();
}
criterion_group!(benches, with_new_benchmark, with_capacity_benchmark);
criterion_main!(benches);
Here's the output (with outliers and other benchmarking stuff removed):
with_new/100 time: [331.17 ns 331.38 ns 331.61 ns]
with_new/1000 time: [1.1719 us 1.1731 us 1.1745 us]
with_new/10000 time: [8.6784 us 8.6840 us 8.6899 us]
with_new/100000 time: [77.524 us 77.596 us 77.680 us]
with_new/1000000 time: [1.6966 ms 1.6978 ms 1.6990 ms]
with_new/10000000 time: [22.063 ms 22.084 ms 22.105 ms]
with_capacity/100 time: [76.796 ns 76.859 ns 76.926 ns]
with_capacity/1000 time: [497.90 ns 498.14 ns 498.39 ns]
with_capacity/10000 time: [5.0058 us 5.0084 us 5.0112 us]
with_capacity/100000 time: [50.358 us 50.414 us 50.470 us]
with_capacity/1000000 time: [1.0861 ms 1.0868 ms 1.0876 ms]
with_capacity/10000000 time: [10.644 ms 10.656 ms 10.668 ms]
The with_capacity runs were consistently faster than with_new. The closest ones being in the 10,000- to 1,000,000-element runs where with_capacity still only took ~60% of the time, while others had it taking half the time or less.
It crossed my mind that there could be some strange const-propagation behavior going on, but even with individual functions with hard-coded sizes (playground for brevity), the behavior didn't significantly change:
with_new/100 time: [313.87 ns 314.22 ns 314.56 ns]
with_new/1000 time: [1.1498 us 1.1505 us 1.1511 us]
with_new/10000 time: [7.9062 us 7.9095 us 7.9130 us]
with_new/100000 time: [77.925 us 77.990 us 78.057 us]
with_new/1000000 time: [1.5675 ms 1.5683 ms 1.5691 ms]
with_new/10000000 time: [20.956 ms 20.990 ms 21.023 ms]
with_capacity/100 time: [76.943 ns 76.999 ns 77.064 ns]
with_capacity/1000 time: [535.00 ns 536.22 ns 537.21 ns]
with_capacity/10000 time: [5.1122 us 5.1150 us 5.1181 us]
with_capacity/100000 time: [50.064 us 50.092 us 50.122 us]
with_capacity/1000000 time: [1.0768 ms 1.0776 ms 1.0784 ms]
with_capacity/10000000 time: [10.600 ms 10.613 ms 10.628 ms]
Your testing code only calls each strategy once, so its conceivable that your testing environment is changed after calling the first one (a potential culprit being heap fragmentation as suggested by #trent_formerly_cl in the comments, though there could be others: cpu boosting/throttling, spatial and/or temporal cache locality, OS behavior, etc.). A benchmarking framework like criterion helps avoid a lot of these problems by iterating each test multiple times (including warmup iterations).
I am trying to solve the leetcode problem distribute-candies. It is easy, just find out the minimum between the candies' kinds and candies half number.
Here's my solution (cost 48ms):
use std::collections::HashSet;
pub fn distribute_candies(candies: Vec<i32>) -> i32 {
let sister_candies = (candies.len() / 2) as i32;
let mut kind = 0;
let mut candies_kinds = HashSet::new();
for candy in candies.into_iter() {
if candies_kinds.insert(candy) {
kind += 1;
if kind > sister_candies {
return sister_candies;
}
}
}
kind
}
However, I found a solution using an iterator:
use std::collections::HashSet;
use std::cmp::min;
pub fn distribute_candies(candies: Vec<i32>) -> i32 {
min(candies.iter().collect::<HashSet<_>>().len(), candies.len() / 2) as i32
}
and it costs 36ms.
I can't quite understand why the iterator solution is faster than my for loop solution. Are there some magic optimizations that Rust is performing in the background?
The main difference is that the iterator version internally uses Iterator::size_hint to determine how much space to reserve in the HashSet before collecting into it. This prevents repeatedly having to reallocate as the set grows.
You can do the same using HashSet::with_capacity instead of HashSet::new:
let mut candies_kinds = HashSet::with_capacity(candies.len());
In my benchmark this single change makes your code significantly faster than the iterator. However, if I simplify your code to remove the early bailout optimisation, it runs in almost exactly the same time as the iterator version.
pub fn distribute_candies(candies: &[i32]) -> i32 {
let sister_candies = (candies.len() / 2) as i32;
let mut candies_kinds = HashSet::with_capacity(candies.len());
for candy in candies.into_iter() {
candies_kinds.insert(candy);
}
sister_candies.min(candies_kinds.len() as i32)
}
Timings:
test tests::bench_iter ... bench: 262,315 ns/iter (+/- 23,704)
test tests::bench_loop ... bench: 307,697 ns/iter (+/- 16,119)
test tests::bench_loop_with_capacity ... bench: 112,194 ns/iter (+/- 18,295)
test tests::bench_loop_with_capacity_no_bailout ... bench: 259,961 ns/iter (+/- 17,712)
This suggests to me that the HashSet preallocation is the dominant difference. Your additional optimisation also proves to be very effective - at least with the dataset that I happened to choose.
I am doing some problems on Project Euler. This challenge requires filtering prime numbers from an array. I was halfway near my solution when I found out that Rust was a bit slow. I added a progressbar to check the progress.
Here is the code:
extern crate pbr;
use self::pbr::ProgressBar;
pub fn is_prime(i: i32) -> bool {
for d in 2..i {
if i % d == 0 {
return false;
}
}
true
}
pub fn calc_sum_loop(max_num: i32) -> i32 {
let mut pb = ProgressBar::new(max_num as u64);
pb.format("[=>_]");
let mut sum_primes = 0;
for i in 1..max_num {
if is_prime(i) {
sum_primes += i;
}
pb.inc();
}
sum_primes
}
pub fn solve() {
println!("About to calculate sum of primes in the first 20000");
println!("When using a forloop {:?}", calc_sum_loop(400000));
}
I am calling the solve function from my main.rs file. Turns out that the number of iterations in my for loop is a lot faster in the beginning and a lot slower later.
➜ euler-rust git:(master) ✗ cargo run --release
Finished release [optimized] target(s) in 0.05s
Running `target/release/euler-rust`
About to calculate sum of primes..
118661 / 400000 [===========>__________________________] 29.67 % 48780.25/s 6s
...
...
400000 / 400000 [=======================================] 100.00 % 23725.24/s
I am sort of drawing a blank in what might be causing this slowdown. It feels like Rust should be able to be much faster than what I am currently seeing. Note that I am telling Cargo to build with the --release flag. I am aware that not doing this might slow things down even further.
The function which is slowing the execution is:
is_prime(i: i32)
You may consider to use more efficient crate like primes or you can check efficient prime number checking algorithms here
Reading the Zero-cost abstractions and looking at Introduction to rust: a low-level language with high-level abstractions I tried to compare two approaches to computing the dot product of a vector: one using a for loop and one using iterators.
#![feature(test)]
extern crate rand;
extern crate test;
use std::cmp::min;
fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
return result;
}
fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum::<f64>()
}
#[cfg(test)]
mod bench {
use test::Bencher;
use rand::{Rng,thread_rng};
use super::*;
const LEN: usize = 30;
#[test]
fn test_1() {
let x = [1.0, 2.0, 3.0];
let y = [2.0, 4.0, 6.0];
let result = dot_product_1(&x, &y);
assert_eq!(result, 28.0);
}
#[test]
fn test_2() {
let x = [1.0, 2.0, 3.0];
let y = [2.0, 4.0, 6.0];
let result = dot_product_2(&x, &y);
assert_eq!(result, 28.0);
}
fn rand_array(cnt: u32) -> Vec<f64> {
let mut rng = thread_rng();
(0..cnt).map(|_| rng.gen::<f64>()).collect()
}
#[bench]
fn bench_small_1(b: &mut Bencher) {
let samples = rand_array(2*LEN as u32);
b.iter(|| {
dot_product_1(&samples[0..LEN], &samples[LEN..2*LEN])
})
}
#[bench]
fn bench_small_2(b: &mut Bencher) {
let samples = rand_array(2*LEN as u32);
b.iter(|| {
dot_product_2(&samples[0..LEN], &samples[LEN..2*LEN])
})
}
}
The later of the links above claims that the version with the iterators should have similar performance "and actually be a little bit faster". However, when benchmarking the two, I get very different results:
running 2 tests
test bench::bench_small_loop ... bench: 20 ns/iter (+/- 1)
test bench::bench_small_iter ... bench: 24 ns/iter (+/- 2)
test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out
So, where did the "zero-cost abstraction" go?
Update: Adding the foldr example provided by #wimh and using split_at instead of slices give the following result.
running 3 tests
test bench::bench_small_fold ... bench: 18 ns/iter (+/- 1)
test bench::bench_small_iter ... bench: 21 ns/iter (+/- 1)
test bench::bench_small_loop ... bench: 24 ns/iter (+/- 1)
test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out
So it seems that the additional time comes directly or indirectly from constructing the slices inside the measured code. To check that it indeed was the case, I tried the following two approaches with the same result (here shown for foldr case and using map + sum):
#[bench]
fn bench_small_iter(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let s0 = &samples[0..LEN];
let s1 = &samples[LEN..2 * LEN];
b.iter(|| dot_product_iter(s0, s1))
}
#[bench]
fn bench_small_fold(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_fold(s0, s1))
}
It seems zero cost to me. I wrote your code slightly more idiomatically, using the same random values for both tests, and then tested multiple times:
fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
result
}
fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum()
}
fn rand_array(cnt: usize) -> Vec<f64> {
let mut rng = rand::rngs::StdRng::seed_from_u64(42);
rng.sample_iter(&rand::distributions::Standard).take(cnt).collect()
}
#[bench]
fn bench_small_1(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_1(s0, s1))
}
#[bench]
fn bench_small_2(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_2(s0, s1))
}
bench_small_1 20 ns/iter (+/- 6)
bench_small_2 17 ns/iter (+/- 1)
bench_small_1 19 ns/iter (+/- 3)
bench_small_2 17 ns/iter (+/- 2)
bench_small_1 19 ns/iter (+/- 5)
bench_small_2 17 ns/iter (+/- 3)
bench_small_1 19 ns/iter (+/- 3)
bench_small_2 24 ns/iter (+/- 7)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 24 ns/iter (+/- 1)
bench_small_1 27 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 0)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 17 ns/iter (+/- 1)
In 9 of the 10 runs, the idiomatic code was faster than the for loop. This was done on 2.9 GHz Core i9 (I9-8950HK) with 32 GB RAM, compiled with rustc 1.31.0-nightly (fc403ad98 2018-09-30).
For fun, I rewrote the benchmark to use criterion, a port of the Haskell benchmarking library.
Cargo.toml
[package]
name = "mats-zero-cost-rust"
version = "0.1.0"
authors = ["mats"]
[dev-dependencies]
criterion = "0.2"
rand = "0.4"
[[bench]]
name = "my_benchmark"
harness = false
benches/my_benchmark.rs
#[macro_use]
extern crate criterion;
extern crate rand;
use std::cmp::min;
use criterion::Criterion;
use rand::{thread_rng, Rng};
const LEN: usize = 30;
fn dot_product_loop(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
return result;
}
fn dot_product_iter(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum()
}
fn rand_array(cnt: u32) -> Vec<f64> {
let mut rng = thread_rng();
(0..cnt).map(|_| rng.gen()).collect()
}
fn criterion_loop_with_slice(c: &mut Criterion) {
c.bench_function("loop with slice", |b| {
let samples = rand_array(2 * LEN as u32);
b.iter(|| dot_product_loop(&samples[0..LEN], &samples[LEN..2 * LEN]))
});
}
fn criterion_loop_without_slice(c: &mut Criterion) {
c.bench_function("loop without slice", |b| {
let samples = rand_array(2 * LEN as u32);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_loop(s0, s1))
});
}
fn criterion_iter_with_slice(c: &mut Criterion) {
c.bench_function("iterators with slice", |b| {
let samples = rand_array(2 * LEN as u32);
b.iter(|| dot_product_iter(&samples[0..LEN], &samples[LEN..2 * LEN]))
});
}
fn criterion_iter_without_slice(c: &mut Criterion) {
c.bench_function("iterators without slice", |b| {
let samples = rand_array(2 * LEN as u32);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_iter(s0, s1))
});
}
criterion_group!(benches, criterion_loop_with_slice, criterion_loop_without_slice, criterion_iter_with_slice, criterion_iter_without_slice);
criterion_main!(benches);
I observe these results;
kolmodin#blin:~/code/mats-zero-cost-rust$ cargo bench
Compiling mats-zero-cost-rust v0.1.0 (/home/kolmodin/code/mats-zero-cost-rust)
Finished release [optimized] target(s) in 1.16s
Running target/release/deps/my_benchmark-6f00e042fd40bc1d
Gnuplot not found, disabling plotting
loop with slice time: [7.5794 ns 7.6843 ns 7.8016 ns]
Found 14 outliers among 100 measurements (14.00%)
1 (1.00%) high mild
13 (13.00%) high severe
loop without slice time: [24.384 ns 24.486 ns 24.589 ns]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) low severe
1 (1.00%) low mild
iterators with slice time: [13.842 ns 13.852 ns 13.863 ns]
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe
iterators without slice time: [13.420 ns 13.424 ns 13.430 ns]
Found 12 outliers among 100 measurements (12.00%)
1 (1.00%) low mild
1 (1.00%) high mild
10 (10.00%) high severe
Gnuplot not found, disabling plotting
Using rustc 1.30.0 (da5f414c2 2018-10-24) on an AMD Ryzen 7 2700X.
The iterator implementation gets similar results for using slice and split_at.
The loop implementation gets very different results. The version with slice is significantly faster.
I am doing some random number generation for my Lotto Simulation and was wondering why would it be MUCH slower when running multiple instances?
I am running this program under Ubuntu 15.04 (linux kernel 4.2). rustc 1.7.0-nightly (d5e229057 2016-01-04)
Overall CPU utilization is about 45% during these tests but each individual thread is taking up 100% of that thread.
Here is my script I am using to start multiple instances at the same time.
#!/usr/bin/env bash
pkill lotto_sim
for _ in `seq 1 14`;
do
./lotto_sim 15000000 1>> /var/log/syslog &
done
Output:
Took PT38.701900316S seconds to generate 15000000 random tickets
Took PT39.193917241S seconds to generate 15000000 random tickets
Took PT39.412279484S seconds to generate 15000000 random tickets
Took PT39.492940352S seconds to generate 15000000 random tickets
Took PT39.715433024S seconds to generate 15000000 random tickets
Took PT39.726609237S seconds to generate 15000000 random tickets
Took PT39.884151996S seconds to generate 15000000 random tickets
Took PT40.025874106S seconds to generate 15000000 random tickets
Took PT40.088332517S seconds to generate 15000000 random tickets
Took PT40.112601899S seconds to generate 15000000 random tickets
Took PT40.205958636S seconds to generate 15000000 random tickets
Took PT40.227956170S seconds to generate 15000000 random tickets
Took PT40.393753486S seconds to generate 15000000 random tickets
Took PT40.465173616S seconds to generate 15000000 random tickets
However, a single run gives this output:
$ ./lotto_sim 15000000
Took PT9.860698141S seconds to generate 15000000 random tickets
My understanding is that each process has it's own memory and doesn't share anything. Correct?
Here is the relevant code:
extern crate rand;
extern crate itertools;
extern crate time;
use std::env;
use rand::{Rng, Rand};
use itertools::Itertools;
use time::PreciseTime;
struct Ticket {
whites: Vec<u8>,
power_ball: u8,
is_power_play: bool,
}
const POWER_PLAY_PERCENTAGE: u8 = 15;
const WHITE_MIN: u8 = 1;
const WHITE_MAX: u8 = 69;
const POWER_BALL_MIN: u8 = 1;
const POWER_BALL_MAX: u8 = 26;
impl Rand for Ticket {
fn rand<R: Rng>(rng: &mut R) -> Self {
let pp_guess = rng.gen_range(0, 100);
let pp_value = pp_guess < POWER_PLAY_PERCENTAGE;
let mut whites_vec: Vec<_> = (0..).map(|_| rng.gen_range(WHITE_MIN, WHITE_MAX + 1))
.unique().take(5).collect();
whites_vec.sort();
let pb_value = rng.gen_range(POWER_BALL_MIN, POWER_BALL_MAX + 1);
Ticket { whites: whites_vec, power_ball: pb_value, is_power_play: pp_value}
}
}
fn gen_test(num_tickets: i64) {
let mut rng = rand::thread_rng();
let _: Vec<_> = rng.gen_iter::<Ticket>()
.take(num_tickets as usize)
.collect();
}
fn main() {
let args: Vec<_> = env::args().collect();
let num_tickets: i64 = args[1].parse::<i64>().unwrap();
let start = PreciseTime::now();
gen_test(num_tickets);
let end = PreciseTime::now();
println!("Took {} seconds to generate {} random tickets", start.to(end), num_tickets);
}
Edit:
Maybe a better question would be how do I debug and figure this out? Where would I look within the program or within my OS to find the performance hindrances? I am new to Rust and lower level programming like this that relies so heavily on the OS.