What part of my code to filter prime numbers causes it to slow down as it processes? - performance

I am doing some problems on Project Euler. This challenge requires filtering prime numbers from an array. I was halfway near my solution when I found out that Rust was a bit slow. I added a progressbar to check the progress.
Here is the code:
extern crate pbr;
use self::pbr::ProgressBar;
pub fn is_prime(i: i32) -> bool {
for d in 2..i {
if i % d == 0 {
return false;
}
}
true
}
pub fn calc_sum_loop(max_num: i32) -> i32 {
let mut pb = ProgressBar::new(max_num as u64);
pb.format("[=>_]");
let mut sum_primes = 0;
for i in 1..max_num {
if is_prime(i) {
sum_primes += i;
}
pb.inc();
}
sum_primes
}
pub fn solve() {
println!("About to calculate sum of primes in the first 20000");
println!("When using a forloop {:?}", calc_sum_loop(400000));
}
I am calling the solve function from my main.rs file. Turns out that the number of iterations in my for loop is a lot faster in the beginning and a lot slower later.
➜ euler-rust git:(master) ✗ cargo run --release
Finished release [optimized] target(s) in 0.05s
Running `target/release/euler-rust`
About to calculate sum of primes..
118661 / 400000 [===========>__________________________] 29.67 % 48780.25/s 6s
...
...
400000 / 400000 [=======================================] 100.00 % 23725.24/s
I am sort of drawing a blank in what might be causing this slowdown. It feels like Rust should be able to be much faster than what I am currently seeing. Note that I am telling Cargo to build with the --release flag. I am aware that not doing this might slow things down even further.

The function which is slowing the execution is:
is_prime(i: i32)
You may consider to use more efficient crate like primes or you can check efficient prime number checking algorithms here

Related

Basic F# / Rust performance comparison

this is a simplistic performance test based on https://www.youtube.com/watch?v=QlMLB2-G25c which compares the performance of rust vs wasm vs python vs go
the original rust program (from https://github.com/masmullin2000/random-sort-examples) is:
use rand::prelude::*;
fn main() {
let vec = make_random_vec(1_000_000, 100);
for _ in 0..250 {
let mut v = vec.clone();
// v.sort_unstable();
v.sort(); // using stable sort as f# sort is a stable sort
}
}
pub fn make_random_vec(sz: usize, modulus: i64) -> Vec<i64> {
let mut v: Vec<i64> = Vec::with_capacity(sz);
for _ in 0..sz {
let x: i64 = random();
v.push(x % modulus);
}
v
}
so i created the following f# program to compare against rust
open System
let rec cls (arr:int64 array) count =
if count > 0 then
let v1 = Array.copy arr
let v2 = Array.sort v1
cls arr (count-1)
else
()
let rnd = Random()
let rndArray = Array.init 1000000 (fun _ -> int64 (rnd.Next(100)))
cls rndArray 250 |> ignore
i was expecting f# to be slower (both running on WSL2) but got the following times on my Core i7 8th gen laptop
Rust - around 17 seconds
Rust (unstable sort) - around 2.7 seconds
F# - around 11 seconds
my questions:
is the dotnet compiler doing some sort of optimisation that throws away some of the processing because the return values are not being used resulting in the f# code running faster or am i doing something wrong?
does f# have an unstable sort function that i can use to compare against the Rust unstable sort?

Which Rust RNG should be used for multithreaded sampling?

I am trying to create a function in Rust which will sample from M normal distributions N times. I have the sequential version below, which runs fine. I am trying to parallelize it using Rayon, but am encountering the error
Rc<UnsafeCell<ReseedingRng<rand_chacha::chacha::ChaCha12Core, OsRng>>> cannot be sent between threads safely
It seems my rand::thread_rng does not implement the traits Send and Sync. I tried using StdRng and OsRng which both do, to no avail because then I get errors that the variables pred and rng cannot be borrowed as mutable because they are captured in a Fn closure.
This is the working code below. It errors when I change the first into_iter() to into_par_iter().
use rand_distr::{Normal, Distribution};
use std::time::Instant;
use rayon::prelude::*;
fn rprednorm(n: i32, means: Vec<f64>, sds: Vec<f64>) -> Vec<Vec<f64>> {
let mut rng = rand::thread_rng();
let mut preds = vec![vec![0.0; n as usize]; means.len()];
(0..means.len()).into_iter().for_each(|i| {
(0..n).into_iter().for_each(|j| {
let normal = Normal::new(means[i], sds[i]).unwrap();
preds[i][j as usize] = normal.sample(&mut rng);
})
});
preds
}
fn main() {
let means = vec![0.0; 67000];
let sds = vec![1.0; 67000];
let start = Instant::now();
let preds = rprednorm(100, means, sds);
let duration = start.elapsed();
println!("{:?}", duration);
}
Any advice on how to make these two iterators parallel?
Thanks.
It seems my rand::thread_rng does not implement the traits Send and Sync.
Why are you trying to send a thread_rng? The entire point of thread_rng is that it's a per-thread RNG.
then I get errors that the variables pred and rng cannot be borrowed as mutable because they are captured in a Fn closure.
Well yes, you need to Clone the StdRng (or Copy the OsRng) into each closure. As for pred, that can't work for a similar reason: once you parallelise the loop the compiler does not know that every i is distinct, so as far as it's concerned the write access to i could overlap (you could have two iterations running in parallel which try to write to the same place at the same time) which is illegal.
The solution is to use rayon to iterate in parallel over the destination vector:
fn rprednorm(n: i32, means: Vec<f64>, sds: Vec<f64>) -> Vec<Vec<f64>> {
let mut preds = vec![vec![0.0; n as usize]; means.len()];
preds.par_iter_mut().enumerate().for_each(|(i, e)| {
let mut rng = rand::thread_rng();
(0..n).into_iter().for_each(|j| {
let normal = Normal::new(means[i], sds[i]).unwrap();
e[j as usize] = normal.sample(&mut rng);
})
});
preds
}
Alternatively with OsRng, it's just a marker ZST, so you can refer to it as a value:
fn rprednorm(n: i32, means: Vec<f64>, sds: Vec<f64>) -> Vec<Vec<f64>> {
let mut preds = vec![vec![0.0; n as usize]; means.len()];
preds.par_iter_mut().enumerate().for_each(|(i, e)| {
(0..n).into_iter().for_each(|j| {
let normal = Normal::new(means[i], sds[i]).unwrap();
e[j as usize] = normal.sample(&mut rand::rngs::OsRng);
})
});
preds
}
StdRng doesn't seem very suitable to this use-case, as you'll either have to create one per toplevel iteration to get different samplings, or you'll have to initialise a base rng then clone it once per spark, and they'll all have the same sequence (as they'll share a seed).

Why is my for loop code slower than an iterator?

I am trying to solve the leetcode problem distribute-candies. It is easy, just find out the minimum between the candies' kinds and candies half number.
Here's my solution (cost 48ms):
use std::collections::HashSet;
pub fn distribute_candies(candies: Vec<i32>) -> i32 {
let sister_candies = (candies.len() / 2) as i32;
let mut kind = 0;
let mut candies_kinds = HashSet::new();
for candy in candies.into_iter() {
if candies_kinds.insert(candy) {
kind += 1;
if kind > sister_candies {
return sister_candies;
}
}
}
kind
}
However, I found a solution using an iterator:
use std::collections::HashSet;
use std::cmp::min;
pub fn distribute_candies(candies: Vec<i32>) -> i32 {
min(candies.iter().collect::<HashSet<_>>().len(), candies.len() / 2) as i32
}
and it costs 36ms.
I can't quite understand why the iterator solution is faster than my for loop solution. Are there some magic optimizations that Rust is performing in the background?
The main difference is that the iterator version internally uses Iterator::size_hint to determine how much space to reserve in the HashSet before collecting into it. This prevents repeatedly having to reallocate as the set grows.
You can do the same using HashSet::with_capacity instead of HashSet::new:
let mut candies_kinds = HashSet::with_capacity(candies.len());
In my benchmark this single change makes your code significantly faster than the iterator. However, if I simplify your code to remove the early bailout optimisation, it runs in almost exactly the same time as the iterator version.
pub fn distribute_candies(candies: &[i32]) -> i32 {
let sister_candies = (candies.len() / 2) as i32;
let mut candies_kinds = HashSet::with_capacity(candies.len());
for candy in candies.into_iter() {
candies_kinds.insert(candy);
}
sister_candies.min(candies_kinds.len() as i32)
}
Timings:
test tests::bench_iter ... bench: 262,315 ns/iter (+/- 23,704)
test tests::bench_loop ... bench: 307,697 ns/iter (+/- 16,119)
test tests::bench_loop_with_capacity ... bench: 112,194 ns/iter (+/- 18,295)
test tests::bench_loop_with_capacity_no_bailout ... bench: 259,961 ns/iter (+/- 17,712)
This suggests to me that the HashSet preallocation is the dominant difference. Your additional optimisation also proves to be very effective - at least with the dataset that I happened to choose.

Why is the Rust random number generator slower with multiple instances running?

I am doing some random number generation for my Lotto Simulation and was wondering why would it be MUCH slower when running multiple instances?
I am running this program under Ubuntu 15.04 (linux kernel 4.2). rustc 1.7.0-nightly (d5e229057 2016-01-04)
Overall CPU utilization is about 45% during these tests but each individual thread is taking up 100% of that thread.
Here is my script I am using to start multiple instances at the same time.
#!/usr/bin/env bash
pkill lotto_sim
for _ in `seq 1 14`;
do
./lotto_sim 15000000 1>> /var/log/syslog &
done
Output:
Took PT38.701900316S seconds to generate 15000000 random tickets
Took PT39.193917241S seconds to generate 15000000 random tickets
Took PT39.412279484S seconds to generate 15000000 random tickets
Took PT39.492940352S seconds to generate 15000000 random tickets
Took PT39.715433024S seconds to generate 15000000 random tickets
Took PT39.726609237S seconds to generate 15000000 random tickets
Took PT39.884151996S seconds to generate 15000000 random tickets
Took PT40.025874106S seconds to generate 15000000 random tickets
Took PT40.088332517S seconds to generate 15000000 random tickets
Took PT40.112601899S seconds to generate 15000000 random tickets
Took PT40.205958636S seconds to generate 15000000 random tickets
Took PT40.227956170S seconds to generate 15000000 random tickets
Took PT40.393753486S seconds to generate 15000000 random tickets
Took PT40.465173616S seconds to generate 15000000 random tickets
However, a single run gives this output:
$ ./lotto_sim 15000000
Took PT9.860698141S seconds to generate 15000000 random tickets
My understanding is that each process has it's own memory and doesn't share anything. Correct?
Here is the relevant code:
extern crate rand;
extern crate itertools;
extern crate time;
use std::env;
use rand::{Rng, Rand};
use itertools::Itertools;
use time::PreciseTime;
struct Ticket {
whites: Vec<u8>,
power_ball: u8,
is_power_play: bool,
}
const POWER_PLAY_PERCENTAGE: u8 = 15;
const WHITE_MIN: u8 = 1;
const WHITE_MAX: u8 = 69;
const POWER_BALL_MIN: u8 = 1;
const POWER_BALL_MAX: u8 = 26;
impl Rand for Ticket {
fn rand<R: Rng>(rng: &mut R) -> Self {
let pp_guess = rng.gen_range(0, 100);
let pp_value = pp_guess < POWER_PLAY_PERCENTAGE;
let mut whites_vec: Vec<_> = (0..).map(|_| rng.gen_range(WHITE_MIN, WHITE_MAX + 1))
.unique().take(5).collect();
whites_vec.sort();
let pb_value = rng.gen_range(POWER_BALL_MIN, POWER_BALL_MAX + 1);
Ticket { whites: whites_vec, power_ball: pb_value, is_power_play: pp_value}
}
}
fn gen_test(num_tickets: i64) {
let mut rng = rand::thread_rng();
let _: Vec<_> = rng.gen_iter::<Ticket>()
.take(num_tickets as usize)
.collect();
}
fn main() {
let args: Vec<_> = env::args().collect();
let num_tickets: i64 = args[1].parse::<i64>().unwrap();
let start = PreciseTime::now();
gen_test(num_tickets);
let end = PreciseTime::now();
println!("Took {} seconds to generate {} random tickets", start.to(end), num_tickets);
}
Edit:
Maybe a better question would be how do I debug and figure this out? Where would I look within the program or within my OS to find the performance hindrances? I am new to Rust and lower level programming like this that relies so heavily on the OS.

Why are these ASCII methods inconsistent?

When I look at the rust ASCII operations it feels like there is a consistency issue between
is_lowercase/is_uppercase:
pub fn is_uppercase(&self) -> bool {
(self.chr - b'A') < 26
}
is_alphabetic:
pub fn is_alphabetic(&self) -> bool {
(self.chr >= 0x41 && self.chr <= 0x5A) || (self.chr >= 0x61 && self.chr <= 0x7A)
}
Is there a good reason? Are the two methods totally equivalent or am I missing something?
All these functions are marked as stable so I'm confused.
EDIT:
To make it clearer, what I would expect is to decide on the best (in terms of performance/readability/common practice) implementation for lower/upper then have
pub fn is_alphabetic(&self) -> bool {
self.is_lowercase() || self.is_uppercase()
}
Since the question changed to be about performance, I'll add a second answer.
To start, I created a clone of the Ascii module (playpen):
pub struct Alpha(u8);
impl Alpha {
#[inline(never)]
pub fn is_uppercase_sub(&self) -> bool {
(self.0 - b'A') < 26
}
#[inline(never)]
pub fn is_uppercase_range(&self) -> bool {
self.0 >= 0x41 && self.0 <= 0x5A
}
}
fn main() {
let yes = Alpha(b'A');
let no = Alpha(b'a');
println!("{}, {}", yes.is_uppercase_sub(), yes.is_uppercase_range());
}
In the playpen, make sure that the optimization is set to -O2 and then click IR. This shows the LLVM Intermediate Representation. It's like a higher-level assembly, if you'd like.
There's lots of output, but look for the sections with fastcc. I've removed various bits to make this code clearer, but you can see that the exact same function is called, even though our code calls two different implementations, one with a subtraction and one with a range:
%3 = call fastcc zeroext i1 #_ZN5Alpha16is_uppercase_sub20h63aa0b11479803f4laaE
%5 = call fastcc zeroext i1 #_ZN5Alpha16is_uppercase_sub20h63aa0b11479803f4laaE
The LLVM optimizer can tell that these implementations are the same, so really it's up to the developers preference. You might be able to get a commit into Rust to make them consistent, if you'd like! ^_^
Asking about is_alphabetic is harder; inlining will come into play here. If LLVM inlines is_upper and is_lower into is_alphabetic, then your suggested change would be better. If it doesn't, then potentially what was 1 function call is now 3! That could be really bad.
These types of questions are a lot harder to answer at this level; one would have to do some looking (edit and profiling!) at real Rust code in the large to understand the optimizer with regards to inlining.
They're equivalent. is_alphabetic could be written with byte literals instead of hex codes, making it more readable and matching the other functions:
pub fn is_alphabetic(&self) -> bool {
(self.chr >= b'A' && self.chr <= b'Z') ||
(self.chr >= b'a' && self.chr <= b'z')
}
The values in is_alphabetic certainly correspond to the appropriate ASCII values for the letters. You can validate this with:
println!("0x{:x} 0x{:x}", b'A', b'A');
println!("0x{:x} 0x{:x}", b'a', b'z');
is_alphabetic relies on the fact that the ASCII lower and uppercase letters are sequential (not with each other, unfortunately). It could have been written:
pub fn is_alphabetic(&self) -> bool {
(self.chr >= b'A' && self.chr <= b'Z') || (self.chr >= b'a' && self.chr <= b'z')
}
// Or
pub fn is_alphabetic(&self) -> bool {
self.is_upper() || self.is_lower()
}
is_lower and is_upper both rely on unsigned math underflow to be correct. If a is 0x61 and z is 0x7A, and we subtract a from both, we get 0 and 25. However, if it's one less than a, we would get 0xFF. 0xFF is not < 26, so it will fail that check.

Resources