Why is what timers show so counter-intuitive?

Why is what timers show so counter-intuitive? - performance

I am making a parser of some text. I need to support unicode texts, that's why I am using the String::chars iterator:
playground
use std::time::Instant;
fn main() {
let text = "a".repeat(10000);
let mut timer1 = 0;
let mut timer2 = 0;
let start1 = Instant::now();
for pos in 1..10000 {
let start2 = Instant::now();
let ch = text.chars().nth(pos).unwrap();
timer2 += start2.elapsed().as_millis();
}
timer1 += start1.elapsed().as_millis();
println!("timer1: {} timer2: {}", timer1, timer2);
}
Example output:
timer1: 4276 timer2: 133
Why is timer2 unbelievably less than timer1, when I believe they should be very close to each other?
P.S. I know already that .nth is slow, and shouldn't be used.

You are running into a resolution problem. The inside of the loop takes (on average) less than one millisecond to execute, so start2.elapsed().as_millis() usually evaluates to 0. To fix this problem, you could do some operation inside the loop that takes even longer, or change your resolution from milliseconds to something smaller, like microseconds or nanoseconds.
switching to microseconds yields a more consistent time
use std::time::{Instant};
fn main() {
let text = "a".repeat(10000);
let mut timer1 = 0;
let mut timer2 = 0;
let start1 = Instant::now();
for pos in 1..10000 {
let start2 = Instant::now();
let ch = text.chars().nth(pos).unwrap();
timer2+=start2.elapsed().as_micros();
}
timer1+=start1.elapsed().as_micros();
println!("timer1: {} timer2: {}", timer1, timer2);
}
output
timer1: 3511812 timer2: 3499669
This question was tagged performance, so I'd like to point out that using std::Instant is a very tedious way to measure performance. Better ways include criterion.rs, flamegraph and cargo-bench.

Related

My prime number sieve is extremely slow even with --release

I have looked at multiple answers online to the same question but I cannot figure out why my program is so slow. I think it is the for loops but I am unsure.
P.S. I am quite new to Rust and am not very proficient in it yet. Any tips or tricks, or any good coding practices that I am not using are more than welcome :)
math.rs
pub fn number_to_vector(number: i32) -> Vec<i32> {
let mut numbers: Vec<i32> = Vec::new();
for i in 1..number + 1 {
numbers.push(i);
}
return numbers;
}
user_input.rs
use std::io;
pub fn get_user_input(prompt: &str) -> i32 {
println!("{}", prompt);
let mut user_input: String = String::new();
io::stdin().read_line(&mut user_input).expect("Failed to read line");
let number: i32 = user_input.trim().parse().expect("Please enter an integer!");
return number;
}
main.rs
mod math;
mod user_input;
fn main() {
let user_input: i32 = user_input::get_user_input("Enter a positive integer: ");
let mut numbers: Vec<i32> = math::number_to_vector(user_input);
numbers.remove(numbers.iter().position(|x| *x == 1).unwrap());
let mut numbers_to_remove: Vec<i32> = Vec::new();
let ceiling_root: i32 = (user_input as f64).sqrt().ceil() as i32;
for i in 2..ceiling_root + 1 {
for j in i..user_input + 1 {
numbers_to_remove.push(i * j);
}
}
numbers_to_remove.sort_unstable();
numbers_to_remove.dedup();
numbers_to_remove.retain(|x| *x <= user_input);
for number in numbers_to_remove {
if numbers.iter().any(|&i| i == number) {
numbers.remove(numbers.iter().position(|x| *x == number).unwrap());
}
}
println!("Prime numbers up to {}: {:?}", user_input, numbers);
}

There's two main problems in your code: the i * j loop has wrong upper limit for j, and the composites removal loop uses O(n) operations for each entry, making it quadratic overall.
The corrected code:
fn main() {
let user_input: i32 = get_user_input("Enter a positive integer: ");
let mut numbers: Vec<i32> = number_to_vector(user_input);
numbers.remove(numbers.iter().position(|x| *x == 1).unwrap());
let mut numbers_to_remove: Vec<i32> = Vec::new();
let mut primes: Vec<i32> = Vec::new(); // new code
let mut i = 0; // new code
let ceiling_root: i32 = (user_input as f64).sqrt().ceil() as i32;
for i in 2..ceiling_root + 1 {
for j in i..(user_input/i) + 1 { // FIX #1: user_input/i
numbers_to_remove.push(i * j);
}
}
numbers_to_remove.sort_unstable();
numbers_to_remove.dedup();
//numbers_to_remove.retain(|x| *x <= user_input); // not needed now
for n in numbers { // FIX #2:
if n < numbers_to_remove[i] { // two linear enumerations
primes.push(n);
}
else {
i += 1; // in unison
}
}
println!("Last prime number up to {}: {:?}", user_input, primes.last());
println!("Total prime numbers up to {}: {:?}", user_input,
primes.iter().count());
}
Your i * j loop was actually O( N1.5), whereas your numbers removal loop was actually quadratic -- remove is O(n) because it needs to move all the elements past the removed one back, so there is no gap.
The mended code now runs at ~ N1.05 empirically in the 106...2*106 range, and orders of magnitude faster in absolute terms as well.
Oh and that's a sieve, but not of Eratosthenes. To qualify as such, the is should range over primes, not just all numbers.

As commented AKX you function's big O is (m * n), that's why it's slow.
For this kind of "expensive" calculations to make it run faster you can use multithreading.
This part of answer is not about the right algorithm to choose, but code style. (tips/tricks)
I think the idiomatic way to do this is with iterators (which are lazy), it make code more readable/simple and runs like 2 times faster in this case.
fn primes_up_to() {
let num = get_user_input("Enter a positive integer greater than 2: ");
let primes = (2..=num).filter(is_prime).collect::<Vec<i32>>();
println!("{:?}", primes);
}
fn is_prime(num: &i32) -> bool {
let bound = (*num as f32).sqrt() as i32;
*num == 2 || !(2..=bound).any(|n| num % n == 0)
}
Edit: Also this style gives you ability easily to switch to parallel iterators for "expensive" calculations with rayon (Link)
Edit2: Algorithm fix. Before, this uses a quadratic algorithm. Thanks to #WillNess

VirtualFree results in ERROR_INVALID_PARAMETER in Rust

Here's a situation. I'm allocating memory using the following function
let addr = windows::Win32::System::Memory::VirtualAlloc(
ptr::null_mut(),
size,
windows::Win32::System::Memory::MEM_RESERVE | windows::Win32::System::Memory::MEM_COMMIT,
windows::Win32::System::Memory::PAGE_READWRITE,
);
Upon successful allocation, the resulting memory is cast to *mut u8 and everyone's happy until it's a time to deallocate this same space. Here's how I approach it
let result = System::Memory::VirtualFree(
ptr as *mut c_void,
size,
windows::Win32::System::Memory::MEM_DECOMMIT).0;
In Win32 API docs stated that upon successful reclamation of memory VirtualFree spits out a non-zero value, but in my case the return value turns out to be a zero. I was quite dismayed at first, so I decided to get right into the weeds to further investigate the problem. During my investigation I found out that calling GetLastError would give me a more detailed explanation of what I might have done wrong. The value this function ended up returning was 0x57, i.e ERROR_INVALID_PARAMETER. As that issue has been a primary source of majority of negative emotions for quite a while, I've had a lot of time to experiment with input values to these precious functions. And here's a thing. The setting I started describing the problem with functions perfectly when I'm running tests in release mode, but is completely off the table when it comes to debug mode. When I pass 0 as a second argument to VirtualFree, and MEM_RELEASE as a third one, it ends up crashing in both modes. So, how do I escape this nightmare and finally resolve the issue?
UPD
I apologize for the lack of context. So, the problem occurs when I'm running the following test
#[test]
fn stress() {
let mut rng = rand::thread_rng();
let seed: u64 = rng.gen();
let seed = seed % 10000;
run_stress(seed);
}
fn run_stress(seed: u64) {
let mut a = Dlmalloc::new();
println!("++++++++++++++++++++++ seed = {}\n", seed);
let mut rng = StdRng::seed_from_u64(seed);
let mut ptrs = Vec::new();
let max = if cfg!(test_lots) { 1_000_000 } else { 10_000 };
unsafe {
for _k in 0..max {
let free = !ptrs.is_empty()
&& ((ptrs.len() < 10_000 && rng.gen_bool(1f64 / 3f64)) || rng.gen());
if free {
let idx = rng.gen_range(0, ptrs.len());
let (ptr, size, align) = ptrs.swap_remove(idx);
println!("ptr: {:p}, size = {}", ptr, size);
a.free(ptr, size, align); // crashes right after the call to this function
continue;
}
if !ptrs.is_empty() && rng.gen_bool(1f64 / 100f64) {
let idx = rng.gen_range(0, ptrs.len());
let (ptr, size, align) = ptrs.swap_remove(idx);
let new_size = if rng.gen() {
rng.gen_range(size, size * 2)
} else if size > 10 {
rng.gen_range(size / 2, size)
} else {
continue;
};
let mut tmp = Vec::new();
for i in 0..cmp::min(size, new_size) {
tmp.push(*ptr.add(i));
}
let ptr = a.realloc(ptr, size, align, new_size);
assert!(!ptr.is_null());
for (i, byte) in tmp.iter().enumerate() {
assert_eq!(*byte, *ptr.add(i));
}
ptrs.push((ptr, new_size, align));
}
let size = if rng.gen() {
rng.gen_range(1, 128)
} else {
rng.gen_range(1, 128 * 1024)
};
let align = if rng.gen_bool(1f64 / 10f64) {
1 << rng.gen_range(3, 8)
} else {
8
};
let zero = rng.gen_bool(1f64 / 50f64);
let ptr = if zero {
a.calloc(size, align)
} else {
a.malloc(size, align)
};
for i in 0..size {
if zero {
assert_eq!(*ptr.add(i), 0);
}
*ptr.add(i) = 0xce;
}
ptrs.push((ptr, size, align));
}
}
}
I should point out that it doesn't crash on a particular iteration -- this number always changes.
This is the excerpt from the dlmalloc-rust crate.
The crate I'm using for interacting with winapi is windows-rs
Here's an implementation of free
pub unsafe fn free(ptr: *mut u8, size: usize) -> bool {
let result = System::Memory::VirtualFree(
ptr as *mut c_void,
0,
windows::Win32::System::Memory::MEM_RELEASE).0;
if result == 0 {
let cause = windows::Win32::Foundation::GetLastError().0;
dlverbose!("{}", cause);
}
result != 0
}

Rust - why is my program performing very slowly - over 5 times slower than the same program written in JavaScript using Node

I have finished converting an application that I made in JavaScript to Rust for increased performance. I am learning to program, and all the application does is work out the multiplicative persistence of any number in a range. It multiplies all digits together to form a new number, then repeats until the number becomes less than 10.
My issue is, my program written in JavaScript is over 5 times faster than the same in Rust. I must be doing something wrong with converting Strings to ints somewhere, I even tried swapping i128 to i64 and it made little difference.
If I run "cargo run --release" it is still slower!
Please can somebody look through my code to work out if there is any part of it that is causing the issues? Thank you in advance :)
fn multiplicative_persistence(mut user_input: i128) -> i128 {
let mut steps: i128 = 0;
let mut numbers: Vec<i128> = Vec::new();
while user_input > 10 {
let string_number: String = user_input.to_string();
let digits: Vec<&str> = string_number.split("").collect();
let mut sum: i128 = 1;
let digits_count = digits.len();
for number in 1..digits_count - 1 {
sum *= digits[number].parse::<i128>().unwrap();
}
numbers.push(sum);
steps += 1;
user_input = sum;
}
return steps;
}
fn main() {
// let _user_input: i128 = 277777788888899;
let mut highest_steps_count: i128 = 0;
let mut highest_steps_number: i128 = 0;
let start: i128 = 77551000000;
let finish: i128 = 1000000000000000;
for number in start..=finish {
// println!("{}: {}", number, multiplicative_persistence(number));
if multiplicative_persistence(number) > highest_steps_count {
highest_steps_count = multiplicative_persistence(number);
highest_steps_number = number;
}
if number % 1000000 == 0 {
println!("Upto {} so far: {}", number, highest_steps_number);
}
}
println!("Highest step count: {} at {}", highest_steps_number, highest_steps_count);
}
I do plan to use the numbers variable in the function but I have not learnt enough to know how to properly return it as an associative array.

Maybe the issue is that converting a number to a string, and then re-converting it again into a number is not that fast, and avoidable. You don't need this intermediate step:
fn step(mut x: i128) -> i128 {
let mut result = 1;
while x > 0 {
result *= x % 10;
x /= 10;
}
result
}
fn multiplicative_persistence(mut user_input: i128) -> i128 {
let mut steps = 0;
while user_input > 10 {
user_input = step(user_input);
steps += 1;
}
steps
}
EDIT Just out of curiosity, I'd like to know whether the bottleneck is really due to the string conversion or to the rest of the code that is somehow wasteful. Here is an example that does not call .split(""), does not re-allocate that intermediate vector, and only allocates once, not at each step, the string.
#![feature(fmt_internals)]
use std::fmt::{Formatter, Display};
fn multiplicative_persistence(user_input: i128) -> i128 {
let mut steps = 0;
let mut digits = user_input.to_string();
while user_input > 10 {
let product = digits
.chars()
.map(|x| x.to_digit(10).unwrap())
.fold(1, |acc, i| acc*i);
digits.clear();
let mut formatter = Formatter::new(&mut digits);
Display::fmt(&product, &mut formatter).unwrap();
steps += 1;
}
steps
}
I have basically inlined the string conversion that would be performed by .to_string() in order to re-use the already-allocated buffer, instead of re-allocating one each iteration. You can try it out on the playground. Note that you need a nightly compiler because it makes use of an unstable feature.

How can i make my rust code run faster in parallel?

#![feature(map_first_last)]
use num_cpus;
use std::collections::BTreeMap;
use ordered_float::OrderedFloat;
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::Instant;
const MONO_FREQ: [f64; 26] = [
8.55, 1.60, 3.16, 3.87, 12.1, 2.18, 2.09, 4.96, 7.33, 0.22, 0.81, 4.21, 2.53, 7.17, 7.47, 2.07,
0.10, 6.33, 6.73, 8.94, 2.68, 1.06, 1.83, 0.19, 1.72, 0.11,
];
fn main() {
let ciphertext : String = "helloworldthisisatest".to_string();
concurrent( &ciphertext);
parallel( &ciphertext);
}
fn concurrent(ciphertext : &String) {
let start = Instant::now();
for _ in 0..50000 {
let mut best_fit : f64 = chi_squared(&ciphertext);
let mut best_key : u8 = 0;
for i in 1..26 {
let test_fit = chi_squared(&decrypt(&ciphertext, i));
if test_fit < best_fit {
best_key = i;
best_fit = test_fit;
}
}
}
let elapsed = start.elapsed();
println!("Concurrent : {} ms", elapsed.as_millis());
}
fn parallel(ciphertext : &String) {
let cpus = num_cpus::get() as u8;
let start = Instant::now();
for _ in 0..50000 {
let mut best_result : f64 = chi_squared(&ciphertext);
for i in (0..26).step_by(cpus.into()) {
let results = Arc::new(Mutex::new(BTreeMap::new()));
let mut threads = vec![];
for ii in i..i+cpus {
threads.push(thread::spawn({
let clone = Arc::clone(&results);
let test = OrderedFloat(chi_squared(&decrypt(&ciphertext, ii)));
move || {
let mut v = clone.lock().unwrap();
v.insert(test, ii);
}
}));
}
for t in threads {
t.join().unwrap();
}
let lock = Arc::try_unwrap(results).expect("Lock still has multiple owners");
let hold = lock.into_inner().expect("Mutex cannot be locked");
if hold.last_key_value().unwrap().0.into_inner() > best_result {
best_result = hold.last_key_value().unwrap().0.into_inner();
}
}
}
let elapsed = start.elapsed();
println!("Parallel : {} ms", elapsed.as_millis());
}
fn decrypt(ciphertext : &String, shift : u8) -> String {
ciphertext.chars().map(|x| ((x as u8 + shift - 97) % 26 + 97) as char).collect()
}
pub fn chi_squared(text: &str) -> f64 {
let mut result: f64 = 0.0;
for (pos, i) in get_letter_counts(text).iter().enumerate() {
let expected = MONO_FREQ[pos] * text.len() as f64 / 100.0;
result += (*i as f64 - expected).powf(2.0) / expected;
}
return result;
}
fn get_letter_counts(text: &str) -> [u64; 26] {
let mut results: [u64; 26] = [0; 26];
for i in text.chars() {
results[((i as u64) - 97) as usize] += 1;
}
return results;
}
Sorry to dump so much code, but i have no idea where the problem is, no matter what i try the parallel code seems to be around 100x slower.
I think that the problem may be in the chi_squared function as i don't know if this is running in parallel.
I have tried arc mutex, rayon and messaging and all slow it down when it should speed it up. What could I do to make this faster?

Your code calculates chi_squared function on main thread here is the correct version.
for ii in i..i + cpus {
let cp = ciphertext.clone();
let clone = Arc::clone(&results);
threads.push(thread::spawn(move || {
let test = OrderedFloat(chi_squared(&decrypt(&cp, ii)));
let mut v = clone.lock().unwrap();
v.insert(test, ii);
}));
}
Note that it does not matter if it is calculated parallel or not because spawning 50000*26 threads and synchronization overhead between threads are what makes up the 100x difference in the first place. Using a threadpool implementation would reduce the overhead but the result will still be much slower than single threaded version. The only thing you can do is assigning work in the outer loop (0..50000 ) however i am guessing you are trying to parallelize inside the main loop.

How to reverse a singly-linked list and convert it to a vector?

While writing the A* algorithm, I tried to reverse a singly-linked list of actions and pack it into Vec.
Here's the structure for my singly-linked list:
use std::rc::Rc;
struct FrontierElem<A> {
prev: Option<Rc<FrontierElem<A>>>,
action: A,
}
My first thought was to push actions into Vec then reverse the vector:
fn rev1<A>(fel: &Rc<FrontierElem<A>>) -> Vec<A>
where
A: Clone,
{
let mut cur = fel;
let mut ret = Vec::new();
while let Some(ref prev) = cur.prev {
ret.push(cur.action.clone());
cur = prev;
} // First action (where cur.prev==None) is ignored by design
ret.as_mut_slice().reverse();
ret
}
I didn't find the SliceExt::reverse method at the time, so I proceeded to the second plan: fill the vector from the end to the start. I didn't find a way to do that safely.
/// Copies action fields from single-linked list to vector in reverse order.
/// `fel` stands for first element
fn rev2<A>(fel: &Rc<FrontierElem<A>>) -> Vec<A>
where
A: Clone,
{
let mut cnt = 0usize;
// First pass. Let's find a length of list `fel`
{
let mut cur = fel;
while let Some(ref prev) = cur.prev {
cnt = cnt + 1;
cur = prev;
}
} // Lexical scoping to unborrow `fel`
// Second pass. Create and fill `ret` vector
let mut ret = Vec::<A>::with_capacity(cnt);
{
let mut idx = cnt - 1;
let mut cur = fel;
// I didn't find safe and fast way to populate vector from the end to the beginning.
unsafe {
ret.set_len(cnt); //unsafe. vector values aren't initialized
while let Some(ref prev) = cur.prev {
ret[idx] = cur.action.clone();
idx = idx - 1;
cur = prev;
}
}
assert_eq!(idx, std::usize::MAX);
} // Lexical scoping to make `fel` usable again
ret
}
While I was writing this, it occurred to me that I can also implement Iterator for the linked list and then use rev and from_iter to create a vector. Alas, this requires significant overhead, as I must implement DoubleEndedIterator trait for rev to work.
At this point my question seems trivial, but I post it in hope that it will be of some use.
Benchmark:
running 2 tests
test bench_rev1 ... bench: 1537061 ns/iter (+/- 14466)
test bench_rev2 ... bench: 1556088 ns/iter (+/- 17165)

Fill the vector, then reverse it using .as_mut_slice().reverse().
fn rev1<A>(fel: &Rc<FrontierElem<A>>) -> Vec<A>
where
A: Clone,
{
let mut cur = fel;
let mut ret = Vec::new();
while let Some(ref prev) = cur.prev {
ret.push(cur.action.clone());
cur = prev;
} // First action (where cur.prev==None) is ignored by design
ret.as_mut_slice().reverse();
ret
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Why is what timers show so counter-intuitive? - performance

Related

My prime number sieve is extremely slow even with --release

VirtualFree results in ERROR_INVALID_PARAMETER in Rust

Rust - why is my program performing very slowly - over 5 times slower than the same program written in JavaScript using Node

How can i make my rust code run faster in parallel?

How to reverse a singly-linked list and convert it to a vector?

Categories

Resources