Why rayon-based parallel processing takes more time than serial processing?

Why rayon-based parallel processing takes more time than serial processing? - parallel-processing

Learning Rayon, I wanted to compare the performace of parallel calculation and serial calculation of Fibonacci series. Here's my code:
use rayon;
use std::time::Instant;
fn main() {
let nth = 30;
let now = Instant::now();
let fib = fibonacci_serial(nth);
println!(
"[s] The {}th number in the fibonacci sequence is {}, elapsed: {}",
nth,
fib,
now.elapsed().as_micros()
);
let now = Instant::now();
let fib = fibonacci_parallel(nth);
println!(
"[p] The {}th number in the fibonacci sequence is {}, elapsed: {}",
nth,
fib,
now.elapsed().as_micros()
);
}
fn fibonacci_parallel(n: u64) -> u64 {
if n <= 1 {
return n;
}
let (a, b) = rayon::join(|| fibonacci_parallel(n - 2), || fibonacci_parallel(n - 1));
a + b
}
fn fibonacci_serial(n: u64) -> u64 {
if n <= 1 {
return n;
}
fibonacci_serial(n - 2) + fibonacci_serial(n - 1)
}
Run in Rust Playground
I expected the elapsed time of parallel calculation would be smaller than the elapsed time of serial caculation, but the result was opposite:
# `s` stands for serial calculation and `p` for parallel
[s] The 30th number in the fibonacci sequence is 832040, elapsed: 12127
[p] The 30th number in the fibonacci sequence is 832040, elapsed: 990379
My implementation for serial/parallel calculation would have flaws. But if not, why am I seeing these results?

I think the real reason is, that you create n² threads which is not good. In every call of fibonacci_parallel you create another pair of threads for rayon and because you call fibonacci_parallel again in the closure you create yet another pair of threads.
This is utterly terrible for the OS/rayon.
An approach to solve this problem could be this:
fn fibonacci_parallel(n: u64) -> u64 {
fn inner(n: u64) -> u64 {
if n <= 1 {
return n;
}
inner(n - 2) + inner(n - 1)
}
if n <= 1 {
return n;
}
let (a, b) = rayon::join(|| inner(n - 2), || inner(n - 1));
a + b
}
You create two threads which both execute the inner function. With this addition I get
op#VBOX /t/t/foo> cargo run --release 40
Finished release [optimized] target(s) in 0.03s
Running `target/release/foo 40`
[s] The 40th number in the fibonacci sequence is 102334155, elapsed: 1373741
[p] The 40th number in the fibonacci sequence is 102334155, elapsed: 847343
But as said, for low numbers parallel execution is not worth it:
op#VBOX /t/t/foo> cargo run --release 20
Finished release [optimized] target(s) in 0.02s
Running `target/release/foo 20`
[s] The 10th number in the fibonacci sequence is 6765, elapsed: 82
[p] The 10th number in the fibonacci sequence is 6765, elapsed: 241

Related

Rust sorting uses surprisingly few comparaisons

I am currently learning Rust (using the Rust book), and one page mentions counting the number of times the sorting key was used while sorting an array. I modified the code in order to count this for arbitrary sizes, and here is the code :
fn main() {
const MAX: i32 = 10000;
for n in 1..MAX {
let mut v: Vec<i32> = (1..n).collect();
let mut ops = 0;
v.sort_by(|x, y| {
ops += 1;
x.cmp(y)
});
if n-2 >= 0 {
assert_eq!(n-2, ops);
}
// println!("A list of {n} elements is sorted in {ops} operations");
}
}
However, it seems that in order to sort a vector of n elements, Rust only needs n-2 comparaisons (the code above runs without panicking).
How can this be possible ? Aren't sorts supposed to be in O(n*log(n)) ?
Is it because Rust somehow "noticed" that my input vector was already sorted ?
Even in that case, how can a vector of length 2 can be sorted without any comparaisons ? Shouldn't it at least be n-1 ?

The biggest misconseption you have, I think, is:
fn main() {
const SIZE: i32 = 5;
let v: Vec<i32> = (1..SIZE).collect();
println!("{}", v.len());
}
4
The range 1..SIZE does not include SIZE and contains SIZE-1 elements.
Further, it will already be sorted, so it's as simple as iterating through it once.
See here:
fn main() {
const SIZE: i32 = 5;
let mut v: Vec<i32> = (1..SIZE).collect();
let mut ops = 0;
v.sort_by(|x, y| {
ops += 1;
let result = x.cmp(y);
println!(" - cmp {} vs {} => {:?}", x, y, result);
result
});
println!("Total comparisons: {}", ops);
}
- cmp 4 vs 3 => Greater
- cmp 3 vs 2 => Greater
- cmp 2 vs 1 => Greater
Total comparisons: 3
it seems that in order to sort a vector of n elements, Rust only needs n-2 comparaisons
That is incorrect. In order to sort an already sorted vector (which yours are), Rust needs n-1 comparisons. It doesn't detect that, that's just an inherent property of the mergesort implementation that Rust uses.
If it isn't already sorted, it will be more:
fn main() {
let mut v: Vec<i32> = vec![2, 4, 1, 3];
let mut ops = 0;
v.sort_by(|x, y| {
ops += 1;
let result = x.cmp(y);
println!(" - cmp {} vs {} => {:?}", x, y, result);
result
});
println!("Total comparisons: {}", ops);
}
- cmp 3 vs 1 => Greater
- cmp 1 vs 4 => Less
- cmp 3 vs 4 => Less
- cmp 1 vs 2 => Less
- cmp 3 vs 2 => Greater
Total comparisons: 5

FYI sort_by:
pub fn sort_by<F>(&mut self, mut compare: F)
where
F: FnMut(&T, &T) -> Ordering,
{
merge_sort(self, |a, b| compare(a, b) == Less);
}
and it actually invokes merge_sort:
/// This merge sort borrows some (but not all) ideas from TimSort, which is described in detail
/// [here](https://github.com/python/cpython/blob/main/Objects/listsort.txt).
///
/// The algorithm identifies strictly descending and non-descending subsequences, which are called
/// natural runs. There is a stack of pending runs yet to be merged. Each newly found run is pushed
/// onto the stack, and then some pairs of adjacent runs are merged until these two invariants are
/// satisfied:
///
/// 1. for every `i` in `1..runs.len()`: `runs[i - 1].len > runs[i].len`
/// 2. for every `i` in `2..runs.len()`: `runs[i - 2].len > runs[i - 1].len + runs[i].len`
///
/// The invariants ensure that the total running time is *O*(*n* \* log(*n*)) worst-case.
#[cfg(not(no_global_oom_handling))]
fn merge_sort<T, F>(v: &mut [T], mut is_less: F)
how can a vector of length 2 be sorted without any comparisons? Shouldn't it at least be n-1?
(1..2) returns a slice of length 1 (start from 1, but less than 2). So, when n == 2 in your code, please note that the length of the vector is one.
Let me demonstrate how it will actually go in the merge_sort if the input is a slice shorter than or equal to 2.
// MAX_INSERTION: 20
if len <= MAX_INSERTION {
// if the len is less than 1, it won't use `is_less` closure to let you count the cmp.
if len >= 2 {
for i in (0..len - 1).rev() {
insert_head(&mut v[i..], &mut is_less); // <- go into `insert_head`.
}
}
return;
}
fn insert_head<T, F>(v: &mut [T], is_less: &mut F)
where
F: FnMut(&T, &T) -> bool,
{
if v.len() >= 2 && is_less(&v[1], &v[0]) // <- here it uses the closure to make comparison.
So if your input is less than 21, short arrays will get sorted in place via insertion sort to avoid allocations.

Faster HashMap for sequential keys

Initially I was very surprised to find out Rust's HashMap, even with the FNV hasher, was considerably slower than the equivalents in Java, .NET, PHP. I am talking about optimized Release mode, not Debug mode. I did some calculations and realized the timings in Java/.NET/PHP were suspiciously low. Then it hit me - even though I was testing with a big hash table (millions of entries), I was reading mostly sequential key values (like 14, 15, 16, ...), which apparently resulted in lots of CPU cache hits, due to the way the standard hash tables (and hash-code functions for integers and short strings) in those languages are implementated, so that entries with nearby keys are usually located in nearby memory locations.
Rust's HashMap, on the other hand, uses the so called SwissTable implementation, which apparently distributes values differently. When I tested reading by random keys, everything fell into place - the "competitors" scored behind Rust.
So if we are in a situation where we need to perform lots of gets sequentially, for example iterating some DB IDs that are ordered and mostly sequential (with not too many gaps), is there a good Rust hash map implementation that can compete with Java's HashMap or .NET's Dictionary?
P.S. As requested in the comments, I paste an example here. I ran lots of tests, but here is a simple example that takes 75 ms in Rust (release mode) and 20 ms in Java:
In Rust:
let hm: FnvHashMap<i32, i32> = ...;
// Start timer here
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
println!("The sum is: {}", sum);
In Java:
Map<Integer, Integer> hm = ...;
// Start timer here
long sum = 0;
for (int i = 0; i < 1_000_000; i++) {
sum += hm.get(i);
}
With HashMap<i32, i32> and its default SipHash hasher it took 190 ms. I know why it's slower than FnvHashMap. I'm just mentioning that for completeness.

First, here is some runnable code to measure the efficiency of the different implementations:
use std::{collections::HashMap, time::Instant};
fn main() {
let hm: HashMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
let t0 = Instant::now();
let mut sum = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += x;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("{} - The sum is: {}", elapsed, sum);
}
On the old desktop machine I'm writing this on, it reports 76 ms to run. Since the machine is 10+ years old, I find it baffling that your hardware would take 190 ms to run the same code, so I'm wondering how and what you're actually measuring. But let's ignore that and concentrate on the relative numbers.
When you care about hashmap efficiency in Rust, and when the keys don't come from an untrusted source, the first thing to try should always be to switch to a non-DOS-resistant hash function. One possibility is the FNV hash function from the fnv crate, which you can get by switching HashMap to fnv::FnvHashMap. That brings performance to 34 ms, i.e. a 2.2x speedup.
If this is not enough, you can try the hash from the rustc-hash crate (almost the same as fxhash, but allegedly better maintained), which uses the same function as the Rust compiler, adapted from the hash used by Firefox. Not based on any formal analysis, it performs badly on hash function test suites, but is reported to consistently outperform FNV. That's confirmed on the above example, where switching from FnvHashMap to rustc_hash::FxHashMap drops the time to 28 ms, i.e. a 2.7x speedup from the initial timing.
Finally, if you want to just imitate what C# and Java do, and could care less about certain patterns of inserted numbers leading to degraded performance, you can use the aptly-named nohash_hasher crate that gives you an identity hash. Changing HashMap<i32, i32> to HashMap<i32, i32, nohash_hasher::BuildNoHashHasher<i32>> drops the time to just under 4 ms, i.e. a whopping 19x speedup from the initial timing.
Since you report the Java example to be 9.5x faster than Rust, a 19x speedup should make your code approximately twice as fast as Java.

Rust's HashMap by default uses an implementation of SipHash as the hash function. SipHash is designed to avoid denial-of-service attacks based on predicting hash collisions, which is an important security property for a hash function used in a hash map.
If you don't need this guarantee, you can use a simpler hash function. One option is using the fxhash crate, which should speed up reading integers from a HashMap<i32, i32> by about a factor of 3.
Other options are implementing your own trivial hash function (e.g. by simply using the identity function, which is a decent hash function for mostly consecutive keys), or using a vector instead of a hash map.
.NET uses the identity function for hashes of Int32 by default, so it's not resistant to hash flooding attacks. Of course this is faster, but the downside is not even mentioned in the documentation of Dictionary. For what it's worth, I prefer Rust's "safe by default" approach over .NET's any day, since many developers aren't even aware of the problems predictable hash functions can cause. Rust still allows you to use a more performant hash function if you don't need the hash flooding protection, so to me personally this seems to be a strength of Rust compared to at least .NET rather than a weakness.

I decided to run some more tests, based on the suggestions by user4815162342. This time I used another machine with Ubuntu 20.04.
Rust code
println!("----- HashMap (with its default SipHash hasher) -----------");
let hm: HashMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
for k in 0..6 {
let t0 = Instant::now();
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("The sum is: {}. Time elapsed: {:.3} sec", sum, elapsed);
}
println!("----- FnvHashMap (fnv 1.0.7) ------------------------------");
let hm: FnvHashMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
for k in 0..6 {
let t0 = Instant::now();
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("The sum is: {}. Time elapsed: {:.3} sec", sum, elapsed);
}
println!("----- FxHashMap (rustc-hash 1.1.0) ------------------------");
let hm: FxHashMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
for k in 0..6 {
let t0 = Instant::now();
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("The sum is: {}. Time elapsed: {:.3} sec", sum, elapsed);
}
println!("----- HashMap/BuildNoHashHasher (nohash-hasher 0.2.0) -----");
let hm: HashMap<i32, i32, nohash_hasher::BuildNoHashHasher<i32>> = (0..1_000_000).map(|i| (i, i)).collect();
for k in 0..6 {
let t0 = Instant::now();
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("The sum is: {}. Time elapsed: {:.3} sec", sum, elapsed);
}
BTW the last one can be replaced with this shorter type:
let hm: IntMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
For those interested, this is IntMap's definition:
pub type IntMap<K, V> = std::collections::HashMap<K, V, BuildNoHashHasher<K>>;
Java code
On the same machine I tested a Java example. I don't have a JVM installed on it, so I used a Docker image adoptopenjdk/openjdk14 and directly pasted the code below in jshell> (not sure if that would hurt Java's timings). So this is the Java code:
Map<Integer, Integer> hm = new HashMap<>();
for (int i = 0; i < 1_000_000; i++) {
hm.put(i, i);
}
for (int k = 0; k < 6; k++) {
long t0 = System.currentTimeMillis();
// Start timer here
long sum = 0;
for (int i = 0; i < 1_000_000; i++) {
sum += hm.get(i);
}
System.out.println("The sum is: " + sum + ". Time elapsed: " + (System.currentTimeMillis() - t0) + " ms");
}
Results
Rust (release mode):
----- HashMap (with its default SipHash hasher) -----------
The sum is: 499999500000. Time elapsed: 0.149 sec
The sum is: 499999500000. Time elapsed: 0.140 sec
The sum is: 499999500000. Time elapsed: 0.167 sec
The sum is: 499999500000. Time elapsed: 0.150 sec
The sum is: 499999500000. Time elapsed: 0.261 sec
The sum is: 499999500000. Time elapsed: 0.189 sec
----- FnvHashMap (fnv 1.0.7) ------------------------------
The sum is: 499999500000. Time elapsed: 0.055 sec
The sum is: 499999500000. Time elapsed: 0.052 sec
The sum is: 499999500000. Time elapsed: 0.053 sec
The sum is: 499999500000. Time elapsed: 0.058 sec
The sum is: 499999500000. Time elapsed: 0.051 sec
The sum is: 499999500000. Time elapsed: 0.056 sec
----- FxHashMap (rustc-hash 1.1.0) ------------------------
The sum is: 499999500000. Time elapsed: 0.039 sec
The sum is: 499999500000. Time elapsed: 0.076 sec
The sum is: 499999500000. Time elapsed: 0.064 sec
The sum is: 499999500000. Time elapsed: 0.048 sec
The sum is: 499999500000. Time elapsed: 0.057 sec
The sum is: 499999500000. Time elapsed: 0.061 sec
----- HashMap/BuildNoHashHasher (nohash-hasher 0.2.0) -----
The sum is: 499999500000. Time elapsed: 0.004 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
Java:
The sum is: 499999500000. Time elapsed: 49 ms // see notes below
The sum is: 499999500000. Time elapsed: 41 ms // see notes below
The sum is: 499999500000. Time elapsed: 18 ms
The sum is: 499999500000. Time elapsed: 29 ms
The sum is: 499999500000. Time elapsed: 19 ms
The sum is: 499999500000. Time elapsed: 23 ms
(With Java the first 1-2 runs are normally slower, as the JVM HotSpot still hasn't fully optimized the relevant piece of code.)

Try hashbrown
It uses aHash which have full comparison with other HashMap algorithm here

Using Grand Central Dispatch in Swift to parallelize and speed up “for" loops?

I am trying to wrap my head around how to use GCD to parallelize and speed up Monte Carlo simulations. Most/all simple examples are presented for Objective C and I really need a simple example for Swift since Swift is my first “real” programming language.
The minimal working version of a monte carlo simulation in Swift would be something like this:
import Foundation
import Cocoa
var winner = 0
var j = 0
var i = 0
var chance = 0
var points = 0
for j=1;j<1000001;++j{
var ability = 500
var player1points = 0
for i=1;i<1000;++i{
chance = Int(arc4random_uniform(1001))
if chance<(ability-points) {++points}
else{points = points - 1}
}
if points > 0{++winner}
}
println(winner)
The code works directly pasted into a command line program project in xcode 6.1
The innermost loop cannot be parallelized because the new value of variable “points” is used in the next loop. But the outermost just run the innermost simulation 1000000 times and tally up the results and should be an ideal candidate for parallelization.
So my question is how to use GCD to parallelize the outermost for loop?

A "multi-threaded iteration" can be done with dispatch_apply():
let outerCount = 100 // # of concurrent block iterations
let innerCount = 10000 // # of iterations within each block
let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
for innerIdx in 1 ... innerCount {
// ...
}
}
(You have to figure out the best relation between outer and inner counts.)
There are two things to notice:
arc4random() uses an internal mutex, which makes it extremely slow when called
from several threads in parallel, see Performance of concurrent code using dispatch_group_async is MUCH slower than single-threaded version. From the answers given there,
rand_r() (with separate seeds for each thread) seems to be faster alternative.
The result variable winner must not be modified from multiple threads simultaneously.
You can use an array instead where each thread updates its own element, and the results
are added afterwards. A thread-safe method has been described in https://stackoverflow.com/a/26790019/1187415.
Then it would roughly look like this:
let outerCount = 100 // # of concurrent block iterations
let innerCount = 10000 // # of iterations within each block
let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
var winners = [Int](count: outerCount, repeatedValue: 0)
winners.withUnsafeMutableBufferPointer { winnersPtr -> Void in
dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
var seed = arc4random() // seed for rand_r() in this "thread"
for innerIdx in 1 ... innerCount {
var points = 0
var ability = 500
for i in 1 ... 1000 {
let chance = Int(rand_r(&seed) % 1001)
if chance < (ability-points) { ++points }
else {points = points - 1}
}
if points > 0 {
winnersPtr[Int(outerIdx)] += 1
}
}
}
}
// Add results:
let winner = reduce(winners, 0, +)
println(winner)

Just to update this for contemporary syntax, we now use concurrentPerform (which replaces dispatch_apply).
So we can replace
for j in 0 ..< 1_000_000 {
for i in 0 ..< 1000 {
...
}
}
With
DispatchQueue.concurrentPerform(1_000_000) { j in
for i in 0 ..< 1000 {
...
}
}
Note, parallelizing introduces a little overhead, in both the basic GCD dispatch mechanism, as well as the synchronization of the results. If you had 32 iterations in your parallel loop this would be inconsequential, but you have a million iterations, and it will start to add up.
We generally solve this by “striding”: Rather than parallelizing 1 million iterations, you might only do 100 parallel iterations, doing 10,000 iterations each. E.g. something like:
let totalIterations = 1_000_000
let stride = 10_000
let (quotient, remainder) = totalIterations.quotientAndRemainder(dividingBy: stride)
let iterations = quotient + remainder == 0 ? 0 : 1
DispatchQueue.concurrentPerform(iterations: iterations) { iteration in
for j in iteration * stride ..< min(totalIterations, (iteration + 1) * stride) {
for i in 0 ..< 1000 {
...
}
}
}

Memoization done, what now?

I was trying to solve a puzzle in Haskell and had written the following code:
u 0 p = 0.0
u 1 p = 1.0
u n p = 1.0 + minimum [((1.0-q)*(s k p)) + (u (n-k) p) | k <-[1..n], let q = (1.0-p)**(fromIntegral k)]
s 1 p = 0.0
s n p = 1.0 + minimum [((1.0-q)*(s (n-k) p)) + q*((s k p) + (u (n-k) p)) | k <-[1..(n-1)], let q = (1.0-(1.0-p)**(fromIntegral k))/(1.0-(1.0-p)**(fromIntegral n))]
This code was terribly slow though. I suspect the reason for this is that the same things get calculated again and again. I therefore made a memoized version:
memoUa = array (0,10000) ((0,0.0):(1,1.0):[(k,mua k) | k<- [2..10000]])
mua n = (1.0) + minimum [((1.0-q)*(memoSa ! k)) + (memoUa ! (n-k)) | k <-[1..n], let q = (1.0-0.02)**(fromIntegral k)]
memoSa = array (0,10000) ((0,0.0):(1,0.0):[(k,msa k) | k<- [2..10000]])
msa n = (1.0) + minimum [((1.0-q) * (memoSa ! (n-k))) + q*((memoSa ! k) + (memoUa ! (n-k))) | k <-[1..(n-1)], let q = (1.0-(1.0-0.02)**(fromIntegral k))/(1.0-(1.0-0.02)**(fromIntegral n))]
This seems to be a lot faster, but now I get an out of memory error. I do not understand why this happens (the same strategy in java, without recursion, has no problems). Could somebody point me in the right direction on how to improve this code?
EDIT: I am adding my java version here (as I don't know where else to put it). I realize that the code isn't really reader-friendly (no meaningful names, etc.), but I hope it is clear enough.
public class Main {
public static double calc(double p) {
double[] u = new double[10001];
double[] s = new double[10001];
u[0] = 0.0;
u[1] = 1.0;
s[0] = 0.0;
s[1] = 0.0;
for (int n=2;n<10001;n++) {
double q = 1.0;
double denom = 1.0;
for (int k = 1; k <= n; k++ ) {
denom = denom * (1.0 - p);
}
denom = 1.0 - denom;
s[n] = (double) n;
u[n] = (double) n;
for (int k = 1; k <= n; k++ ) {
q = (1.0 - p) * q;
if (k<n) {
double qs = (1.0-q)/denom;
double bs = (1.0-qs)*s[n-k] + qs*(s[k]+ u[n-k]) + 1.0;
if (bs < s[n]) {
s[n] = bs;
}
}
double bu = (1.0-q)*s[k] + 1.0 + u[n-k];
if (bu < u[n]) {
u[n] = bu;
}
}
}
return u[10000];
}
public static void main(String[] args) {
double s = 0.0;
int i = 2;
//for (int i = 1; i<51; i++) {
s = s + calc(i*0.01);
//}
System.out.println("result = " + s);
}
}

I don't run out of memory when I run the compiled version, but there is a significant difference between how the Java version works and how the Haskell version works which I'll illustrate here.
The first thing to do is to add some important type signatures. In particular, you don't want Integer array indices, so I added:
memoUa :: Array Int Double
memoSa :: Array Int Double
I found these using ghc-mod check. I also added a main so that you can run it from the command line:
import System.Environment
main = do
(arg:_) <- getArgs
let n = read arg
print $ mua n
Now to gain some insight into what's going on, we can compile the program using profiling:
ghc -O2 -prof memo.hs
Then when we invoke the program like this:
memo 1000 +RTS -s
we will get profiling output which looks like:
164.31333233347755
98,286,872 bytes allocated in the heap
29,455,360 bytes copied during GC
657,080 bytes maximum residency (29 sample(s))
38,260 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 161 colls, 0 par 0.03s 0.03s 0.0002s 0.0011s
Gen 1 29 colls, 0 par 0.03s 0.03s 0.0011s 0.0017s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.21s ( 0.21s elapsed)
GC time 0.06s ( 0.06s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.27s ( 0.27s elapsed)
%GC time 21.8% (22.3% elapsed)
Alloc rate 468,514,624 bytes per MUT second
Productivity 78.2% of total user, 77.3% of total elapsed
Important things to pay attention to are:
maximum residency
Total time
%GC time (or Productivity)
Maximum residency is a measure of how much memory is needed by the program. %GC time the proportion of the time spent in garbage collection and Productivity is the complement (100% - %GC time).
If you run the program for various input values you will see a productivity of around 80%:
n Max Res. Prod. Time Output
2000 779,076 79.4% 1.10s 328.54535361588535
4000 1,023,016 80.7% 4.41s 657.0894961398351
6000 1,299,880 81.3% 9.91s 985.6071032981068
8000 1,539,352 81.5% 17.64s 1314.0968411684714
10000 1,815,600 81.7% 27.57s 1642.5891214360522
This means that about 20% of the run time is spent in garbage collection. Also, we see increasing memory usage as n increases.
It turns out we can dramatically improve productivity and memory usage by telling Haskell the order in which to evaluate the array elements instead of relying on lazy evaluation:
import Control.Monad (forM_)
main = do
(arg:_) <- getArgs
let n = read arg
forM_ [1..n] $ \i -> mua i `seq` return ()
print $ mua n
And the new profiling stats are:
n Max Res. Prod. Time Output
2000 482,800 99.3% 1.31s 328.54535361588535
4000 482,800 99.6% 5.88s 657.0894961398351
6000 482,800 99.5% 12.09s 985.6071032981068
8000 482,800 98.1% 21.71s 1314.0968411684714
10000 482,800 96.1% 34.58s 1642.5891214360522
Some interesting observations here: productivity is up, memory usage is down (constant now over the range of inputs) but run time is up. This suggests that we forced more computations than we needed to. In an imperative language like Java you have to give an evaluation order so you would know exactly which computations need to be performed. It would interesting to see your Java code to see which computations it is performing.

Why is my recursive function so slow in R?

The following takes about 30 seconds to run whereas I would expect it to be nearly instant. Is there a problem with my code?
x <- fibonacci(35);
fibonacci <- function(seq) {
if (seq == 1) return(1);
if (seq == 2) return(2);
return (fibonacci(seq - 1) + fibonacci(seq - 2));
}

Patrick Burns gives an example in R Inferno of one way to do memoization in R with local() and <<-. In fact, it's a fibonacci:
fibonacci <- local({
memo <- c(1, 1, rep(NA, 100))
f <- function(x) {
if(x == 0) return(0)
if(x < 0) return(NA)
if(x > length(memo))
stop("’x’ too big for implementation")
if(!is.na(memo[x])) return(memo[x])
ans <- f(x-2) + f(x-1)
memo[x] <<- ans
ans
}
})

That just provided a nice opportunity to plug Rcpp which allows us to add C++ functions easily to R.
So after fixing your code slightly, and using the packages inline (to easily compile, load and link short code snippets as dynamically loadable functions) as well as rbenchmark to time and compare functions, we end up with a stunning 700-fold increase in performance:
R> print(res)
test replications elapsed relative user.self sys.self
2 fibRcpp(N) 1 0.092 1.000 0.10 0
1 fibR(N) 1 65.693 714.054 65.66 0
R>
Here we see elapsed times of 92 milliseonds versus 65 seconds, for a relative ratio of 714. But by now everybody else told you not to do this directly in R.... The code is below.
## inline to compile, load and link the C++ code
require(inline)
## we need a pure C/C++ function as the generated function
## will have a random identifier at the C++ level preventing
## us from direct recursive calls
incltxt <- '
int fibonacci(const int x) {
if (x == 0) return(0);
if (x == 1) return(1);
return (fibonacci(x - 1)) + fibonacci(x - 2);
}'
## now use the snipped above as well as one argument conversion
## in as well as out to provide Fibonacci numbers via C++
fibRcpp <- cxxfunction(signature(xs="int"),
plugin="Rcpp",
incl=incltxt,
body='
int x = Rcpp::as<int>(xs);
return Rcpp::wrap( fibonacci(x) );
')
## for comparison, the original (but repaired with 0/1 offsets)
fibR <- function(seq) {
if (seq == 0) return(0);
if (seq == 1) return(1);
return (fibR(seq - 1) + fibR(seq - 2));
}
## load rbenchmark to compare
library(rbenchmark)
N <- 35 ## same parameter as original post
res <- benchmark(fibR(N),
fibRcpp(N),
columns=c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"),
order="relative",
replications=1)
print(res) ## show result
And for completeness, the functions also produce the correct output:
R> sapply(1:10, fibR)
[1] 1 1 2 3 5 8 13 21 34 55
R> sapply(1:10, fibRcpp)
[1] 1 1 2 3 5 8 13 21 34 55
R>

Because you are using one of the worst algorithms in the world!
Complexity of which is O(fibonacci(n)) = O((golden ratio)^n) and golden ratio is 1.6180339887498948482…

:-) because you use exponential algorithm!!! So for fibonacci number N it has to call the function 2^N times, which 2^35, which is heck of a number.... :-)
Use linear algorithm:
fib = function (x)
{
if (x == 0)
return (0)
n1 = 0
n2 = 1
for (i in 1:(x-1)) {
sum = n1 + n2
n1 = n2
n2 = sum
}
n2
}
Sorry, edit: the complexity of the exponential recursive algorithm is not O(2^N) but O(fib(N)), as Martinho Fernandes greatly joked :-) Really a good note :-)

Because the memoise package was already mentioned here is a reference implementation:
fib <- function(n) {
if (n < 2) return(1)
fib(n - 2) + fib(n - 1)
}
system.time(fib(35))
## user system elapsed
## 36.10 0.02 36.16
library(memoise)
fib2 <- memoise(function(n) {
if (n < 2) return(1)
fib2(n - 2) + fib2(n - 1)
})
system.time(fib2(35))
## user system elapsed
## 0 0 0
Source: Wickham, H.: Advanced R, p. 238.
In general memoization in computer science means that you save the results of a function so that when you call it again with the same arguments it returns the saved value.

A recursive implementation with linear cost:
fib3 <- function(n){
fib <- function(n, fibm1, fibm2){
if(n==1){return(fibm2)}
if(n==2){return(fibm1)}
if(n >2){
fib(n-1, fibm1+fibm2, fibm1)
}
}
fib(n, 1, 0)
}
Comparing with the recursive solution with exponential cost:
> system.time(fibonacci(35))
usuário sistema decorrido
14.629 0.017 14.644
> system.time(fib3(35))
usuário sistema decorrido
0.001 0.000 0.000
This solution can be vectorized with ifelse:
fib4 <- function(n){
fib <- function(n, fibm1, fibm2){
ifelse(n<=1, fibm2,
ifelse(n==2, fibm1,
Recall(n-1, fibm1+fibm2, fibm1)
))
}
fib(n, 1, 0)
}
fib4(1:30)
## [1] 0 1 1 2 3 5 8
## [8] 13 21 34 55 89 144 233
## [15] 377 610 987 1597 2584 4181 6765
## [22] 10946 17711 28657 46368 75025 121393 196418
## [29] 317811 514229
The only changes required are changing == to <= for the n==1 case, and changing each if block to the equivalent ifelse.

If you are truly looking to return Fibonacci numbers and aren't using this example to explore how recursion works then you can solve it non-recursively by using the following:
fib = function(n) {round((1.61803398875^n+0.61803398875^n)/sqrt(5))}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Why rayon-based parallel processing takes more time than serial processing? - parallel-processing

Related

Rust sorting uses surprisingly few comparaisons

Faster HashMap for sequential keys

Using Grand Central Dispatch in Swift to parallelize and speed up “for" loops?

Memoization done, what now?

Why is my recursive function so slow in R?

Categories

Resources