Faster HashMap for sequential keys - performance

Initially I was very surprised to find out Rust's HashMap, even with the FNV hasher, was considerably slower than the equivalents in Java, .NET, PHP. I am talking about optimized Release mode, not Debug mode. I did some calculations and realized the timings in Java/.NET/PHP were suspiciously low. Then it hit me - even though I was testing with a big hash table (millions of entries), I was reading mostly sequential key values (like 14, 15, 16, ...), which apparently resulted in lots of CPU cache hits, due to the way the standard hash tables (and hash-code functions for integers and short strings) in those languages are implementated, so that entries with nearby keys are usually located in nearby memory locations.
Rust's HashMap, on the other hand, uses the so called SwissTable implementation, which apparently distributes values differently. When I tested reading by random keys, everything fell into place - the "competitors" scored behind Rust.
So if we are in a situation where we need to perform lots of gets sequentially, for example iterating some DB IDs that are ordered and mostly sequential (with not too many gaps), is there a good Rust hash map implementation that can compete with Java's HashMap or .NET's Dictionary?
P.S. As requested in the comments, I paste an example here. I ran lots of tests, but here is a simple example that takes 75 ms in Rust (release mode) and 20 ms in Java:
In Rust:
let hm: FnvHashMap<i32, i32> = ...;
// Start timer here
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
println!("The sum is: {}", sum);
In Java:
Map<Integer, Integer> hm = ...;
// Start timer here
long sum = 0;
for (int i = 0; i < 1_000_000; i++) {
sum += hm.get(i);
}
With HashMap<i32, i32> and its default SipHash hasher it took 190 ms. I know why it's slower than FnvHashMap. I'm just mentioning that for completeness.

First, here is some runnable code to measure the efficiency of the different implementations:
use std::{collections::HashMap, time::Instant};
fn main() {
let hm: HashMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
let t0 = Instant::now();
let mut sum = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += x;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("{} - The sum is: {}", elapsed, sum);
}
On the old desktop machine I'm writing this on, it reports 76 ms to run. Since the machine is 10+ years old, I find it baffling that your hardware would take 190 ms to run the same code, so I'm wondering how and what you're actually measuring. But let's ignore that and concentrate on the relative numbers.
When you care about hashmap efficiency in Rust, and when the keys don't come from an untrusted source, the first thing to try should always be to switch to a non-DOS-resistant hash function. One possibility is the FNV hash function from the fnv crate, which you can get by switching HashMap to fnv::FnvHashMap. That brings performance to 34 ms, i.e. a 2.2x speedup.
If this is not enough, you can try the hash from the rustc-hash crate (almost the same as fxhash, but allegedly better maintained), which uses the same function as the Rust compiler, adapted from the hash used by Firefox. Not based on any formal analysis, it performs badly on hash function test suites, but is reported to consistently outperform FNV. That's confirmed on the above example, where switching from FnvHashMap to rustc_hash::FxHashMap drops the time to 28 ms, i.e. a 2.7x speedup from the initial timing.
Finally, if you want to just imitate what C# and Java do, and could care less about certain patterns of inserted numbers leading to degraded performance, you can use the aptly-named nohash_hasher crate that gives you an identity hash. Changing HashMap<i32, i32> to HashMap<i32, i32, nohash_hasher::BuildNoHashHasher<i32>> drops the time to just under 4 ms, i.e. a whopping 19x speedup from the initial timing.
Since you report the Java example to be 9.5x faster than Rust, a 19x speedup should make your code approximately twice as fast as Java.

Rust's HashMap by default uses an implementation of SipHash as the hash function. SipHash is designed to avoid denial-of-service attacks based on predicting hash collisions, which is an important security property for a hash function used in a hash map.
If you don't need this guarantee, you can use a simpler hash function. One option is using the fxhash crate, which should speed up reading integers from a HashMap<i32, i32> by about a factor of 3.
Other options are implementing your own trivial hash function (e.g. by simply using the identity function, which is a decent hash function for mostly consecutive keys), or using a vector instead of a hash map.
.NET uses the identity function for hashes of Int32 by default, so it's not resistant to hash flooding attacks. Of course this is faster, but the downside is not even mentioned in the documentation of Dictionary. For what it's worth, I prefer Rust's "safe by default" approach over .NET's any day, since many developers aren't even aware of the problems predictable hash functions can cause. Rust still allows you to use a more performant hash function if you don't need the hash flooding protection, so to me personally this seems to be a strength of Rust compared to at least .NET rather than a weakness.

I decided to run some more tests, based on the suggestions by user4815162342. This time I used another machine with Ubuntu 20.04.
Rust code
println!("----- HashMap (with its default SipHash hasher) -----------");
let hm: HashMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
for k in 0..6 {
let t0 = Instant::now();
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("The sum is: {}. Time elapsed: {:.3} sec", sum, elapsed);
}
println!("----- FnvHashMap (fnv 1.0.7) ------------------------------");
let hm: FnvHashMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
for k in 0..6 {
let t0 = Instant::now();
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("The sum is: {}. Time elapsed: {:.3} sec", sum, elapsed);
}
println!("----- FxHashMap (rustc-hash 1.1.0) ------------------------");
let hm: FxHashMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
for k in 0..6 {
let t0 = Instant::now();
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("The sum is: {}. Time elapsed: {:.3} sec", sum, elapsed);
}
println!("----- HashMap/BuildNoHashHasher (nohash-hasher 0.2.0) -----");
let hm: HashMap<i32, i32, nohash_hasher::BuildNoHashHasher<i32>> = (0..1_000_000).map(|i| (i, i)).collect();
for k in 0..6 {
let t0 = Instant::now();
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("The sum is: {}. Time elapsed: {:.3} sec", sum, elapsed);
}
BTW the last one can be replaced with this shorter type:
let hm: IntMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
For those interested, this is IntMap's definition:
pub type IntMap<K, V> = std::collections::HashMap<K, V, BuildNoHashHasher<K>>;
Java code
On the same machine I tested a Java example. I don't have a JVM installed on it, so I used a Docker image adoptopenjdk/openjdk14 and directly pasted the code below in jshell> (not sure if that would hurt Java's timings). So this is the Java code:
Map<Integer, Integer> hm = new HashMap<>();
for (int i = 0; i < 1_000_000; i++) {
hm.put(i, i);
}
for (int k = 0; k < 6; k++) {
long t0 = System.currentTimeMillis();
// Start timer here
long sum = 0;
for (int i = 0; i < 1_000_000; i++) {
sum += hm.get(i);
}
System.out.println("The sum is: " + sum + ". Time elapsed: " + (System.currentTimeMillis() - t0) + " ms");
}
Results
Rust (release mode):
----- HashMap (with its default SipHash hasher) -----------
The sum is: 499999500000. Time elapsed: 0.149 sec
The sum is: 499999500000. Time elapsed: 0.140 sec
The sum is: 499999500000. Time elapsed: 0.167 sec
The sum is: 499999500000. Time elapsed: 0.150 sec
The sum is: 499999500000. Time elapsed: 0.261 sec
The sum is: 499999500000. Time elapsed: 0.189 sec
----- FnvHashMap (fnv 1.0.7) ------------------------------
The sum is: 499999500000. Time elapsed: 0.055 sec
The sum is: 499999500000. Time elapsed: 0.052 sec
The sum is: 499999500000. Time elapsed: 0.053 sec
The sum is: 499999500000. Time elapsed: 0.058 sec
The sum is: 499999500000. Time elapsed: 0.051 sec
The sum is: 499999500000. Time elapsed: 0.056 sec
----- FxHashMap (rustc-hash 1.1.0) ------------------------
The sum is: 499999500000. Time elapsed: 0.039 sec
The sum is: 499999500000. Time elapsed: 0.076 sec
The sum is: 499999500000. Time elapsed: 0.064 sec
The sum is: 499999500000. Time elapsed: 0.048 sec
The sum is: 499999500000. Time elapsed: 0.057 sec
The sum is: 499999500000. Time elapsed: 0.061 sec
----- HashMap/BuildNoHashHasher (nohash-hasher 0.2.0) -----
The sum is: 499999500000. Time elapsed: 0.004 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
Java:
The sum is: 499999500000. Time elapsed: 49 ms // see notes below
The sum is: 499999500000. Time elapsed: 41 ms // see notes below
The sum is: 499999500000. Time elapsed: 18 ms
The sum is: 499999500000. Time elapsed: 29 ms
The sum is: 499999500000. Time elapsed: 19 ms
The sum is: 499999500000. Time elapsed: 23 ms
(With Java the first 1-2 runs are normally slower, as the JVM HotSpot still hasn't fully optimized the relevant piece of code.)

Try hashbrown
It uses aHash which have full comparison with other HashMap algorithm here

Related

Why is Vec::with_capacity slower than Vec::new for small final lengths?

Consider this code.
type Int = i32;
const MAX_NUMBER: Int = 1_000_000;
fn main() {
let result1 = with_new();
let result2 = with_capacity();
assert_eq!(result1, result2)
}
fn with_new() -> Vec<Int> {
let mut result = Vec::new();
for i in 0..MAX_NUMBER {
result.push(i);
}
result
}
fn with_capacity() -> Vec<Int> {
let mut result = Vec::with_capacity(MAX_NUMBER as usize);
for i in 0..MAX_NUMBER {
result.push(i);
}
result
}
Both functions produce the same output. One uses Vec::new, the other uses Vec::with_capacity. For small values of MAX_NUMBER (like in the example), with_capacity is slower than new. Only for larger final vector lengths (e.g. 100 million) the version using with_capacity is as fast as using new.
Flamegraph for 1 million elements
Flamegraph for 100 million elements
It is my understanding that with_capacity should always be faster if the final length is known, because data on the heap is allocated once which should result in a single chunk. In contrast, the version with new grows the vector MAX_NUMBER times, which results in more allocations.
What am I missing?
Edit
The first section was compiled with the debug profile. If I use the release profile with the following settings in Cargo.toml
[package]
name = "vec_test"
version = "0.1.0"
edition = "2021"
[profile.release]
opt-level = 3
debug = 2
I still get the following result for a length of 10 million.
I was not able to reproduce this in a synthetic benchmark:
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
fn with_new(size: i32) -> Vec<i32> {
let mut result = Vec::new();
for i in 0..size {
result.push(i);
}
result
}
fn with_capacity(size: i32) -> Vec<i32> {
let mut result = Vec::with_capacity(size as usize);
for i in 0..size {
result.push(i);
}
result
}
pub fn with_new_benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("with_new");
for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000].iter() {
group.bench_with_input(BenchmarkId::from_parameter(size), size, |b, &size| {
b.iter(|| with_new(size));
});
}
group.finish();
}
pub fn with_capacity_benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("with_capacity");
for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000].iter() {
group.bench_with_input(BenchmarkId::from_parameter(size), size, |b, &size| {
b.iter(|| with_capacity(size));
});
}
group.finish();
}
criterion_group!(benches, with_new_benchmark, with_capacity_benchmark);
criterion_main!(benches);
Here's the output (with outliers and other benchmarking stuff removed):
with_new/100 time: [331.17 ns 331.38 ns 331.61 ns]
with_new/1000 time: [1.1719 us 1.1731 us 1.1745 us]
with_new/10000 time: [8.6784 us 8.6840 us 8.6899 us]
with_new/100000 time: [77.524 us 77.596 us 77.680 us]
with_new/1000000 time: [1.6966 ms 1.6978 ms 1.6990 ms]
with_new/10000000 time: [22.063 ms 22.084 ms 22.105 ms]
with_capacity/100 time: [76.796 ns 76.859 ns 76.926 ns]
with_capacity/1000 time: [497.90 ns 498.14 ns 498.39 ns]
with_capacity/10000 time: [5.0058 us 5.0084 us 5.0112 us]
with_capacity/100000 time: [50.358 us 50.414 us 50.470 us]
with_capacity/1000000 time: [1.0861 ms 1.0868 ms 1.0876 ms]
with_capacity/10000000 time: [10.644 ms 10.656 ms 10.668 ms]
The with_capacity runs were consistently faster than with_new. The closest ones being in the 10,000- to 1,000,000-element runs where with_capacity still only took ~60% of the time, while others had it taking half the time or less.
It crossed my mind that there could be some strange const-propagation behavior going on, but even with individual functions with hard-coded sizes (playground for brevity), the behavior didn't significantly change:
with_new/100 time: [313.87 ns 314.22 ns 314.56 ns]
with_new/1000 time: [1.1498 us 1.1505 us 1.1511 us]
with_new/10000 time: [7.9062 us 7.9095 us 7.9130 us]
with_new/100000 time: [77.925 us 77.990 us 78.057 us]
with_new/1000000 time: [1.5675 ms 1.5683 ms 1.5691 ms]
with_new/10000000 time: [20.956 ms 20.990 ms 21.023 ms]
with_capacity/100 time: [76.943 ns 76.999 ns 77.064 ns]
with_capacity/1000 time: [535.00 ns 536.22 ns 537.21 ns]
with_capacity/10000 time: [5.1122 us 5.1150 us 5.1181 us]
with_capacity/100000 time: [50.064 us 50.092 us 50.122 us]
with_capacity/1000000 time: [1.0768 ms 1.0776 ms 1.0784 ms]
with_capacity/10000000 time: [10.600 ms 10.613 ms 10.628 ms]
Your testing code only calls each strategy once, so its conceivable that your testing environment is changed after calling the first one (a potential culprit being heap fragmentation as suggested by #trent_formerly_cl in the comments, though there could be others: cpu boosting/throttling, spatial and/or temporal cache locality, OS behavior, etc.). A benchmarking framework like criterion helps avoid a lot of these problems by iterating each test multiple times (including warmup iterations).

Why rayon-based parallel processing takes more time than serial processing?

Learning Rayon, I wanted to compare the performace of parallel calculation and serial calculation of Fibonacci series. Here's my code:
use rayon;
use std::time::Instant;
fn main() {
let nth = 30;
let now = Instant::now();
let fib = fibonacci_serial(nth);
println!(
"[s] The {}th number in the fibonacci sequence is {}, elapsed: {}",
nth,
fib,
now.elapsed().as_micros()
);
let now = Instant::now();
let fib = fibonacci_parallel(nth);
println!(
"[p] The {}th number in the fibonacci sequence is {}, elapsed: {}",
nth,
fib,
now.elapsed().as_micros()
);
}
fn fibonacci_parallel(n: u64) -> u64 {
if n <= 1 {
return n;
}
let (a, b) = rayon::join(|| fibonacci_parallel(n - 2), || fibonacci_parallel(n - 1));
a + b
}
fn fibonacci_serial(n: u64) -> u64 {
if n <= 1 {
return n;
}
fibonacci_serial(n - 2) + fibonacci_serial(n - 1)
}
Run in Rust Playground
I expected the elapsed time of parallel calculation would be smaller than the elapsed time of serial caculation, but the result was opposite:
# `s` stands for serial calculation and `p` for parallel
[s] The 30th number in the fibonacci sequence is 832040, elapsed: 12127
[p] The 30th number in the fibonacci sequence is 832040, elapsed: 990379
My implementation for serial/parallel calculation would have flaws. But if not, why am I seeing these results?
I think the real reason is, that you create n² threads which is not good. In every call of fibonacci_parallel you create another pair of threads for rayon and because you call fibonacci_parallel again in the closure you create yet another pair of threads.
This is utterly terrible for the OS/rayon.
An approach to solve this problem could be this:
fn fibonacci_parallel(n: u64) -> u64 {
fn inner(n: u64) -> u64 {
if n <= 1 {
return n;
}
inner(n - 2) + inner(n - 1)
}
if n <= 1 {
return n;
}
let (a, b) = rayon::join(|| inner(n - 2), || inner(n - 1));
a + b
}
You create two threads which both execute the inner function. With this addition I get
op#VBOX /t/t/foo> cargo run --release 40
Finished release [optimized] target(s) in 0.03s
Running `target/release/foo 40`
[s] The 40th number in the fibonacci sequence is 102334155, elapsed: 1373741
[p] The 40th number in the fibonacci sequence is 102334155, elapsed: 847343
But as said, for low numbers parallel execution is not worth it:
op#VBOX /t/t/foo> cargo run --release 20
Finished release [optimized] target(s) in 0.02s
Running `target/release/foo 20`
[s] The 10th number in the fibonacci sequence is 6765, elapsed: 82
[p] The 10th number in the fibonacci sequence is 6765, elapsed: 241

Akka - worse performance with more actors

I'm trying out some parallel programming with Scala and Akka, which I'm new to. I've got a pretty simple Monte Carlo Pi application (approximates pi in a circle) which I've built in several languages. However the performance of the version I've built in Akka is puzzling me.
I have a sequential version written in pure Scala that tends to take roughly 400ms to complete.
In comparison with 1 worker actor the Akka version takes around 300-350ms, however as I increase the number of actors that time increases dramatically. With 4 actors the time can be anywhere between 500ms all the way up to 1200ms or higher.
The number of iterations are being divided up between the worker actors, so ideally performance should be getting better the more of them there are, currently it's getting significantly worse.
My code is
object MCpi{
//Declare initial values
val numWorkers = 2
val numIterations = 10000000
//Declare messages that will be sent to actors
sealed trait PiMessage
case object Calculate extends PiMessage
case class Work(iterations: Int) extends PiMessage
case class Result(value: Int) extends PiMessage
case class PiApprox(pi: Double, duration: Double)
//Main method
def main(args: Array[String]): Unit = {
val system = ActorSystem("MCpi_System") //Create Akka system
val master = system.actorOf(Props(new MCpi_Master(numWorkers, numIterations))) //Create Master Actor
println("Starting Master")
master ! Calculate //Run calculation
}
}
//Master
class MCpi_Master(numWorkers: Int, numIterations: Int) extends Actor{
var pi: Double = _ // Store pi
var quadSum: Int = _ //the total number of points inside the quadrant
var numResults: Int = _ //number of results returned
val startTime: Double = System.currentTimeMillis() //calculation start time
//Create a group of worker actors
val workerRouter = context.actorOf(
Props[MCpi_Worker].withRouter(RoundRobinPool(numWorkers)), name = "workerRouter")
val listener = context.actorOf(Props[MCpi_Listener], name = "listener")
def receive = {
//Tell workers to start the calculation
//For each worker a message is sent with the number of iterations it is to perform,
//iterations are split up between the number of workers.
case Calculate => for(i <- 0 until numWorkers) workerRouter ! Work(numIterations / numWorkers);
//Receive the results from the workers
case Result(value) =>
//Add up the total number of points in the circle from each worker
quadSum += value
//Total up the number of results which have been received, this should be 1 for each worker
numResults += 1
if(numResults == numWorkers) { //Once all results have been collected
//Calculate pi
pi = (4.0 * quadSum) / numIterations
//Send the results to the listener to output
listener ! PiApprox(pi, duration = System.currentTimeMillis - startTime)
context.stop(self)
}
}
}
//Worker
class MCpi_Worker extends Actor {
//Performs the calculation
def calculatePi(iterations: Int): Int = {
val r = scala.util.Random // Create random number generator
var inQuadrant: Int = 0 //Store number of points within circle
for(i <- 0 to iterations){
//Generate random point
val X = r.nextFloat()
val Y = r.nextFloat()
//Determine whether or not the point is within the circle
if(((X * X) + (Y * Y)) < 1.0)
inQuadrant += 1
}
inQuadrant //return the number of points within the circle
}
def receive = {
//Starts the calculation then returns the result
case Work(iterations) => sender ! Result(calculatePi(iterations))
}
}
//Listener
class MCpi_Listener extends Actor{ //Recieves and prints the final result
def receive = {
case PiApprox(pi, duration) =>
//Print the results
println("\n\tPi approximation: \t\t%s\n\tCalculation time: \t%s".format(pi, duration))
//Print to a CSV file
val pw: FileWriter = new FileWriter("../../../..//Results/Scala_Results.csv", true)
pw.append(duration.toString())
pw.append("\n")
pw.close()
context.system.terminate()
}
}
The plain Scala sequential version is
object MCpi {
def main(args: Array[String]): Unit = {
//Define the number of iterations to perform
val iterations = args(0).toInt;
val resultsPath = args(1);
//Get the current time
val start = System.currentTimeMillis()
// Create random number generator
val r = scala.util.Random
//Store number of points within circle
var inQuadrant: Int = 0
for(i <- 0 to iterations){
//Generate random point
val X = r.nextFloat()
val Y = r.nextFloat()
//Determine whether or not the point is within the circle
if(((X * X) + (Y * Y)) < 1.0)
inQuadrant += 1
}
//Calculate pi
val pi = (4.0 * inQuadrant) / iterations
//Get the total time
val time = System.currentTimeMillis() - start
//Output values
println("Number of Iterations: " + iterations)
println("Pi has been calculated as: " + pi)
println("Total time taken: " + time + " (Milliseconds)")
//Print to a CSV file
val pw: FileWriter = new FileWriter(resultsPath + "/Scala_Results.csv", true)
pw.append(time.toString())
pw.append("\n")
pw.close()
}
}
Any suggestions as to why this is happening or how I can improve performance would be very welcome.
Edit: I'd like to thank all of you for your answers, this is my first question on this site and all the answers are extremely helpful, I have plenty to look in to now :)
You have a synchronisation issue around the Random instance you're using.
More specifically, this line
val r = scala.util.Random // Create random number generator
actually doesn't "create a random number generator", but picks up the singleton object that scala.util conveniently offers you. This means that all threads will share it, and will synchronise around its seed (see the code of java.util.Random.nextFloat for more info).
Simply by changing that line to
val r = new scala.util.Random // Create random number generator
you should get some parallelisation speed-up. As stated in the comments, the speed-up will depend on your architecture, etc. etc., but at least it will not be so badly biased by strong synchronisation.
Note that java.util will use System.nanoTime as seed of a newly created Random, so you should need not worry about randomisation issues.
I think it's a great question worth digging into. Using Akka Actor system that does come with some systems overhead, I expect performance gain will be seen only when the scale is large enough. I test-ran your two versions (non-akka vs akka) with minimal code change. At 1 million or 10 million hits, as expected there is hardly any performance difference regardless of Akka vs non-Akka or number of workers used. But at 100 million hits, you can see consistent performance difference.
Besides scaling up the total hits to 100 million, the only code change I made was replacing scala.util.Random with java.util.concurrent.ThreadLocalRandom:
//val r = scala.util.Random // Create random number generator
def r = ThreadLocalRandom.current
...
//Generate random point
//val X = r.nextFloat()
//val Y = r.nextFloat()
val X = r.nextDouble(0.0, 1.0)
val Y = r.nextDouble(0.0, 1.0)
This was all done on an old MacBook Pro with a 2GHz quadcore CPU and 8GB of memory. Here are the test-run results at 100 million total hits:
Non-Akka app takes ~1720 ms
Akka app with 2 workers takes ~770 ms
Akka app with 4 workers takes ~430 ms
Individual test-runs below ...
Non-Akka
$ sbt "runMain calcpi.MCpi 100000000 /tmp"
[info] Loading project definition from /Users/leo/projects/scala/test/akka-calculate-pi/project
[info] Set current project to Akka Pi Calculation (in build file:/Users/leo/projects/scala/test/akka-calculate-pi/)
[info] Running calcpi.MCpi 100000000 /tmp
Number of Iterations: 100000000
Pi has been calculated as: 3.1415916
Total time taken: 1722 (Milliseconds)
[success] Total time: 2 s, completed Jan 20, 2017 3:26:20 PM
$ sbt "runMain calcpi.MCpi 100000000 /tmp"
[info] Loading project definition from /Users/leo/projects/scala/test/akka-calculate-pi/project
[info] Set current project to Akka Pi Calculation (in build file:/Users/leo/projects/scala/test/akka-calculate-pi/)
[info] Running calcpi.MCpi 100000000 /tmp
Number of Iterations: 100000000
Pi has been calculated as: 3.14159724
Total time taken: 1715 (Milliseconds)
[success] Total time: 2 s, completed Jan 20, 2017 3:28:17 PM
Using Akka
Number of Workers = 4:
$ sbt "runMain calcpi.MCpi 100000000 /tmp"
[info] Loading project definition from /Users/leo/projects/scala/test/akka-calculate-pi/project
[info] Set current project to Akka Pi Calculation (in build file:/Users/leo/projects/scala/test/akka-calculate-pi/)
[info] Running calcpi.MCpi 100000000 /tmp
Starting Master
Pi approximation: 3.14110116
Calculation time: 423.0
[success] Total time: 1 s, completed Jan 20, 2017 3:35:25 PM
$ sbt "runMain calcpi.MCpi 100000000 /tmp"
[info] Loading project definition from /Users/leo/projects/scala/test/akka-calculate-pi/project
[info] Set current project to Akka Pi Calculation (in build file:/Users/leo/projects/scala/test/akka-calculate-pi/)
[info] Running calcpi.MCpi 100000000 /tmp
Starting Master
Pi approximation: 3.14181316
Calculation time: 440.0
[success] Total time: 1 s, completed Jan 20, 2017 3:35:34 PM
Number of Workers = 2:
$ sbt "runMain calcpi.MCpi 100000000 /tmp"
[info] Loading project definition from /Users/leo/projects/scala/test/akka-calculate-pi/project
[info] Set current project to Akka Pi Calculation (in build file:/Users/leo/projects/scala/test/akka-calculate-pi/)
[info] Running calcpi.MCpi 100000000 /tmp
Starting Master
Pi approximation: 3.14162344
Calculation time: 766.0
[success] Total time: 2 s, completed Jan 20, 2017 3:36:34 PM
$ sbt "runMain calcpi.MCpi 100000000 /tmp"
[info] Loading project definition from /Users/leo/projects/scala/test/akka-calculate-pi/project
[info] Set current project to Akka Pi Calculation (in build file:/Users/leo/projects/scala/test/akka-calculate-pi/)
[info] Running calcpi.MCpi 100000000 /tmp
Starting Master
Pi approximation: 3.14182148
Calculation time: 787.0
[success] Total time: 2 s, completed Jan 20, 2017 3:36:43 PM
I think that your issue is caused by execution of heavy calculations in the body of receive function, it may be the case that some of them run on the one thread so you are just adding aktor system weight to your standard one threaded computation, thus making it slower. From akka documentation:
Behind the scenes Akka will run sets of actors on sets of real threads, where typically many actors share one thread, and subsequent invocations of one actor may end up being processed on different threads. Akka ensures that this implementation detail does not affect the single-threadedness of handling the actor’s state.
I am not sure if it is the case but you may try running your computation in future:
Future {
//your code
}
To make it work you need to provide implicit execution context, you can do this in many ways, but two are the easiest:
Import global execution context
Import execution context of the actor:
import context.dispatcher
The second one has to be used insied your actor class body.

Measure elapsed time in OS X

I need to measure elapsed time, in order to know when a certain period of time has been exceeded.
I used to use Ticks() and Microseconds() for this, but both functions are now deprecated.
CFAbsoluteTimeGetCurrent is not the correct way to use because it may run backwards, as explained in the docs:
Repeated calls to this function do not guarantee monotonically
increasing results. The system time may decrease due to
synchronization with external time references or due to an explicit
user change of the clock.
What else is there that's not deprecated and fairly future-proof?
One way, as explained in Q&A 1398, is to use mach_absolute_time as follows:
static mach_timebase_info_data_t sTimebaseInfo;
mach_timebase_info(&sTimebaseInfo); // Determines the time scale
uint64_t t1 = mach_absolute_time();
...
uint64_t t2 = mach_absolute_time();
uint64_t elapsedNano = (t2-t1) * sTimebaseInfo.numer / sTimebaseInfo.denom;
This may not be fool-proof either, though. The values could overflow in some cases, as pointed out in this answer.
Use NSTimeInterval:
Used to specify a time interval, in seconds.
Example:
- (void)loop {
NSDate *startTime = [NSDate date];
sleep(90); // sleep for 90 seconds
[self elapsedTime:startTime];
}
- (void)elapsedTime:(NSDate *)startTime {
NSTimeInterval elapsedTime = fabs([startTime timeIntervalSinceNow]);
int intSeconds = (int) elapsedTime;
int intMinutes = intSeconds / 60;
intSeconds = intSeconds % 60;
NSLog(#"Elapsed Time: %d minute(s) %d seconds", intMinutes, intSeconds);
}
Result:
Elapsed Time: 1 minute(s) 29 seconds
It's unclear what type of precision you are looking for, although NSTimeInterval can accomodate fractions of a second (eg. tenths, hundredths, thousandths, etc.)

Memoization done, what now?

I was trying to solve a puzzle in Haskell and had written the following code:
u 0 p = 0.0
u 1 p = 1.0
u n p = 1.0 + minimum [((1.0-q)*(s k p)) + (u (n-k) p) | k <-[1..n], let q = (1.0-p)**(fromIntegral k)]
s 1 p = 0.0
s n p = 1.0 + minimum [((1.0-q)*(s (n-k) p)) + q*((s k p) + (u (n-k) p)) | k <-[1..(n-1)], let q = (1.0-(1.0-p)**(fromIntegral k))/(1.0-(1.0-p)**(fromIntegral n))]
This code was terribly slow though. I suspect the reason for this is that the same things get calculated again and again. I therefore made a memoized version:
memoUa = array (0,10000) ((0,0.0):(1,1.0):[(k,mua k) | k<- [2..10000]])
mua n = (1.0) + minimum [((1.0-q)*(memoSa ! k)) + (memoUa ! (n-k)) | k <-[1..n], let q = (1.0-0.02)**(fromIntegral k)]
memoSa = array (0,10000) ((0,0.0):(1,0.0):[(k,msa k) | k<- [2..10000]])
msa n = (1.0) + minimum [((1.0-q) * (memoSa ! (n-k))) + q*((memoSa ! k) + (memoUa ! (n-k))) | k <-[1..(n-1)], let q = (1.0-(1.0-0.02)**(fromIntegral k))/(1.0-(1.0-0.02)**(fromIntegral n))]
This seems to be a lot faster, but now I get an out of memory error. I do not understand why this happens (the same strategy in java, without recursion, has no problems). Could somebody point me in the right direction on how to improve this code?
EDIT: I am adding my java version here (as I don't know where else to put it). I realize that the code isn't really reader-friendly (no meaningful names, etc.), but I hope it is clear enough.
public class Main {
public static double calc(double p) {
double[] u = new double[10001];
double[] s = new double[10001];
u[0] = 0.0;
u[1] = 1.0;
s[0] = 0.0;
s[1] = 0.0;
for (int n=2;n<10001;n++) {
double q = 1.0;
double denom = 1.0;
for (int k = 1; k <= n; k++ ) {
denom = denom * (1.0 - p);
}
denom = 1.0 - denom;
s[n] = (double) n;
u[n] = (double) n;
for (int k = 1; k <= n; k++ ) {
q = (1.0 - p) * q;
if (k<n) {
double qs = (1.0-q)/denom;
double bs = (1.0-qs)*s[n-k] + qs*(s[k]+ u[n-k]) + 1.0;
if (bs < s[n]) {
s[n] = bs;
}
}
double bu = (1.0-q)*s[k] + 1.0 + u[n-k];
if (bu < u[n]) {
u[n] = bu;
}
}
}
return u[10000];
}
public static void main(String[] args) {
double s = 0.0;
int i = 2;
//for (int i = 1; i<51; i++) {
s = s + calc(i*0.01);
//}
System.out.println("result = " + s);
}
}
I don't run out of memory when I run the compiled version, but there is a significant difference between how the Java version works and how the Haskell version works which I'll illustrate here.
The first thing to do is to add some important type signatures. In particular, you don't want Integer array indices, so I added:
memoUa :: Array Int Double
memoSa :: Array Int Double
I found these using ghc-mod check. I also added a main so that you can run it from the command line:
import System.Environment
main = do
(arg:_) <- getArgs
let n = read arg
print $ mua n
Now to gain some insight into what's going on, we can compile the program using profiling:
ghc -O2 -prof memo.hs
Then when we invoke the program like this:
memo 1000 +RTS -s
we will get profiling output which looks like:
164.31333233347755
98,286,872 bytes allocated in the heap
29,455,360 bytes copied during GC
657,080 bytes maximum residency (29 sample(s))
38,260 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 161 colls, 0 par 0.03s 0.03s 0.0002s 0.0011s
Gen 1 29 colls, 0 par 0.03s 0.03s 0.0011s 0.0017s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.21s ( 0.21s elapsed)
GC time 0.06s ( 0.06s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.27s ( 0.27s elapsed)
%GC time 21.8% (22.3% elapsed)
Alloc rate 468,514,624 bytes per MUT second
Productivity 78.2% of total user, 77.3% of total elapsed
Important things to pay attention to are:
maximum residency
Total time
%GC time (or Productivity)
Maximum residency is a measure of how much memory is needed by the program. %GC time the proportion of the time spent in garbage collection and Productivity is the complement (100% - %GC time).
If you run the program for various input values you will see a productivity of around 80%:
n Max Res. Prod. Time Output
2000 779,076 79.4% 1.10s 328.54535361588535
4000 1,023,016 80.7% 4.41s 657.0894961398351
6000 1,299,880 81.3% 9.91s 985.6071032981068
8000 1,539,352 81.5% 17.64s 1314.0968411684714
10000 1,815,600 81.7% 27.57s 1642.5891214360522
This means that about 20% of the run time is spent in garbage collection. Also, we see increasing memory usage as n increases.
It turns out we can dramatically improve productivity and memory usage by telling Haskell the order in which to evaluate the array elements instead of relying on lazy evaluation:
import Control.Monad (forM_)
main = do
(arg:_) <- getArgs
let n = read arg
forM_ [1..n] $ \i -> mua i `seq` return ()
print $ mua n
And the new profiling stats are:
n Max Res. Prod. Time Output
2000 482,800 99.3% 1.31s 328.54535361588535
4000 482,800 99.6% 5.88s 657.0894961398351
6000 482,800 99.5% 12.09s 985.6071032981068
8000 482,800 98.1% 21.71s 1314.0968411684714
10000 482,800 96.1% 34.58s 1642.5891214360522
Some interesting observations here: productivity is up, memory usage is down (constant now over the range of inputs) but run time is up. This suggests that we forced more computations than we needed to. In an imperative language like Java you have to give an evaluation order so you would know exactly which computations need to be performed. It would interesting to see your Java code to see which computations it is performing.

Resources