Akka - worse performance with more actors - performance

I'm trying out some parallel programming with Scala and Akka, which I'm new to. I've got a pretty simple Monte Carlo Pi application (approximates pi in a circle) which I've built in several languages. However the performance of the version I've built in Akka is puzzling me.
I have a sequential version written in pure Scala that tends to take roughly 400ms to complete.
In comparison with 1 worker actor the Akka version takes around 300-350ms, however as I increase the number of actors that time increases dramatically. With 4 actors the time can be anywhere between 500ms all the way up to 1200ms or higher.
The number of iterations are being divided up between the worker actors, so ideally performance should be getting better the more of them there are, currently it's getting significantly worse.
My code is
object MCpi{
//Declare initial values
val numWorkers = 2
val numIterations = 10000000
//Declare messages that will be sent to actors
sealed trait PiMessage
case object Calculate extends PiMessage
case class Work(iterations: Int) extends PiMessage
case class Result(value: Int) extends PiMessage
case class PiApprox(pi: Double, duration: Double)
//Main method
def main(args: Array[String]): Unit = {
val system = ActorSystem("MCpi_System") //Create Akka system
val master = system.actorOf(Props(new MCpi_Master(numWorkers, numIterations))) //Create Master Actor
println("Starting Master")
master ! Calculate //Run calculation
}
}
//Master
class MCpi_Master(numWorkers: Int, numIterations: Int) extends Actor{
var pi: Double = _ // Store pi
var quadSum: Int = _ //the total number of points inside the quadrant
var numResults: Int = _ //number of results returned
val startTime: Double = System.currentTimeMillis() //calculation start time
//Create a group of worker actors
val workerRouter = context.actorOf(
Props[MCpi_Worker].withRouter(RoundRobinPool(numWorkers)), name = "workerRouter")
val listener = context.actorOf(Props[MCpi_Listener], name = "listener")
def receive = {
//Tell workers to start the calculation
//For each worker a message is sent with the number of iterations it is to perform,
//iterations are split up between the number of workers.
case Calculate => for(i <- 0 until numWorkers) workerRouter ! Work(numIterations / numWorkers);
//Receive the results from the workers
case Result(value) =>
//Add up the total number of points in the circle from each worker
quadSum += value
//Total up the number of results which have been received, this should be 1 for each worker
numResults += 1
if(numResults == numWorkers) { //Once all results have been collected
//Calculate pi
pi = (4.0 * quadSum) / numIterations
//Send the results to the listener to output
listener ! PiApprox(pi, duration = System.currentTimeMillis - startTime)
context.stop(self)
}
}
}
//Worker
class MCpi_Worker extends Actor {
//Performs the calculation
def calculatePi(iterations: Int): Int = {
val r = scala.util.Random // Create random number generator
var inQuadrant: Int = 0 //Store number of points within circle
for(i <- 0 to iterations){
//Generate random point
val X = r.nextFloat()
val Y = r.nextFloat()
//Determine whether or not the point is within the circle
if(((X * X) + (Y * Y)) < 1.0)
inQuadrant += 1
}
inQuadrant //return the number of points within the circle
}
def receive = {
//Starts the calculation then returns the result
case Work(iterations) => sender ! Result(calculatePi(iterations))
}
}
//Listener
class MCpi_Listener extends Actor{ //Recieves and prints the final result
def receive = {
case PiApprox(pi, duration) =>
//Print the results
println("\n\tPi approximation: \t\t%s\n\tCalculation time: \t%s".format(pi, duration))
//Print to a CSV file
val pw: FileWriter = new FileWriter("../../../..//Results/Scala_Results.csv", true)
pw.append(duration.toString())
pw.append("\n")
pw.close()
context.system.terminate()
}
}
The plain Scala sequential version is
object MCpi {
def main(args: Array[String]): Unit = {
//Define the number of iterations to perform
val iterations = args(0).toInt;
val resultsPath = args(1);
//Get the current time
val start = System.currentTimeMillis()
// Create random number generator
val r = scala.util.Random
//Store number of points within circle
var inQuadrant: Int = 0
for(i <- 0 to iterations){
//Generate random point
val X = r.nextFloat()
val Y = r.nextFloat()
//Determine whether or not the point is within the circle
if(((X * X) + (Y * Y)) < 1.0)
inQuadrant += 1
}
//Calculate pi
val pi = (4.0 * inQuadrant) / iterations
//Get the total time
val time = System.currentTimeMillis() - start
//Output values
println("Number of Iterations: " + iterations)
println("Pi has been calculated as: " + pi)
println("Total time taken: " + time + " (Milliseconds)")
//Print to a CSV file
val pw: FileWriter = new FileWriter(resultsPath + "/Scala_Results.csv", true)
pw.append(time.toString())
pw.append("\n")
pw.close()
}
}
Any suggestions as to why this is happening or how I can improve performance would be very welcome.
Edit: I'd like to thank all of you for your answers, this is my first question on this site and all the answers are extremely helpful, I have plenty to look in to now :)

You have a synchronisation issue around the Random instance you're using.
More specifically, this line
val r = scala.util.Random // Create random number generator
actually doesn't "create a random number generator", but picks up the singleton object that scala.util conveniently offers you. This means that all threads will share it, and will synchronise around its seed (see the code of java.util.Random.nextFloat for more info).
Simply by changing that line to
val r = new scala.util.Random // Create random number generator
you should get some parallelisation speed-up. As stated in the comments, the speed-up will depend on your architecture, etc. etc., but at least it will not be so badly biased by strong synchronisation.
Note that java.util will use System.nanoTime as seed of a newly created Random, so you should need not worry about randomisation issues.

I think it's a great question worth digging into. Using Akka Actor system that does come with some systems overhead, I expect performance gain will be seen only when the scale is large enough. I test-ran your two versions (non-akka vs akka) with minimal code change. At 1 million or 10 million hits, as expected there is hardly any performance difference regardless of Akka vs non-Akka or number of workers used. But at 100 million hits, you can see consistent performance difference.
Besides scaling up the total hits to 100 million, the only code change I made was replacing scala.util.Random with java.util.concurrent.ThreadLocalRandom:
//val r = scala.util.Random // Create random number generator
def r = ThreadLocalRandom.current
...
//Generate random point
//val X = r.nextFloat()
//val Y = r.nextFloat()
val X = r.nextDouble(0.0, 1.0)
val Y = r.nextDouble(0.0, 1.0)
This was all done on an old MacBook Pro with a 2GHz quadcore CPU and 8GB of memory. Here are the test-run results at 100 million total hits:
Non-Akka app takes ~1720 ms
Akka app with 2 workers takes ~770 ms
Akka app with 4 workers takes ~430 ms
Individual test-runs below ...
Non-Akka
$ sbt "runMain calcpi.MCpi 100000000 /tmp"
[info] Loading project definition from /Users/leo/projects/scala/test/akka-calculate-pi/project
[info] Set current project to Akka Pi Calculation (in build file:/Users/leo/projects/scala/test/akka-calculate-pi/)
[info] Running calcpi.MCpi 100000000 /tmp
Number of Iterations: 100000000
Pi has been calculated as: 3.1415916
Total time taken: 1722 (Milliseconds)
[success] Total time: 2 s, completed Jan 20, 2017 3:26:20 PM
$ sbt "runMain calcpi.MCpi 100000000 /tmp"
[info] Loading project definition from /Users/leo/projects/scala/test/akka-calculate-pi/project
[info] Set current project to Akka Pi Calculation (in build file:/Users/leo/projects/scala/test/akka-calculate-pi/)
[info] Running calcpi.MCpi 100000000 /tmp
Number of Iterations: 100000000
Pi has been calculated as: 3.14159724
Total time taken: 1715 (Milliseconds)
[success] Total time: 2 s, completed Jan 20, 2017 3:28:17 PM
Using Akka
Number of Workers = 4:
$ sbt "runMain calcpi.MCpi 100000000 /tmp"
[info] Loading project definition from /Users/leo/projects/scala/test/akka-calculate-pi/project
[info] Set current project to Akka Pi Calculation (in build file:/Users/leo/projects/scala/test/akka-calculate-pi/)
[info] Running calcpi.MCpi 100000000 /tmp
Starting Master
Pi approximation: 3.14110116
Calculation time: 423.0
[success] Total time: 1 s, completed Jan 20, 2017 3:35:25 PM
$ sbt "runMain calcpi.MCpi 100000000 /tmp"
[info] Loading project definition from /Users/leo/projects/scala/test/akka-calculate-pi/project
[info] Set current project to Akka Pi Calculation (in build file:/Users/leo/projects/scala/test/akka-calculate-pi/)
[info] Running calcpi.MCpi 100000000 /tmp
Starting Master
Pi approximation: 3.14181316
Calculation time: 440.0
[success] Total time: 1 s, completed Jan 20, 2017 3:35:34 PM
Number of Workers = 2:
$ sbt "runMain calcpi.MCpi 100000000 /tmp"
[info] Loading project definition from /Users/leo/projects/scala/test/akka-calculate-pi/project
[info] Set current project to Akka Pi Calculation (in build file:/Users/leo/projects/scala/test/akka-calculate-pi/)
[info] Running calcpi.MCpi 100000000 /tmp
Starting Master
Pi approximation: 3.14162344
Calculation time: 766.0
[success] Total time: 2 s, completed Jan 20, 2017 3:36:34 PM
$ sbt "runMain calcpi.MCpi 100000000 /tmp"
[info] Loading project definition from /Users/leo/projects/scala/test/akka-calculate-pi/project
[info] Set current project to Akka Pi Calculation (in build file:/Users/leo/projects/scala/test/akka-calculate-pi/)
[info] Running calcpi.MCpi 100000000 /tmp
Starting Master
Pi approximation: 3.14182148
Calculation time: 787.0
[success] Total time: 2 s, completed Jan 20, 2017 3:36:43 PM

I think that your issue is caused by execution of heavy calculations in the body of receive function, it may be the case that some of them run on the one thread so you are just adding aktor system weight to your standard one threaded computation, thus making it slower. From akka documentation:
Behind the scenes Akka will run sets of actors on sets of real threads, where typically many actors share one thread, and subsequent invocations of one actor may end up being processed on different threads. Akka ensures that this implementation detail does not affect the single-threadedness of handling the actor’s state.
I am not sure if it is the case but you may try running your computation in future:
Future {
//your code
}
To make it work you need to provide implicit execution context, you can do this in many ways, but two are the easiest:
Import global execution context
Import execution context of the actor:
import context.dispatcher
The second one has to be used insied your actor class body.

Related

Faster HashMap for sequential keys

Initially I was very surprised to find out Rust's HashMap, even with the FNV hasher, was considerably slower than the equivalents in Java, .NET, PHP. I am talking about optimized Release mode, not Debug mode. I did some calculations and realized the timings in Java/.NET/PHP were suspiciously low. Then it hit me - even though I was testing with a big hash table (millions of entries), I was reading mostly sequential key values (like 14, 15, 16, ...), which apparently resulted in lots of CPU cache hits, due to the way the standard hash tables (and hash-code functions for integers and short strings) in those languages are implementated, so that entries with nearby keys are usually located in nearby memory locations.
Rust's HashMap, on the other hand, uses the so called SwissTable implementation, which apparently distributes values differently. When I tested reading by random keys, everything fell into place - the "competitors" scored behind Rust.
So if we are in a situation where we need to perform lots of gets sequentially, for example iterating some DB IDs that are ordered and mostly sequential (with not too many gaps), is there a good Rust hash map implementation that can compete with Java's HashMap or .NET's Dictionary?
P.S. As requested in the comments, I paste an example here. I ran lots of tests, but here is a simple example that takes 75 ms in Rust (release mode) and 20 ms in Java:
In Rust:
let hm: FnvHashMap<i32, i32> = ...;
// Start timer here
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
println!("The sum is: {}", sum);
In Java:
Map<Integer, Integer> hm = ...;
// Start timer here
long sum = 0;
for (int i = 0; i < 1_000_000; i++) {
sum += hm.get(i);
}
With HashMap<i32, i32> and its default SipHash hasher it took 190 ms. I know why it's slower than FnvHashMap. I'm just mentioning that for completeness.
First, here is some runnable code to measure the efficiency of the different implementations:
use std::{collections::HashMap, time::Instant};
fn main() {
let hm: HashMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
let t0 = Instant::now();
let mut sum = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += x;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("{} - The sum is: {}", elapsed, sum);
}
On the old desktop machine I'm writing this on, it reports 76 ms to run. Since the machine is 10+ years old, I find it baffling that your hardware would take 190 ms to run the same code, so I'm wondering how and what you're actually measuring. But let's ignore that and concentrate on the relative numbers.
When you care about hashmap efficiency in Rust, and when the keys don't come from an untrusted source, the first thing to try should always be to switch to a non-DOS-resistant hash function. One possibility is the FNV hash function from the fnv crate, which you can get by switching HashMap to fnv::FnvHashMap. That brings performance to 34 ms, i.e. a 2.2x speedup.
If this is not enough, you can try the hash from the rustc-hash crate (almost the same as fxhash, but allegedly better maintained), which uses the same function as the Rust compiler, adapted from the hash used by Firefox. Not based on any formal analysis, it performs badly on hash function test suites, but is reported to consistently outperform FNV. That's confirmed on the above example, where switching from FnvHashMap to rustc_hash::FxHashMap drops the time to 28 ms, i.e. a 2.7x speedup from the initial timing.
Finally, if you want to just imitate what C# and Java do, and could care less about certain patterns of inserted numbers leading to degraded performance, you can use the aptly-named nohash_hasher crate that gives you an identity hash. Changing HashMap<i32, i32> to HashMap<i32, i32, nohash_hasher::BuildNoHashHasher<i32>> drops the time to just under 4 ms, i.e. a whopping 19x speedup from the initial timing.
Since you report the Java example to be 9.5x faster than Rust, a 19x speedup should make your code approximately twice as fast as Java.
Rust's HashMap by default uses an implementation of SipHash as the hash function. SipHash is designed to avoid denial-of-service attacks based on predicting hash collisions, which is an important security property for a hash function used in a hash map.
If you don't need this guarantee, you can use a simpler hash function. One option is using the fxhash crate, which should speed up reading integers from a HashMap<i32, i32> by about a factor of 3.
Other options are implementing your own trivial hash function (e.g. by simply using the identity function, which is a decent hash function for mostly consecutive keys), or using a vector instead of a hash map.
.NET uses the identity function for hashes of Int32 by default, so it's not resistant to hash flooding attacks. Of course this is faster, but the downside is not even mentioned in the documentation of Dictionary. For what it's worth, I prefer Rust's "safe by default" approach over .NET's any day, since many developers aren't even aware of the problems predictable hash functions can cause. Rust still allows you to use a more performant hash function if you don't need the hash flooding protection, so to me personally this seems to be a strength of Rust compared to at least .NET rather than a weakness.
I decided to run some more tests, based on the suggestions by user4815162342. This time I used another machine with Ubuntu 20.04.
Rust code
println!("----- HashMap (with its default SipHash hasher) -----------");
let hm: HashMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
for k in 0..6 {
let t0 = Instant::now();
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("The sum is: {}. Time elapsed: {:.3} sec", sum, elapsed);
}
println!("----- FnvHashMap (fnv 1.0.7) ------------------------------");
let hm: FnvHashMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
for k in 0..6 {
let t0 = Instant::now();
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("The sum is: {}. Time elapsed: {:.3} sec", sum, elapsed);
}
println!("----- FxHashMap (rustc-hash 1.1.0) ------------------------");
let hm: FxHashMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
for k in 0..6 {
let t0 = Instant::now();
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("The sum is: {}. Time elapsed: {:.3} sec", sum, elapsed);
}
println!("----- HashMap/BuildNoHashHasher (nohash-hasher 0.2.0) -----");
let hm: HashMap<i32, i32, nohash_hasher::BuildNoHashHasher<i32>> = (0..1_000_000).map(|i| (i, i)).collect();
for k in 0..6 {
let t0 = Instant::now();
let mut sum: i64 = 0;
for i in 0..1_000_000 {
if let Some(x) = hm.get(&i) {
sum += *x as i64;
}
}
let elapsed = t0.elapsed().as_secs_f64();
println!("The sum is: {}. Time elapsed: {:.3} sec", sum, elapsed);
}
BTW the last one can be replaced with this shorter type:
let hm: IntMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
For those interested, this is IntMap's definition:
pub type IntMap<K, V> = std::collections::HashMap<K, V, BuildNoHashHasher<K>>;
Java code
On the same machine I tested a Java example. I don't have a JVM installed on it, so I used a Docker image adoptopenjdk/openjdk14 and directly pasted the code below in jshell> (not sure if that would hurt Java's timings). So this is the Java code:
Map<Integer, Integer> hm = new HashMap<>();
for (int i = 0; i < 1_000_000; i++) {
hm.put(i, i);
}
for (int k = 0; k < 6; k++) {
long t0 = System.currentTimeMillis();
// Start timer here
long sum = 0;
for (int i = 0; i < 1_000_000; i++) {
sum += hm.get(i);
}
System.out.println("The sum is: " + sum + ". Time elapsed: " + (System.currentTimeMillis() - t0) + " ms");
}
Results
Rust (release mode):
----- HashMap (with its default SipHash hasher) -----------
The sum is: 499999500000. Time elapsed: 0.149 sec
The sum is: 499999500000. Time elapsed: 0.140 sec
The sum is: 499999500000. Time elapsed: 0.167 sec
The sum is: 499999500000. Time elapsed: 0.150 sec
The sum is: 499999500000. Time elapsed: 0.261 sec
The sum is: 499999500000. Time elapsed: 0.189 sec
----- FnvHashMap (fnv 1.0.7) ------------------------------
The sum is: 499999500000. Time elapsed: 0.055 sec
The sum is: 499999500000. Time elapsed: 0.052 sec
The sum is: 499999500000. Time elapsed: 0.053 sec
The sum is: 499999500000. Time elapsed: 0.058 sec
The sum is: 499999500000. Time elapsed: 0.051 sec
The sum is: 499999500000. Time elapsed: 0.056 sec
----- FxHashMap (rustc-hash 1.1.0) ------------------------
The sum is: 499999500000. Time elapsed: 0.039 sec
The sum is: 499999500000. Time elapsed: 0.076 sec
The sum is: 499999500000. Time elapsed: 0.064 sec
The sum is: 499999500000. Time elapsed: 0.048 sec
The sum is: 499999500000. Time elapsed: 0.057 sec
The sum is: 499999500000. Time elapsed: 0.061 sec
----- HashMap/BuildNoHashHasher (nohash-hasher 0.2.0) -----
The sum is: 499999500000. Time elapsed: 0.004 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
The sum is: 499999500000. Time elapsed: 0.003 sec
Java:
The sum is: 499999500000. Time elapsed: 49 ms // see notes below
The sum is: 499999500000. Time elapsed: 41 ms // see notes below
The sum is: 499999500000. Time elapsed: 18 ms
The sum is: 499999500000. Time elapsed: 29 ms
The sum is: 499999500000. Time elapsed: 19 ms
The sum is: 499999500000. Time elapsed: 23 ms
(With Java the first 1-2 runs are normally slower, as the JVM HotSpot still hasn't fully optimized the relevant piece of code.)
Try hashbrown
It uses aHash which have full comparison with other HashMap algorithm here

Why is the Rust random number generator slower with multiple instances running?

I am doing some random number generation for my Lotto Simulation and was wondering why would it be MUCH slower when running multiple instances?
I am running this program under Ubuntu 15.04 (linux kernel 4.2). rustc 1.7.0-nightly (d5e229057 2016-01-04)
Overall CPU utilization is about 45% during these tests but each individual thread is taking up 100% of that thread.
Here is my script I am using to start multiple instances at the same time.
#!/usr/bin/env bash
pkill lotto_sim
for _ in `seq 1 14`;
do
./lotto_sim 15000000 1>> /var/log/syslog &
done
Output:
Took PT38.701900316S seconds to generate 15000000 random tickets
Took PT39.193917241S seconds to generate 15000000 random tickets
Took PT39.412279484S seconds to generate 15000000 random tickets
Took PT39.492940352S seconds to generate 15000000 random tickets
Took PT39.715433024S seconds to generate 15000000 random tickets
Took PT39.726609237S seconds to generate 15000000 random tickets
Took PT39.884151996S seconds to generate 15000000 random tickets
Took PT40.025874106S seconds to generate 15000000 random tickets
Took PT40.088332517S seconds to generate 15000000 random tickets
Took PT40.112601899S seconds to generate 15000000 random tickets
Took PT40.205958636S seconds to generate 15000000 random tickets
Took PT40.227956170S seconds to generate 15000000 random tickets
Took PT40.393753486S seconds to generate 15000000 random tickets
Took PT40.465173616S seconds to generate 15000000 random tickets
However, a single run gives this output:
$ ./lotto_sim 15000000
Took PT9.860698141S seconds to generate 15000000 random tickets
My understanding is that each process has it's own memory and doesn't share anything. Correct?
Here is the relevant code:
extern crate rand;
extern crate itertools;
extern crate time;
use std::env;
use rand::{Rng, Rand};
use itertools::Itertools;
use time::PreciseTime;
struct Ticket {
whites: Vec<u8>,
power_ball: u8,
is_power_play: bool,
}
const POWER_PLAY_PERCENTAGE: u8 = 15;
const WHITE_MIN: u8 = 1;
const WHITE_MAX: u8 = 69;
const POWER_BALL_MIN: u8 = 1;
const POWER_BALL_MAX: u8 = 26;
impl Rand for Ticket {
fn rand<R: Rng>(rng: &mut R) -> Self {
let pp_guess = rng.gen_range(0, 100);
let pp_value = pp_guess < POWER_PLAY_PERCENTAGE;
let mut whites_vec: Vec<_> = (0..).map(|_| rng.gen_range(WHITE_MIN, WHITE_MAX + 1))
.unique().take(5).collect();
whites_vec.sort();
let pb_value = rng.gen_range(POWER_BALL_MIN, POWER_BALL_MAX + 1);
Ticket { whites: whites_vec, power_ball: pb_value, is_power_play: pp_value}
}
}
fn gen_test(num_tickets: i64) {
let mut rng = rand::thread_rng();
let _: Vec<_> = rng.gen_iter::<Ticket>()
.take(num_tickets as usize)
.collect();
}
fn main() {
let args: Vec<_> = env::args().collect();
let num_tickets: i64 = args[1].parse::<i64>().unwrap();
let start = PreciseTime::now();
gen_test(num_tickets);
let end = PreciseTime::now();
println!("Took {} seconds to generate {} random tickets", start.to(end), num_tickets);
}
Edit:
Maybe a better question would be how do I debug and figure this out? Where would I look within the program or within my OS to find the performance hindrances? I am new to Rust and lower level programming like this that relies so heavily on the OS.

Using Grand Central Dispatch in Swift to parallelize and speed up “for" loops?

I am trying to wrap my head around how to use GCD to parallelize and speed up Monte Carlo simulations. Most/all simple examples are presented for Objective C and I really need a simple example for Swift since Swift is my first “real” programming language.
The minimal working version of a monte carlo simulation in Swift would be something like this:
import Foundation
import Cocoa
var winner = 0
var j = 0
var i = 0
var chance = 0
var points = 0
for j=1;j<1000001;++j{
var ability = 500
var player1points = 0
for i=1;i<1000;++i{
chance = Int(arc4random_uniform(1001))
if chance<(ability-points) {++points}
else{points = points - 1}
}
if points > 0{++winner}
}
println(winner)
The code works directly pasted into a command line program project in xcode 6.1
The innermost loop cannot be parallelized because the new value of variable “points” is used in the next loop. But the outermost just run the innermost simulation 1000000 times and tally up the results and should be an ideal candidate for parallelization.
So my question is how to use GCD to parallelize the outermost for loop?
A "multi-threaded iteration" can be done with dispatch_apply():
let outerCount = 100 // # of concurrent block iterations
let innerCount = 10000 // # of iterations within each block
let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
for innerIdx in 1 ... innerCount {
// ...
}
}
(You have to figure out the best relation between outer and inner counts.)
There are two things to notice:
arc4random() uses an internal mutex, which makes it extremely slow when called
from several threads in parallel, see Performance of concurrent code using dispatch_group_async is MUCH slower than single-threaded version. From the answers given there,
rand_r() (with separate seeds for each thread) seems to be faster alternative.
The result variable winner must not be modified from multiple threads simultaneously.
You can use an array instead where each thread updates its own element, and the results
are added afterwards. A thread-safe method has been described in https://stackoverflow.com/a/26790019/1187415.
Then it would roughly look like this:
let outerCount = 100 // # of concurrent block iterations
let innerCount = 10000 // # of iterations within each block
let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
var winners = [Int](count: outerCount, repeatedValue: 0)
winners.withUnsafeMutableBufferPointer { winnersPtr -> Void in
dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
var seed = arc4random() // seed for rand_r() in this "thread"
for innerIdx in 1 ... innerCount {
var points = 0
var ability = 500
for i in 1 ... 1000 {
let chance = Int(rand_r(&seed) % 1001)
if chance < (ability-points) { ++points }
else {points = points - 1}
}
if points > 0 {
winnersPtr[Int(outerIdx)] += 1
}
}
}
}
// Add results:
let winner = reduce(winners, 0, +)
println(winner)
Just to update this for contemporary syntax, we now use concurrentPerform (which replaces dispatch_apply).
So we can replace
for j in 0 ..< 1_000_000 {
for i in 0 ..< 1000 {
...
}
}
With
DispatchQueue.concurrentPerform(1_000_000) { j in
for i in 0 ..< 1000 {
...
}
}
Note, parallelizing introduces a little overhead, in both the basic GCD dispatch mechanism, as well as the synchronization of the results. If you had 32 iterations in your parallel loop this would be inconsequential, but you have a million iterations, and it will start to add up.
We generally solve this by “striding”: Rather than parallelizing 1 million iterations, you might only do 100 parallel iterations, doing 10,000 iterations each. E.g. something like:
let totalIterations = 1_000_000
let stride = 10_000
let (quotient, remainder) = totalIterations.quotientAndRemainder(dividingBy: stride)
let iterations = quotient + remainder == 0 ? 0 : 1
DispatchQueue.concurrentPerform(iterations: iterations) { iteration in
for j in iteration * stride ..< min(totalIterations, (iteration + 1) * stride) {
for i in 0 ..< 1000 {
...
}
}
}

Algorithm to order 'tag line' campaigns based on resulting sales

I want to be able to introduce new 'tag lines' into a database that are shown 'randomly' to users. (These tag lines are shown as an introduction as animated text.)
Based upon the number of sales that result from those taglines I'd like the good ones to trickle to the top, but still show the others less frequently.
I could come up with a basic algorithm quite easily but I want something thats a little more 'statistically accurate'.
I dont really know where to start. Its been a while since I've done anything more than basic statistics. My model would need to be sensitive to tolerances, but obviously it doesnt need to be worthy of a PHD.
Edit: I am currently tracking a 'conversion rate' - i.e. hits per order. This value would probably be best calculated as a cumulative 'all time' convertsion rate to be fed into the algorithm.
Looking at your problem, I would modify the requirements a bit -
1) The most popular one should be shown most often.
2) Taglines should "age", so one that got a lot of votes (purchase) in the past, but none recently should be shown less often
3) Brand new taglines should be shown more often during their first days.
If you agree on those, then a algorithm could be something like:
START:
x = random(1, 3);
if x = 3 goto NEW else goto NORMAL
NEW:
TagVec = Taglines.filterYounger(5 days); // I'm taking a LOT of liberties with the pseudo code,,,
x = random(1, TagVec.Length);
return tagVec[x-1]; // 0 indexed vectors even in made up language,
NORMAL:
// Similar to EBGREEN above
sum = 0;
ForEach(TagLine in TagLines) {
sum += TagLine.noOfPurhcases;
}
x = random(1, sum);
ForEach(TagLine in TagLines) {
x -= TagLine.noOfPurchase;
if ( x > 0) return TagLine; // Find the TagLine that represent our random number
}
Now, as a setup I would give every new tagline 10 purchases, to avoid getting really big slanting for one single purchase.
The aging process I would count a purchase older than a week as 0.8 purhcase per week of age. So 1 week old gives 0.8 points, 2 weeks give 0.8*0.8 = 0,64 and so forth...
You would have to play around with the Initial purhcases parameter (10 in my example) and the aging speed (1 week here) and the aging factor (0.8 here) to find something that suits you.
I would suggest randomly choosing with a weighting factor based on previous sales. So let's say you had this:
tag1 = 1 sale
tag2 = 0 sales
tag3 = 1 sale
tag4 = 2 sales
tag5 = 3 sales
A simple weighting formula would be 1 + number of sales, so this would be the probability of selecting each tag:
tag1 = 2/12 = 16.7%
tag2 = 1/12 = 8.3%
tag3 = 2/12 = 16.6%
tag4 = 3/12 = 25%
tag5 = 4/12 = 33.3%
You could easily change the weighting formula to get just the distribution that you want.
You have to come up with a weighting formula based on sales.
I don't think there's any such thing as a "statistically accurate" formula here - it's all based on your preference.
No one can say "this is the correct weighting and the other weighting is wrong" because there isn't a final outcome you are attempting to model - this isn't like trying to weigh responses to a poll about an upcoming election (where you are trying to model results to represent something that will happen in the future).
Heres an example in javascript. Not that I'm not suggesting running this client side...
Also there is alot of optimization that can be done.
Note: createMemberInNormalDistribution() is implemented here Converting a Uniform Distribution to a Normal Distribution
/*
* an example set of taglines
* hits are sales
* views are times its been shown
*/
var taglines = [
{"tag":"tagline 1","hits":1,"views":234},
{"tag":"tagline 2","hits":5,"views":566},
{"tag":"tagline 3","hits":3,"views":421},
{"tag":"tagline 4","hits":1,"views":120},
{"tag":"tagline 5","hits":7,"views":200}
];
/*set up our stat model for the tags*/
var TagModel = function(set){
var hits, views, sumOfDiff, sumOfSqDiff;
hits = views = sumOfDiff = sumOfSqDiff = 0;
/*find average*/
for (n in set){
hits += set[n].hits;
views += set[n].views;
}
this.avg = hits/views;
/*find standard deviation and variance*/
for (n in set){
var diff =((set[n].hits/set[n].views)-this.avg);
sumOfDiff += diff;
sumOfSqDiff += diff*diff;
}
this.variance = sumOfDiff;
this.std_dev = Math.sqrt(sumOfSqDiff/set.length);
/*return tag to use fChooser determines likelyhood of tag*/
this.getTag = function(fChooser){
var m = this;
set.sort(function(a,b){
return fChooser((a.hits/a.views),(b.hits/b.views), m);
});
return set[0];
};
};
var config = {
"uniformDistribution":function(a,b,model){
return Math.random()*b-Math.random()*a;
},
"normalDistribution":function(a,b,model){
var a1 = createMemberInNormalDistribution(model.avg,model.std_dev)* a;
var b1 = createMemberInNormalDistribution(model.avg,model.std_dev)* b;
return b1-a1;
},
//say weight = 10^n... higher n is the more even the distribution will be.
"weight": .5,
"weightedDistribution":function(a,b,model){
var a1 = createMemberInNormalDistribution(model.avg,model.std_dev*config.weight)* a;
var b1 = createMemberInNormalDistribution(model.avg,model.std_dev*config.weight)* b;
return b1-a1;
}
}
var model = new TagModel(taglines);
//to use
model.getTag(config.uniformDistribution).tag;
//running 10000 times: ({'tagline 4':836, 'tagline 5':7608, 'tagline 1':100, 'tagline 2':924, 'tagline 3':532})
model.getTag(config.normalDistribution).tag;
//running 10000 times: ({'tagline 4':1775, 'tagline 5':3471, 'tagline 1':1273, 'tagline 2':1857, 'tagline 3':1624})
model.getTag(config.weightedDistribution).tag;
//running 10000 times: ({'tagline 4':1514, 'tagline 5':5045, 'tagline 1':577, 'tagline 2':1627, 'tagline 3':1237})
config.weight = 2;
model.getTag(config.weightedDistribution).tag;
//running 10000 times: {'tagline 4':1941, 'tagline 5':2715, 'tagline 1':1559, 'tagline 2':1957, 'tagline 3':1828})

Calculating frames per second in a game

What's a good algorithm for calculating frames per second in a game? I want to show it as a number in the corner of the screen. If I just look at how long it took to render the last frame the number changes too fast.
Bonus points if your answer updates each frame and doesn't converge differently when the frame rate is increasing vs decreasing.
You need a smoothed average, the easiest way is to take the current answer (the time to draw the last frame) and combine it with the previous answer.
// eg.
float smoothing = 0.9; // larger=more smoothing
measurement = (measurement * smoothing) + (current * (1.0-smoothing))
By adjusting the 0.9 / 0.1 ratio you can change the 'time constant' - that is how quickly the number responds to changes. A larger fraction in favour of the old answer gives a slower smoother change, a large fraction in favour of the new answer gives a quicker changing value. Obviously the two factors must add to one!
This is what I have used in many games.
#define MAXSAMPLES 100
int tickindex=0;
int ticksum=0;
int ticklist[MAXSAMPLES];
/* need to zero out the ticklist array before starting */
/* average will ramp up until the buffer is full */
/* returns average ticks per frame over the MAXSAMPLES last frames */
double CalcAverageTick(int newtick)
{
ticksum-=ticklist[tickindex]; /* subtract value falling off */
ticksum+=newtick; /* add new value */
ticklist[tickindex]=newtick; /* save new value so it can be subtracted later */
if(++tickindex==MAXSAMPLES) /* inc buffer index */
tickindex=0;
/* return average */
return((double)ticksum/MAXSAMPLES);
}
Well, certainly
frames / sec = 1 / (sec / frame)
But, as you point out, there's a lot of variation in the time it takes to render a single frame, and from a UI perspective updating the fps value at the frame rate is not usable at all (unless the number is very stable).
What you want is probably a moving average or some sort of binning / resetting counter.
For example, you could maintain a queue data structure which held the rendering times for each of the last 30, 60, 100, or what-have-you frames (you could even design it so the limit was adjustable at run-time). To determine a decent fps approximation you can determine the average fps from all the rendering times in the queue:
fps = # of rendering times in queue / total rendering time
When you finish rendering a new frame you enqueue a new rendering time and dequeue an old rendering time. Alternately, you could dequeue only when the total of the rendering times exceeded some preset value (e.g. 1 sec). You can maintain the "last fps value" and a last updated timestamp so you can trigger when to update the fps figure, if you so desire. Though with a moving average if you have consistent formatting, printing the "instantaneous average" fps on each frame would probably be ok.
Another method would be to have a resetting counter. Maintain a precise (millisecond) timestamp, a frame counter, and an fps value. When you finish rendering a frame, increment the counter. When the counter hits a pre-set limit (e.g. 100 frames) or when the time since the timestamp has passed some pre-set value (e.g. 1 sec), calculate the fps:
fps = # frames / (current time - start time)
Then reset the counter to 0 and set the timestamp to the current time.
Increment a counter every time you render a screen and clear that counter for some time interval over which you want to measure the frame-rate.
Ie. Every 3 seconds, get counter/3 and then clear the counter.
There are at least two ways to do it:
The first is the one others have mentioned here before me.
I think it's the simplest and preferred way. You just to keep track of
cn: counter of how many frames you've rendered
time_start: the time since you've started counting
time_now: the current time
Calculating the fps in this case is as simple as evaluating this formula:
FPS = cn / (time_now - time_start).
Then there is the uber cool way you might like to use some day:
Let's say you have 'i' frames to consider. I'll use this notation: f[0], f[1],..., f[i-1] to describe how long it took to render frame 0, frame 1, ..., frame (i-1) respectively.
Example where i = 3
|f[0] |f[1] |f[2] |
+----------+-------------+-------+------> time
Then, mathematical definition of fps after i frames would be
(1) fps[i] = i / (f[0] + ... + f[i-1])
And the same formula but only considering i-1 frames.
(2) fps[i-1] = (i-1) / (f[0] + ... + f[i-2])
Now the trick here is to modify the right side of formula (1) in such a way that it will contain the right side of formula (2) and substitute it for it's left side.
Like so (you should see it more clearly if you write it on a paper):
fps[i] = i / (f[0] + ... + f[i-1])
= i / ((f[0] + ... + f[i-2]) + f[i-1])
= (i/(i-1)) / ((f[0] + ... + f[i-2])/(i-1) + f[i-1]/(i-1))
= (i/(i-1)) / (1/fps[i-1] + f[i-1]/(i-1))
= ...
= (i*fps[i-1]) / (f[i-1] * fps[i-1] + i - 1)
So according to this formula (my math deriving skill are a bit rusty though), to calculate the new fps you need to know the fps from the previous frame, the duration it took to render the last frame and the number of frames you've rendered.
This might be overkill for most people, that's why I hadn't posted it when I implemented it. But it's very robust and flexible.
It stores a Queue with the last frame times, so it can accurately calculate an average FPS value much better than just taking the last frame into consideration.
It also allows you to ignore one frame, if you are doing something that you know is going to artificially screw up that frame's time.
It also allows you to change the number of frames to store in the Queue as it runs, so you can test it out on the fly what is the best value for you.
// Number of past frames to use for FPS smooth calculation - because
// Unity's smoothedDeltaTime, well - it kinda sucks
private int frameTimesSize = 60;
// A Queue is the perfect data structure for the smoothed FPS task;
// new values in, old values out
private Queue<float> frameTimes;
// Not really needed, but used for faster updating then processing
// the entire queue every frame
private float __frameTimesSum = 0;
// Flag to ignore the next frame when performing a heavy one-time operation
// (like changing resolution)
private bool _fpsIgnoreNextFrame = false;
//=============================================================================
// Call this after doing a heavy operation that will screw up with FPS calculation
void FPSIgnoreNextFrame() {
this._fpsIgnoreNextFrame = true;
}
//=============================================================================
// Smoothed FPS counter updating
void Update()
{
if (this._fpsIgnoreNextFrame) {
this._fpsIgnoreNextFrame = false;
return;
}
// While looping here allows the frameTimesSize member to be changed dinamically
while (this.frameTimes.Count >= this.frameTimesSize) {
this.__frameTimesSum -= this.frameTimes.Dequeue();
}
while (this.frameTimes.Count < this.frameTimesSize) {
this.__frameTimesSum += Time.deltaTime;
this.frameTimes.Enqueue(Time.deltaTime);
}
}
//=============================================================================
// Public function to get smoothed FPS values
public int GetSmoothedFPS() {
return (int)(this.frameTimesSize / this.__frameTimesSum * Time.timeScale);
}
Good answers here. Just how you implement it is dependent on what you need it for. I prefer the running average one myself "time = time * 0.9 + last_frame * 0.1" by the guy above.
however I personally like to weight my average more heavily towards newer data because in a game it is SPIKES that are the hardest to squash and thus of most interest to me. So I would use something more like a .7 \ .3 split will make a spike show up much faster (though it's effect will drop off-screen faster as well.. see below)
If your focus is on RENDERING time, then the .9.1 split works pretty nicely b/c it tend to be more smooth. THough for gameplay/AI/physics spikes are much more of a concern as THAT will usually what makes your game look choppy (which is often worse than a low frame rate assuming we're not dipping below 20 fps)
So, what I would do is also add something like this:
#define ONE_OVER_FPS (1.0f/60.0f)
static float g_SpikeGuardBreakpoint = 3.0f * ONE_OVER_FPS;
if(time > g_SpikeGuardBreakpoint)
DoInternalBreakpoint()
(fill in 3.0f with whatever magnitude you find to be an unacceptable spike)
This will let you find and thus solve FPS issues the end of the frame they happen.
A much better system than using a large array of old framerates is to just do something like this:
new_fps = old_fps * 0.99 + new_fps * 0.01
This method uses far less memory, requires far less code, and places more importance upon recent framerates than old framerates while still smoothing the effects of sudden framerate changes.
You could keep a counter, increment it after each frame is rendered, then reset the counter when you are on a new second (storing the previous value as the last second's # of frames rendered)
JavaScript:
// Set the end and start times
var start = (new Date).getTime(), end, FPS;
/* ...
* the loop/block your want to watch
* ...
*/
end = (new Date).getTime();
// since the times are by millisecond, use 1000 (1000ms = 1s)
// then multiply the result by (MaxFPS / 1000)
// FPS = (1000 - (end - start)) * (MaxFPS / 1000)
FPS = Math.round((1000 - (end - start)) * (60 / 1000));
Here's a complete example, using Python (but easily adapted to any language). It uses the smoothing equation in Martin's answer, so almost no memory overhead, and I chose values that worked for me (feel free to play around with the constants to adapt to your use case).
import time
SMOOTHING_FACTOR = 0.99
MAX_FPS = 10000
avg_fps = -1
last_tick = time.time()
while True:
# <Do your rendering work here...>
current_tick = time.time()
# Ensure we don't get crazy large frame rates, by capping to MAX_FPS
current_fps = 1.0 / max(current_tick - last_tick, 1.0/MAX_FPS)
last_tick = current_tick
if avg_fps < 0:
avg_fps = current_fps
else:
avg_fps = (avg_fps * SMOOTHING_FACTOR) + (current_fps * (1-SMOOTHING_FACTOR))
print(avg_fps)
Set counter to zero. Each time you draw a frame increment the counter. After each second print the counter. lather, rinse, repeat. If yo want extra credit, keep a running counter and divide by the total number of seconds for a running average.
In (c++ like) pseudocode these two are what I used in industrial image processing applications that had to process images from a set of externally triggered camera's. Variations in "frame rate" had a different source (slower or faster production on the belt) but the problem is the same. (I assume that you have a simple timer.peek() call that gives you something like the nr of msec (nsec?) since application start or the last call)
Solution 1: fast but not updated every frame
do while (1)
{
ProcessImage(frame)
if (frame.framenumber%poll_interval==0)
{
new_time=timer.peek()
framerate=poll_interval/(new_time - last_time)
last_time=new_time
}
}
Solution 2: updated every frame, requires more memory and CPU
do while (1)
{
ProcessImage(frame)
new_time=timer.peek()
delta=new_time - last_time
last_time = new_time
total_time += delta
delta_history.push(delta)
framerate= delta_history.length() / total_time
while (delta_history.length() > avg_interval)
{
oldest_delta = delta_history.pop()
total_time -= oldest_delta
}
}
qx.Class.define('FpsCounter', {
extend: qx.core.Object
,properties: {
}
,events: {
}
,construct: function(){
this.base(arguments);
this.restart();
}
,statics: {
}
,members: {
restart: function(){
this.__frames = [];
}
,addFrame: function(){
this.__frames.push(new Date());
}
,getFps: function(averageFrames){
debugger;
if(!averageFrames){
averageFrames = 2;
}
var time = 0;
var l = this.__frames.length;
var i = averageFrames;
while(i > 0){
if(l - i - 1 >= 0){
time += this.__frames[l - i] - this.__frames[l - i - 1];
}
i--;
}
var fps = averageFrames / time * 1000;
return fps;
}
}
});
How i do it!
boolean run = false;
int ticks = 0;
long tickstart;
int fps;
public void loop()
{
if(this.ticks==0)
{
this.tickstart = System.currentTimeMillis();
}
this.ticks++;
this.fps = (int)this.ticks / (System.currentTimeMillis()-this.tickstart);
}
In words, a tick clock tracks ticks. If it is the first time, it takes the current time and puts it in 'tickstart'. After the first tick, it makes the variable 'fps' equal how many ticks of the tick clock divided by the time minus the time of the first tick.
Fps is an integer, hence "(int)".
Here's how I do it (in Java):
private static long ONE_SECOND = 1000000L * 1000L; //1 second is 1000ms which is 1000000ns
LinkedList<Long> frames = new LinkedList<>(); //List of frames within 1 second
public int calcFPS(){
long time = System.nanoTime(); //Current time in nano seconds
frames.add(time); //Add this frame to the list
while(true){
long f = frames.getFirst(); //Look at the first element in frames
if(time - f > ONE_SECOND){ //If it was more than 1 second ago
frames.remove(); //Remove it from the list of frames
} else break;
/*If it was within 1 second we know that all other frames in the list
* are also within 1 second
*/
}
return frames.size(); //Return the size of the list
}
In Typescript, I use this algorithm to calculate framerate and frametime averages:
let getTime = () => {
return new Date().getTime();
}
let frames: any[] = [];
let previousTime = getTime();
let framerate:number = 0;
let frametime:number = 0;
let updateStats = (samples:number=60) => {
samples = Math.max(samples, 1) >> 0;
if (frames.length === samples) {
let currentTime: number = getTime() - previousTime;
frametime = currentTime / samples;
framerate = 1000 * samples / currentTime;
previousTime = getTime();
frames = [];
}
frames.push(1);
}
usage:
statsUpdate();
// Print
stats.innerHTML = Math.round(framerate) + ' FPS ' + frametime.toFixed(2) + ' ms';
Tip: If samples is 1, the result is real-time framerate and frametime.
This is based on KPexEA's answer and gives the Simple Moving Average. Tidied and converted to TypeScript for easy copy and paste:
Variable declaration:
fpsObject = {
maxSamples: 100,
tickIndex: 0,
tickSum: 0,
tickList: []
}
Function:
calculateFps(currentFps: number): number {
this.fpsObject.tickSum -= this.fpsObject.tickList[this.fpsObject.tickIndex] || 0
this.fpsObject.tickSum += currentFps
this.fpsObject.tickList[this.fpsObject.tickIndex] = currentFps
if (++this.fpsObject.tickIndex === this.fpsObject.maxSamples) this.fpsObject.tickIndex = 0
const smoothedFps = this.fpsObject.tickSum / this.fpsObject.maxSamples
return Math.floor(smoothedFps)
}
Usage (may vary in your app):
this.fps = this.calculateFps(this.ticker.FPS)
I adapted #KPexEA's answer to Go, moved the globals into struct fields, allowed the number of samples to be configurable, and used time.Duration instead of plain integers and floats.
type FrameTimeTracker struct {
samples []time.Duration
sum time.Duration
index int
}
func NewFrameTimeTracker(n int) *FrameTimeTracker {
return &FrameTimeTracker{
samples: make([]time.Duration, n),
}
}
func (t *FrameTimeTracker) AddFrameTime(frameTime time.Duration) (average time.Duration) {
// algorithm adapted from https://stackoverflow.com/a/87732/814422
t.sum -= t.samples[t.index]
t.sum += frameTime
t.samples[t.index] = frameTime
t.index++
if t.index == len(t.samples) {
t.index = 0
}
return t.sum / time.Duration(len(t.samples))
}
The use of time.Duration, which has nanosecond precision, eliminates the need for floating-point arithmetic to compute the average frame time, but comes at the expense of needing twice as much memory for the same number of samples.
You'd use it like this:
// track the last 60 frame times
frameTimeTracker := NewFrameTimeTracker(60)
// main game loop
for frame := 0;; frame++ {
// ...
if frame > 0 {
// prevFrameTime is the duration of the last frame
avgFrameTime := frameTimeTracker.AddFrameTime(prevFrameTime)
fps := 1.0 / avgFrameTime.Seconds()
}
// ...
}
Since the context of this question is game programming, I'll add some more notes about performance and optimization. The above approach is idiomatic Go but always involves two heap allocations: one for the struct itself and one for the array backing the slice of samples. If used as indicated above, these are long-lived allocations so they won't really tax the garbage collector. Profile before optimizing, as always.
However, if performance is a major concern, some changes can be made to eliminate the allocations and indirections:
Change samples from a slice of []time.Duration to an array of [N]time.Duration where N is fixed at compile time. This removes the flexibility of changing the number of samples at runtime, but in most cases that flexibility is unnecessary.
Then, eliminate the NewFrameTimeTracker constructor function entirely and use a var frameTimeTracker FrameTimeTracker declaration (at the package level or local to main) instead. Unlike C, Go will pre-zero all relevant memory.
Unfortunately, most of the answers here don't provide either accurate enough or sufficiently "slow responsive" FPS measurements. Here's how I do it in Rust using a measurement queue:
use std::collections::VecDeque;
use std::time::{Duration, Instant};
pub struct FpsCounter {
sample_period: Duration,
max_samples: usize,
creation_time: Instant,
frame_count: usize,
measurements: VecDeque<FrameCountMeasurement>,
}
#[derive(Copy, Clone)]
struct FrameCountMeasurement {
time: Instant,
frame_count: usize,
}
impl FpsCounter {
pub fn new(sample_period: Duration, samples: usize) -> Self {
assert!(samples > 1);
Self {
sample_period,
max_samples: samples,
creation_time: Instant::now(),
frame_count: 0,
measurements: VecDeque::new(),
}
}
pub fn fps(&self) -> f32 {
match (self.measurements.front(), self.measurements.back()) {
(Some(start), Some(end)) => {
let period = (end.time - start.time).as_secs_f32();
if period > 0.0 {
(end.frame_count - start.frame_count) as f32 / period
} else {
0.0
}
}
_ => 0.0,
}
}
pub fn update(&mut self) {
self.frame_count += 1;
let current_measurement = self.measure();
let last_measurement = self
.measurements
.back()
.copied()
.unwrap_or(FrameCountMeasurement {
time: self.creation_time,
frame_count: 0,
});
if (current_measurement.time - last_measurement.time) >= self.sample_period {
self.measurements.push_back(current_measurement);
while self.measurements.len() > self.max_samples {
self.measurements.pop_front();
}
}
}
fn measure(&self) -> FrameCountMeasurement {
FrameCountMeasurement {
time: Instant::now(),
frame_count: self.frame_count,
}
}
}
How to use:
Create the counter:
let mut fps_counter = FpsCounter::new(Duration::from_millis(100), 5);
Call fps_counter.update() on every frame drawn.
Call fps_counter.fps() whenever you like to display current FPS.
Now, the key is in parameters to FpsCounter::new() method: sample_period is how responsive fps() is to changes in framerate, and samples controls how quickly fps() ramps up or down to the actual framerate. So if you choose 10 ms and 100 samples, fps() would react almost instantly to any change in framerate - basically, FPS value on the screen would jitter like crazy, but since it's 100 samples, it would take 1 second to match the actual framerate.
So my choice of 100 ms and 5 samples means that displayed FPS counter doesn't make your eyes bleed by changing crazy fast, and it would match your actual framerate half a second after it changes, which is sensible enough for a game.
Since sample_period * samples is averaging time span, you don't want it to be too short if you want a reasonably accurate FPS counter.
store a start time and increment your framecounter once per loop? every few seconds you could just print framecount/(Now - starttime) and then reinitialize them.
edit: oops. double-ninja'ed

Resources