Using Grand Central Dispatch in Swift to parallelize and speed up “for" loops?

Using Grand Central Dispatch in Swift to parallelize and speed up “for" loops? - macos

I am trying to wrap my head around how to use GCD to parallelize and speed up Monte Carlo simulations. Most/all simple examples are presented for Objective C and I really need a simple example for Swift since Swift is my first “real” programming language.
The minimal working version of a monte carlo simulation in Swift would be something like this:
import Foundation
import Cocoa
var winner = 0
var j = 0
var i = 0
var chance = 0
var points = 0
for j=1;j<1000001;++j{
var ability = 500
var player1points = 0
for i=1;i<1000;++i{
chance = Int(arc4random_uniform(1001))
if chance<(ability-points) {++points}
else{points = points - 1}
}
if points > 0{++winner}
}
println(winner)
The code works directly pasted into a command line program project in xcode 6.1
The innermost loop cannot be parallelized because the new value of variable “points” is used in the next loop. But the outermost just run the innermost simulation 1000000 times and tally up the results and should be an ideal candidate for parallelization.
So my question is how to use GCD to parallelize the outermost for loop?

A "multi-threaded iteration" can be done with dispatch_apply():
let outerCount = 100 // # of concurrent block iterations
let innerCount = 10000 // # of iterations within each block
let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
for innerIdx in 1 ... innerCount {
// ...
}
}
(You have to figure out the best relation between outer and inner counts.)
There are two things to notice:
arc4random() uses an internal mutex, which makes it extremely slow when called
from several threads in parallel, see Performance of concurrent code using dispatch_group_async is MUCH slower than single-threaded version. From the answers given there,
rand_r() (with separate seeds for each thread) seems to be faster alternative.
The result variable winner must not be modified from multiple threads simultaneously.
You can use an array instead where each thread updates its own element, and the results
are added afterwards. A thread-safe method has been described in https://stackoverflow.com/a/26790019/1187415.
Then it would roughly look like this:
let outerCount = 100 // # of concurrent block iterations
let innerCount = 10000 // # of iterations within each block
let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
var winners = [Int](count: outerCount, repeatedValue: 0)
winners.withUnsafeMutableBufferPointer { winnersPtr -> Void in
dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
var seed = arc4random() // seed for rand_r() in this "thread"
for innerIdx in 1 ... innerCount {
var points = 0
var ability = 500
for i in 1 ... 1000 {
let chance = Int(rand_r(&seed) % 1001)
if chance < (ability-points) { ++points }
else {points = points - 1}
}
if points > 0 {
winnersPtr[Int(outerIdx)] += 1
}
}
}
}
// Add results:
let winner = reduce(winners, 0, +)
println(winner)

Just to update this for contemporary syntax, we now use concurrentPerform (which replaces dispatch_apply).
So we can replace
for j in 0 ..< 1_000_000 {
for i in 0 ..< 1000 {
...
}
}
With
DispatchQueue.concurrentPerform(1_000_000) { j in
for i in 0 ..< 1000 {
...
}
}
Note, parallelizing introduces a little overhead, in both the basic GCD dispatch mechanism, as well as the synchronization of the results. If you had 32 iterations in your parallel loop this would be inconsequential, but you have a million iterations, and it will start to add up.
We generally solve this by “striding”: Rather than parallelizing 1 million iterations, you might only do 100 parallel iterations, doing 10,000 iterations each. E.g. something like:
let totalIterations = 1_000_000
let stride = 10_000
let (quotient, remainder) = totalIterations.quotientAndRemainder(dividingBy: stride)
let iterations = quotient + remainder == 0 ? 0 : 1
DispatchQueue.concurrentPerform(iterations: iterations) { iteration in
for j in iteration * stride ..< min(totalIterations, (iteration + 1) * stride) {
for i in 0 ..< 1000 {
...
}
}
}

Related

One coding problem two different solutions, how to prove is correct?

I have a coding problem:
The awards committee of your alma mater (i.e. your college/university) asked for your assistance with a budget allocation problem they’re facing. Originally, the committee planned to give N research grants this year. However, due to spending cutbacks, the budget was reduced to newBudget dollars and now they need to reallocate the grants. The committee made a decision that they’d like to impact as few grant recipients as possible by applying a maximum cap on all grants. Every grant initially planned to be higher than cap will now be exactly cap dollars. Grants less or equal to cap, obviously, won’t be impacted.
Given an array grantsArray of the original grants and the reduced budget newBudget, write a function findGrantsCap that finds in the most efficient manner a cap such that the least number of recipients is impacted and that the new budget constraint is met (i.e. sum of the N reallocated grants equals to newBudget).
Analyse the time and space complexities of your solution.
Example:
input: grantsArray = [2, 100, 50, 120, 1000], newBudget = 190
output: 47
The recommended solution is:
fun findCorrectGrantsCap(grantsArray: DoubleArray, newBudget: Double): Double {
grantsArray.sortDescending()
val grantsArray = grantsArray + 0.0
var surplus = grantsArray.sum() - newBudget
if (surplus <= 0)
return grantsArray[0]
var lastIndex = 0
for(i in 0 until grantsArray.lastIndex) {
lastIndex = i
surplus -= (i+1) * (grantsArray[i] - grantsArray[i+1])
if (surplus <= 0)
break
}
return grantsArray[lastIndex+1] + (-surplus / (lastIndex.toDouble()+1))
}
Compact and complexity is O(nlogn)
I came across with O(n) solution with tiny fractional part difference in the result between suggested solution and my one:
fun DoubleArray.calcSumAndCount(averageCap: Double, round: Boolean): Pair<Double, Int> {
var count = 0
var sum = 0.0
forEach {
if(round && it > round(averageCap))
count++
else if(!round && it > averageCap)
count++
else
sum+=it
}
return sum to count
}
fun Pair<Double, Int>.calcCap(budget: Double) =
(budget-first)/second
fun findGrantsCap(grantsArray: DoubleArray, newBudget: Double): Double {
if(grantsArray.isEmpty())
return 0.0
val averageCap = newBudget/grantsArray.size
if(grantsArray.sum() <= newBudget)
return grantsArray.maxOf { it }
var sumAndCount = grantsArray.calcSumAndCount(averageCap, false)
val cap = sumAndCount.calcCap(newBudget)
val finalSum = grantsArray.sumOf {
if(it > cap)
cap
else it
}
return if(finalSum == newBudget)
cap
else
grantsArray
.calcSumAndCount(averageCap, true)
.calcCap(newBudget)
}
I wonder if any test case to prove that my solution incorrect or vice versa is correct since provided approaches to solve this coding problem completely different.
Original source doesn't provide reach test cases.
UPDATE
As PaulHankin suggested I wrote simple test:
repeat(1000000) {
val grants = (0..Random.nextInt(6)).map { Random.nextDouble(0.0, 9000000000.0) }.toDoubleArray()
val newBudget = Random.nextDouble(0.0, 9000000000.0)
val cap1 = findCorrectGrantsCap(grants, newBudget)
val cap2 = findGrantsCap(grants, newBudget)
if (abs(cap1 - cap2) > .00001)
println("FAILED: $cap1 != $cap2 (${grants.joinToString()}), $newBudget")
}
And it's failed he is right. But when I redesigned my solution:
fun findGrantsCap(grantsArray: DoubleArray, newBudget: Double): Double {
if(grantsArray.isEmpty())
return 0.0
if(grantsArray.sum() <= newBudget)
return grantsArray.maxOf { it }
grantsArray.sort()
var size = grantsArray.size
var averageCap = newBudget/size
var tempBudget = newBudget
for(grant in grantsArray) {
if(grant <= averageCap) {
size--
tempBudget -= grant
averageCap = tempBudget/size
} else break
}
return averageCap
}
After that the test cases pass successfully, the only problem with double precision/overflow error if I use large Doubles if increase limits of input for grants and/or budget (it can be fixed using BigDecimal instead for large inputs).
So the latest solution is correct now? Or is still can be some test cases where it can be failed?

Efficient way to generate a seemingly random permutation from a very large set without repeating?

I have a very large set (billions or more, it's expected to grow exponentially to some level), and I want to generate seemingly random elements from it without repeating. I know I can pick a random number and repeat and record the elements I have generated, but that takes more and more memory as numbers are generated, and wouldn't be practical after couple millions elements out.
I mean, I could say 1, 2, 3 up to billions and each would be constant time without remembering all the previous, or I can say 1,3,5,7,9 and on then 2,4,6,8,10, but is there a more sophisticated way to do that and eventually get a seemingly random permutation of that set?
Update
1, The set does not change size in the generation process. I meant when the user's input increases linearly, the size of the set increases exponentially.
2, In short, the set is like the set of every integer from 1 to 10 billions or more.
3, In long, it goes up to 10 billion because each element carries the information of many independent choices, for example. Imagine an RPG character that have 10 attributes, each can go from 1 to 100 (for my problem different choices can have different ranges), thus there's 10^20 possible characters, number "10873456879326587345" would correspond to a character that have "11, 88, 35...", and I would like an algorithm to generate them one by one without repeating, but makes it looks random.

Thanks for the interesting question. You can create a "pseudorandom"* (cyclic) permutation with a few bytes using modular exponentiation. Say we have n elements. Search for a prime p that's bigger than n+1. Then find a primitive root g modulo p. Basically by definition of primitive root, the action x --> (g * x) % p is a cyclic permutation of {1, ..., p-1}. And so x --> ((g * (x+1))%p) - 1 is a cyclic permutation of {0, ..., p-2}. We can get a cyclic permutation of {0, ..., n-1} by repeating the previous permutation if it gives a value bigger (or equal) n.
I implemented this idea as a Go package. https://github.com/bwesterb/powercycle
package main
import (
"fmt"
"github.com/bwesterb/powercycle"
)
func main() {
var x uint64
cycle := powercycle.New(10)
for i := 0; i < 10; i++ {
fmt.Println(x)
x = cycle.Apply(x)
}
}
This outputs something like
0
6
4
1
2
9
3
5
8
7
but that might vary off course depending on the generator chosen.
It's fast, but not super-fast: on my five year old i7 it takes less than 210ns to compute one application of a cycle on 1000000000000000 elements. More details:
BenchmarkNew10-8 1000000 1328 ns/op
BenchmarkNew1000-8 500000 2566 ns/op
BenchmarkNew1000000-8 50000 25893 ns/op
BenchmarkNew1000000000-8 200000 7589 ns/op
BenchmarkNew1000000000000-8 2000 648785 ns/op
BenchmarkApply10-8 10000000 170 ns/op
BenchmarkApply1000-8 10000000 173 ns/op
BenchmarkApply1000000-8 10000000 172 ns/op
BenchmarkApply1000000000-8 10000000 169 ns/op
BenchmarkApply1000000000000-8 10000000 201 ns/op
BenchmarkApply1000000000000000-8 10000000 204 ns/op
Why did I say "pseudorandom"? Well, we are always creating a very specific kind of cycle: namely one that uses modular exponentiation. It looks pretty pseudorandom though.

I would use a random number and swap it with an element at the beginning of the set.
Here's some pseudo code
set = [1, 2, 3, 4, 5, 6]
picked = 0
Function PickNext(set, picked)
If picked > Len(set) - 1 Then
Return Nothing
End If
// random number between picked (inclusive) and length (exclusive)
r = RandomInt(picked, Len(set))
// swap the picked element to the beginning of the set
result = set[r]
set[r] = set[picked]
set[picked] = result
// update picked
picked++
// return your next random element
Return temp
End Function
Every time you pick an element there is one swap and the only extra memory being used is the picked variable. The swap can happen if the elements are in a database or in memory.
EDIT Here's a jsfiddle of a working implementation http://jsfiddle.net/sun8rw4d/
JavaScript
var set = [];
set.picked = 0;
function pickNext(set) {
if(set.picked > set.length - 1) { return null; }
var r = set.picked + Math.floor(Math.random() * (set.length - set.picked));
var result = set[r];
set[r] = set[set.picked];
set[set.picked] = result;
set.picked++;
return result;
}
// testing
for(var i=0; i<100; i++) {
set.push(i);
}
while(pickNext(set) !== null) { }
document.body.innerHTML += set.toString();
EDIT 2 Finally, a random binary walk of the set. This can be accomplished with O(Log2(N)) stack space (memory) which for 10billion is only 33. There's no shuffling or swapping involved. Using trinary instead of binary might yield even better pseudo random results.
// on the fly set generator
var count = 0;
var maxValue = 64;
function nextElement() {
// restart the generation
if(count == maxValue) {
count = 0;
}
return count++;
}
// code to pseudo randomly select elements
var current = 0;
var stack = [0, maxValue - 1];
function randomBinaryWalk() {
if(stack.length == 0) { return null; }
var high = stack.pop();
var low = stack.pop();
var mid = ((high + low) / 2) | 0;
// pseudo randomly choose the next path
if(Math.random() > 0.5) {
if(low <= mid - 1) {
stack.push(low);
stack.push(mid - 1);
}
if(mid + 1 <= high) {
stack.push(mid + 1);
stack.push(high);
}
} else {
if(mid + 1 <= high) {
stack.push(mid + 1);
stack.push(high);
}
if(low <= mid - 1) {
stack.push(low);
stack.push(mid - 1);
}
}
// how many elements to skip
var toMid = (current < mid ? mid - current : (maxValue - current) + mid);
// skip elements
for(var i = 0; i < toMid - 1; i++) {
nextElement();
}
current = mid;
// get result
return nextElement();
}
// test
var result;
var list = [];
do {
result = randomBinaryWalk();
list.push(result);
} while(result !== null);
document.body.innerHTML += '<br/>' + list.toString();
Here's the results from a couple of runs with a small set of 64 elements. JSFiddle http://jsfiddle.net/yooLjtgu/
30,46,38,34,36,35,37,32,33,31,42,40,41,39,44,45,43,54,50,52,53,51,48,47,49,58,60,59,61,62,56,57,55,14,22,18,20,19,21,16,15,17,26,28,29,27,24,25,23,6,2,4,5,3,0,1,63,10,8,7,9,12,11,13
30,14,22,18,16,15,17,20,19,21,26,28,29,27,24,23,25,6,10,8,7,9,12,13,11,2,0,63,1,4,5,3,46,38,42,44,45,43,40,41,39,34,36,35,37,32,31,33,54,58,56,55,57,60,59,61,62,50,48,49,47,52,51,53
As I mentioned in my comment, unless you have an efficient way to skip to a specific point in your "on the fly" generation of the set this will not be very efficient.

if it is enumerable then use a pseudo-random integer generator adjusted to the period 0 .. 2^n - 1 where the upper bound is just greater than the size of your set and generate pseudo-random integers discarding those more than the size of your set. Use those integers to index items from your set.

Pre- compute yourself a series of indices (e.g. in a file), which has the properties you need and then randomly choose a start index for your enumeration and use the series in a round-robin manner.
The length of your pre-computed series should be > the maximum size of the set.
If you combine this (depending on your programming language etc.) with file mappings, your final nextIndex(INOUT state) function is (nearly) as simple as return mappedIndices[state++ % PERIOD];, if you have a fixed size of each entry (e.g. 8 bytes -> uint64_t).
Of course, the returned value could be > your current set size. Simply draw indices until you get one which is <= your sets current size.
Update (In response to question-update):
There is another option to achieve your goal if it is about creating 10Billion unique characters in your RPG: Generate a GUID and write yourself a function which computes your number from the GUID. man uuid if you are are on a unix system. Else google it. Some parts of the uuid are not random but contain meta-info, some parts are either systematic (such as your network cards MAC address) or random, depending on generator algorithm. But they are very very most likely unique. So, whenever you need a new unique number, generate a uuid and transform it to your number by means of some algorithm which basically maps the uuid bytes to your number in a non-trivial way (e.g. use hash functions).

Making a more efficient monte carlo simulation

So, I've written this code that should effectively estimate the area under the curve of the function defined as h(x). My problem is that i need to be able to estimate the area to within 6 decimal places, but the algorithm i've defined in estimateN seems to be using too heavy for my machine. Essentially the question is how can i make the following code more efficient? Is there a way i can get rid of that loop?
h = function(x) {
return(1+(x^9)+(x^3))
}
estimateN = function(n) {
count = 0
k = 1
xpoints = runif(n, 0, 1)
ypoints = runif(n, 0, 3)
while(k <= n){
if(ypoints[k]<=h(xpoints[k]))
count = count+1
k = k+1
}
#because of the range that im using for y
return(3*(count/n))
}
#uses the fact that err<=1/sqrt(n) to determine size of dataset
estimate_to = function(i) {
n = (10^i)^2
print(paste(n, " repetitions: ", estimateN(n)))
}
estimate_to(6)

Replace this code:
count = 0
k = 1
while(k <= n){
if(ypoints[k]<=h(xpoints[k]))
count = count+1
k = k+1
}
With this line:
count <- sum(ypoints <= h(xpoints))

If it's truly efficiency you're striving for, integrate is several orders of magnitude faster (not to mention more memory efficient) for this problem.
integrate(h, 0, 1)
# 1.35 with absolute error < 1.5e-14
microbenchmark(integrate(h, 0, 1), estimate_to(3), times=10)
# Unit: microseconds
# expr min lq median uq max neval
# integrate(h, 0, 1) 14.456 17.769 42.918 54.514 83.125 10
# estimate_to(3) 151980.781 159830.956 162290.668 167197.742 174881.066 10

Why is my Scala tail-recursion faster than the while loop?

Here are two solutions to exercise 4.9 in Cay Horstmann's Scala for the Impatient: "Write a function lteqgt(values: Array[Int], v: Int) that returns a triple containing the counts of values less than v, equal to v, and greater than v." One uses tail recursion, the other uses a while loop. I thought that both would compile to similar bytecode but the while loop is slower than the tail recursion by a factor of almost 2. This suggests to me that my while method is badly written.
import scala.annotation.tailrec
import scala.util.Random
object PerformanceTest {
def main(args: Array[String]): Unit = {
val bigArray:Array[Int] = fillArray(new Array[Int](100000000))
println(time(lteqgt(bigArray, 25)))
println(time(lteqgt2(bigArray, 25)))
}
def time[T](block : => T):T = {
val start = System.nanoTime : Double
val result = block
val end = System.nanoTime : Double
println("Time = " + (end - start) / 1000000.0 + " millis")
result
}
#tailrec def fillArray(a:Array[Int], pos:Int=0):Array[Int] = {
if (pos == a.length)
a
else {
a(pos) = Random.nextInt(50)
fillArray(a, pos+1)
}
}
#tailrec def lteqgt(values: Array[Int], v:Int, lt:Int=0, eq:Int=0, gt:Int=0, pos:Int=0):(Int, Int, Int) = {
if (pos == values.length)
(lt, eq, gt)
else
lteqgt(values, v, lt + (if (values(pos) < v) 1 else 0), eq + (if (values(pos) == v) 1 else 0), gt + (if (values(pos) > v) 1 else 0), pos+1)
}
def lteqgt2(values:Array[Int], v:Int):(Int, Int, Int) = {
var lt = 0
var eq = 0
var gt = 0
var pos = 0
val limit = values.length
while (pos < limit) {
if (values(pos) > v)
gt += 1
else if (values(pos) < v)
lt += 1
else
eq += 1
pos += 1
}
(lt, eq, gt)
}
}
Adjust the size of bigArray according to your heap size. Here is some sample output:
Time = 245.110899 millis
(50004367,2003090,47992543)
Time = 465.836894 millis
(50004367,2003090,47992543)
Why is the while method so much slower than the tailrec? Naively the tailrec version looks to be at a slight disadvantage, as it must always perform 3 "if" checks for every iteration, whereas the while version will often only perform 1 or 2 tests due to the else construct. (NB reversing the order I perform the two methods does not affect the outcome).

Test results (after reducing array size to 20000000)
Under Java 1.6.22 I get 151 and 122 ms for tail-recursion and while-loop respectively.
Under Java 1.7.0 I get 55 and 101 ms
So under Java 6 your while-loop is actually faster; both have improved in performance under Java 7, but the tail-recursive version has overtaken the loop.
Explanation
The performance difference is due to the fact that in your loop, you conditionally add 1 to the totals, while for recursion you always add either 1 or 0. So they are not equivalent. The equivalent while-loop to your recursive method is:
def lteqgt2(values:Array[Int], v:Int):(Int, Int, Int) = {
var lt = 0
var eq = 0
var gt = 0
var pos = 0
val limit = values.length
while (pos < limit) {
gt += (if (values(pos) > v) 1 else 0)
lt += (if (values(pos) < v) 1 else 0)
eq += (if (values(pos) == v) 1 else 0)
pos += 1
}
(lt, eq, gt)
}
and this gives exactly the same execution time as the recursive method (regardless of Java version).
Discussion
I'm not an expert on why the Java 7 VM (HotSpot) can optimize this better than your first version, but I'd guess it's because it's taking the same path through the code each time (rather than branching along the if / else if paths), so the bytecode can be inlined more efficiently.
But remember that this is not the case in Java 6. Why one while-loop outperforms the other is a question of JVM internals. Happily for the Scala programmer, the version produced from idiomatic tail-recursion is the faster one in the latest version of the JVM.
The difference could also be occurring at the processor level. See this question, which explains how code slows down if it contains unpredictable branching.

The two constructs are not identical. In particular, in the first case you don't need any jumps (on x86, you can use cmp and setle and add, instead of having to use cmp and jb and (if you don't jump) add. Not jumping is faster than jumping on pretty much every modern architecture.
So, if you have code that looks like
if (a < b) x += 1
where you may add or you may jump instead, vs.
x += (a < b)
(which only makes sense in C/C++ where 1 = true and 0 = false), the latter tends to be faster as it can be turned into more compact assembly code. In Scala/Java, you can't do this, but you can do
x += if (a < b) 1 else 0
which a smart JVM should recognize is the same as x += (a < b), which has a jump-free machine code translation, which is usually faster than jumping. An even smarter JVM would recognize that
if (a < b) x += 1
is the same yet again (because adding zero doesn't do anything).
C/C++ compilers routinely perform optimizations like this. Being unable to apply any of these optimizations was not a mark in the JIT compiler's favor; apparently it can as of 1.7, but only partially (i.e. it doesn't recognize that adding zero is the same as a conditional adding one, but it does at least convert x += if (a<b) 1 else 0 into fast machine code).
Now, none of this has anything to do with tail recursion or while loops per se. With tail recursion it's more natural to write the if (a < b) 1 else 0 form, but you can do either; and with while loops you can also do either. It just so happened that you picked one form for tail recursion and the other for the while loop, making it look like recursion vs. looping was the change instead of the two different ways to do the conditionals.

Calculating frames per second in a game

What's a good algorithm for calculating frames per second in a game? I want to show it as a number in the corner of the screen. If I just look at how long it took to render the last frame the number changes too fast.
Bonus points if your answer updates each frame and doesn't converge differently when the frame rate is increasing vs decreasing.

You need a smoothed average, the easiest way is to take the current answer (the time to draw the last frame) and combine it with the previous answer.
// eg.
float smoothing = 0.9; // larger=more smoothing
measurement = (measurement * smoothing) + (current * (1.0-smoothing))
By adjusting the 0.9 / 0.1 ratio you can change the 'time constant' - that is how quickly the number responds to changes. A larger fraction in favour of the old answer gives a slower smoother change, a large fraction in favour of the new answer gives a quicker changing value. Obviously the two factors must add to one!

This is what I have used in many games.
#define MAXSAMPLES 100
int tickindex=0;
int ticksum=0;
int ticklist[MAXSAMPLES];
/* need to zero out the ticklist array before starting */
/* average will ramp up until the buffer is full */
/* returns average ticks per frame over the MAXSAMPLES last frames */
double CalcAverageTick(int newtick)
{
ticksum-=ticklist[tickindex]; /* subtract value falling off */
ticksum+=newtick; /* add new value */
ticklist[tickindex]=newtick; /* save new value so it can be subtracted later */
if(++tickindex==MAXSAMPLES) /* inc buffer index */
tickindex=0;
/* return average */
return((double)ticksum/MAXSAMPLES);
}

Well, certainly
frames / sec = 1 / (sec / frame)
But, as you point out, there's a lot of variation in the time it takes to render a single frame, and from a UI perspective updating the fps value at the frame rate is not usable at all (unless the number is very stable).
What you want is probably a moving average or some sort of binning / resetting counter.
For example, you could maintain a queue data structure which held the rendering times for each of the last 30, 60, 100, or what-have-you frames (you could even design it so the limit was adjustable at run-time). To determine a decent fps approximation you can determine the average fps from all the rendering times in the queue:
fps = # of rendering times in queue / total rendering time
When you finish rendering a new frame you enqueue a new rendering time and dequeue an old rendering time. Alternately, you could dequeue only when the total of the rendering times exceeded some preset value (e.g. 1 sec). You can maintain the "last fps value" and a last updated timestamp so you can trigger when to update the fps figure, if you so desire. Though with a moving average if you have consistent formatting, printing the "instantaneous average" fps on each frame would probably be ok.
Another method would be to have a resetting counter. Maintain a precise (millisecond) timestamp, a frame counter, and an fps value. When you finish rendering a frame, increment the counter. When the counter hits a pre-set limit (e.g. 100 frames) or when the time since the timestamp has passed some pre-set value (e.g. 1 sec), calculate the fps:
fps = # frames / (current time - start time)
Then reset the counter to 0 and set the timestamp to the current time.

Increment a counter every time you render a screen and clear that counter for some time interval over which you want to measure the frame-rate.
Ie. Every 3 seconds, get counter/3 and then clear the counter.

There are at least two ways to do it:
The first is the one others have mentioned here before me.
I think it's the simplest and preferred way. You just to keep track of
cn: counter of how many frames you've rendered
time_start: the time since you've started counting
time_now: the current time
Calculating the fps in this case is as simple as evaluating this formula:
FPS = cn / (time_now - time_start).
Then there is the uber cool way you might like to use some day:
Let's say you have 'i' frames to consider. I'll use this notation: f[0], f[1],..., f[i-1] to describe how long it took to render frame 0, frame 1, ..., frame (i-1) respectively.
Example where i = 3
|f[0] |f[1] |f[2] |
+----------+-------------+-------+------> time
Then, mathematical definition of fps after i frames would be
(1) fps[i] = i / (f[0] + ... + f[i-1])
And the same formula but only considering i-1 frames.
(2) fps[i-1] = (i-1) / (f[0] + ... + f[i-2])
Now the trick here is to modify the right side of formula (1) in such a way that it will contain the right side of formula (2) and substitute it for it's left side.
Like so (you should see it more clearly if you write it on a paper):
fps[i] = i / (f[0] + ... + f[i-1])
= i / ((f[0] + ... + f[i-2]) + f[i-1])
= (i/(i-1)) / ((f[0] + ... + f[i-2])/(i-1) + f[i-1]/(i-1))
= (i/(i-1)) / (1/fps[i-1] + f[i-1]/(i-1))
= ...
= (i*fps[i-1]) / (f[i-1] * fps[i-1] + i - 1)
So according to this formula (my math deriving skill are a bit rusty though), to calculate the new fps you need to know the fps from the previous frame, the duration it took to render the last frame and the number of frames you've rendered.

This might be overkill for most people, that's why I hadn't posted it when I implemented it. But it's very robust and flexible.
It stores a Queue with the last frame times, so it can accurately calculate an average FPS value much better than just taking the last frame into consideration.
It also allows you to ignore one frame, if you are doing something that you know is going to artificially screw up that frame's time.
It also allows you to change the number of frames to store in the Queue as it runs, so you can test it out on the fly what is the best value for you.
// Number of past frames to use for FPS smooth calculation - because
// Unity's smoothedDeltaTime, well - it kinda sucks
private int frameTimesSize = 60;
// A Queue is the perfect data structure for the smoothed FPS task;
// new values in, old values out
private Queue<float> frameTimes;
// Not really needed, but used for faster updating then processing
// the entire queue every frame
private float __frameTimesSum = 0;
// Flag to ignore the next frame when performing a heavy one-time operation
// (like changing resolution)
private bool _fpsIgnoreNextFrame = false;
//=============================================================================
// Call this after doing a heavy operation that will screw up with FPS calculation
void FPSIgnoreNextFrame() {
this._fpsIgnoreNextFrame = true;
}
//=============================================================================
// Smoothed FPS counter updating
void Update()
{
if (this._fpsIgnoreNextFrame) {
this._fpsIgnoreNextFrame = false;
return;
}
// While looping here allows the frameTimesSize member to be changed dinamically
while (this.frameTimes.Count >= this.frameTimesSize) {
this.__frameTimesSum -= this.frameTimes.Dequeue();
}
while (this.frameTimes.Count < this.frameTimesSize) {
this.__frameTimesSum += Time.deltaTime;
this.frameTimes.Enqueue(Time.deltaTime);
}
}
//=============================================================================
// Public function to get smoothed FPS values
public int GetSmoothedFPS() {
return (int)(this.frameTimesSize / this.__frameTimesSum * Time.timeScale);
}

Good answers here. Just how you implement it is dependent on what you need it for. I prefer the running average one myself "time = time * 0.9 + last_frame * 0.1" by the guy above.
however I personally like to weight my average more heavily towards newer data because in a game it is SPIKES that are the hardest to squash and thus of most interest to me. So I would use something more like a .7 \ .3 split will make a spike show up much faster (though it's effect will drop off-screen faster as well.. see below)
If your focus is on RENDERING time, then the .9.1 split works pretty nicely b/c it tend to be more smooth. THough for gameplay/AI/physics spikes are much more of a concern as THAT will usually what makes your game look choppy (which is often worse than a low frame rate assuming we're not dipping below 20 fps)
So, what I would do is also add something like this:
#define ONE_OVER_FPS (1.0f/60.0f)
static float g_SpikeGuardBreakpoint = 3.0f * ONE_OVER_FPS;
if(time > g_SpikeGuardBreakpoint)
DoInternalBreakpoint()
(fill in 3.0f with whatever magnitude you find to be an unacceptable spike)
This will let you find and thus solve FPS issues the end of the frame they happen.

A much better system than using a large array of old framerates is to just do something like this:
new_fps = old_fps * 0.99 + new_fps * 0.01
This method uses far less memory, requires far less code, and places more importance upon recent framerates than old framerates while still smoothing the effects of sudden framerate changes.

You could keep a counter, increment it after each frame is rendered, then reset the counter when you are on a new second (storing the previous value as the last second's # of frames rendered)

JavaScript:
// Set the end and start times
var start = (new Date).getTime(), end, FPS;
/* ...
* the loop/block your want to watch
* ...
*/
end = (new Date).getTime();
// since the times are by millisecond, use 1000 (1000ms = 1s)
// then multiply the result by (MaxFPS / 1000)
// FPS = (1000 - (end - start)) * (MaxFPS / 1000)
FPS = Math.round((1000 - (end - start)) * (60 / 1000));

Here's a complete example, using Python (but easily adapted to any language). It uses the smoothing equation in Martin's answer, so almost no memory overhead, and I chose values that worked for me (feel free to play around with the constants to adapt to your use case).
import time
SMOOTHING_FACTOR = 0.99
MAX_FPS = 10000
avg_fps = -1
last_tick = time.time()
while True:
# <Do your rendering work here...>
current_tick = time.time()
# Ensure we don't get crazy large frame rates, by capping to MAX_FPS
current_fps = 1.0 / max(current_tick - last_tick, 1.0/MAX_FPS)
last_tick = current_tick
if avg_fps < 0:
avg_fps = current_fps
else:
avg_fps = (avg_fps * SMOOTHING_FACTOR) + (current_fps * (1-SMOOTHING_FACTOR))
print(avg_fps)

Set counter to zero. Each time you draw a frame increment the counter. After each second print the counter. lather, rinse, repeat. If yo want extra credit, keep a running counter and divide by the total number of seconds for a running average.

In (c++ like) pseudocode these two are what I used in industrial image processing applications that had to process images from a set of externally triggered camera's. Variations in "frame rate" had a different source (slower or faster production on the belt) but the problem is the same. (I assume that you have a simple timer.peek() call that gives you something like the nr of msec (nsec?) since application start or the last call)
Solution 1: fast but not updated every frame
do while (1)
{
ProcessImage(frame)
if (frame.framenumber%poll_interval==0)
{
new_time=timer.peek()
framerate=poll_interval/(new_time - last_time)
last_time=new_time
}
}
Solution 2: updated every frame, requires more memory and CPU
do while (1)
{
ProcessImage(frame)
new_time=timer.peek()
delta=new_time - last_time
last_time = new_time
total_time += delta
delta_history.push(delta)
framerate= delta_history.length() / total_time
while (delta_history.length() > avg_interval)
{
oldest_delta = delta_history.pop()
total_time -= oldest_delta
}
}

qx.Class.define('FpsCounter', {
extend: qx.core.Object
,properties: {
}
,events: {
}
,construct: function(){
this.base(arguments);
this.restart();
}
,statics: {
}
,members: {
restart: function(){
this.__frames = [];
}
,addFrame: function(){
this.__frames.push(new Date());
}
,getFps: function(averageFrames){
debugger;
if(!averageFrames){
averageFrames = 2;
}
var time = 0;
var l = this.__frames.length;
var i = averageFrames;
while(i > 0){
if(l - i - 1 >= 0){
time += this.__frames[l - i] - this.__frames[l - i - 1];
}
i--;
}
var fps = averageFrames / time * 1000;
return fps;
}
}
});

How i do it!
boolean run = false;
int ticks = 0;
long tickstart;
int fps;
public void loop()
{
if(this.ticks==0)
{
this.tickstart = System.currentTimeMillis();
}
this.ticks++;
this.fps = (int)this.ticks / (System.currentTimeMillis()-this.tickstart);
}
In words, a tick clock tracks ticks. If it is the first time, it takes the current time and puts it in 'tickstart'. After the first tick, it makes the variable 'fps' equal how many ticks of the tick clock divided by the time minus the time of the first tick.
Fps is an integer, hence "(int)".

Here's how I do it (in Java):
private static long ONE_SECOND = 1000000L * 1000L; //1 second is 1000ms which is 1000000ns
LinkedList<Long> frames = new LinkedList<>(); //List of frames within 1 second
public int calcFPS(){
long time = System.nanoTime(); //Current time in nano seconds
frames.add(time); //Add this frame to the list
while(true){
long f = frames.getFirst(); //Look at the first element in frames
if(time - f > ONE_SECOND){ //If it was more than 1 second ago
frames.remove(); //Remove it from the list of frames
} else break;
/*If it was within 1 second we know that all other frames in the list
* are also within 1 second
*/
}
return frames.size(); //Return the size of the list
}

In Typescript, I use this algorithm to calculate framerate and frametime averages:
let getTime = () => {
return new Date().getTime();
}
let frames: any[] = [];
let previousTime = getTime();
let framerate:number = 0;
let frametime:number = 0;
let updateStats = (samples:number=60) => {
samples = Math.max(samples, 1) >> 0;
if (frames.length === samples) {
let currentTime: number = getTime() - previousTime;
frametime = currentTime / samples;
framerate = 1000 * samples / currentTime;
previousTime = getTime();
frames = [];
}
frames.push(1);
}
usage:
statsUpdate();
// Print
stats.innerHTML = Math.round(framerate) + ' FPS ' + frametime.toFixed(2) + ' ms';
Tip: If samples is 1, the result is real-time framerate and frametime.

This is based on KPexEA's answer and gives the Simple Moving Average. Tidied and converted to TypeScript for easy copy and paste:
Variable declaration:
fpsObject = {
maxSamples: 100,
tickIndex: 0,
tickSum: 0,
tickList: []
}
Function:
calculateFps(currentFps: number): number {
this.fpsObject.tickSum -= this.fpsObject.tickList[this.fpsObject.tickIndex] || 0
this.fpsObject.tickSum += currentFps
this.fpsObject.tickList[this.fpsObject.tickIndex] = currentFps
if (++this.fpsObject.tickIndex === this.fpsObject.maxSamples) this.fpsObject.tickIndex = 0
const smoothedFps = this.fpsObject.tickSum / this.fpsObject.maxSamples
return Math.floor(smoothedFps)
}
Usage (may vary in your app):
this.fps = this.calculateFps(this.ticker.FPS)

I adapted #KPexEA's answer to Go, moved the globals into struct fields, allowed the number of samples to be configurable, and used time.Duration instead of plain integers and floats.
type FrameTimeTracker struct {
samples []time.Duration
sum time.Duration
index int
}
func NewFrameTimeTracker(n int) *FrameTimeTracker {
return &FrameTimeTracker{
samples: make([]time.Duration, n),
}
}
func (t *FrameTimeTracker) AddFrameTime(frameTime time.Duration) (average time.Duration) {
// algorithm adapted from https://stackoverflow.com/a/87732/814422
t.sum -= t.samples[t.index]
t.sum += frameTime
t.samples[t.index] = frameTime
t.index++
if t.index == len(t.samples) {
t.index = 0
}
return t.sum / time.Duration(len(t.samples))
}
The use of time.Duration, which has nanosecond precision, eliminates the need for floating-point arithmetic to compute the average frame time, but comes at the expense of needing twice as much memory for the same number of samples.
You'd use it like this:
// track the last 60 frame times
frameTimeTracker := NewFrameTimeTracker(60)
// main game loop
for frame := 0;; frame++ {
// ...
if frame > 0 {
// prevFrameTime is the duration of the last frame
avgFrameTime := frameTimeTracker.AddFrameTime(prevFrameTime)
fps := 1.0 / avgFrameTime.Seconds()
}
// ...
}
Since the context of this question is game programming, I'll add some more notes about performance and optimization. The above approach is idiomatic Go but always involves two heap allocations: one for the struct itself and one for the array backing the slice of samples. If used as indicated above, these are long-lived allocations so they won't really tax the garbage collector. Profile before optimizing, as always.
However, if performance is a major concern, some changes can be made to eliminate the allocations and indirections:
Change samples from a slice of []time.Duration to an array of [N]time.Duration where N is fixed at compile time. This removes the flexibility of changing the number of samples at runtime, but in most cases that flexibility is unnecessary.
Then, eliminate the NewFrameTimeTracker constructor function entirely and use a var frameTimeTracker FrameTimeTracker declaration (at the package level or local to main) instead. Unlike C, Go will pre-zero all relevant memory.

Unfortunately, most of the answers here don't provide either accurate enough or sufficiently "slow responsive" FPS measurements. Here's how I do it in Rust using a measurement queue:
use std::collections::VecDeque;
use std::time::{Duration, Instant};
pub struct FpsCounter {
sample_period: Duration,
max_samples: usize,
creation_time: Instant,
frame_count: usize,
measurements: VecDeque<FrameCountMeasurement>,
}
#[derive(Copy, Clone)]
struct FrameCountMeasurement {
time: Instant,
frame_count: usize,
}
impl FpsCounter {
pub fn new(sample_period: Duration, samples: usize) -> Self {
assert!(samples > 1);
Self {
sample_period,
max_samples: samples,
creation_time: Instant::now(),
frame_count: 0,
measurements: VecDeque::new(),
}
}
pub fn fps(&self) -> f32 {
match (self.measurements.front(), self.measurements.back()) {
(Some(start), Some(end)) => {
let period = (end.time - start.time).as_secs_f32();
if period > 0.0 {
(end.frame_count - start.frame_count) as f32 / period
} else {
0.0
}
}
_ => 0.0,
}
}
pub fn update(&mut self) {
self.frame_count += 1;
let current_measurement = self.measure();
let last_measurement = self
.measurements
.back()
.copied()
.unwrap_or(FrameCountMeasurement {
time: self.creation_time,
frame_count: 0,
});
if (current_measurement.time - last_measurement.time) >= self.sample_period {
self.measurements.push_back(current_measurement);
while self.measurements.len() > self.max_samples {
self.measurements.pop_front();
}
}
}
fn measure(&self) -> FrameCountMeasurement {
FrameCountMeasurement {
time: Instant::now(),
frame_count: self.frame_count,
}
}
}
How to use:
Create the counter:
let mut fps_counter = FpsCounter::new(Duration::from_millis(100), 5);
Call fps_counter.update() on every frame drawn.
Call fps_counter.fps() whenever you like to display current FPS.
Now, the key is in parameters to FpsCounter::new() method: sample_period is how responsive fps() is to changes in framerate, and samples controls how quickly fps() ramps up or down to the actual framerate. So if you choose 10 ms and 100 samples, fps() would react almost instantly to any change in framerate - basically, FPS value on the screen would jitter like crazy, but since it's 100 samples, it would take 1 second to match the actual framerate.
So my choice of 100 ms and 5 samples means that displayed FPS counter doesn't make your eyes bleed by changing crazy fast, and it would match your actual framerate half a second after it changes, which is sensible enough for a game.
Since sample_period * samples is averaging time span, you don't want it to be too short if you want a reasonably accurate FPS counter.

store a start time and increment your framecounter once per loop? every few seconds you could just print framecount/(Now - starttime) and then reinitialize them.
edit: oops. double-ninja'ed

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio