Efficient way to skip code every X iterations? - performance

I'm using GameMaker Studio and you can think of it as a giant loop.
I use a counter variable step to keep track of what frame it is.
I'd like to run some code only every Xth step for efficiency.
if step mod 60 {
}
Would run that block every 60 steps (or 1 second at 60 fps).
My understanding is modulus is a heavy operation though and with thousands of steps I imagine the computation can get out of hand. Is there a more efficient way to do this?
Perhaps involving bitwise operator?
I know this can work for every other frame:
// Declare
counter = 0
// Step
counter = (counter + 1) & 1
if counter {
}
Or is the performance impact of modulus negligible at 60FPS even with large numbers?

In essence:
i := 0
WHILE i < n/4
do rest of stuff × 4
do stuff that you want to do one time in four
Increment i
Do rest of stuff i%4 times
The variant of this that takes the modulus and switches based on that is called Duff’s Device. Which is faster will depend on your architecture: on many RISC chips, mod executes in a single clock cycle, but on other CPUs, integer division might not even be a native instruction.
If you don’t have a loop counter per se, because it’s an event loop for example, you can always make one and reset it every four times in the if block where you execute your code:
i := 1
WHILE you loop
do other stuff
if i == 4
do stuff
i := 1
else
i := i + 1
Here’s an example of doing some stuff one time in two and stuff one time in three:
WHILE looping
do stuff
do stuff a second time
do stuff B
do stuff a third time
do stuff C
do stuff a fourth time
do stuff B
do stuff a fifth time
do stuff a sixth time
do stuff B
do stiff C
Note that the stuff you do can include calling an event loop once.
Since this can get unwieldy, you can use template metaprogramming to write these loops for you in C++, something like:
constexpr unsigned a = 5, b = 7, LCM_A_B = 35;
template<unsigned N>
inline void do_stuff(void)
{
do_stuff_always();
if (N%a)
do_stuff_a(); // Since N is a compile-time constant, the compiler does not have to check this at runtime!
if (N%b)
do_stuff_b();
do_stuff<N-1>();
}
template<>
inline void do_stuff<1U>(void)
{
do_stuff_always();
}
while (sentinel)
do_stuff<LCM_A_B>();
In general, though, if you want to know whether your optimizations are helping, profile.

The most important part of the answer: that test probably takes so little time, in context, that it isn't worth the ions moving around your brain to think about it.
If it only costs 1% it's almost certain there are bigger speedups you should be thinking about.
However, if the loop is fast, you could put in something like this:
if (--count < 0){
count = 59;
// do your thing
}
In some hardware, that test comes down to a single instruction decrement-and-branch-if-negative.

Related

How to time something in Go, which takes less than a nanosecond?

How would I proceed if I want to compare the time of two functions, but the functions take less than a nanosecond?
t := time.Now()
_ = fmt.Sprint("Hello, World!")
d := time.Since(t)
d.Round(0)
fmt.Println(d.Nanoseconds()) // Prints 0
I could run the function a couple of times and divide the time by the number of executions, but I would rather like a way to time a single execution. Is there a way to do this?
If timing a function is generating a time less than a nanosecond, it usually means that the code was optimized out.
Compilers can detect that some code has no side effects, and decide "why would I do that".
1/(1 ns) is 1 Ghz. Modern desktop computers cap out at about 5 Ghz, give or take. So for it to be less than 1 ns, the operation plus the overhead of getting the time has to take fewer than 5 CPU cycles. At that level of resolution, the CPU isn't doing one thing at a time. The start of an instruction and the end can be multiple cycles apart, with the operation being pipelined.
So you have to look at the machine code and understand the architecture to determine what the real cost is, including what parts of the CPU's resources are used, and how it would conflict with other operations you might want to do nearby.
So even "before" and "after" stop having reasonable meaning at sub-1-ns time resolutions on most CPUs fast enough to run above 1 Ghz.
Something you could do, is answer the question "how many runs before this thing
takes at least one nanosecond":
package main
import (
"fmt"
"time"
)
func main() {
d := 1
for {
t := time.Now()
for e := d; e > 0; e-- {
fmt.Sprint("Hello, World!")
}
if time.Since(t) > 0 { break }
d *= 2
}
println(d)
}

GLSL: Why is a random write in local array significantly slower than a looped write?

Lets look at a simplified example function in GLSL:
void foo() {
vec2 localData[16];
// ...
int i = ... // somehow dependent on dynamic data (not known at compile time)
localData[i] = x; // THE IMPORTANT LINE
}
It writes some value x to a dynamic determined index in a local array.
Now, replacing the line localData[i] = x; with
for( int j = 0; j < 16; ++j )
if( i == j )
localData[j] = x;
makes the code significantly faster. In several tested examples (different shaders) the execution time almost halved and there were much more things going on than this write.
For example: in an order-independent transparency shader which, among other things, fetches 16 texels the timings are 39ms with the direct write and 23ms with the looped write. Nothing else changed!
The test hardware is an GTX1080. The assembly returned by glGetProgramBinary is still too high-level. It contains one line in the first case and a loop+if surrounding an identical line in the second.
Why does this performance issue happen?
Is this true for all vendors?
Guess: localData is stored in 8 vec4 registers (the assembly does not say anything about that). Further I assume, that registers cannot be addressed with an index. If both are true, than the final binary must use some branch construct. The loop variant might be unrolled and result in a switch-like pattern which is faster. But is that common for all vendors? Why can't the compiler use whatever results from the for loop as the default for such writes?
Further experiments have shown that the reason is the use of a different memory type for the array. The (unrolled) looped variant uses registers, while the random access variant switches to local memory.
Local memory is usual placed in the global one, but private to each thread. It is likely that accesses to this local array are going to be cached (L2?).
The experiments to verify this reasoning were the following:
Manual versions of unrolled loops (measured in an insertion sort with 16 elements over 1M pixels):
Base line: localData[i] = x 33ms
For loop: for j + if i=j 16.8ms
Switch: switch(i) { case 0: localData[0] ...: 16.92ms
If else tree (splitting in halves): 16.92ms
If list (plain manual unrolled): 16.8ms
=> All kinds of branch constructs result in more or less the same timings. So it is not a bad branching behavior as initially guessed.
Multiple vs. one vs no random access (32 element insertion sort)
2x localData[i] = x 47ms
1x localData[i] = x 45ms
0x localData[i] = x 16ms
=> As long as there is at least one random access the performance will be bad. This means there is a global decision changing the behavior of localData -- most likely the use of a different memory. Using more than one random access does not make things worse much, because of caching.

How to compute blot exposure in backgammon efficiently

I am trying to implement an algorithm for backgammon similar to td-gammon as described here.
As described in the paper, the initial version of td-gammon used only the raw board encoding in the feature space which created a good playing agent, but to get a world-class agent you need to add some pre-computed features associated with good play. One of the most important features turns out to be the blot exposure.
Blot exposure is defined here as:
For a given blot, the number of rolls out of 36 which would allow the opponent to hit the blot. The total blot exposure is the number of rolls out of 36 which would allow the opponent to hit any blot. Blot exposure depends on: (a) the locations of all enemy men in front of the blot; (b) the number and location of blocking points between the blot and the enemy men and (c) the number of enemy men on the bar, and the rolls which allow them to re-enter the board, since men on the bar must re-enter before blots can be hit.
I have tried various approaches to compute this feature efficiently but my computation is still too slow and I am not sure how to speed it up.
Keep in mind that the td-gammon approach evaluates every possible board position for a given dice roll, so each turn for every players dice roll you would need to calculate this feature for every possible board position.
Some rough numbers: assuming there are approximately 30 board position per turn and an average game lasts 50 turns we get that to run 1,000,000 game simulations takes: (x * 30 * 50 * 1,000,000) / (1000 * 60 * 60 * 24) days where x is the number of milliseconds to compute the feature. Putting x = 0.7 we get approximately 12 days to simulate 1,000,000 games.
I don't really know if that's reasonable timing but I feel there must be a significantly faster approach.
So here's what I've tried:
Approach 1 (By dice roll)
For every one of the 21 possible dice rolls, recursively check to see a hit occurs. Here's the main workhorse for this procedure:
private bool HitBlot(int[] dieValues, Checker.Color checkerColor, ref int depth)
{
Moves legalMovesOfDie = new Moves();
if (depth < dieValues.Length)
{
legalMovesOfDie = LegalMovesOfDie(dieValues[depth], checkerColor);
}
if (depth == dieValues.Length || legalMovesOfDie.Count == 0)
{
return false;
}
bool hitBlot = false;
foreach (Move m in legalMovesOfDie.List)
{
if (m.HitChecker == true)
{
return true;
}
board.ApplyMove(m);
depth++;
hitBlot = HitBlot(dieValues, checkerColor, ref depth);
board.UnapplyMove(m);
depth--;
if (hitBlot == true)
{
break;
}
}
return hitBlot;
}
What this function does is take as input an array of dice values (i.e. if the player rolls 1,1 the array would be [1,1,1,1]. The function then recursively checks to see if there is a hit and if so exits with true. The function LegalMovesOfDie computes the legal moves for that particular die value.
Approach 2 (By blot)
With this approach I first find all the blots and then for each blot I loop though every possible dice value and see if a hit occurs. The function is optimized so that once a dice value registers a hit I don't use it again for the next blot. It is also optimized to only consider moves that are in front of the blot. My code:
public int BlotExposure2(Checker.Color checkerColor)
{
if (DegreeOfContact() == 0 || CountBlots(checkerColor) == 0)
{
return 0;
}
List<Dice> unusedDice = Dice.GetAllDice();
List<int> blotPositions = BlotPositions(checkerColor);
int count = 0;
for(int i =0;i<blotPositions.Count;i++)
{
int blotPosition = blotPositions[i];
for (int j =unusedDice.Count-1; j>= 0;j--)
{
Dice dice = unusedDice[j];
Transitions transitions = new Transitions(this, dice);
bool hitBlot = transitions.HitBlot2(checkerColor, blotPosition);
if(hitBlot==true)
{
unusedDice.Remove(dice);
if (dice.ValuesEqual())
{
count = count + 1;
}
else
{
count = count + 2;
}
}
}
}
return count;
}
The method transitions.HitBlot2 takes a blotPosition parameter which ensures that only moves considered are those that are in front of the blot.
Both of these implementations were very slow and when I used a profiler I discovered that the recursion was the cause, so I then tried refactoring these as follows:
To use for loops instead of recursion (ugly code but it's much faster)
To use parallel.foreach so that instead of checking 1 dice value at a time I check these in parallel.
Here are the average timing results of my runs for 50000 computations of the feature (note the timings for each approach was done of the same data):
Approach 1 using recursion: 2.28 ms per computation
Approach 2 using recursion: 1.1 ms per computation
Approach 1 using for loops: 1.02 ms per computation
Approach 2 using for loops: 0.57 ms per computation
Approach 1 using parallel.foreach: 0.75 ms per computation
6 Approach 2 using parallel.foreach: 0.75 ms per computation
I've found the timings to be quite volatile (Maybe dependent on the random initialization of the neural network weights) but around 0.7 ms seems achievable which if you recall leads to 12 days of training for 1,000,000 games.
My questions are: Does anyone know if this is reasonable? Is there a faster algorithm I am not aware of that can reduce training?
One last piece of info: I'm running on a fairly new machine. Intel Cote (TM) i7-5500U CPU #2.40 GHz.
Any more info required please let me know and I will provide.
Thanks,
Ofir
Yes, calculating these features makes really hairy code. Look at the GNU Backgammon code. find the eval.c and look at the lines for 1008 to 1267. Yes, it's 260 lines of code. That code calculates what the number of rolls that hits at least one checker, and also the number of rolls that hits at least 2 checkers. As you see, the code is hairy.
If you find a better way to calculate this, please post your results. To improve I think you have to look at the board representation. Can you represent the board in a different way that makes this calculation faster?

Why are Scala for loops so much faster than while loops for processing a sequence of distinct objects?

I recently read a post about for loops over a range of integers being slower than the corresponding while loops, which is true, but wanted to see if the same held up for iterating over existing sequences and was surprised to find the complete opposite by a large margin.
First and foremost, I'm using the following function for timing:
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: " + (System.nanoTime - s) / 1e6 + "ms")
ret
}
and I'm using a simple sequence of Integers:
val seq = List.range(0, 10000)
(I also tried creating this sequence a few other ways in case the way this sequence was accessed affected the run time. Using the Range type certainly did. This should ensure that each item in the sequence is an independent object.)
I ran the following:
time {
for(item <- seq) {
println(item)
}
}
and
time {
var i = 0
while(i < seq.size) {
println(seq(i))
i += 1
}
}
I printed the results so to ensure that we're actually accessing the values in both loops. The first code snippet runs in an average of 33 ms on my machine. The second takes an average of 305 ms.
I tried adding the mutable variable i to the for loop, but it only adds a few milliseconds. The map function gets similar performance to a for loop, as expected. For whatever reason, this doesn't seem to occur if I use an array (converting the above defined seq with seq.toArray). In such a case, the for loop takes 90 ms and the while loop takes 40 ms.
What is the reason for this major performance difference?
The reason is: complexity. seq(i) is Θ(i) for List, which means your whole loop takes quadratic time. The foreach method, however, is linear.
If you compile with -optimize, the for loop version will likely be even faster, because List.foreach should be inlined, eliminating the cost of the lambda.

Why are loops slow in R?

I know that loops are slow in R and that I should try to do things in a vectorised manner instead.
But, why? Why are loops slow and apply is fast? apply calls several sub-functions -- that doesn't seem fast.
Update: I'm sorry, the question was ill-posed. I was confusing vectorisation with apply. My question should have been,
"Why is vectorisation faster?"
It's not always the case that loops are slow and apply is fast. There's a nice discussion of this in the May, 2008, issue of R News:
Uwe Ligges and John Fox. R Help Desk: How can I avoid this loop or
make it faster? R News, 8(1):46-50, May 2008.
In the section "Loops!" (starting on pg 48), they say:
Many comments about R state that using loops is a particularly bad idea. This is not necessarily true. In certain cases, it is difficult to write vectorized code, or vectorized code may consume a huge amount of memory.
They further suggest:
Initialize new objects to full length before the loop, rather
than increasing their size within the loop. Do not do things in a
loop that can be done outside the loop. Do not avoid loops simply
for the sake of avoiding loops.
They have a simple example where a for loop takes 1.3 sec but apply runs out of memory.
Loops in R are slow for the same reason any interpreted language is slow: every
operation carries around a lot of extra baggage.
Look at R_execClosure in eval.c (this is the function called to call a
user-defined function). It's nearly 100 lines long and performs all sorts of
operations -- creating an environment for execution, assigning arguments into
the environment, etc.
Think how much less happens when you call a function in C (push args on to
stack, jump, pop args).
So that is why you get timings like these (as joran pointed out in the comment,
it's not actually apply that's being fast; it's the internal C loop in mean
that's being fast. apply is just regular old R code):
A = matrix(as.numeric(1:100000))
Using a loop: 0.342 seconds:
system.time({
Sum = 0
for (i in seq_along(A)) {
Sum = Sum + A[[i]]
}
Sum
})
Using sum: unmeasurably small:
sum(A)
It's a little disconcerting because, asymptotically, the loop is just as good
as sum; there's no practical reason it should be slow; it's just doing more
extra work each iteration.
So consider:
# 0.370 seconds
system.time({
I = 0
while (I < 100000) {
10
I = I + 1
}
})
# 0.743 seconds -- double the time just adding parentheses
system.time({
I = 0
while (I < 100000) {
((((((((((10))))))))))
I = I + 1
}
})
(That example was discovered by Radford Neal)
Because ( in R is an operator, and actually requires a name lookup every time you use it:
> `(` = function(x) 2
> (3)
[1] 2
Or, in general, interpreted operations (in any language) have more steps. Of course, those steps provide benefits as well: you couldn't do that ( trick in C.
The only Answer to the Question posed is; loops are not slow if what you need to do is iterate over a set of data performing some function and that function or the operation is not vectorized. A for() loop will be as quick, in general, as apply(), but possibly a little bit slower than an lapply() call. The last point is well covered on SO, for example in this Answer, and applies if the code involved in setting up and operating the loop is a significant part of the overall computational burden of the loop.
Why many people think for() loops are slow is because they, the user, are writing bad code. In general (though there are several exceptions), if you need to expand/grow an object, that too will involve copying so you have both the overhead of copying and growing the object. This is not just restricted to loops, but if you copy/grow at each iteration of a loop, of course, the loop is going to be slow because you are incurring many copy/grow operations.
The general idiom for using for() loops in R is that you allocate the storage you require before the loop starts, and then fill in the object thus allocated. If you follow that idiom, loops will not be slow. This is what apply() manages for you, but it is just hidden from view.
Of course, if a vectorised function exists for the operation you are implementing with the for() loop, don't do that. Likewise, don't use apply() etc if a vectorised function exists (e.g. apply(foo, 2, mean) is better performed via colMeans(foo)).
Just as a comparison (don't read too much into it!): I ran a (very) simple for loop in R and in JavaScript in Chrome and IE 8.
Note that Chrome does compilation to native code, and R with the compiler package compiles to bytecode.
# In R 2.13.1, this took 500 ms
f <- function() { sum<-0.5; for(i in 1:1000000) sum<-sum+i; sum }
system.time( f() )
# And the compiled version took 130 ms
library(compiler)
g <- cmpfun(f)
system.time( g() )
#Gavin Simpson: Btw, it took 1162 ms in S-Plus...
And the "same" code as JavaScript:
// In IE8, this took 282 ms
// In Chrome 14.0, this took 4 ms
function f() {
var sum = 0.5;
for(i=1; i<=1000000; ++i) sum = sum + i;
return sum;
}
var start = new Date().getTime();
f();
time = new Date().getTime() - start;

Resources