F# - Memory management

F# - Memory management - performance

I am working on a basic engine to draw sprites on screen and I wanted to implement a frame - rate method that works as such:
I have a background image stored in a 2D array (that never changes, if I need a different background I need to load a new image) that gets copied every time a new frame has to be made;
all sprites are contained in a list contained in a list ordered in ascending z order. I am yet unsure wether to tap into multithreading or not as far as blitting is concerned. This list is mutable, so if a sprite is inserted , it's locked and sorted after insertion;
the sprites are blitted on the 2D array mentioned in point 1 and after that it's displayed on screen.
The question I have is , if I want to bump the framerate up , will this cause an out of memory exception/error of some sort? Basically , will this eat all my memory before the language has enough time to free any? I know this is the case for some applications in Java that have to deal with many small instances being created, where the program can just crash (at least , it happened to me on a few instances). Is there a better way or is this good enough?
For clarification, my console / raster will never be wider or longer than a few hundred units, so say abot 100 * 80 is probably as big as it gets. Finally , my pixel class takes up 24 bytes (1 character , 1 background color , 1 foreground color).

In general, the garbage collector should be good enough to collect unused objects on the fly. It is hard to give any specific comment about how efficient this will be for your purposes - I think you just have to write a prototype to test the performance and see if that is good enough for you.
One important thing is that F# (and .NET) support "value types" (structs) which are not going to be allocated as individual objects on the heap. When you create a (2D) array of value types, it will be just a single continuous block of memory (rather than 1+80,000 separate object instances).
To define a value type, you can use the Struct attribute:
open System.Drawing
[<Struct>]
type Pixel(c:char, bg:Color, fg:Color) =
member x.Char = c
member x.Background = bg
member x.Foreground = fg
let a2d =
Array2D.init 100 80 (fun i j ->
Pixel('a', Color.Red, Color.Black))

Related

Own fast Gamma Index implementation

My friends and I are writing our own implementation of Gamma Index algorithm. It should compute it within 1s for standard size 2d pictures (512 x 512) though could also calculate 3D pictures; be portable and easy to install and maintain.
Gamma Index, in case if you haven't came across this topic, is a method for comparing pictures. On input we provide two pictures (reference and target); every picture consist of points distributed over regular fine grid; every point has location and value. As output we receive a picture of Gamma Index values. For each point of target picture we calculate some function (called gamma) against every point from reference picture (in original version) or against points from reference picture, that are closest to the one from target picture (in version, that is usually used in Gamma Index calculation software). The Gamma Index for certain target point is minimum of calculated for it gamma function.
So far we have tried following ideas with these results:
use GPU - the calculation time has decreased 10 times. Problem is, that it's fairly difficult to install it on machines with non nVidia graphics card
use supercomputer or cluster - the problem is with maintenance of this solution. Plus every picture has to be ciphered for travel through network due to data sensitivity
iterate points ordered by their distances to target point with some extra stop criterion - this way we got 15 seconds at best condition (which is actually not ideally precise)
currently we are writing in Python due to NumPy awesome optimizations over matrix calculation, but we are open for other languages too.
Do you have any ideas how we can accelerate our algorithm(s), in order to meet the objectives? Do you think the obtaining of this level of performance is possible?
Some more information about GI for anyone interested:
http://lcr.uerj.br/Manual_ABFM/A%20technique%20for%20the%20quantitative%20evaluation%20of%20dose%20distributions.pdf

Performing different tasks for different data items in OpenCL?

In summary, I'm looking for ways to deal with a situation where the very first step in the calculation is a conditional branch between two computationally expensive branches.
I'm essentially trying to implement a graphics filter that operates on an image and a mask - the mask is a bitmap array the same size as the image, and the filter performs different operations according to the value of the mask. So I basically want to do something like this for each pixel:
if(mask == 1) {
foo();
} else {
bar();
}
where both foo and bar are fairly expensive operations. As I understand it, when I run this code on the GPU it will have to calculate both branches for every pixel. (This gets even more expensive if there are more than two possible values for the mask.) Is there any way to avoid this?
One option I can think of would be to, in the host code, sort all the pixels into two 1-dimensional arrays based on the value of the mask at that point, and then entirely different kernels on them; then reconstruct the image from the two datasets afterwards. The problem with this is that, in my case, I want to run the filter iteratively, and both the image and the mask change with each iteration (the mask is actually calculated from the image). If I'm splitting the image into two buckets in the host code, I have to transfer each iteration of the image and mask from the GPU, and then the new buckets back to the GPU, introducing a new bottleneck to replace the old one.
Is there any other way to avoid this bottleneck?

Another approach might be to do a simple bucket sort within each work-group using the mask.
So add a local memory array and atomic counter for each value of mask. First read a pixel (or set of pixels might be better) for each work item, increment the appropriate atomic count and write the pixel address into that location in the array.
Then perform a work-group barrier.
Then as a final stage assign some set of work-items, maybe a multiple of the underlying vector size, to each of those arrays and iterate through it. Your operations will then be largely efficient, barring some loss at the ends, and if you look at enough pixels per work-item you may have very little loss of efficiency even if you assign the entire group to one mask value and then the other in turn.
Given that your description only has two values of the mask, fitting two arrays into local memory should be pretty simple and scale well.

Push demanding task of a thread to shared/local memory(synchronization slows the process) and execute light ones untill all light ones finish(so the slow sync latency is hidden by this), then execute heavier ones.
if(mask == 1) {
uploadFoo();//heavy, upload to __local object[]
} else {
processBar(); // compute until, then check for a foo() in local memory if any exists.
downloadFoo();
}
using a producer - consumer approach maybe.

Ruby : The best way to manage a large 3d array

I would like to know what is the best way to manage a large 3d array with something like :
x = 1000
y = 1000
z = 100
=> 100000000 objects
And each cell is an object with some amount of data.
Simple methods are very loooooong even if all data are collapsed (I a first tryed an array of array of array of objects)
class Test
def initialize
#name = "Test"
end
end
qtt = 1000*1000*100
Array.new(qtt).each { |e| e = Test.new }
I read somewhere that DB could be a good thing for such cases.
What do you think about this ?
What am I trying to do ?
This "matrix" represents a world. And each element is a 1mx1mx2m block who could be a different kind (water, mud, stone, ...) Some block could be empty too.
But the user should be able to remove blocks everywhere and change everything around (if they where water behind, it will flow through the hole for exemple.
In fact what I wish to do is not Minecraft be a really small clone of DwarfFortress (http://www.bay12games.com/dwarves/)
Other interesting things
In my model the groud is at level 10. It means that [0,10] is empty sky in most of cases.
Only hills and parts of mountains could be present on those layers.
Underground is basicaly unknown and not dug. So we should not have to add instances for unused blocks.
What we should add from the beginning to the model : gems, gold, water who could stored without having to store the adjacent stone/mood/earth blocks.
At the beginning of the game, 80% of the cube doesn't need to be loaded in memory.
Each time we dig we create new blocks : the empty block we dug and the blocks around.
The only things we should index is :
underground rivers
underground lakes
lava rivers

Holding that many objects in memory is never a good thing. A flat-file or database-centric approach would be a lot more efficient and easier to maintain.
What I would do - The object-oriented approach
Store the parameters of the blocks as simple data and construct the objects dynamically.
Create a Block class to represent a block in the game, and give it variables to hold the parameters of that particular block:
class Block
# location of the Block
attr_accessor :x, :y, :z
# an individual id for the Block
attr_accessor :id
# to define the block type (rock, water etc.)
attr_accessor :block_type
# and add any other attributes of a Block...
end
I'd then create a few methods that would enable me to serialise/de-serialise the data to a file or database.
As you've stated it works on a board, you'd also need a Board class to represent it that would maintain the state of the game as well as perform actions on the Block objects. Using the x, y, z attributes from each Block you can determine its location within the game. Using this information you can then write a method in the Block class that locates those blocks adjacent to the current one. This would enable you to perform the "cascading" effects you talk about where one Block is affected by actions on another.
Accessing the data efficiently
This will rely entirely on how you choose to serialise the Block objects. I would probably choose a binary format to reduce unnecessary data reads and store the objects via their id parameter, and then use something like MMIO to quickly do random-access reads/writes on a large data file in an Array-like manner. This will allow you to access the data quickly and efficiently, without the memory overhead. How you read the data will relate to your adjacent blocks method above.
You can of course also choose the DB storage route which will allow you to isolate the Blocks and do lookups on particular blocks in a higher-level manner, however that might give you a bit of extra overhead.
It sounds like an interesting project, I hope this helps a bit! :)
P.S With regards to the comment above by #Linuxious about choosing a different language. Yes this might be true in some cases, but a skilled programmer never blames his tools. A program is only as efficient as the programmer makes it...unless you're writing it in Java ;)

A grid based layout game and removing items

Let's say I got a collection (simple grid) of invaders:
In this image, only invader type C can shoot.
Shots are fired, an invader gets destroyed:
Now, invader type B in the third column in the second row can fire as well. Note that there can only be three random invader shots on the screen at the same time. So only three of the invaders in the set {C, C, B, C, C, C} can shoot.
How would I go about implementing this? I am thinking of two solutions:
Use an array of arrays [][] (or [,]). When an invader get shot, the place where the invader was gets set to null. Then, when it's time for the invaders to fire, there's a loop going over the first row. Encountering a null makes it check the space above the null. Is it null? Then do the same for the space above that. Is the space in the uppermost row null? Go to the next column in the first row.
Each invader type has a position (I use Point for that). Assign to each position the row number (the collection used will be some sort of dictionary). So, when looking at the image, all C's get a 1, all B's get a 2, and all A's get a 3.
In this picture, C at position (2, 2) is destroyed. It should then subtract 1 from it the Y value of the point, which will be (2, 1). If there's a position like that in the collection, then assign the invader at that position (2, 1) to the position of the invader that got destroyed (2, 2).. Like this, I don't have to have a jagged array containing a bunch of nulls.
My thoughts about how it should look like -> when the game starts the first set is {C C C C C C} and then it will be {C C B C C C}. From this set, three will be randomly chosen to fire.
So, any thoughts?

I disagree with Mirkules. I would suggest you not keep a separate data structure for only the invaders that can shoot. In general it's always a good idea to stick to the DRY pattern to prevent logic issues later on. For a simple application where you can keep the entire program in your head, it's probably not a big deal. But when you start working on larger projects it becomes more difficult to remember that you need to update multiple data structures whenever you modify any one of the associated structures.
Premature optimization is the root of all evil. You probably don't even need to worry about an optimization on such a miniscule level. It is my experience that when you spend a great deal of time working on these types of issues, you end up with good code, but you don't have much to show for it. Instead, I prefer to spend time getting my app to do what I intend, and then refactor it at a later date. Seeing my app work properly gives me the motivation to continue writing more code.
Good luck with your game. Xna is so much fun to write games in!

In game development, especially in a managed language like C# and especially on the Xbox 360, in general your first priority should be to avoid allocating memory while the game is running. Saving memory and reducing operation count is a secondary concern.
A null (in 32-bit, which XNA runs in) is just four bytes!
A 2D array ([,]) containing pointers to your invaders seems entirely appropriate. Especially as it allows you to make the location of each invader implicit by its location in the data structure. (Do you even need to create individual invader objects and point to them? Just use a number which indicates what "type" of invader they are.)
Looping through that data structure (in the manner you suggest) is going to be so amazingly fast that it may well be a "free" operation. Because the processor itself can process the data faster than you can bring it into the cache anyway.
AND you're not even doing it every frame - only when your invaders fire! I am willing to bet that it would be slower to calculate and store that data whenever an invader is destroyed and then load it when your invaders fire.
(Basically what you are proposing is caching/pre-computing that data. A useful performance optimisation technique - but only when it's actually necessary.)
You should be a lot more worried about costs that happen each frame, than ones that are only triggered occasionally by timers and user input.
Do not use a jagged array ([][]). This is basically an array of arrays. It uses more memory and involves an additional layer of indirection which in turn also has the effect of potentially reducing the locality of your data (meaning your data might not end up in the cache in a single hit - this is the "slow bit"). It also increases the number of objects the GC has to think about. Do not use a Dictionary for the same reasons.
In game development it helps to keep this kind of performance stuff at least in-mind when you work (anywhere else this would be completely premature optimisation).
But, for something as simple as Space Invaders, you can pretty much do whatever you like! So do the simplest thing that could possibly work.

I would keep a 2D array of all invaders. In addition, it would be a lot faster if you maintained a separate data structure with pointers only to the invaders that can fire instead of having to loop through the entire array each time you have to fire (causing your program to be really slow in the beginning since there will be a lot more invaders). So in your first diagram, your data structure would contain all "C", and in the second diagram {C, C, B, C, C, C}. When it comes time to fire, you just have to reference this data structure, get their pointers and then only call "fire()" on any of those invaders.
You didn't quite explain how the three invader that can fire are selected, so my guess is it's randomly chosen -- in this case, all you would have to do is pick a random number between 0 (inclusive) and n-1 (where n is the number of invaders in your data structure, in this case 6).
Finally, when it comes time to destroy an invader, if you have the 2D array and you know the position, it will be really easy to pop the killed invader off the "firing squad" data structure, and assign the one above him to the firing squad (i.e. invaderArray[KilledInvader.row-1][KillerInvader.column])
Hope this makes sense.

Can an invader change column? And can it pass an invader that's in front of it?
Assuming that the answer to both questions is no, I would maintain a Queue/List for each column. Then, the set of invader that can fire is the first element in each Queue. When an invader is destroyed, you would just pop it off the queue. Again, assuming only the front row can be destroyed.
Each invader would have to maintain a position for updating and drawing purposes.

Optimizing Conway's 'Game of Life'

To experiment, I've (long ago) implemented Conway's Game of Life (and I'm aware of this related question!).
My implementation worked by keeping 2 arrays of booleans, representing the 'last state', and the 'state being updated' (the 2 arrays being swapped at each iteration). While this is reasonably fast, I've often wondered about how to optimize this.
One idea, for example, would be to precompute at iteration N the zones that could be modified at iteration (N+1) (so that if a cell does not belong to such a zone, it won't even be considered for modification at iteration (N+1)). I'm aware that this is very vague, and I never took time to go into the details...
Do you have any ideas (or experience!) of how to go about optimizing (for speed) Game of Life iterations?

I am going to quote my answer from the other question, because the chapters I mention have some very interesting and fine-tuned solutions. Some of the implementation details are in c and/or assembly, yes, but for the most part the algorithms can work in any language:
Chapters 17 and 18 of
Michael Abrash's Graphics
Programmer's Black Book are one of
the most interesting reads I have ever
had. It is a lesson in thinking
outside the box. The whole book is
great really, but the final optimized
solutions to the Game of Life are
incredible bits of programming.

There are some super-fast implementations that (from memory) represent cells of 8 or more adjacent squares as bit patterns and use that as an index into a large array of precalculated values to determine in a single machine instruction if a cell is live or dead.
Check out here:
http://dotat.at/prog/life/life.html
Also XLife:
http://linux.maruhn.com/sec/xlife.html

You should look into Hashlife, the ultimate optimization. It uses the quadtree approach that skinp mentioned.

As mentioned in Arbash's Black Book, one of the most simple and straight forward ways to get a huge speedup is to keep a change list.
Instead of iterating through the entire cell grid each time, keep a copy of all the cells that you change.
This will narrow down the work you have to do on each iteration.

The algorithm itself is inherently parallelizable. Using the same double-buffered method in an unoptimized CUDA kernel, I'm getting around 25ms per generation in a 4096x4096 wrapped world.

what is the most efficient algo mainly depends on the initial state.
if the majority of cells is dead, you could save a lot of CPU time by skipping empty parts and not calculating stuff cell by cell.
im my opinion it can make sense to check for completely dead spaces first, when your initial state is something like "random, but with chance for life lower than 5%."
i would just divide the matrix up into halves and start checking the bigger ones first.
so if you have a field of 10,000 * 10,000, you´d first accumulate the states of the upper left quarter of 5,000 * 5,000.
and if the sum of states is zero in the first quarter, you can ignore this first quarter completely now and check the upper right 5,000 * 5,000 for life next.
if its sum of states is >0, you will now divide up the second quarter into 4 pieces again - and repeat this check for life for each of these subspaces.
you could go down to subframes of 8*8 or 10*10 (not sure what makes the most sense here) now.
whenever you find life, you mark these subspaces as "has life".
only spaces which "have life" need to be divided into smaller subspaces - the empty ones can be skipped.
when you are finished assigning the "has life" attribute to all possible subspaces, you end up with a list of subspaces which you now simply extend by +1 to each direction - with empty cells - and perform the regular (or modified) game of life rules to them.
you might think that dividn up a 10,000*10,000 spae into subspaces of 8*8 is a lot os tasks - but accumulating their states values is in fact much, much less computing work than performing the GoL algo to each cell plus their 8 neighbours plus comparing the number and storing the new state for the net iteration somewhere...
but like i said above, for a random init state with 30% population this wont make much sense, as there will be not many completely dead 8*8 subspaces to find (leave alone dead 256*256 subpaces)
and of course, the way of perfect optimisation will last but not least depend on your language.
-110

Two ideas:
(1) Many configurations are mostly empty space. Keep a linked list (not necessarily in order, that would take more time) of the live cells, and during an update, only update around the live cells (this is similar to your vague suggestion, OysterD :)
(2) Keep an extra array which stores the # of live cells in each row of 3 positions (left-center-right). Now when you compute the new dead/live value of a cell, you need only 4 read operations (top/bottom rows and the center-side positions), and 4 write operations (update the 3 affected row summary values, and the dead/live value of the new cell). This is a slight improvement from 8 reads and 1 write, assuming writes are no slower than reads. I'm guessing you might be able to be more clever with such configurations and arrive at an even better improvement along these lines.

If you don't want anything too complex, then you can use a grid to slice it up, and if that part of the grid is empty, don't try to simulate it (please view Tyler's answer). However, you could do a few optimizations:
Set different grid sizes depending on the amount of live cells, so if there's not a lot of live cells, that likely means they are in a tiny place.
When you randomize it, don't use the grid code until the user changes the data: I've personally tested randomizing it, and even after a long amount of time, it still fills most of the board (unless for a sufficiently small grid, at which point it won't help that much anymore)
If you are showing it to the screen, don't use rectangles for pixel size 1 and 2: instead set the pixels of the output. Any higher pixel size and I find it's okay to use the native rectangle-filling code. Also, preset the background so you don't have to fill the rectangles for the dead cells (not live, because live cells disappear pretty quickly)

Don't exactly know how this can be done, but I remember some of my friends had to represent this game's grid with a Quadtree for a assignment. I'm guess it's real good for optimizing the space of the grid since you basically only represent the occupied cells. I don't know about execution speed though.

It's a two dimensional automaton, so you can probably look up optimization techniques. Your notion seems to be about compressing the number of cells you need to check at each step. Since you only ever need to check cells that are occupied or adjacent to an occupied cell, perhaps you could keep a buffer of all such cells, updating it at each step as you process each cell.
If your field is initially empty, this will be much faster. You probably can find some balance point at which maintaining the buffer is more costly than processing all the cells.

There are table-driven solutions for this that resolve multiple cells in each table lookup. A google query should give you some examples.

I implemented this in C#:
All cells have a location, a neighbor count, a state, and access to the rule.
Put all the live cells in array B in array A.
Have all the cells in array A add 1 to the neighbor count of their
neighbors.
Have all the cells in array A put themselves and their neighbors in array B.
All the cells in Array B Update according to the rule and their state.
All the cells in Array B set their neighbors to 0.
Pros:
Ignores cells that don't need to be updated
Cons:
4 arrays: a 2d array for the grid, an array for the live cells, and an array
for the active cells.
Can't process rule B0.
Processes cells one by one.
Cells aren't just booleans
Possible improvements:
Cells also have an "Updated" value, they are updated only if they haven't
updated in the current tick, removing the need of array B as mentioned above
Instead of array B being the ones with live neighbors, array B could be the
cells without, and those check for rule B0.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio