how to remove nested foreach loops for performance emprovement - performance

I have a performance-based question.
Is there a way to remove the nested foreach loops replacing them with something more performant ? Here is an example:
List<foo> foos = SelectAllfoos();
foreach(foo f in foos){
//dosomething
foreach(foo2 f2 in foo.GetFoos2()){
//dosomething
}
foreach(foo3 f3 in foo.GetFoos3()){
//dosomething
}
foreach(foo4 f4 in foo.GetFoos4()){
//dosomething
foreach(foo4_1 f4_1 in f4.GetFoos4_1()){
//dosomething
}
}
}
Obiouvsly it is a fake code I just invented for this example. But imagine you have something like that. How should you improve this method's performances?
PS: I already tried using System.Threading.Task.Parallel.ForEachand it improve performance, but I mean a better way to write this code.
PPS: this is written in C#, but my question regards a wider scope, something useful in all languages.

Since the question is rather general and only focused on loops which provide no information about the actual work being done, I can only provide a general answer.
The last thing you typically want to focus on are the loop mechanics themselves. These often yield little, if any, impact.
Typically if you have this kind of situation where algorithmic improvements are out (ex: sequential loops that cannot do better than linear-time complexity as they require traversing and doing something with every single element no matter what), then the two biggest improvements will often come from parallelization and memory optimization.
The latter one is unfortunately less discussed, especially in higher-level languages, but often carries just as much or more impact. It can improve execution times by orders of magnitudes, and is applicable regardless of the language. Concepts like cache efficiency are not language-dependent concepts, as the hardware remains the same no matter what programming language we use (though how we achieve it can vary considerably between languages).
Memory Access Patterns
For example, take an image processing algorithm. In that case, given two otherwise identical machine instructions (except for the fact that they are swapped), a memory access pattern accessing pixels one horizontal scanline at a time in the outer loop can significantly outperform a memory access pattern that accesses pixels one vertical column of pixels at a time. This would be true even with otherwise identical machine instructions that have the same total instruction-level cost (though instruction costs are variable), but merely access memory in a swapped order.
It's because, put crudely, computers fetch data from slower forms of memory into faster forms of memory in contiguous chunks (pages, cache lines). When you access pixels of an image horizontally, an adjacent, horizontal chunk of pixels might be fetched from a slower form of memory into a faster form, and you end up accessing all the neighboring pixels from the faster form of memory prior to moving on to the next series of pixels. When you access pixels of an image in a vertical fashion, you end up loading horizontal neighboring pixels into a faster form of memory only to use one pixel from that column. The result can significantly slow down the resulting image algorithm as a result of cache misses, since we're failing to use all the data available when it's loaded into a smaller but faster form of memory prior to it being evicted (we're basically wasting a lot of the benefits of that smaller but faster memory).
So typically if you want to make loops go faster, and algorithmic improvements are out, you want to analyze the way that memory is being accessed and potentially change even the memory layout of the data structures involved. Computers like it when you access contiguous data close together in memory, and don't like it so much when you're accessing memory in a chaotic way that's going all over the place. They like arrays which pack their memory contents tightly together a lot more than linked structures which scatter the memory all over the place (unless the linked structures or their memory allocators are carefully designed not to do that). Speedy loops don't come from changing the mechanics of the loop so much as what the loops are doing, but deeper than algorithmic improvements and perhaps even parallelization are those memory-related optimizations coming from a data-oriented design mindset. In languages like C#, one of the techniques to get better locality of reference out of your data structures is object pooling.
Loop Tiling/Blocking
Occasionally there are opportunities where you can improve the memory access patterns by simply changing the way you loop over the data without actually changing the way the data is represented. One such example is loop tiling (aka loop blocking): https://software.intel.com/en-us/articles/how-to-use-loop-blocking-to-optimize-memory-use-on-32-bit-intel-architecture. But again, here the speedup isn't coming from optimizing how you write the loop, per se, but optimizing the way you traverse the data in a way that exploits locality of reference. It's still entirely about memory access.
Profiling
All of these micro-level optimization techniques have a tendency to make your code harder to maintain, so they're almost always best applied in hindsight with plenty of profiling measurements in your hand. The first thing to learn about optimization in general is how to measure, to do it based on hard data rather than hunches. Beginners tend to want to optimize more, not less, because they're doing it based on guesses about what might be inefficient instead of hard data and proper measurements. It's easy to do this for glaring algorithmic bottlenecks, but anything else typically demands a profiler in your hand. A good optimizer is a sniper dispatching hotspots, not a grenadier blindly hurling grenades at anything that might slow things down. In fact, knowing how to prioritize optimizations properly and to make the proper measurements is probably even more important than understanding the inner workings of the machine. So probably beyond all this stuff, if you want to make your loops go faster, first grab a profiler and learn how to measure inefficiencies properly. The first thing to ask is not how to make things faster so much as what actually needs to be faster (and just as importantly if not more, what doesn't).

Related

can we improve dynamic array to make it more faster

I have learned about dynamic array (non-fixed size array) as dynamic array as vector in C++ and Arraylist in Java
And how can we implement it.
Basically when the array is full we create another array of doubled size and copy the old items to the new array
So can we implement an array of non-fixed size with random access as a vector and Arraylist without spending time copying the old elements?
In other word, Is there data structure like that (dynamic size and random access and no need for copy elements)??
Depending on what you mean by "like", this is trivially impossible to already exists.
First the trivially impossible. When we create an array, we mark a section of memory as being only for that array. If you have 3 such arrays that can grow without bound, one of them will eventually run into another. Given that we can actually create arrays that are bigger than available memory (it just pages to disk), we have to manage this risk, not avoid it.
But how big an issue is it? Copying data is O(1) per element, no matter how big it gets. And the overhead is low. The cost of this dynamicism is that you need to always check where the array starts. But that's a pretty fast check.
Alternately we can move to paged memory. Now an array access looks like, "Check what page it is on, then look at where it is in the page." Now your array can grow, but you never change where anything is. But if you want it to grow without bound, you have to add levels. We can implement it, and it does avoid copying, but this form of indirection has generally NOT proven worth it for general purpose programming. However paging is used in databases. And it is also used by operating systems to manage turning what the program thinks is the address of the data, to the actual address in memory. If you want to dive down that rabbit hole, TLB is worth looking at.
But there are other options that exist as well. Instead of fixed sized pages, we can have variable sized ones. This approach gets very complicated, very quickly. But the result is very useful. Look up ropes for more.
The browser that I wrote this on stores the text of what I wrote using a rope. This is how it can easily offer features like multi-level undo and editing in the middle of the document. However the raw performance of such schemes is significant. It is clearly worthwhile if you need the features, but otherwise we don't do it.
In short, every set of choices we make has tradeoffs. The one you'd like to optimize has what has proven to be the best tradeoff for offering dynamic size and raw performance. That's why it appears everywhere from Python lists to C++ vectors.

What is a coarse and fine grid search?

I was reading this answer
Efficient (and well explained) implementation of a Quadtree for 2D collision detection
and encountered this paragraph
All right, so actually quadtrees are not my favorite data structure for this purpose. I tend to prefer a grid hierarchy, like a coarse grid for the world, a finer grid for a region, and an even finer grid for a sub-region (3 fixed levels of dense grids, and no trees involved), with row-based optimizations so that a row that has no entities in it will be deallocated and turned into a null pointer, and likewise completely empty regions or sub-regions turned into nulls. While this simple implementation of the quadtree running in one thread can handle 100k agents on my i7 at 60+ FPS, I've implemented grids that can handle a couple million agents bouncing off each other every frame on older hardware (an i3). Also I always liked how grids made it very easy to predict how much memory they'll require, since they don't subdivide cells. But I'll try to cover how to implement a reasonably efficient quadtree.
This type of grid seems intuitive, it sort of sounds like a "N-order" grid, where instead of 4 child nodes, you have N child nodes per parent. N^3 can go much further than 4^3, which allows better precision with potentially (I guess) less branching (since there are many less nodes to branch).
I'm a little puzzled because I would intuitively use a single, or maybe 3 std::map with the proper < operator(), to reduce its memory footprint, but I'm not sure it would be so fast, since querying an AABB would mean stacking several accesses that are O(log n).
What exactly are those row-based optimizations he is talking about? Is this type of grid search common?
I have some understanding of a z order curve, and I'm not entirely satisfied with a quadtree.
It's my own quote. But that's based on a common pattern I encountered in my personal experience. Also, terms like "parent" and "child" are ones I'd largely discard when talking about grids. You just got a big 2-dimensional or N-dimensional table/matrix storing stuff. There's not really a hierarchy involved whatsoever -- these data structures are more comparable to arrays than trees.
"Coarse" and "Fine"
On "coarse" and "fine", what I meant there is that "coarse" search queries tend to be cheaper but give more false positives. A coarser grid would be one that is lower in grid resolution (fewer, larger cells). Coarse searches may involve traversing/searching fewer and larger grid cells. For example, say we want to see if an element intersects a point/dot in a gigantic cell (imagine just a 1x1 grid storing everything in the simulation). If the dot intersects the cell, we may get a whole lot of elements returned in that cell but maybe only one or none of them actually intersect the dot.
So a "coarse" query is broad and simple but not very precise at narrowing down the list of candidates (or "suspects"). It may return too many results and still leave a whole lot of processing required left to do to narrow down what actually intersects the search parameter*.
It's like in those detective shows when they search a database for a
possible killer, putting in "white male" might not require much
processing to list the results but might give way too many results to
properly narrow down the suspects. "Fine" would be the opposite and might require more processing of the database but narrow down the result to just one suspect.
This is a crude and flawed analogy but I hope it helps.
Often the key to broadly optimizing spatial indices before we get into things like memory optimizations whether we're talking spatial hashes or quadtrees is to find a nice balance between "coarse" and "fine". Too "fine" and we might spend too much time traversing the data structure (searching many small cells in a uniform grid, or spending too much time in tree traversal for adaptive grids like quadtrees). Too "coarse" and the spatial index might give back too many results to significantly reduce the amount of time required for further processing. For spatial hashes (a data structure I don't personally like very much but they're very popular in gamedev), there's often a lot of thought and experimentation and measuring involved in determining an optimal cell size to use.
With uniform NxM grids, how "coarse" or "fine" they are (big or small cells and high or low grid resolution) not only impacts search times for a query but can also impact insertion and removal times if the elements are larger than a point. If the grid is too fine, a single large or medium-sized element may have to be inserted into many tiny cells and removed from many tiny cells, using lots of extra memory and processing. If it's too coarse, the element may only have to be inserted and removed to/from one large cell but at costs to the data structure's ability to narrow down the number of candidates returned from a search query to a minimum. Without care, going too "fine/granular" can become very bottlenecky in practice and a developer might find his grid structure using gigabytes of RAM for a modest input size. With tree variants like quadtrees, a similar thing can happen if the maximum allowed tree depth is too high a value causing explosive memory use and processing when the leaf nodes of the quadtree store the tiniest cell sizes (we can even start running into floating-point precision bugs that wreck performance if the cells are allowed to be subdivided to too small a size in the tree).
The essence of accelerating performance with spatial indices is often this sort of balancing act. For example, we typically don't want to apply frustum culling to individual polygons being rendered in computer graphics because that's typically not only redundant with what the hardware already does at the clipping stage, but it's also too "fine/granular" and requires too much processing on its own compared to the time required to just request to render one or more polygons. But we might net huge performance improvements with something a bit "coarser", like applying frustum culling to an entire creature or space ship (an entire model/mesh), allowing us to avoid requesting to render many polygons at once with a single test. So I often use the terms, "coarse" and "fine/granular" frequently in these sorts of discussions (until I find better terminology that more people can easily understand).
Uniform vs. Adaptive Grid
You can think of a quadtree as an "adaptive" grid with adaptive grid cell sizes arranged hierarchically (working from "coarse' to "fine" as we drill down from root to leaf in a single smart and adaptive data structure) as opposed to a simple NxM "uniform" grid.
The adaptive nature of the tree-based structures is very smart and can handle a broad range of use cases (although typically requiring some fiddling of maximum tree depth and/or minimum cell size allowed and possibly how many maximum elements are stored in a cell/node before it subdivides). However, it can be more difficult to optimize tree data structures because the hierarchical nature doesn't lend itself so easily to the kind of contiguous memory layout that our hardware and memory hierarchy is so well-suited to traverse. So very often I find data structures that don't involve trees to be easier to optimize in the same sense that optimizing a hash table might be simpler than optimizing a red-black tree, especially when we can anticipate a lot about the type of data we're going to be storing in advance.
Another reason I tend to favor simpler, more contiguous data structures in lots of contexts is that the performance requirements of a realtime simulation often want not just fast frame rates, but consistent and predictable frame rates. The consistency is important because even if a video game has very high frame rates for most of the game but some part of the game causes the frame rates to drop substantially for even a brief period of time, the player may die and game over as a result of it. It was often very important in my case to avoid these types of scenarios and have data structures largely absent pathological worst-case performance scenarios. In general, I find it easier to predict the performance characteristics of lots of simpler data structures that don't involve an adaptive hierarchy and are kind of on the "dumber" side. Very often I find the consistency and predictability of frame rates to be roughly tied to how easily I can predict the data structure's overall memory usage and how stable that is. If the memory usage is wildly unpredictable and sporadic, I often (not always, but often) find the frame rates will likewise be sporadic.
So I often find better results using grids personally, but if it's tricky to determine a single optimal cell size to use for the grid in a particular simulation context, I just use multiple instances of them: one instance with larger cell sizes for "coarse" searches (say 10x10), one with smaller ones for "finer" searches (say 100x100), and maybe even one with even smaller cells for the "finest" searches (say 1000x1000). If no results are returned in the coarse search, then I don't proceed with the finer searches. I get some balance of the benefits of quadtrees and grids this way.
What I did when I used these types of representations in the past is not to store a single element in all three grid instances. That would triple the memory use of an element entry/node into these structures. Instead, what I did was insert the indices of the occupied cells of the finer grids into the coarser grids, as there are typically far fewer occupied cells than there are a total number of elements in the simulation. Only the finest, highest-resolution grid with the smallest cell sizes would store the element. The cells in the finest grid are analogous to the leaf nodes of a quadtree.
The "loose-tight double grid" as I'm calling it in one of the answers to that question is an expansion on this multi-grid idea. The difference is that the finer grid is actually loose and has cell sizes that expand and shrink based on the elements inserted to it, always guaranteeing that a single element, no matter how large or small, needs only be inserted to one cell in the grid. The coarser grid stores the occupied cells of the finer grid leading to two constant-time queries (one in the coarser tight grid, another into the finer loose grid) to return an element list of potential candidates matching the search parameter. It also has the most stable and predictable memory use (not necessarily the minimal memory use because the fine/loose cells require storing an axis-aligned bounding box that expands/shrinks which adds another 16 bytes or so to a cell) and corresponding stable frame rates because one element is always inserted to one cell and doesn't take any additional memory required to store it besides its own element data with the exception of when its insertion causes a loose cell to expand to the point where it has to be inserted to additional cells in the coarser grid (which should be a fairly rare-case scenario).
Multiple Grids For Other Purposes
I'm a little puzzled because I would intuitively use a single, or maybe 3 std::map with the proper operator(), to reduce its memory footprint, but I'm not sure it would be so fast, since querying an AABB would mean stacking several accesses that are O(log n).
I think that's an intuition many of us have and also probably a subconscious desire to just lean on one solution for everything because programming can get quite tedious and repetitive and it'd be ideal to just implement something once and reuse it for everything: a "one-size-fits-all" t-shirt. But a one-sized-fits-all shirt can be poorly tailored to fit our very broad and muscular programmer bodies*. So sometimes it helps to use the analogy of a small, medium, and large size.
This is a very possibly poor attempt at humor on my part to make my long-winded texts less boring to read.
For example, if you are using std::map for something like a spatial hash, then there can be a lot of thought and fiddling around trying to find an optimal cell size. In a video game, one might compromise with something like making the cell size relative to the size of your average human in the game (perhaps a bit smaller or bigger), since lots of the models/sprites in the game might be designed for human use. But it might get very fiddly and be very sub-optimal for teeny things and very sub-optimal for gigantic things. In that case, we might do well to resist our intuitions and desires to just use one solution and use multiple (it could still be the same code but just multiple instances of the same class instance for the data structure constructed with varying parameters).
As for the overhead of searching multiple data structures instead of a single one, that's something best measured and it's worth remembering that the input sizes of each container will be smaller as a result, reducing the cost of each search and very possibly improve locality of reference. It might exceed the benefits in a hierarchical structure that requires logarithmic search times like std::map (or not, best to just measure and compare), but I tend to use more data structures which do this in constant-time (grids or hash tables). In my cases, I find the benefits far exceeding the additional cost of requiring multiple searches to do a single query, especially when the element sizes vary radically or I want some basic thing resembling a hierarchy with 2 or more NxM grids that range from "coarse" to "fine".
Row-Based Optimizations
As for "row-based optimizations", that's very specific to uniform fixed-sized grids and not trees. It refers to using a separate variable-sized list/container per row instead of a single one for the entire grid. Aside from potentially reducing memory use for empty rows that just turn into nulls without requiring an allocated memory block, it can save on lots of processing and improve memory access patterns.
If we store a single list for the entire grid, then we have to constantly insert and remove from that one shared list as elements move around, particles are born, etc. That could lead to more heap allocations/deallocations growing and shrinking the container but also increases the average memory stride to get from one element in that list to the next which will tend to translate to more cache misses as a result of more irrelevant data being loaded into a cache line. Also these days we have so many cores so having a single shared container for the entire grid may reduce the ability to process the grid in parallel (ex: searching one row while simultaneously inserting to another). It can also lead to more net memory use for the structure since if we use a contiguous sequence like std::vector or ArrayList, those can often store the memory capacity of as many as twice the elements required to reduce the time of insertions to amortized constant time by minimizing the need to reallocate and copy the former elements in linear-time by keeping excess capacity.
By associating a separate medium-sized container per grid row or per column instead of gigantic one for the entire grid, we can mitigate these costs in some cases*.
This is the type of thing you definitely measure before and after though to make sure it actually improves overall frame rates, and probably attempt in response to a first attempt storing a single list for the entire grid revealing many non-compulsory cache misses in the profiler.
This might beg the question of why we don't use a separate teeny list container for every single cell in the grid. It's a balancing act. If we store that many containers (ex: a million instances of std::vector for a 1000x1000 grid possibly storing very few or no elements each), it would allow maximum parallelism and minimize the stride to get from one element in a cell to the next one in the cell, but that can be quite explosive in memory use and introduce a lot of extra processing and heap overhead.
Especially in my case, my finest grids might store a million cells or more, but I only require 4 bytes per cell. A variable-sized sequence per cell would typically explode to at least something like 24 bytes or more (typically far more) per cell on 64-bit architectures to store the container data (typically a pointer and a couple of extra integers, or three pointers on top of the heap-allocated memory block), but on top of that, every single element inserted to an empty cell may require a heap allocation and compulsory cache miss and page fault and very frequently due to the lack of temporal locality. So I find the balance and sweet spot to be one list container per row typically among my best-measured implementations.
I use what I call a "singly-linked array list" to store elements in a grid row and allow constant-time insertions and removals while still allowing some degree of spatial locality with lots of elements being contiguous. It can be described like this:
struct GridRow
{
struct Node
{
// Element data
...
// Stores the index into the 'nodes' array of the next element
// in the cell or the next removed element. -1 indicates end of
// list.
int next = -1;
};
// Stores all the nodes in the row.
std::vector<Node> nodes;
// Stores the index of the first removed element or -1
// if there are no removed elements.
int free_element = -1;
};
This combines some of the benefits of a linked list using a free list allocator but without the need to manage separate allocator and data structure implementations which I find to be too generic and unwieldy for my purposes. Furthermore, doing it this way allows us to halve the size of a pointer down to a 32-bit array index on 64-bit architectures which I find to be a big measured win in my use cases when the alignment requirements of the element data combined with the 32-bit index don't require an additional 32-bits of padding for the class/struct which is frequently the case for me since I often use 32-bit or smaller integers and 32-bit single-precision floating-point or 16-bit half-floats.
Unorthodox?
On this question:
Is this type of grid search common?
I am not sure! I tend to struggle a bit with terminology and I'll have to ask people's forgiveness and patience in communication. I started programming from early childhood in the 1980s before the internet was widespread, so I came to rely on inventing a lot of my own techniques and using my own crude terminology as a result. I got my degree in computer science about a decade and a half later when I reached my 20s and corrected some of my terminology and misconceptions but I've had many years just rolling my own solutions. So I am often not sure if other people have come across some of the same solutions or not, and if there are formal names and terms for them or not.
That makes communication with other programmers difficult and very frustrating for both of us at times and I have to ask for a lot of patience to explain what I have in mind. I've made it a habit in meetings to always start off showing something with very promising results which tends to make people more patient with my crude and long-winded explanations of how they work. They tend to give my ideas much more of a chance if I start off by showing results, but I'm often very frustrated with the orthodoxy and dogmatism that can be prevalent in this industry that can sometimes prioritize concepts far more than execution and actual results. I'm a pragmatist at heart so I don't think in terms of "what is the best data structure?" I think in terms of what we can effectively implement personally given our strengths and weaknesses and what is intuitive and counter-intuitive to us and I'm willing to endlessly compromise on the purity of concepts in favor of a simpler and less problematic execution. I just like good, reliable solutions that roll naturally off our fingertips no matter how orthodox or unorthodox they may be, but a lot of my methods may be unorthodox as a result (or not and I might just have yet to find people who have done the same things). I've found this site useful at rare times in finding peers who are like, "Oh, I've done that too! I found the best results if we do this [...]" or someone pointing out like, "What you are proposing is called [...]."
In performance-critical contexts, I kind of let the profiler come up with the data structure for me, crudely speaking. That is to say, I'll come up with some quick first draft (typically very orthodox) and measure it with the profiler and let the profiler's results give me ideas for a second draft until I converge to something both simple and performant and appropriately scalable for the requirements (which may become pretty unorthodox along the way). I'm very happy to abandon lots of ideas since I figure we have to weed through a lot of bad ideas in a process of elimination to come up with a good one, so I tend to cycle through lots of implementations and ideas and have come to become a really rapid prototyper (I have a psychological tendency to stubbornly fall in love with solutions I spent lots of time on, so to counter that I've learned to spend the absolute minimal time on a solution until it's very, very promising).
You can see my exact methodology at work in the very answers to that
question where I iteratively converged through lots of profiling and
measuring over the course of a few days and prototyping from a fairly orthodox quadtree to that
unorthodox "loose-tight double grid" solution that handled the largest
number of agents at the most stable frame rates and was, for me
anyway, much faster and simpler to implement than all the structures
before it. I had to go through lots of orthodox solutions and measure them though to generate the final idea for the unusual loose-tight variant. I always start off with and favor the orthodox solutions and start off inside the box because they're well-documented and understood and just very gently and timidly venture outside, but I do often find myself a bit outside the box when the requirements are steep enough. I'm no stranger to the steepest requirements since in my industry and as a fairly low-level type working on engines, the ability to handle more data at good frame rates often translates not only to greater interactivity for the user but also allows artists to create more detailed content of higher visual quality than ever before. We're always chasing higher and higher visual quality at good frame rates, and that often boils down to a combination of both performance and getting away with crude approximations whenever possible. This inevitably leads to some degree of unorthodoxy with lots of in-house solutions very unique to a particular engine, and each engine tends to have its own very unique strengths and weaknesses as you find comparing something like CryEngine to Unreal Engine to Frostbite to Unity.
For example, I have this data structure I've been using since childhood and I don't know the name of it. It's a straightforward concept and it's just a hierarchical bitset that allows set intersections of potentially millions of elements to be found in as little as a few iterations of simple work as well as traverse millions of elements occupying the set with just a few iterations (less than linear-time requirements to traverse everything in the set just through the data structure itself which returns ranges of occupied elements/set bits instead of individual elements/bit indices). But I have no idea what the name is since it's just something I rolled and I've never encountered anyone talking about it in computer science. I tend to refer to it as a "hierarchical bitset". Originally I called it a "sparse bitset tree" but that seems a tad verbose. It's not a particularly clever concept at all and I wouldn't be surprised or disappointed (actually quite happy) to find someone else discovering the same solution before me but just one I don't find people using or talking about ever. It just expands on the strengths of a regular, flat bitset in rapidly finding set intersections with bitwise OR and rapidly traverse all set bits using FFZ/FFS but reducing the linear-time requirements of both down to logarithmic (with the logarithm base being a number much larger than 2).
Anyway, I wouldn't be surprised if some of these solutions are very unorthodox, but also wouldn't be surprised if they are reasonably orthodox and I've just failed to find the proper name and terminology for these techniques. A lot of the appeal of sites like this for me is a lonely search for someone else who has used similar techniques and to try to find proper names and terms for them often to end in frustration. I'm also hoping to improve on my ability to explain them although I've always been so bad and long-winded here. I find using pictures helps me a lot because I find human language to be incredibly riddled with ambiguities. I'm also fond of deliberately imprecise figurative language which embraces and celebrates the ambiguities such as metaphor and analogy and humorous hyperbole, but I've not found it's the type of thing programmers tend to appreciate so much due to its lack of precision. But I've never found precision to be that important so long as we can convey the meaty stuff and what is "cool" about an idea while they can draw their own interpretations of the rest. Apologies for the whole explanation but hopefully that clears some things up about my crude terminology and the overall methodology I use to arrive at these techniques. English is also not my first language so that adds another layer of convolution where I have to sort of translate my thoughts into English words and struggle a lot with that.

CUDA parallel sorting algorithm vs single thread sorting algorithms

I have a large amount of data which i need to sort, several million array each with tens of thousand of values. What im wondering is the following:
Is it better to implement a parallel sorting algorithm, on the GPU, and run it across all the arrays
OR
implement a single thread algorithm, like quicksort, and assign each thread, of the GPU, a different array.
Obviously speed is the most important factor. For single thread sorting algorithm memory is a limiting factor. Ive already tried to implement a recursive quicksort but it doesnt seem to work for large amounts of data so im assuming there is a memory issue.
Data type to be sorted is long, so i dont believe a radix sort would be possible due to the fact that it a binary representation of the numbers would be too long.
Any pointers would be appreciated.
Sorting is an operation that has received a lot of attention. Writing your own sort isn't advisable if you are interested in high performance. I would consider something like thrust, back40computing, moderngpu, or CUB for sorting on the GPU.
Most of the above will be handling an array at a time, using the full GPU to sort an array. There are techniques within thrust to do a vectorized sort which can handle multiple arrays "at once", and CUB may also be an option for doing a "per-thread" sort (let's say, "per thread block").
Generally I would say the same thing about CPU sorting code. Don't write your own.
EDIT: I guess one more comment. I would lean heavily towards the first approach you mention (i.e. not doing a sort per thread.) There are two related reasons for this:
Most of the fast sorting work has been done along the lines of your first method, not the second.
The GPU is generally better at being fast when the work is well adapted for SIMD or SIMT. This means we generally want each thread to be doing the same thing and minimizing branching and warp divergence. This is harder to achieve (I think) in the second case, where each thread appears to be following the same sequence but in fact data dependencies are causing "algorithm divergence". On the surface of it, you might wonder if the same criticism might be levelled at the first approach, but since these libraries I mention arer written by experts, they are aware of how best to utilize the SIMT architecture. The thrust "vectorized sort" and CUB approaches will allow multiple sorts to be done per operation, while still taking advantage of SIMT architecture.

Performance Testing for Calculation-Heavy Programs

What are some good tips and/or techniques for optimizing and improving the performance of calculation heavy programs. I'm talking about things like complication graphics calculations or mathematical and simulation types of programming where every second saved is useful, as opposed to IO heavy programs where only a certain amount of speedup is helpful.
While changing the algorithm is frequently mentioned as the most effective method here,I'm trying to find out how effective different algorithms are in the first place, so I want to create as much efficiency with each algorithm as is possible. The "problem" I'm solving isn't something thats well known, so there are few if any algorithms on the web, but I'm looking for any good advice on how to proceed and what to look for.
I am exploring the differences in effectiveness between evolutionary algorithms and more straightforward approaches for a particular group of related problems. I have written three evolutionary algorithms for the problem already and now I have written an brute force technique that I am trying to make as fast as possible.
Edit: To specify a bit more. I am using C# and my algorithms all revolve around calculating and solving constraint type problems for expressions (using expression trees). By expressions I mean things like x^2 + 4 or anything else like that which would be parsed into an expression tree. My algorithms all create and manipulate these trees to try to find better approximations. But I wanted to put the question out there in a general way in case it would help anyone else.
I am trying to find out if it is possible to write a useful evolutionary algorithm for finding expressions that are a good approximation for various properties. Both because I want to know what a good approximation would be and to see how the evolutionary stuff compares to traditional methods.
It's pretty much the same process as any other optimization: profile, experiment, benchmark, repeat.
First you have to figure out what sections of your code are taking up the time. Then try different methods to speed them up (trying methods based on merit would be a better idea than trying things at random). Benchmark to find out if you actually did speed them up. If you did, replace the old method with the new one. Profile again.
I would recommend against a brute force approach if it's at all possible to do it some other way. But, here are some guidelines that should help you speed your code up either way.
There are many, many different optimizations you could apply to your code, but before you do anything, you should profile to figure out where the bottleneck is. Here are some profilers that should give you a good idea about where the hot spots are in your code:
GProf
PerfMon2
OProfile
HPCToolkit
These all use sampling to get their data, so the overhead of running them with your code should be minimal. Only GProf requires that you recompile your code. Also, the last three let you do both time and hardware performance counter profiles, so once you do a time (or CPU cycle) profile, you can zoom in on the hotter regions and find out why they might be running slow (cache misses, FP instruction counts, etc.).
Beyond that, it's a matter of thinking about how best to restructure your code, and this depends on what the problem is. It may be that you've just got a loop that the compiler doesn't optimize well, and you can inline or move things in/out of the loop to help the compiler out. Or, if you're running as fast as you can with basic arithmetic ops, you may want to try to exploit vector instructions (SSE, etc.) If your code is parallel, you might have load balance problems, and you may need to restructure your code so that data is better distributed across cores.
These are just a few examples. Performance optimization is complex, and it might not help you nearly enough if you're doing a brute force approach to begin with.
For more information on ways people have optimized things, there were some pretty good examples in the recent Why do you program in assembly? question.
If your optimization problem is (quasi-)convex or can be transformed into such a form, there are far more efficient algorithms than evolutionary search.
If you have large matrices, pay attention to your linear algebra routines. The right algorithm can make shave an order of magnitude off the computation time, especially if your matrices are sparse.
Think about how data is loaded into memory. Even when you think you're spending most of your time on pure arithmetic, you're actually spending a lot of time moving things between levels of cache etc. Do as much as you can with the data while it's in the fastest memory.
Try to avoid unnecessary memory allocation and de-allocation. Here's where it can make sense to back away from a purely OO approach.
This is more of a tip to find holes in the algorithm itself...
To realize maximum performance, simplify everything inside the most inner loop at the expense of everything else.
One example of keeping things simple is the classic bouncing ball animation. You can implement gravity by looking up the definition in your physics book and plugging in the numbers, or you can do it like this and save precious clock cycles:
initialize:
float y = 0; // y coordinate
float yi = 0; // incremental variable
loop:
y += yi;
yi += 0.001;
if (y > 10)
yi = -yi;
But now let's say you're having to do this with nested loops in an N-body simulation where every particle is attracted to every other particle. This can be an enormously processor intensive task when you're dealing with thousands of particles.
You should of course take the same approach as to simplify everything inside the most inner loop. But more than that, at the very simplest level you should also use data types wisely. For example, math operations are faster when working with integers than floating point variables. Also, addition is faster than multiplication, and multiplication is faster than division.
So with all of that in mind, you should be able to simplify the most inner loop using primarily addition and multiplication of integers. And then any scaling down you might need to do can be done afterwards. To take the y and yi example, if yi is an integer that you modify inside the inner loop then you could scale it down after the loop like this:
y += yi * 0.01;
These are very basic low-level performance tips, but they're all things I try to keep in mind whenever I'm working with processor intensive algorithms. Of course, if you then take these ideas and apply them to parallel processing on a GPU then you can take your algorithm to a whole new level. =)
Well how you do this depends the most on which language
you are using. Still, the key in any language
in the profiler. Profile your code. See which
functions/operations are taking the most time and then determine
if you can make these costly operations more efficient.
Standard bottlenecks in numerical algorithms are memory
usage (do you access matrices in the order which the elements
are stored in memory); communication overhead, etc. They
can be little different than other non-numerical programs.
Moreover, many other factors such as preconditioning, etc.
can lead to drastically difference performance behavior
of the SAME algorithm on the same problem. Make sure
you determine optimal parameters for your implementations.
As for comparing different algorithms, I recommend
reading the paper
"Benchmarking optimization software with performance profiles,"
Jorge Moré and Elizabeth D. Dolan, Mathematical Programming 91 (2002), 201-213.
It provides a nice, uniform way to compare different algorithms being
applied to the same problem set. It really should be better known
outside of the optimization community (in my not so humble opinion
at least).
Good luck!

Is it worthwhile to use a bit vector/array rather than a simple array of bools?

When I want an array of flags it has typically pained me to use an entire byte (or word) to store each one, as would be the result if I made an array of bools or some other numeric type that could be set to 0 or 1. But now I wonder whether using a structure that is more space-efficient is worth it given the (albeit hopefully very slight) additional overhead of shifting and bit testing.
In my company we use Rogue Wave tools (though hopefully not for much longer) and it's their RWBitVec that I've used for this purpose up until now.
It's mostly about saving memory. If your array of bools is large enough that a 8x improvement on storage space is meaningful, then by all means, use a bitarray.
Note that the memory access is pretty expensive compared to the shift/and, so the bitarray approach is slightly faster than the array-of-chars. Basically it comes down to memory versus programmer time. Remember that premature optimization is a waste of time. I'd use whichever approach is the easiest to develop, and then refactor only after it shows that it's a primary performance bottleneck.
Don't use vector<bool>, it's not really a Container:
http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=98
Use std::bitset (for fixed size bitsets) and boost::dynamic_bitset (for resizeable ones) where appropriate. They aren't Containers either, but they don't look as if they ought to be, so are less likely to cause confusion.
Whether the trade-off is worth it depends, obviously, on how big the arrays are in your program. I think you're right that the overhead of bit access is usually negligible, but if the memory overhead is negligible too then you've nothing to go on there either.
bitsets have the advantage that they do exactly what they say on the tin - none of this "declare an array of chars/ints, but the only legal values are 0 and 1" nonsense. Your code will read about the same as if you'd used an array.
I wrote some code once to unpack a bitmap image line into separate bytes per pixel, then pack it back again after processing. For the code I was benchmarking, it was actually faster to do it that way than to work at the bit level.
I've used a bit array for indexing a HUGE tree. The algorithm was:
Check bitarray if entry exists
if entry doesn't exists
return null
else do binary search in tree
return value
The advantage is that the Tree has huge enough that searching for a non existent entry would cause several cache misses before completing. Thus the algorithm was taking longer or not depending on the existence of the value.
However adding that initial bit array search meant I'd reduce cache misses, and would avoid searching the tree at all if the answer wasn't there. By adding this extra step the algorithm became much more robust (actual performance time on a Computer, became nearly linear although the Big-O would say differently), and overall performance increased by an order of magnitude.
Like they say sometimes taking hardware into consideration is more important than the "ideal" mathematical algorithm.
Modern computers have barrel shifters so that a shift of any number of bits up to 31 takes a few cycles (less than many other instructions). Compilers take advantage of this and bit operations are not only space efficient but in most cases time efficient.
But it really depends on how you're using and testing the bits - there are some inefficient methods that would make using a whole integer faster.
-Adam
Is it worth it? Only if you know that you have a problem with memory usage.
But unless you're either:
Working on an embedded processor with very limited resources, or
Storing an astronomical number of bools
then the answer is no. You'll have to work somewhat harder to achieve the same level of readability in your source by using a bitmap than you will using bools, and unless you're operating under either of the previous two conditions you'll likely find that it doesn't make any noticeable difference to your memory footprint.

Resources