What is data oriented design? [closed]

What is data oriented design? [closed] - data-oriented-design

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
The community reviewed whether to reopen this question 6 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I was reading this article, and this guy goes on talking about how everyone can greatly benefit from mixing in data oriented design with OOP. He doesn't show any code samples, however.
I googled this and couldn't find any real information as to what this is, let alone any code samples. Is anyone familiar with this term and can provide an example? Is this maybe a different word for something else?

First of all, don't confuse this with data-driven design.
My understanding of Data-Oriented Design is that it is about organizing your data for efficient processing. Especially with respect to cache misses etc. Data-Driven Design on the other hand is about letting data control a lot of the behavior of your program (described very well by Andrew Keith's answer).
Say you have ball objects in your application with properties such as color, radius, bounciness, position, etc.
Object Oriented Approach
In OOP you would describe balls like this:
class Ball {
Point position;
Color color;
double radius;
void draw();
};
And then you would create a collection of balls like this:
vector<Ball> balls;
Data-Oriented Approach
In Data Oriented Design, however, you are more likely to write the code like this:
class Balls {
vector<Point> position;
vector<Color> color;
vector<double> radius;
void draw();
};
As you can see there is no single unit representing one Ball anymore. Ball objects only exist implicitly.
This can have many advantages, performance-wise. Usually, we want to do operations on many balls at the same time. The hardware usually wants large contiguous chunks of memory to operate efficiently.
Secondly, you might do operations that affect only part of the properties of a ball. For E.g. if you combine the colors of all the balls in various ways, then you want your cache to only contain color information. However, when all ball properties are stored in one unit you will pull in all the other properties of a ball as well. Even though you don't need them.
Cache Usage Example
Say each ball takes up 64 bytes and a Point takes 4 bytes. A cache slot takes, say, 64 bytes as well. If I want to update the position of 10 balls, I have to pull in 10 x 64 = 640 bytes of memory into cache and get 10 cache misses. If however, I can work the positions of the balls as separate units, that will only take 4 x 10 = 40 bytes. That fits in one cache fetch. Thus we only get 1 cache miss to update all the 10 balls. These numbers are arbitrary - I assume a cache block is bigger.
But it illustrates how memory layout can have a severe effect on cache hits and thus performance. This will only increase in importance as the difference between CPU and RAM speed widens.
How to layout the memory
In my ball example, I simplified the issue a lot, because usually for any normal app you will likely access multiple variables together. E.g. position and radius will probably be used together frequently. Then your structure should be:
class Body {
Point position;
double radius;
};
class Balls {
vector<Body> bodies;
vector<Color> color;
void draw();
};
The reason you should do this is that if data used together are placed in separate arrays, there is a risk that they will compete for the same slots in the cache. Thus loading one will throw out the other.
So compared to Object-Oriented programming, the classes you end up making are not related to the entities in your mental model of the problem. Since data is lumped together based on data usage, you won't always have sensible names to give your classes in Data-Oriented Design.
Relation to relational databases
The thinking behind Data-Oriented Design is very similar to how you think about relational databases. Optimizing a relational database can also involve using the cache more efficiently, although in this case, the cache is not CPU cache but pages in memory. A good database designer will also likely split out infrequently accessed data into a separate table rather than creating a table with a huge number of columns where only a few of the columns are ever used. He might also choose to denormalize some of the tables so that data don't have to be accessed from multiple locations on disk. Just like with Data-Oriented Design these choices are made by looking at what the data access patterns are and where the performance bottleneck is.

Mike Acton gave a public talk about Data oriented design recently:
My basic summary of it would be: if you want performance, then think about data flow, find the storage layer that is most likely to screw with you and optimize for it hard. Mike is focusing on L2 cache misses, because he's doing realtime, but I imagine the same thing applies to databases (disk reads) and even the Web (HTTP requests). It's a useful way of doing systems programming, I think.
Note that it doesn't absolve you from thinking about algorithms and time complexity, it just focuses your attention at figuring out the most expensive operation type that you then must target with your mad CS skills.

I just want to point out that Noel is talking specifically about some of the specific needs we face in game development. I suppose other sectors that are doing real-time soft simulation would benefit from this, but it is unlikely to be a technique that will show noticeable improvement to general business applications. This set up is for ensuring that every last bit of performance is squeezed out of the underlying hardware.

A data oriented design is a design in which the logic of the application is built up of data sets, instead of procedural algorithms. For example
procedural approach.
int animation; // this value is the animation index
if(animation == 0)
PerformMoveForward();
else if(animation == 1)
PerformMoveBack();
.... // etc
data design approach
typedef struct
{
int Index;
void (*Perform)();
}AnimationIndice;
// build my animation dictionary
AnimationIndice AnimationIndices[] =
{
{ 0,PerformMoveForward }
{ 1,PerformMoveBack }
}
// when its time to run, i use my dictionary to find my logic
int animation; // this value is the animation index
AnimationIndices[animation].Perform();
Data designs like this promote the usage of data to build the logic of the application. Its easier to manage especially in video games which might have thousands of logic paths based on animation or some other factor.

If you want to take advantage of modern processor architecture, you need to lay out your data in memory in a certain way. CPUs are really good at processing simple types that are laid out sequentially in memory. Any other layout has a much higher processing cost.
In object-oriented approach, you always think about one instance, and then you are extending it to several instances by grouping objects into collections. But from the hardware point of view, this comes with the added cost.
In data-oriented approach, you don't have an "instance" in the same way you have in object-oriented programming. Your instance can have an identifier, similar to data in relational databases, but apart from that, data related to your instance can be split over several tables (tables are implemented as vectors), to allow efficient processing.
An example: imagine you have class Student { int id; std::string name; float average; bool graduated; }. In case of OOP, you would put all your students in a single vectors.
In data-oriented design, you will first ask yourself what kind of processing you want to do to this data. Say you want to calculate an average mark for all students that still haven't graduated. So you will create a table which contains only students that have graduated, and another that haven't. You won't keep the student name in that table since it is not used for processing. But you will keep a student ID and an average mark in the table.
Now calculating average mark for non-graduated students will mean iterating through the non-graduated table and performing the calculation. Since average marks are neighboring in memory, your CPU will use SIMD and process the data in the most efficient way possible. Since we are not querying the bool graduated to test if the student has graduated, there are no data cache misses.
This sounds nice in theory but I have never done this kind of development on a real-world project. If anybody have any experience, please contact me, I have many questions.

I first heard about Data-Oriented Design in the Our Machinery podcast, episode "S3: EP4 Data-oriented Design". https://www.owltail.com/podcast/Atvr2-Our-Machinery
Maybe this has changed, but a while ago finding information about Data-Oriented Design was difficult. The only book I found is: https://www.manning.com/books/data-oriented-programming

Related

What is a coarse and fine grid search?

I was reading this answer
Efficient (and well explained) implementation of a Quadtree for 2D collision detection
and encountered this paragraph
All right, so actually quadtrees are not my favorite data structure for this purpose. I tend to prefer a grid hierarchy, like a coarse grid for the world, a finer grid for a region, and an even finer grid for a sub-region (3 fixed levels of dense grids, and no trees involved), with row-based optimizations so that a row that has no entities in it will be deallocated and turned into a null pointer, and likewise completely empty regions or sub-regions turned into nulls. While this simple implementation of the quadtree running in one thread can handle 100k agents on my i7 at 60+ FPS, I've implemented grids that can handle a couple million agents bouncing off each other every frame on older hardware (an i3). Also I always liked how grids made it very easy to predict how much memory they'll require, since they don't subdivide cells. But I'll try to cover how to implement a reasonably efficient quadtree.
This type of grid seems intuitive, it sort of sounds like a "N-order" grid, where instead of 4 child nodes, you have N child nodes per parent. N^3 can go much further than 4^3, which allows better precision with potentially (I guess) less branching (since there are many less nodes to branch).
I'm a little puzzled because I would intuitively use a single, or maybe 3 std::map with the proper < operator(), to reduce its memory footprint, but I'm not sure it would be so fast, since querying an AABB would mean stacking several accesses that are O(log n).
What exactly are those row-based optimizations he is talking about? Is this type of grid search common?
I have some understanding of a z order curve, and I'm not entirely satisfied with a quadtree.

It's my own quote. But that's based on a common pattern I encountered in my personal experience. Also, terms like "parent" and "child" are ones I'd largely discard when talking about grids. You just got a big 2-dimensional or N-dimensional table/matrix storing stuff. There's not really a hierarchy involved whatsoever -- these data structures are more comparable to arrays than trees.
"Coarse" and "Fine"
On "coarse" and "fine", what I meant there is that "coarse" search queries tend to be cheaper but give more false positives. A coarser grid would be one that is lower in grid resolution (fewer, larger cells). Coarse searches may involve traversing/searching fewer and larger grid cells. For example, say we want to see if an element intersects a point/dot in a gigantic cell (imagine just a 1x1 grid storing everything in the simulation). If the dot intersects the cell, we may get a whole lot of elements returned in that cell but maybe only one or none of them actually intersect the dot.
So a "coarse" query is broad and simple but not very precise at narrowing down the list of candidates (or "suspects"). It may return too many results and still leave a whole lot of processing required left to do to narrow down what actually intersects the search parameter*.
It's like in those detective shows when they search a database for a
possible killer, putting in "white male" might not require much
processing to list the results but might give way too many results to
properly narrow down the suspects. "Fine" would be the opposite and might require more processing of the database but narrow down the result to just one suspect.
This is a crude and flawed analogy but I hope it helps.
Often the key to broadly optimizing spatial indices before we get into things like memory optimizations whether we're talking spatial hashes or quadtrees is to find a nice balance between "coarse" and "fine". Too "fine" and we might spend too much time traversing the data structure (searching many small cells in a uniform grid, or spending too much time in tree traversal for adaptive grids like quadtrees). Too "coarse" and the spatial index might give back too many results to significantly reduce the amount of time required for further processing. For spatial hashes (a data structure I don't personally like very much but they're very popular in gamedev), there's often a lot of thought and experimentation and measuring involved in determining an optimal cell size to use.
With uniform NxM grids, how "coarse" or "fine" they are (big or small cells and high or low grid resolution) not only impacts search times for a query but can also impact insertion and removal times if the elements are larger than a point. If the grid is too fine, a single large or medium-sized element may have to be inserted into many tiny cells and removed from many tiny cells, using lots of extra memory and processing. If it's too coarse, the element may only have to be inserted and removed to/from one large cell but at costs to the data structure's ability to narrow down the number of candidates returned from a search query to a minimum. Without care, going too "fine/granular" can become very bottlenecky in practice and a developer might find his grid structure using gigabytes of RAM for a modest input size. With tree variants like quadtrees, a similar thing can happen if the maximum allowed tree depth is too high a value causing explosive memory use and processing when the leaf nodes of the quadtree store the tiniest cell sizes (we can even start running into floating-point precision bugs that wreck performance if the cells are allowed to be subdivided to too small a size in the tree).
The essence of accelerating performance with spatial indices is often this sort of balancing act. For example, we typically don't want to apply frustum culling to individual polygons being rendered in computer graphics because that's typically not only redundant with what the hardware already does at the clipping stage, but it's also too "fine/granular" and requires too much processing on its own compared to the time required to just request to render one or more polygons. But we might net huge performance improvements with something a bit "coarser", like applying frustum culling to an entire creature or space ship (an entire model/mesh), allowing us to avoid requesting to render many polygons at once with a single test. So I often use the terms, "coarse" and "fine/granular" frequently in these sorts of discussions (until I find better terminology that more people can easily understand).
Uniform vs. Adaptive Grid
You can think of a quadtree as an "adaptive" grid with adaptive grid cell sizes arranged hierarchically (working from "coarse' to "fine" as we drill down from root to leaf in a single smart and adaptive data structure) as opposed to a simple NxM "uniform" grid.
The adaptive nature of the tree-based structures is very smart and can handle a broad range of use cases (although typically requiring some fiddling of maximum tree depth and/or minimum cell size allowed and possibly how many maximum elements are stored in a cell/node before it subdivides). However, it can be more difficult to optimize tree data structures because the hierarchical nature doesn't lend itself so easily to the kind of contiguous memory layout that our hardware and memory hierarchy is so well-suited to traverse. So very often I find data structures that don't involve trees to be easier to optimize in the same sense that optimizing a hash table might be simpler than optimizing a red-black tree, especially when we can anticipate a lot about the type of data we're going to be storing in advance.
Another reason I tend to favor simpler, more contiguous data structures in lots of contexts is that the performance requirements of a realtime simulation often want not just fast frame rates, but consistent and predictable frame rates. The consistency is important because even if a video game has very high frame rates for most of the game but some part of the game causes the frame rates to drop substantially for even a brief period of time, the player may die and game over as a result of it. It was often very important in my case to avoid these types of scenarios and have data structures largely absent pathological worst-case performance scenarios. In general, I find it easier to predict the performance characteristics of lots of simpler data structures that don't involve an adaptive hierarchy and are kind of on the "dumber" side. Very often I find the consistency and predictability of frame rates to be roughly tied to how easily I can predict the data structure's overall memory usage and how stable that is. If the memory usage is wildly unpredictable and sporadic, I often (not always, but often) find the frame rates will likewise be sporadic.
So I often find better results using grids personally, but if it's tricky to determine a single optimal cell size to use for the grid in a particular simulation context, I just use multiple instances of them: one instance with larger cell sizes for "coarse" searches (say 10x10), one with smaller ones for "finer" searches (say 100x100), and maybe even one with even smaller cells for the "finest" searches (say 1000x1000). If no results are returned in the coarse search, then I don't proceed with the finer searches. I get some balance of the benefits of quadtrees and grids this way.
What I did when I used these types of representations in the past is not to store a single element in all three grid instances. That would triple the memory use of an element entry/node into these structures. Instead, what I did was insert the indices of the occupied cells of the finer grids into the coarser grids, as there are typically far fewer occupied cells than there are a total number of elements in the simulation. Only the finest, highest-resolution grid with the smallest cell sizes would store the element. The cells in the finest grid are analogous to the leaf nodes of a quadtree.
The "loose-tight double grid" as I'm calling it in one of the answers to that question is an expansion on this multi-grid idea. The difference is that the finer grid is actually loose and has cell sizes that expand and shrink based on the elements inserted to it, always guaranteeing that a single element, no matter how large or small, needs only be inserted to one cell in the grid. The coarser grid stores the occupied cells of the finer grid leading to two constant-time queries (one in the coarser tight grid, another into the finer loose grid) to return an element list of potential candidates matching the search parameter. It also has the most stable and predictable memory use (not necessarily the minimal memory use because the fine/loose cells require storing an axis-aligned bounding box that expands/shrinks which adds another 16 bytes or so to a cell) and corresponding stable frame rates because one element is always inserted to one cell and doesn't take any additional memory required to store it besides its own element data with the exception of when its insertion causes a loose cell to expand to the point where it has to be inserted to additional cells in the coarser grid (which should be a fairly rare-case scenario).
Multiple Grids For Other Purposes
I'm a little puzzled because I would intuitively use a single, or maybe 3 std::map with the proper operator(), to reduce its memory footprint, but I'm not sure it would be so fast, since querying an AABB would mean stacking several accesses that are O(log n).
I think that's an intuition many of us have and also probably a subconscious desire to just lean on one solution for everything because programming can get quite tedious and repetitive and it'd be ideal to just implement something once and reuse it for everything: a "one-size-fits-all" t-shirt. But a one-sized-fits-all shirt can be poorly tailored to fit our very broad and muscular programmer bodies*. So sometimes it helps to use the analogy of a small, medium, and large size.
This is a very possibly poor attempt at humor on my part to make my long-winded texts less boring to read.
For example, if you are using std::map for something like a spatial hash, then there can be a lot of thought and fiddling around trying to find an optimal cell size. In a video game, one might compromise with something like making the cell size relative to the size of your average human in the game (perhaps a bit smaller or bigger), since lots of the models/sprites in the game might be designed for human use. But it might get very fiddly and be very sub-optimal for teeny things and very sub-optimal for gigantic things. In that case, we might do well to resist our intuitions and desires to just use one solution and use multiple (it could still be the same code but just multiple instances of the same class instance for the data structure constructed with varying parameters).
As for the overhead of searching multiple data structures instead of a single one, that's something best measured and it's worth remembering that the input sizes of each container will be smaller as a result, reducing the cost of each search and very possibly improve locality of reference. It might exceed the benefits in a hierarchical structure that requires logarithmic search times like std::map (or not, best to just measure and compare), but I tend to use more data structures which do this in constant-time (grids or hash tables). In my cases, I find the benefits far exceeding the additional cost of requiring multiple searches to do a single query, especially when the element sizes vary radically or I want some basic thing resembling a hierarchy with 2 or more NxM grids that range from "coarse" to "fine".
Row-Based Optimizations
As for "row-based optimizations", that's very specific to uniform fixed-sized grids and not trees. It refers to using a separate variable-sized list/container per row instead of a single one for the entire grid. Aside from potentially reducing memory use for empty rows that just turn into nulls without requiring an allocated memory block, it can save on lots of processing and improve memory access patterns.
If we store a single list for the entire grid, then we have to constantly insert and remove from that one shared list as elements move around, particles are born, etc. That could lead to more heap allocations/deallocations growing and shrinking the container but also increases the average memory stride to get from one element in that list to the next which will tend to translate to more cache misses as a result of more irrelevant data being loaded into a cache line. Also these days we have so many cores so having a single shared container for the entire grid may reduce the ability to process the grid in parallel (ex: searching one row while simultaneously inserting to another). It can also lead to more net memory use for the structure since if we use a contiguous sequence like std::vector or ArrayList, those can often store the memory capacity of as many as twice the elements required to reduce the time of insertions to amortized constant time by minimizing the need to reallocate and copy the former elements in linear-time by keeping excess capacity.
By associating a separate medium-sized container per grid row or per column instead of gigantic one for the entire grid, we can mitigate these costs in some cases*.
This is the type of thing you definitely measure before and after though to make sure it actually improves overall frame rates, and probably attempt in response to a first attempt storing a single list for the entire grid revealing many non-compulsory cache misses in the profiler.
This might beg the question of why we don't use a separate teeny list container for every single cell in the grid. It's a balancing act. If we store that many containers (ex: a million instances of std::vector for a 1000x1000 grid possibly storing very few or no elements each), it would allow maximum parallelism and minimize the stride to get from one element in a cell to the next one in the cell, but that can be quite explosive in memory use and introduce a lot of extra processing and heap overhead.
Especially in my case, my finest grids might store a million cells or more, but I only require 4 bytes per cell. A variable-sized sequence per cell would typically explode to at least something like 24 bytes or more (typically far more) per cell on 64-bit architectures to store the container data (typically a pointer and a couple of extra integers, or three pointers on top of the heap-allocated memory block), but on top of that, every single element inserted to an empty cell may require a heap allocation and compulsory cache miss and page fault and very frequently due to the lack of temporal locality. So I find the balance and sweet spot to be one list container per row typically among my best-measured implementations.
I use what I call a "singly-linked array list" to store elements in a grid row and allow constant-time insertions and removals while still allowing some degree of spatial locality with lots of elements being contiguous. It can be described like this:
struct GridRow
{
struct Node
{
// Element data
...
// Stores the index into the 'nodes' array of the next element
// in the cell or the next removed element. -1 indicates end of
// list.
int next = -1;
};
// Stores all the nodes in the row.
std::vector<Node> nodes;
// Stores the index of the first removed element or -1
// if there are no removed elements.
int free_element = -1;
};
This combines some of the benefits of a linked list using a free list allocator but without the need to manage separate allocator and data structure implementations which I find to be too generic and unwieldy for my purposes. Furthermore, doing it this way allows us to halve the size of a pointer down to a 32-bit array index on 64-bit architectures which I find to be a big measured win in my use cases when the alignment requirements of the element data combined with the 32-bit index don't require an additional 32-bits of padding for the class/struct which is frequently the case for me since I often use 32-bit or smaller integers and 32-bit single-precision floating-point or 16-bit half-floats.
Unorthodox?
On this question:
Is this type of grid search common?
I am not sure! I tend to struggle a bit with terminology and I'll have to ask people's forgiveness and patience in communication. I started programming from early childhood in the 1980s before the internet was widespread, so I came to rely on inventing a lot of my own techniques and using my own crude terminology as a result. I got my degree in computer science about a decade and a half later when I reached my 20s and corrected some of my terminology and misconceptions but I've had many years just rolling my own solutions. So I am often not sure if other people have come across some of the same solutions or not, and if there are formal names and terms for them or not.
That makes communication with other programmers difficult and very frustrating for both of us at times and I have to ask for a lot of patience to explain what I have in mind. I've made it a habit in meetings to always start off showing something with very promising results which tends to make people more patient with my crude and long-winded explanations of how they work. They tend to give my ideas much more of a chance if I start off by showing results, but I'm often very frustrated with the orthodoxy and dogmatism that can be prevalent in this industry that can sometimes prioritize concepts far more than execution and actual results. I'm a pragmatist at heart so I don't think in terms of "what is the best data structure?" I think in terms of what we can effectively implement personally given our strengths and weaknesses and what is intuitive and counter-intuitive to us and I'm willing to endlessly compromise on the purity of concepts in favor of a simpler and less problematic execution. I just like good, reliable solutions that roll naturally off our fingertips no matter how orthodox or unorthodox they may be, but a lot of my methods may be unorthodox as a result (or not and I might just have yet to find people who have done the same things). I've found this site useful at rare times in finding peers who are like, "Oh, I've done that too! I found the best results if we do this [...]" or someone pointing out like, "What you are proposing is called [...]."
In performance-critical contexts, I kind of let the profiler come up with the data structure for me, crudely speaking. That is to say, I'll come up with some quick first draft (typically very orthodox) and measure it with the profiler and let the profiler's results give me ideas for a second draft until I converge to something both simple and performant and appropriately scalable for the requirements (which may become pretty unorthodox along the way). I'm very happy to abandon lots of ideas since I figure we have to weed through a lot of bad ideas in a process of elimination to come up with a good one, so I tend to cycle through lots of implementations and ideas and have come to become a really rapid prototyper (I have a psychological tendency to stubbornly fall in love with solutions I spent lots of time on, so to counter that I've learned to spend the absolute minimal time on a solution until it's very, very promising).
You can see my exact methodology at work in the very answers to that
question where I iteratively converged through lots of profiling and
measuring over the course of a few days and prototyping from a fairly orthodox quadtree to that
unorthodox "loose-tight double grid" solution that handled the largest
number of agents at the most stable frame rates and was, for me
anyway, much faster and simpler to implement than all the structures
before it. I had to go through lots of orthodox solutions and measure them though to generate the final idea for the unusual loose-tight variant. I always start off with and favor the orthodox solutions and start off inside the box because they're well-documented and understood and just very gently and timidly venture outside, but I do often find myself a bit outside the box when the requirements are steep enough. I'm no stranger to the steepest requirements since in my industry and as a fairly low-level type working on engines, the ability to handle more data at good frame rates often translates not only to greater interactivity for the user but also allows artists to create more detailed content of higher visual quality than ever before. We're always chasing higher and higher visual quality at good frame rates, and that often boils down to a combination of both performance and getting away with crude approximations whenever possible. This inevitably leads to some degree of unorthodoxy with lots of in-house solutions very unique to a particular engine, and each engine tends to have its own very unique strengths and weaknesses as you find comparing something like CryEngine to Unreal Engine to Frostbite to Unity.
For example, I have this data structure I've been using since childhood and I don't know the name of it. It's a straightforward concept and it's just a hierarchical bitset that allows set intersections of potentially millions of elements to be found in as little as a few iterations of simple work as well as traverse millions of elements occupying the set with just a few iterations (less than linear-time requirements to traverse everything in the set just through the data structure itself which returns ranges of occupied elements/set bits instead of individual elements/bit indices). But I have no idea what the name is since it's just something I rolled and I've never encountered anyone talking about it in computer science. I tend to refer to it as a "hierarchical bitset". Originally I called it a "sparse bitset tree" but that seems a tad verbose. It's not a particularly clever concept at all and I wouldn't be surprised or disappointed (actually quite happy) to find someone else discovering the same solution before me but just one I don't find people using or talking about ever. It just expands on the strengths of a regular, flat bitset in rapidly finding set intersections with bitwise OR and rapidly traverse all set bits using FFZ/FFS but reducing the linear-time requirements of both down to logarithmic (with the logarithm base being a number much larger than 2).
Anyway, I wouldn't be surprised if some of these solutions are very unorthodox, but also wouldn't be surprised if they are reasonably orthodox and I've just failed to find the proper name and terminology for these techniques. A lot of the appeal of sites like this for me is a lonely search for someone else who has used similar techniques and to try to find proper names and terms for them often to end in frustration. I'm also hoping to improve on my ability to explain them although I've always been so bad and long-winded here. I find using pictures helps me a lot because I find human language to be incredibly riddled with ambiguities. I'm also fond of deliberately imprecise figurative language which embraces and celebrates the ambiguities such as metaphor and analogy and humorous hyperbole, but I've not found it's the type of thing programmers tend to appreciate so much due to its lack of precision. But I've never found precision to be that important so long as we can convey the meaty stuff and what is "cool" about an idea while they can draw their own interpretations of the rest. Apologies for the whole explanation but hopefully that clears some things up about my crude terminology and the overall methodology I use to arrive at these techniques. English is also not my first language so that adds another layer of convolution where I have to sort of translate my thoughts into English words and struggle a lot with that.

how to remove nested foreach loops for performance emprovement

I have a performance-based question.
Is there a way to remove the nested foreach loops replacing them with something more performant ? Here is an example:
List<foo> foos = SelectAllfoos();
foreach(foo f in foos){
//dosomething
foreach(foo2 f2 in foo.GetFoos2()){
//dosomething
}
foreach(foo3 f3 in foo.GetFoos3()){
//dosomething
}
foreach(foo4 f4 in foo.GetFoos4()){
//dosomething
foreach(foo4_1 f4_1 in f4.GetFoos4_1()){
//dosomething
}
}
}
Obiouvsly it is a fake code I just invented for this example. But imagine you have something like that. How should you improve this method's performances?
PS: I already tried using System.Threading.Task.Parallel.ForEachand it improve performance, but I mean a better way to write this code.
PPS: this is written in C#, but my question regards a wider scope, something useful in all languages.

Since the question is rather general and only focused on loops which provide no information about the actual work being done, I can only provide a general answer.
The last thing you typically want to focus on are the loop mechanics themselves. These often yield little, if any, impact.
Typically if you have this kind of situation where algorithmic improvements are out (ex: sequential loops that cannot do better than linear-time complexity as they require traversing and doing something with every single element no matter what), then the two biggest improvements will often come from parallelization and memory optimization.
The latter one is unfortunately less discussed, especially in higher-level languages, but often carries just as much or more impact. It can improve execution times by orders of magnitudes, and is applicable regardless of the language. Concepts like cache efficiency are not language-dependent concepts, as the hardware remains the same no matter what programming language we use (though how we achieve it can vary considerably between languages).
Memory Access Patterns
For example, take an image processing algorithm. In that case, given two otherwise identical machine instructions (except for the fact that they are swapped), a memory access pattern accessing pixels one horizontal scanline at a time in the outer loop can significantly outperform a memory access pattern that accesses pixels one vertical column of pixels at a time. This would be true even with otherwise identical machine instructions that have the same total instruction-level cost (though instruction costs are variable), but merely access memory in a swapped order.
It's because, put crudely, computers fetch data from slower forms of memory into faster forms of memory in contiguous chunks (pages, cache lines). When you access pixels of an image horizontally, an adjacent, horizontal chunk of pixels might be fetched from a slower form of memory into a faster form, and you end up accessing all the neighboring pixels from the faster form of memory prior to moving on to the next series of pixels. When you access pixels of an image in a vertical fashion, you end up loading horizontal neighboring pixels into a faster form of memory only to use one pixel from that column. The result can significantly slow down the resulting image algorithm as a result of cache misses, since we're failing to use all the data available when it's loaded into a smaller but faster form of memory prior to it being evicted (we're basically wasting a lot of the benefits of that smaller but faster memory).
So typically if you want to make loops go faster, and algorithmic improvements are out, you want to analyze the way that memory is being accessed and potentially change even the memory layout of the data structures involved. Computers like it when you access contiguous data close together in memory, and don't like it so much when you're accessing memory in a chaotic way that's going all over the place. They like arrays which pack their memory contents tightly together a lot more than linked structures which scatter the memory all over the place (unless the linked structures or their memory allocators are carefully designed not to do that). Speedy loops don't come from changing the mechanics of the loop so much as what the loops are doing, but deeper than algorithmic improvements and perhaps even parallelization are those memory-related optimizations coming from a data-oriented design mindset. In languages like C#, one of the techniques to get better locality of reference out of your data structures is object pooling.
Loop Tiling/Blocking
Occasionally there are opportunities where you can improve the memory access patterns by simply changing the way you loop over the data without actually changing the way the data is represented. One such example is loop tiling (aka loop blocking): https://software.intel.com/en-us/articles/how-to-use-loop-blocking-to-optimize-memory-use-on-32-bit-intel-architecture. But again, here the speedup isn't coming from optimizing how you write the loop, per se, but optimizing the way you traverse the data in a way that exploits locality of reference. It's still entirely about memory access.
Profiling
All of these micro-level optimization techniques have a tendency to make your code harder to maintain, so they're almost always best applied in hindsight with plenty of profiling measurements in your hand. The first thing to learn about optimization in general is how to measure, to do it based on hard data rather than hunches. Beginners tend to want to optimize more, not less, because they're doing it based on guesses about what might be inefficient instead of hard data and proper measurements. It's easy to do this for glaring algorithmic bottlenecks, but anything else typically demands a profiler in your hand. A good optimizer is a sniper dispatching hotspots, not a grenadier blindly hurling grenades at anything that might slow things down. In fact, knowing how to prioritize optimizations properly and to make the proper measurements is probably even more important than understanding the inner workings of the machine. So probably beyond all this stuff, if you want to make your loops go faster, first grab a profiler and learn how to measure inefficiencies properly. The first thing to ask is not how to make things faster so much as what actually needs to be faster (and just as importantly if not more, what doesn't).

Storing Large 2D Game Worlds

I've been experimenting with different ideas of how to store a 2D game world. I'm interested in hearing techniques of storing large quantities of objects while managing the set that's visible ( lets say 100,000 tiles square ). Obviously the techniques can vary based on how the game renders that space.
Lets assume that we're describing a scrolling 2d game world rather than screen based as you could fairly easily do screen based rendering from such a setup while the converse is a bit more messy.
Looking for language agnostic solutions here so it's more helpful to others.
Edit: I think a good answer here would be a general review of the ideas to consider when thinking about this, as some of the responders have attempted, but also begin to explain how different solutions would apply to those scenarios. It's a somewhat complex question, so I would expect a good answer to reflect that.

Quadtrees are a fairly efficient solution for storing data about a large 2-dimensional world and the objects within it.

You might get some ideas on how to implement this from some spatial data structures like range or kd trees.
However, the answer to this question would vary considerably depending exactly on how your game works.
Are we talking a 2D platformer with 10 enemies onscreen, 20 more offscreen but "active", and an unknown number more "inactive"? If so, you can probably store your whole level as an array of "screens" where you manipulate the ones closest to you.
Or do you mean a true 2D game with lots of up/down movement too? You might have to be a bit more careful here.
The platform is also of some importance. If you're implementing a simple platformer for desktop PCs, you probably wouldn't have to worry about performance as much as you would on an embedded device. This is no excuse to be naive about it, but you might not have to be terribly clever either.
This is a somewhat interesting question I think. Presumably someone smarter than I who has experience with implementing platformers has thought these things out already.

Break the world into smaller areas, and deal with them. Any solution to this problem is going to boil down to this concept (such as quadtrees, mentioned in another answer). The differences will be in how they subdivide the world.
How much data is stored per tile? How fast can players move across the world? What's the behavior of NPCs, etc., that are offscreen? Do they just reset when the player comes back (like old Zelda games)? Do they simply resume where they were? Do they do some kind of catch-up script?
How much different rendering data is going to be needed for different areas?
How much of the world can be seen at one time?
All of these questions are going to immpact your solution, as well as the capabilities of your platform. Coming up with a general answer for these without having a reasonable idea of these parameters is going to be a bit difficult.

Assuming that your game will only update what is visible and some area around what is visible, just break the world in "screens" (a "screen" is a rectangular area on the tilemap that can fill the whole screen). Keep in memory the "screens" around the visible area (and some more if you want to update entities which are close to the character - but there is little reason to update an entity that far away) and have the rest on disk with a cache to avoid loading/unloading of commonly visited areas when you move around. Some setup like:
+---+---+---+---+---+---+---+
|FFF|FFF|FFF|FFF|FFF|FFF|FFF|
+---+---+---+---+---+---+---+
|FFF|NNN|NNN|NNN|NNN|NNN|FFF|
+---+---+---+---+---+---+---+
|FFF|NNN|NNN|NNN|NNN|NNN|FFF|
+---+---+---+---+---+---+---+
|FFF|NNN|NNN|VVV|NNN|NNN|FFF|
+---+---+---+---+---+---+---+
|FFF|NNN|NNN|NNN|NNN|NNN|FFF|
+---+---+---+---+---+---+---+
|FFF|NNN|NNN|NNN|NNN|NNN|FFF|
+---+---+---+---+---+---+---+
|FFF|FFF|FFF|FFF|FFF|FFF|FFF|
+---+---+---+---+---+---+---+
Where "V" part is the "screen" where the center (hero or whatever) is, the "N" parts are those who are nearby and have active (updating) entities, are checked for collisions, etc and "F" parts are far parts which might get updated infrequently and are prone to be "swapped" out (stored to disk). Of course you might want to use more "N" screens than two :-).
Note btw that since 2D games do not usually hold much data instead of saving the far away parts to disk you might want to just keep them in memory compressed.

You probably want to use a single int or byte array that links to block types. If you need more optimization from there, then you'll want to link to more complicated data structures like oct trees from your array. There is a good discussion on a Java game forum here: http://www.javagaming.org/index.php/topic,20505.30.html text
Anything with links becomes very expensive because the pointer takes up something like 8 bytes each, depending upon the language, so depending upon how populated your world is it can get expensive very quickly (8 pointers 8 bytes each is 64 bytes per item, and a byte array is 1 byte per item). So unless 1/64 of your world is empty, a byte array is going to be a much better option. You're also going to need to spend a lot of time iterating down the tree whenever you're doing a lookup for collision or whatever else - a byte array will be an instantaneous lookup.
Hopefully that's detailed enough for you. :-)

Pattern name for flippable data structure?

I'm trying to think of a naming convention that accurately conveys what's going on within a class I'm designing. On a secondary note, I'm trying to decide between two almost-equivalent user APIs.
Here's the situation:
I'm building a scientific application, where one of the central data structures has three phases: 1) accumulation, 2) analysis, and 3) query execution.
In my case, it's a spatial modeling structure, internally using a KDTree to partition a collection of points in 3-dimensional space. Each point describes one or more attributes of the surrounding environment, with a certain level of confidence about the measurement itself.
After adding (a potentially large number of) measurements to the collection, the owner of the object will query it to obtain an interpolated measurement at a new data point somewhere within the applicable field.
The API will look something like this (the code is in Java, but that's not really important; the code is divided into three sections, for clarity):
// SECTION 1:
// Create the aggregation object, and get the zillion objects to insert...
ContinuousScalarField field = new ContinuousScalarField();
Collection<Measurement> measurements = getMeasurementsFromSomewhere();
// SECTION 2:
// Add all of the zillion objects to the aggregation object...
// Each measurement contains its xyz location, the quantity being measured,
// and a numeric value for the measurement. For example, something like
// "68 degrees F, plus or minus 0.5, at point 1.23, 2.34, 3.45"
foreach (Measurement m : measurements) {
field.add(m);
}
// SECTION 3:
// Now the user wants to ask the model questions about the interpolated
// state of the model. For example, "what's the interpolated temperature
// at point (3, 4, 5)
Point3d p = new Point3d(3, 4, 5);
Measurement result = field.interpolateAt(p);
For my particular problem domain, it will be possible to perform a small amount of incremental work (partitioning the points into a balanced KDTree) during SECTION 2.
And there will be a small amount of work (performing some linear interpolations) that can occur during SECTION 3.
But there's a huge amount of work (constructing a kernel density estimator and performing a Fast Gauss Transform, using Taylor series and Hermite functions, but that's totally beside the point) that must be performed between sections 2 and 3.
Sometimes in the past, I've just used lazy-evaluation to construct the data structures (in this case, it'd be on the first invocation of the "interpolateAt" method), but then if the user calls the "field.add()" method again, I have to completely discard those data structures and start over from scratch.
In other projects, I've required the user to explicitly call an "object.flip()" method, to switch from "append mode" into "query mode". The nice this about a design like this is that the user has better control over the exact moment when the hard-core computation starts. But it can be a nuisance for the API consumer to keep track of the object's current mode. And besides, in the standard use case, the caller never adds another value to the collection after starting to issue queries; data-aggregation almost always fully precedes query preparation.
How have you guys handled designing a data structure like this?
Do you prefer to let an object lazily perform its heavy-duty analysis, throwing away the intermediate data structures when new data comes into the collection? Or do you require the programmer to explicitly flip the data structure from from append-mode into query-mode?
And do you know of any naming convention for objects like this? Is there a pattern I'm not thinking of?
ON EDIT:
There seems to be some confusion and curiosity about the class I used in my example, named "ContinuousScalarField".
You can get a pretty good idea for what I'm talking about by reading these wikipedia pages:
http://en.wikipedia.org/wiki/Scalar_field
http://en.wikipedia.org/wiki/Vector_field
Let's say you wanted to create a topographical map (this is not my exact problem, but it's conceptually very similar). So you take a thousand altitude measurements over an area of one square mile, but your survey equipment has a margin of error of plus-or-minus 10 meters in elevation.
Once you've gathered all the data points, you feed them into a model which not only interpolates the values, but also takes into account the error of each measurement.
To draw your topo map, you query the model for the elevation of each point where you want to draw a pixel.
As for the question of whether a single class should be responsible for both appending and handling queries, I'm not 100% sure, but I think so.
Here's a similar example: HashMap and TreeMap classes allow objects to be both added and queried. There aren't separate interfaces for adding and querying.
Both classes are also similar to my example, because the internal data structures have to be maintained on an ongoing basis in order to support the query mechanism. The HashMap class has to periodically allocate new memory, re-hash all objects, and move objects from the old memory to the new memory. A TreeMap has to continually maintain tree balance, using the red-black-tree data structure.
The only difference is that my class will perform optimally if it can perform all of its calculations once it knows the data set is closed.

If an object has two modes like this, I would suggest exposing two interfaces to the client. If the object is in append mode, then you make sure that the client can only ever use the IAppendable implementation. To flip to query mode, you add a method to IAppendable such as AsQueryable. To flip back, call IQueryable.AsAppendable.
You can implement IAppendable and IQueryable on the same object, and keep track of the state in the same way internally, but having two interfaces makes it clear to the client what state the object is in, and forces the client to deliberately make the (expensive) switch.

I generally prefer to have an explicit change, rather than lazily recomputing the result. This approach makes the performance of the utility more predictable, and it reduces the amount of work I have to do to provide a good user experience. For example, if this occurs in a UI, where do I have to worry about popping up an hourglass, etc.? Which operations are going to block for a variable amount of time, and need to be performed in a background thread?
That said, rather than explicitly changing the state of one instance, I would recommend the Builder Pattern to produce a new object. For example, you might have an aggregator object that does a small amount of work as you add each sample. Then instead of your proposed void flip() method, I'd have a Interpolator interpolator() method that gets a copy of the current aggregation and performs all your heavy-duty math. Your interpolateAt method would be on this new Interpolator object.
If your usage patterns warrant, you could do simple caching by keeping a reference to the interpolator you create, and return it to multiple callers, only clearing it when the aggregator is modified.
This separation of responsibilities can help yield more maintainable and reusable object-oriented programs. An object that can return a Measurement at a requested Point is very abstract, and perhaps a lot of clients could use your Interpolator as one strategy implementing a more general interface.
I think that the analogy you added is misleading. Consider an alternative analogy:
Key[] data = new Key[...];
data[idx++] = new Key(...); /* Fast! */
...
Arrays.sort(data); /* Slow! */
...
boolean contains = Arrays.binarySearch(data, datum) >= 0; /* Fast! */
This can work like a set, and actually, it gives better performance than Set implementations (which are implemented with hash tables or balanced trees).
A balanced tree can be seen as an efficient implementation of insertion sort. After every insertion, the tree is in a sorted state. The predictable time requirements of a balanced tree are due to the fact the cost of sorting is spread over each insertion, rather than happening on some queries and not others.
The rehashing of hash tables does result in less consistent performance, and because of that, aren't appropriate for certain applications (perhaps a real-time microcontroller). But even the rehashing operation depends only on the load factor of the table, not the pattern of insertion and query operations.
For your analogy to hold strictly, you would have to "sort" (do the hairy math) your aggregator with each point you add. But it sounds like that would be cost prohibitive, and that leads to the builder or factory method patterns. This makes it clear to your clients when they need to be prepared for the lengthy "sort" operation.

Your objects should have one role and responsibility. In your case should the ContinuousScalarField be responsible for interpolating?
Perhaps you might be better off doing something like:
IInterpolator interpolator = field.GetInterpolator();
Measurement measurement = Interpolator.InterpolateAt(...);
I hope this makes sense, but without fully understanding your problem domain it's hard to give you a more coherent answer.

"I've just used lazy-evaluation to construct the data structures" -- Good
"if the user calls the "field.add()" method again, I have to completely discard those data structures and start over from scratch." -- Interesting
"in the standard use case, the caller never adds another value to the collection after starting to issue queries" -- Whoops, false alarm, actually not interesting.
Since lazy eval fits your use case, stick with it. That's a very heavily used model because it is so delightfully reliable and fits most use cases very well.
The only reason for rethinking this is (a) the use case change (mixed adding and interpolation), or (b) performance optimization.
Since use case changes are unlikely, you might consider the performance implications of breaking up interpolation. For example, during idle time, can you precompute some values? Or with each add is there a summary you can update?
Also, a highly stateful (and not very meaningful) flip method isn't so useful to clients of your class. However, breaking interpolation into two parts might still be helpful to them -- and help you with optimization and state management.
You could, for example, break interpolation into two methods.
public void interpolateAt( Point3d p );
public Measurement interpolatedMasurement();
This borrows the relational database Open and Fetch paradigm. Opening a cursor can do a lot of preliminary work, and may start executing the query, you don't know. Fetching the first row may do all the work, or execute the prepared query, or simply fetch the first buffered row. You don't really know. You only know that it's a two part operation. The RDBMS developers are free to optimize as they see fit.

Do you prefer to let an object lazily perform its heavy-duty analysis,
throwing away the intermediate data structures when new data comes
into the collection? Or do you require the programmer to explicitly
flip the data structure from from append-mode into query-mode?
I prefer using data structures that allow me to incrementally add to it with "a little more work" per addition, and to incrementally pull the data I need with "a little more work" per extraction.
Perhaps if you do some "interpolate_at()" call in the upper-right corner of your region, you only need to do calculations involving the points in that upper-right corner,
and it doesn't hurt anything to leave the other 3 quadrants "open" to new additions.
(And so on down the recursive KDTree).
Alas, that's not always possible -- sometimes the only way to add more data is to throw away all the previous intermediate and final results, and re-calculate everything again from scratch.
The people who use the interfaces I design -- in particular, me -- are human and fallible.
So I don't like using objects where those people must remember to do things in a certain way, or else things go wrong -- because I'm always forgetting those things.
If an object must be in the "post-calculation state" before getting data out of it,
i.e. some "do_calculations()" function must be run before the interpolateAt() function gets valid data,
I much prefer letting the interpolateAt() function check if it's already in that state,
running "do_calculations()" and updating the state of the object if necessary,
and then returning the results I expected.
Sometimes I hear people describe such a data structure as "freeze" the data or "crystallize" the data or "compile" or "put the data into an immutable data structure".
One example is converting a (mutable) StringBuilder or StringBuffer into an (immutable) String.
I can imagine that for some kinds of analysis, you expect to have all the data ahead of time,
and pulling out some interpolated value before all the data has put in would give wrong results.
In that case,
I'd prefer to set things up such that the "add_data()" function fails or throws an exception
if it (incorrectly) gets called after any interpolateAt() call.
I would consider defining a lazily-evaluated "interpolated_point" object that doesn't really evaluate the data right away, but only tells that program that sometime in the future that data at that point will be required.
The collection isn't actually frozen, so it's OK to continue adding more data to it,
up until the point something actually extract the first real value from some "interpolated_point" object,
which internally triggers the "do_calculations()" function and freezes the object.
It might speed things up if you know not only all the data, but also all the points that need to be interpolated, all ahead of time.
Then you can throw away data that is "far away" from the interpolated points,
and only do the heavy-duty calculations in regions "near" the interpolated points.
For other kinds of analysis, you do the best you can with the data you have, but when more data comes in later, you want to use that new data in your later analysis.
If the only way to do that is to throw away all the intermediate results and recalculate everything from scratch, then that's what you have to do.
(And it's best if the object automatically handled this, rather than requiring people to remember to call some "clear_cache()" and "do_calculations()" function every time).

You could have a state variable. Have a method for starting the high level processing, which will only work if the STATE is in SECTION-1. It will set the state to SECTION-2, and then to SECTION-3 when it is done computing. If there's a request to the program to interpolate a given point, it will check if the state is SECTION-3. If not, it will request the computations to begin, and then interpolate the given data.
This way, you accomplish both - the program will perform its computations at the first request to interpolate a point, but can also be requested to do so earlier. This would be convenient if you wanted to run the computations overnight, for example, without needing to request an interpolation.

How can I make my applications scale well?

In general, what kinds of design decisions help an application scale well?
(Note: Having just learned about Big O Notation, I'm looking to gather more principles of programming here. I've attempted to explain Big O Notation by answering my own question below, but I want the community to improve both this question and the answers.)
Responses so far
1) Define scaling. Do you need to scale for lots of users, traffic, objects in a virtual environment?
2) Look at your algorithms. Will the amount of work they do scale linearly with the actual amount of work - i.e. number of items to loop through, number of users, etc?
3) Look at your hardware. Is your application designed such that you can run it on multiple machines if one can't keep up?
Secondary thoughts
1) Don't optimize too much too soon - test first. Maybe bottlenecks will happen in unforseen places.
2) Maybe the need to scale will not outpace Moore's Law, and maybe upgrading hardware will be cheaper than refactoring.

The only thing I would say is write your application so that it can be deployed on a cluster from the very start. Anything above that is a premature optimisation. Your first job should be getting enough users to have a scaling problem.
Build the code as simple as you can first, then profile the system second and optimise only when there is an obvious performance problem.
Often the figures from profiling your code are counter-intuitive; the bottle-necks tend to reside in modules you didn't think would be slow. Data is king when it comes to optimisation. If you optimise the parts you think will be slow, you will often optimise the wrong things.

Ok, so you've hit on a key point in using the "big O notation". That's one dimension that can certainly bite you in the rear if you're not paying attention. There are also other dimensions at play that some folks don't see through the "big O" glasses (but if you look closer they really are).
A simple example of that dimension is a database join. There are "best practices" in constructing, say, a left inner join which will help to make the sql execute more efficiently. If you break down the relational calculus or even look at an explain plan (Oracle) you can easily see which indexes are being used in which order and if any table scans or nested operations are occurring.
The concept of profiling is also key. You have to be instrumented thoroughly and at the right granularity across all the moving parts of the architecture in order to identify and fix any inefficiencies. Say for example you're building a 3-tier, multi-threaded, MVC2 web-based application with liberal use of AJAX and client side processing along with an OR Mapper between your app and the DB. A simplistic linear single request/response flow looks like:
browser -> web server -> app server -> DB -> app server -> XSLT -> web server -> browser JS engine execution & rendering
You should have some method for measuring performance (response times, throughput measured in "stuff per unit time", etc.) in each of those distinct areas, not only at the box and OS level (CPU, memory, disk i/o, etc.), but specific to each tier's service. So on the web server you'll need to know all the counters for the web server your're using. In the app tier, you'll need that plus visibility into whatever virtual machine you're using (jvm, clr, whatever). Most OR mappers manifest inside the virtual machine, so make sure you're paying attention to all the specifics if they're visible to you at that layer. Inside the DB, you'll need to know everything that's being executed and all the specific tuning parameters for your flavor of DB. If you have big bucks, BMC Patrol is a pretty good bet for most of it (with appropriate knowledge modules (KMs)). At the cheap end, you can certainly roll your own but your mileage will vary based on your depth of expertise.
Presuming everything is synchronous (no queue-based things going on that you need to wait for), there are tons of opportunities for performance and/or scalability issues. But since your post is about scalability, let's ignore the browser except for any remote XHR calls that will invoke another request/response from the web server.
So given this problem domain, what decisions could you make to help with scalability?
Connection handling. This is also bound to session management and authentication. That has to be as clean and lightweight as possible without compromising security. The metric is maximum connections per unit time.
Session failover at each tier. Necessary or not? We assume that each tier will be a cluster of boxes horizontally under some load balancing mechanism. Load balancing is typically very lightweight, but some implementations of session failover can be heavier than desired. Also whether you're running with sticky sessions can impact your options deeper in the architecture. You also have to decide whether to tie a web server to a specific app server or not. In the .NET remoting world, it's probably easier to tether them together. If you use the Microsoft stack, it may be more scalable to do 2-tier (skip the remoting), but you have to make a substantial security tradeoff. On the java side, I've always seen it at least 3-tier. No reason to do it otherwise.
Object hierarchy. Inside the app, you need the cleanest possible, lightest weight object structure possible. Only bring the data you need when you need it. Viciously excise any unnecessary or superfluous getting of data.
OR mapper inefficiencies. There is an impedance mismatch between object design and relational design. The many-to-many construct in an RDBMS is in direct conflict with object hierarchies (person.address vs. location.resident). The more complex your data structures, the less efficient your OR mapper will be. At some point you may have to cut bait in a one-off situation and do a more...uh...primitive data access approach (Stored Procedure + Data Access Layer) in order to squeeze more performance or scalability out of a particularly ugly module. Understand the cost involved and make it a conscious decision.
XSL transforms. XML is a wonderful, normalized mechanism for data transport, but man can it be a huge performance dog! Depending on how much data you're carrying around with you and which parser you choose and how complex your structure is, you could easily paint yourself into a very dark corner with XSLT. Yes, academically it's a brilliantly clean way of doing a presentation layer, but in the real world there can be catastrophic performance issues if you don't pay particular attention to this. I've seen a system consume over 30% of transaction time just in XSLT. Not pretty if you're trying to ramp up 4x the user base without buying additional boxes.
Can you buy your way out of a scalability jam? Absolutely. I've watched it happen more times than I'd like to admit. Moore's Law (as you already mentioned) is still valid today. Have some extra cash handy just in case.
Caching is a great tool to reduce the strain on the engine (increasing speed and throughput is a handy side-effect). It comes at a cost though in terms of memory footprint and complexity in invalidating the cache when it's stale. My decision would be to start completely clean and slowly add caching only where you decide it's useful to you. Too many times the complexities are underestimated and what started out as a way to fix performance problems turns out to cause functional problems. Also, back to the data usage comment. If you're creating gigabytes worth of objects every minute, it doesn't matter if you cache or not. You'll quickly max out your memory footprint and garbage collection will ruin your day. So I guess the takeaway is to make sure you understand exactly what's going on inside your virtual machine (object creation, destruction, GCs, etc.) so that you can make the best possible decisions.
Sorry for the verbosity. Just got rolling and forgot to look up. Hope some of this touches on the spirit of your inquiry and isn't too rudimentary a conversation.

Well there's this blog called High Scalibility that contains a lot of information on this topic. Some useful stuff.

Often the most effective way to do this is by a well thought through design where scaling is a part of it.
Decide what scaling actually means for your project. Is infinite amount of users, is it being able to handle a slashdotting on a website is it development-cycles?
Use this to focus your development efforts

Jeff and Joel discuss scaling in the Stack Overflow Podcast #19.

FWIW, most systems will scale most effectively by ignoring this until it's a problem- Moore's law is still holding, and unless your traffic is growing faster than Moore's law does, it's usually cheaper to just buy a bigger box (at $2 or $3K a pop) than to pay developers.
That said, the most important place to focus is your data tier; that is the hardest part of your application to scale out, as it usually needs to be authoritative, and clustered commercial databases are very expensive- the open source variations are usually very tricky to get right.
If you think there is a high likelihood that your application will need to scale, it may be intelligent to look into systems like memcached or map reduce relatively early in your development.

One good idea is to determine how much work each additional task creates. This can depend on how the algorithm is structured.
For example, imagine you have some virtual cars in a city. At any moment, you want each car to have a map showing where all the cars are.
One way to approach this would be:
for each car {
determine my position;
for each car {
add my position to this car's map;
}
}
This seems straightforward: look at the first car's position, add it to the map of every other car. Then look at the second car's position, add it to the map of every other car. Etc.
But there is a scalability problem. When there are 2 cars, this strategy takes 4 "add my position" steps; when there are 3 cars, it takes 9 steps. For each "position update," you have to cycle through the whole list of cars - and every car needs its position updated.
Ignoring how many other things must be done to each car (for example, it may take a fixed number of steps to calculate the position of an individual car), for N cars, it takes N2 "visits to cars" to run this algorithm. This is no problem when you've got 5 cars and 25 steps. But as you add cars, you will see the system bog down. 100 cars will take 10,000 steps, and 101 cars will take 10,201 steps!
A better approach would be to undo the nesting of the for loops.
for each car {
add my position to a list;
}
for each car {
give me an updated copy of the master list;
}
With this strategy, the number of steps is a multiple of N, not of N2. So 100 cars will take 100 times the work of 1 car - NOT 10,000 times the work.
This concept is sometimes expressed in "big O notation" - the number of steps needed are "big O of N" or "big O of N2."
Note that this concept is only concerned with scalability - not optimizing the number of steps for each car. Here we don't care if it takes 5 steps or 50 steps per car - the main thing is that N cars take (X * N) steps, not (X * N2).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio