When to store quaternion vs matrix in static and dynamic objects (data structure design)

When to store quaternion vs matrix in static and dynamic objects (data structure design) - matrix

My question is about design and possible suggestions for the following scenario:
I am writing a 3d visualizer. For my renderable objects I would like to store the minimum data possible (so quaternions are naturally nice for rotation).
At some point I must extract a Matrix for rendering which requires computation and temporary storage on every frame update (even for objects that do not change spatially).
Given that many objects remain static and don't need to be rotated locally would it make sense to store the matrix instead and thereby avoid the computation for each object each frame? Is there any best practice approach to this perhaps from a game engine design point of view?
I am currently a bit torn between storing the two extremes of either position+quaternion or 4x3/4x4 matrix. Looking at openframeworks (not necessarily trying to achieve the same goal as me), they seem to do a hybrid where they store a quaternion AND a matrix (matrix always reflects the quaternion) so its always ready when needed but needs to be updated along with every change to the quaternion.

More compact storage require 3 scalars, so Euler Angels or Exponential Maps (Rodrigues) can be used. Quaternions is good compromise between conversion to matrix speed and compactness.
From design point of view , there is a good rule "make all design decisions as LATE as possible". In your case, just incapsulate (isolate) the rotation (transformation) representation, to be able in the future, to change the physical storage of data in different states (file, memory, rendering and more). Also it enables different platform optimization, keep data in GPU or CPU and more.

Been there.
First: keep in mind the omnipresent struggle of time against space (in computer science processing time against memory requirements)
You said that want to keep minimum information possible at first (space), and next talked about some temporary matrix reflecting the quartenions, which is more of a time worry.
If you accept a tip, I would go for the matrices. They are generally performance wise standard for 3D graphics and it's size becomes easily irrelevant next to the object data itself.
Just to have and idea: in most GPUs transforming an vector for the identity (no change) is actually faster then checking if it needs transformation and then doing nothing.
As for engines, I can't think of one that does not apply the transformations for every vertex every frame. Even if the objects keep in place, they position has to go through projection and view matrices.
(does this answer? Maybe I got you wrong)

Related

Can a CNN recognize the difference in size if the images are the same?

Could a CNN tell the difference between different size range of the same organism? Like a puppy vs a adult or a child vs an adult? Or more like a large fly vs a small fly, where they look identical but one is just larger than the other?

This is a tricky question to answer but usually theoretical CNN is able to do. It is mainly dependent on the training data itself. In case of a child vs adult, you can gather a dataset that includes alot of variances in sizes and ages in order to make sure that you will have CNN model that able to find patterns and generalize at the end. At the end, the CNN will learn many other features that make the classification scale or size invariant (In dependent of Size) such as shapes,colors, clothes and face features ....etc. Such Intra-class classification problems, it is not easily tackled with traditional supervised learning and therefore some researchers are applying an approach called "Deep Metric Learning".
Metric learning is the task of learning a distance function over objects. A metric or distance function has to obey four axioms: non-negativity, identity of indiscernibles, symmetry and subadditivity (or the triangle inequality). In practice, metric learning algorithms ignore the condition of identity of indiscernibles and learn a pseudo-metric.Wiki Definition

It would be better to differentiate the metric that you mention in the question. At first, it is a different task to recognize age and size.
About the age, yes, it is doable. For deep learning-based approach, you will need appropriate data. For non-training based approach (old-school image processing), you would need to create some metrics for each object based on age (counting the wrinkle, white hair etc. for humans)
About the size, unfortunately, it is still under research and it is not clear to mention if it is properly doable or not. Whenever we mention object size recognition from a single image, there are more things to consider. The first thing is the perspective. If the object found in the image is large with respect to the image coordinates, is it close to the camera, even though its size is tiny, hence, it is showing as large or it is really huge but too far away from the camera? Such a problem may be overcome by knowing the object geometry in prior and by developing an algorithm based on that geometry along with deep learning. However, current deep learning technology is not accurate enough to distinguish the dimensions and location, hence object geometry precisely yet.
Another alternative would be to control the environment. For example, if you know that both objects lie on the same plane (i.e. on the table, next to each other) in the real world, the rest is a trivial problem to resolve.

DrawPrimitives performance

I want to draw single faces instead of xna models because it's too slow.
But I don't know what the difference is between
DrawPrimitives
DrawUserPrimitives
DrawIndexedPrimitives
DrawUserIndexedPrimitives
Which one is the fastest method? And what are the indices good for?

The simple answer to your question is that the "User" versions are a fair bit slower on the CPU because they have to transfer vertex data to the GPU (via the driver and the bus) each time they are called.
The non-User versions use vertex and index buffers that already exist on the GPU (you put them there at load time). They have considerably less data to transfer, so they are faster.
The "User" and "Indexed" versions will also each have a performance impact on the GPU. This impact is relatively tiny. Generally speaking you don't need to worry about it.
The User versions exist because they are faster when your data changes each frame. There is also DynamicVertexBuffer which can be used with the non-User version of the draw functions. I believe it is slightly faster than the User methods in cases where you can pre-allocate the buffer at the desired size.
The Indexed versions allow you to select vertices out of your vertex buffer using an index buffer (so triangles that you draw can choose vertices at any position in the vertex buffer). The alternative is that your vertex buffer is simply interpreted as as sequential list of triangle vertices (based on PrimitiveType). The main reason for the existence of index buffers is to remove the need for duplicate vertices in your vertex buffer (which would require additional memory and processing on the GPU).
BUT...
XNA's Model class internally uses DrawIndexedPrimitives. Not only that, but it uses it correctly (ie: it doesn't draw single faces - but as many as it can at once - for the best performance). So if you are finding that it is slow, then your problem lies elsewhere.
I suggest trying to diagnose the reason why your game is performing poorly, before trying to select a "solution". Maybe ask for help doing that in a question here (or on https://gamedev.stackexchange.com/).

All in one time if you can , Instancied draw will be always better , but that need you to give all the textures in one time ! In my case , for example , I like to draw instancied objects with 1 only texture ... all the trees , all the ground , all buildings , etc ...

Proper Data Structure Choice for Collision System

I am looking to implement a 2D top-down collision system, and was hoping for some input as to the likely performance between a few different ideas. For reference I expect the number of moving collision objects to be in the dozens, and the static collision objects to be in the hundreds.
The first idea is border-line brute force (or maybe not so border-line). I would store two lists of collision objects in a collision system. One list would be dynamic objects, the other would include both dynamic and static objects (each dynamic would be in both lists). Each frame I would loop through the dynamic list and pass each object the larger list, so it could find anything it may run into. This will involve a lot of unnecessary calculations for any reasonably sized loaded area but I am using it as a sort of baseline because it would be very easy to implement.
The second idea is to have a single list of all collision objects, and a 2D array of either ints or floats representing the loaded area. Each element in the array would represent a physical location, and each object would have a size value. Each time an object moved, it would subtract its size value from its old location and add it to its new location. The objects would have to access elements in the array before they moved to make sure there was room in their new location, but that would be fairly simple to do. Besides the fact that I have a very public, very large array, I think it would perform fairly well. I could also implement with a boolean array, simply storing if a location is full or not, but I don't see any advantage to this over the numeric storage.
The third I idea I had was less well formed. A month or two ago I read about a two dimensional, rectangle based data structure (may have been a tree, i don't remember) that would be able to keep elements sorted by position. Then I would only have to pass the dynamic objects their small neighborhood of objects for update. I was wondering if anyone had any idea what this data structure might be, so I could look more into it, and if so, how the per-frame sorting of it would affect performance relative to the other methods.
Really I am just looking for ideas on how these would perform, and any pitfalls I am likely overlooking in any of these. I am not so much worried about the actual detection, as the most efficient way to make the objects talk to one another.

You're not talking about a lot of objects in this case. Honestly, you could probably brute force it and probably be fine for your application, even in mobile game development. With that in mind, I'd recommend you keep it simple but throw a bit of optimization on top for gravy. Spatial hashing with a reasonable cell size is the way I'd go here -- relatively reasonable memory use, decent speedup, and not that bad as far as complexity of implementation goes. More on that in a moment!
You haven't said what the representation of your objects is, but in any case you're likely going to end up with a typical "broad phase" and "narrow phase" (like a physics engine) -- the "broad phase" consisting of a false-positives "what could be intersecting?" query and the "narrow phase" brute forcing out the resulting potential intersections. Unless you're using things like binary space partitioning trees for polygonal shapes, you're not going to end up with a one-phase solution.
As mentioned above, for the broad phase I'd use spatial hashing. Basically, you establish a grid and mark down what's in touch with each grid. (It doesn't have to be perfect -- it could be what axis-aligned bounding boxes are in each grid, even.) Then, later you go through the relevant cells of the grid and check if everything in each relevant cell is actually intersecting with anything else in the cell.
Trick is, instead of having an array, either have a hash table for every cell grid. That way you're only taking up space for grids that actually have something in them. (This is not a substitution for badly sized grids -- you want your grid to be coarse enough to not have an object in a ridiculous amount of cells because that takes memory, but you want it to be fine enough to not have all objects in a few cells because that doesn't save much time.) Chances are by visual inspection, you'll be able to figure out what a good grid size is.
One additional step to spatial hashing... if you want to save memory, throw away the indices that you'd normally verify in a hash table. False positives only cost CPU time, and if you're hashing correctly, it's not going to turn out to be much, but it can save you a lot of memory.
So:
When you update objects, update which grids they're probably in. (Again, it's good enough to just use a bounding box -- e.g. a square or rectangle around the object.) Add the object to the hash table for each cell it's in. (E.g. If you're in cell 5,4, that hashes to the 17th entry of the hash table. Add it to that entry of the hash table and throw away the 5,4 data.) Then, to test collisions, go through the relevant cells in the hash table (e.g. the entire screen's worth of cells if that's what you're interested in) and see what objects inside of each cell collide with other objects inside of each cell.
Compared to the solutions above:
Note brute forcing, takes less time.
This has some commonality with the "2D array" method mentioned because, after all, we're imposing a "grid" (or 2D array) over the represented space, however we're doing it in a way less prone to accuracy errors (since it's only used for a broad-phase that is conservative). Additionally, the memory requirements are lessened by the zealous data reduction in hash tables.
kd, sphere, X, BSP, R, and other "TLA"-trees are almost always quite nontrivial to implement correctly and test and, even after all that effort, can end up being much slower that you'd expect. You don't need that sort of complexity for a few hundreds of objects normally.
Implementation note:
Each node in the spatial hash table will ultimately be a linked list. I recommend writing your own linked list with careful allocations. Each node need take up more than 8 bytes (if you're using C/C++) and should a pooled allocation scheme so you're almost never allocating or freeing memory. Relying on the built-in allocator will likely cripple performance.

First thing, I am but a noob, I am working my way through the 3dbuzz xna extreme 101 videos, and we are just now covering a system that uses static lists of each different type of object, when updating an object you only check against the list/s of things it is supposed to collide with.
So you only check enemy collisions against the player or the players bullets, not other enemys etc.
So there is a static list of each type of game object, then each gamenode has its own collision list(edit:a list of nodes) , that are only the types it can hit.
sorry if its not clear what i mean, i'm still finding my feet

Is there some standard functionality in XNA to efficiently implement a 2D Camera on a large world

I am making a 2d space game with many moving objects. I have already implemented a camera that can move around and draw the objects within the view. The problem now is that I have to check for every object wether it is within the rectangle of my view before I draw it (O(n) operations where n is the number of objects in my world). I would like a more efficient way to acquire all the objects within the view so I only have to draw them.
I know a bunch of data structures that can achieve a O(log n + k) query time for a two dimensional range query, where k is the amount of objects within the range. The problem is that all the objects are constantly moving. The update time on most of the data structures is O(log n) as well. This is pretty bad because almost all objects are moving so all will have to be updated resulting in O(n log n) operations. With my current implementation (everything is just stored in a list), the update time takes O(n) operations to update everything.
I am thinking that this problem must have been solved already, but I couldn't really find a solution that specifically considers my options. Most 2D camera examples just do it the way I am currently doing.
So my question basically consists out of two things:
Is there a more efficient way to do this than my current one (in general)?
Is there a more efficient way to do this than my current one (in XNA)?
On one hand I am thinking, O(n) + O(n) is better than O(log n) + O(n log n), but on the other hand I know that in many games they use all these data structures like BSPs etc. So I feel like I am missing some piece of the puzzle.
PS: I am a bit confused whether I should post this on Stack Overflow, the Game Developers stack exchange or the Computer Science Theory stack exchange...... So please excuse me if it's a bit out of scope.

My first question would be: are you really going to have a world with a million (or even a billion!) objects?
This is a performance optimisation, so here's what I would do:
First of all: nothing. Just draw everything and update everything every frame, using a big list. Suitable for tens of objects.
If that is too slow, I would do some basic culling as I iterated the list. So for each object - if it is off-screen, don't draw it. If it is "unimportant" (eg: a particle system, other kinds of animation, etc) don't update it while off-screen either. Suitable for hundreds of objects.
(You need to be able to get an object's position and bounding box, and check compare it with the screen's rectangle.)
And finally, if that is too slow, I would implement a bucketed data structure (constant update time, constant query time). In this case a 2D grid of lists that covers your world space. Suitable for thousands of objects.
If that ends up taking up too much memory - if your world is crazy-large and sparse - I would start looking into quadtrees, perhaps. And remember that you don't have to update the space-partitioning structure every time you update an object - only when an object's position changes significantly!
Remember the idiom: Do The Simplest Thing That Could Possibly Work. Implement the simple case, and then actually see if your game is running slowly with a real-world number of objects, before you go implementing something complicated.

You mean to say, you check every object that you wish to put on to screen & take action against whether or not it is inside the specified screen area???
XNA takes care of handling objects which are out of screen (by not drawing of course) automatically, You simply specify the coordinates... Or did I misunderstand you question??
EDIT
Why not use a container sprite, draw everything you want inside it & then draw that single main sprite onto the screen? Calling draw on that single sprite won't really involve drawing of those parts which are outside.

Improving raytracer performance

I'm writing a comparatively straightforward raytracer/path tracer in D (http://dsource.org/projects/stacy), but even with full optimization it still needs several thousand processor cycles per ray. Is there anything else I can do to speed it up? More generally, do you know of good optimizations / faster approaches for ray tracing?
Edit: this is what I'm already doing.
Code is already running highly parallel
temporary data is structured in a cache-efficient fashion as well as aligned to 16b
Screen divided into 32x32-tiles
Destination array is arranged in such a way that all subsequent pixels in a tile are sequential in memory
Basic scene graph optimizations are performed
Common combinations of objects (plane-plane CSG as in boxes) are replaced with preoptimized objects
Vector struct capable of taking advantage of GDC's automatic vectorization support
Subsequent hits on a ray are found via lazy evaluation; this prevents needless calculations for CSG
Triangles neither supported nor priority. Plain primitives only, as well as CSG operations and basic material properties
Bounding is supported

The typical first order improvement of raytracer speed is some sort of spatial partitioning scheme. Based only on your project outline page, it seems you haven't done this.
Probably the most usual approach is an octree, but the best approach may well be a combination of methods (e.g. spatial partitioning trees and things like mailboxing). Bounding box/sphere tests are a quick cheap and nasty approach, but you should note two things: 1) they don't help much in many situations and 2) if your objects are already simple primitives, you aren't going to gain much (and might even lose). You can more easily (than octree) implement a regular grid for spatial partitioning, but it will only work really well for scenes that are somewhat uniformly distributed (in terms of surface locations)
A lot depends on the complexity of the objects you represent, your internal design (i.e. do you allow local transforms, referenced copies of objects, implicit surfaces, etc), as well as how accurate you're trying to be. If you are writing a global illumination algorithm with implicit surfaces the tradeoffs may be a bit different than if you are writing a basic raytracer for mesh objects or whatever. I haven't looked at your design in detail so I'm not sure what, if any, of the above you've already thought about.
Like any performance optimization process, you're going to have to measure first to find where you're actually spending the time, then improving things (algorithmically by preference, then code bumming by necessity)

One thing I learned with my ray tracer is that a lot of the old rules don't apply anymore. For example, many ray tracing algorithms do a lot of testing to get an "early out" of a computationally expensive calculation. In some cases, I found it was much better to eliminate the extra tests and always run the calculation to completion. Arithmetic is fast on a modern machine, but a missed branch prediction is expensive. I got something like a 30% speed-up on my ray-polygon intersection test by rewriting it with minimal conditional branches.
Sometimes the best approach is counter-intuitive. For example, I found that many scenes with a few large objects ran much faster when I broke them down into a large number of smaller objects. Depending on the scene geometry, this can allow your spatial subdivision algorithm to throw out a lot of intersection tests. And let's face it, intersection tests can be made only so fast. You have to eliminate them to get a significant speed-up.
Hierarchical bounding volumes help a lot, but I finally grokked the kd-tree, and got a HUGE increase in speed. Of course, building the tree has a cost that may make it prohibitive for real-time animation.
Watch for synchronization bottlenecks.
You've got to profile to be sure to focus your attention in the right place.

Is there anything else I can do to speed it up?
D, depending on the implementation and compiler, puts forth reasonably good performance. As you haven't explained what ray tracing methods and optimizations you're using already, then I can't give you much help there.
The next step, then, is to run a timing analysis on the program, and recode the most frequently used code or slowest code than impacts performance the most in assembly.
More generally, check out the resources in these questions:
Literature and Tutorials for Writing a Ray Tracer
Anyone know of a really good book about Ray Tracing?
Computer Graphics: Raytracing and Programming 3D Renders
raytracing with CUDA
I really like the idea of using a graphics card (a massively parallel computer) to do some of the work.
There are many other raytracing related resources on this site, some of which are listed in the sidebar of this question, most of which can be found in the raytracing tag.

I don't know D at all, so I'm not able to look at the code and find specific optimizations, but I can speak generally.
It really depends on your requirements. One of the simplest optimizations is just to reduce the number of reflections/refractions that any particular ray can follow, but then you start to lose out on the "perfect result".
Raytracing is also an "embarrassingly parallel" problem, so if you have the resources (such as a multi-core processor), you could look into calculating multiple pixels in parallel.
Beyond that, you'll probably just have to profile and figure out what exactly is taking so long, then try to optimize that. Is it the intersection detection? Then work on optimizing the code for that, and so on.

Some suggestions.
Use bounding objects to fail fast.
Project the scene at a first step (as common graphic cards do) and use raytracing only for light calculations.
Parallelize the code.

Raytrace every other pixel. Get the color in between by interpolation. If the colors vary greatly (you are on an edge of an object), raytrace the pixel in between. It is cheating, but on simple scenes it can almost double the performance while you sacrifice some image quality.
Render the scene on GPU, then load it back. This will give you the first ray/scene hit at GPU speeds. If you do not have many reflective surfaces in the scene, this would reduce most of your work to plain old rendering. Rendering CSG on GPU is unfortunately not completely straightforward.
Read source code of PovRay for inspiration. :)

You have first to make sure that you use very fast algorithms (implementing them can be a real pain, but what do you want to do and how far want you to go and how fast should it be, that's a kind of a tradeof).
some more hints from me
- don't use mailboxing techniques, in papers it is sometimes discussed that they don't scale that well with the actual architectures because of the counting overhead
- don't use BSP/Octtrees, they are relative slow.
- don't use the GPU for Raytracing, it is far too slow for advanced effects like reflection and shadows and refraction and photon-mapping and so on ( i use it only for shading, but this is my beer)
For a complete static scene kd-Trees are unbeatable and for dynamic scenes there are clever algorithms there that scale very well on a quadcore (i am not sure about the performance above).
And of course, for a realy good performance you need to use very much SSE code (with of course not too much jumps) but for not "that good" performance (im talking here about 10-15% maybe) compiler-intrinsics are enougth to implement your SSE stuff.
And some decent Papers about some Algorithms i was talking about:
"Fast Ray/Axis-Aligned Bounding Box - Overlap Tests using Ray Slopes"
( very fast very good paralelisizable (SSE) AABB-Ray hit test )( note, the code in the paper is not all code, just google for the title of the paper, youll find it)
http://graphics.tu-bs.de/publications/Eisemann07RS.pdf
"Ray Tracing Deformable Scenes using Dynamic Bounding Volume Hierarchies"
http://www.sci.utah.edu/~wald/Publications/2007///BVH/download//togbvh.pdf
if you know how the above algorithm works then this is a much greater algorithm:
"The Use of Precomputed Triangle Clusters for Accelerated Ray Tracing in Dynamic Scenes"
http://garanzha.com/Documents/UPTC-ART-DS-8-600dpi.pdf
I'm also using the pluecker-test to determine fast (not thaat accurate, but well, you can't have all) if i hit a polygon, works very pretty with SSE and above.
So my conclusion is that there are so many great papers out there about so much Topics that do relate to raytracing (How to build fast, efficient trees and how to shade (BRDF models) and so on and so on), it is an realy amazing and interesting field of "experimentating", but you need to have also much sparetime because it is so damn complicated but funny.

My first question is - are you trying to optimize the tracing of one single still screen,
or is this about optimizing the tracing of multiple screens in order to calculate an animation ?
Optimizing for a single shot is one thing, if you want to calculate successive frames in an animation there are lots of new things to think about / optimize.

You could
use an SAH-optimized bounding volume hierarchy...
...eventually using packet traversal,
introduce importance sampling,
access the tiles ordered by Morton code for better cache coherency, and
much more - but those were the suggestions I could immediately think of. In more words:
You can build an optimized hierarchy based on statistics in order to quickly identify candidate nodes when intersecting geometry. In your case you'll have to combine the automatic hierarchy with the modeling hierarchy, that is either constrain the build or have it eventually clone modeling information.
"Packet traversal" means you use SIMD instructions to compute 4 parallel scalars, each of an own ray for traversing the hierarchy (which is typically the hot spot) in order to squeeze the most performance out of the hardware.
You can perform some per-ray-statistics in order to control the sampling rate (number of secondary rays shot) based on the contribution to the resulting pixel color.
Using an area curve on the tile allows you to decrease the average space distance between the pixels and thus the probability that your performance benefits from cache hits.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio