How does SIMD behave in this case? - performance

I am using an engine that allows SIMD code to be written, and it performs fast. But there is only one block that has all the code.
I understand that this code is run independently on each entity concurrently, but when there is only 1 thing changing, is it still faster to calculate it regardless? Is this the idea with SIMD, parallelism?
For instance:
void simdFunction ()
{
center = mesh.center(); // always the same
vert.pos.x = center.x; // run on each vertex
}
In this case, the center is always the same, so will it be calculated for each vertex on SIMD? If so, is this still efficient?
Basically does being able to run this in parallel outweighs the cost of calculating it regardless in the general SIMD programming sense?

this code is run independently on each entity concurrently
No, that's not how SIMD works.
With SIMD, all arithmetic units are working in lock-step, performing identical operations. There's no independence whatsoever.
Generally though, you're better off computing shared constants just once, in sequential code. That way the SIMD engine will spend less time on each slice of vertices.
The exception would be if the computation is short, the SIMD is a co-processor (like GPGPU), and the data is already in that co-processor. Then computing it using SIMD might easily beat moving data back to the sequential processor and back.

Related

Reflection.Emit Performance

Here's a simple question.
Let's say we want to unroll a looping method such as:
public int DoSum1(int n)
{
int result = 0;
for(int i = 1;i <= n; i++)
{
result += i;
}
return result;
}
Into a method performing simple additions only:
public int DoSum2( )
{
return 1+2+3+4+5+6+7+8+9+10+11+12+13+14+15+16+17+18+19+20;
}
[http://etutorials.org/Programming/Programming+C.Sharp/Part+III+The+CLR+and+the+.NET+Framework/Chapter+18.+Attributes+and+Reflection/18.3+Reflection+Emit/][1]
Logically, we're going to need code to create DoSum2 in IL at some point.
In this IL generation code we will perform an actual loop with the same iteration count than the unoptimized method.
What's the point of creating a super fast dynamic method if the code required to generate it will use a similar amount of time to execute???
Perhaps you can give an example, when it worths using Emit in a similar case?
What's the point of creating a super fast dynamic method if the code required to generate it will use a similar amount of time to execute
This isn't really specific to Reflection.Emit, but to runtime code generation in general, so I will answer accordingly.
First, I do not recommend using code generation simply to perform micro-optimizations that compilers normally perform like loop unrolling. Let the JIT compiler do it's job.
Second, you are right in that there is usually little point in generating code that will only execute once. The time required to emit and JIT compile the IL is not insubstantial. You should only bother generating code if it will be executed many times.
Now, there definitely are cases where runtime code generation can prove beneficial. In fact, it's a technique I leverage heavily. I work in an electronic trading environment where it is necessary to process very high volumes of dynamic data. This introduces several concerns, the most significant being memory usage and throughput.
Our trading application needs to keep a lot of data in memory, so the footprint of each record is critical. Dynamic data structures like maps/dictionaries are less efficient than "POCO" classes with optimized field layouts and, depending on the design, may require boxing some values. I avoid this overhead by generating client-side storage classes once the shape of the data is known. In effect, the memory layout is as it would have been had I known the shape of the data at compile time.
Throughput is a major issue as well; (de)serializing dynamic data often involves some additional introspection and extra layers of indirection. Need to serialize a record? OK, first you need to query what the fields are. Then, for each field, you need to determine its type, then select a serializer for that type, and then invoke the serializer. If your data structure has optional fields, you may need to do some additional pre-processing, like figuring out the size of a presence map, and which bits in the presence map correspond to which fields. If you need to process a ton of data, all that overhead becomes a real problem. I avoid this overhead by generating specialized (de)serializers on both the server side and client side. Since the serializers are generated on demand, they can know the exact shape of the data, and read/write that data as efficiently as a hand-optimized serializer. When you have a high volume of data updating at very high frequencies, this can make a huge difference.
Now, keep in mind that we're something of an edge case. Most applications do not have the aggressive memory and throughput requirements that ours has, so runtime code generation isn't necessary. You should only go that route if you really need it, and you have exhausted all other possibilities. Although it can help with performance, generated code can be very difficult to debug and maintain.

GPU and determinism

I was thinking of off-loading some math operations to the GPU. As I'm already using D3D11 I'd use a compute shader to do the work. But the thing is I need the results to be the same for the same input, no matter what GPU the user might have. (only requirement that it supports compute shader 4.0).
So is floating point math deterministic on GPUs?
If not, do GPUs support integer math?
I haven't used DirectCompute, only OpenCL.
GPUs definitely support integer math, both 32-bit and 64-bit integers. A couple of questions already have this discussion:
Integer Calculations on GPU
Performance of integer and bitwise operations on GPU
Basically, on modern GPUs 32-bit float and integer operations are equivalent in performance.
As for deterministic results, it depends on your code. For example, if you rely on multiple threads performing atomic operations on the same memory then reading that memory from other threads and performing operations depending on that value, then results may not be exactly the same every time.
From personal experience, I needed to generate random numbers but also required consistent results. So basically I had a largish array of seeds, one for each thread, and each one was completely independent. Other random number generators which rely on atomic operations and barriers would not have been.
The other half of having deterministic results is having the same result given different hardware. With integer operations you should be fairly safe. With floating point operations in OpenCL, avoiding the fast relaxed math option and the native variants of functions would increase chances of getting the same results on different hardware.

A couple of CUDA-performance questions

This is the first time i ask question here so thanks very much in advance and please forgive my ignorance. And also I've just started to CUDA programming.
Basically, i have a bunch of points, and i want to calculate all the pair-wise distances. Currently my kernel function just holds on one point, and iteratively read in all other points (from global memory), and conduct the calculation. Here's some of my confusions:
I'm using a Tesla M2050 with 448 cores. But my current parallel version (kernel<<<128,16,16>>>) achieves a much higher parallelism (about 600x faster than kernel<<<1,1,1>>>). Is it possibly due to the multithreading thing or pipeline issue, or they actually indicate the same thing?
I want to further improve the performance. So i figure to use shared memory to hold some input points for each multiprocessing block. But the new code is just as fast. What's the possible cause? Could it be related to the fact that i set too many threads?
Or, is it because i have a if-statement in the code? The thing is, i only consider and count the short distances, so i have a statement like (if dist < 200). How much should i worry about this one?
A million thanks!
Bin
Mark Harris has a very good presentation about optimizing CUDA: Optimizing Parallel Reduction in CUDA.
Algorithmic optimizations
Changes to addressing, algorithm cascading
11.84x speedup, combined!
Code optimizations
Loop unrolling
2.54x speedup, combined
Having an extra operations statement, does indeed cause problems although it will be the last thing you want to optimize, if not simply because you need to know the layout of your code before implementing the size assumptions!
The problem you are working on sounds like the famous n-body problem,
see Fast N-Body Simulation with CUDA.
An additional performance increase can be achieved if you can avoid doing a pairwise computation, for example, the elements are too far to have an effect on each-other. This applies to any relationship that can be expressed geometrically, whether it be pairwise costs or a physics simulation with springs. My favorite method is to divide the grid into boxes and, with each element putting itself into a box via division, then only evaluate pairwise relations between between neighboring boxes. This can be called O(n*m).
(1) The GPU runs many more threads in parallel than there are cores. This is because each core is pipelined. Operations take around 20 cycles on compute capability 2.0 (Fermi) architectures. So for each clock cycle, the core starts work on a new operation, returns the finished result of one operation, and move all the other (around 18) operations one more step towards completion. So, to saturate the GPU, you might need something like 448 * 20 threads.
(2) It's probably because your values are getting cached in the L1 and L2 caches.
(3) It depends on how much work you're doing inside the if conditional. The GPU must run all 32 threads in a warp through all the code inside the if even if the condition is true for only a single of those threads. If there is a lot of code in the conditional as compared to the rest of your kernel, and relatively view threads go through that code path, it is likely that you end up with low compute throughput.

Can raymarching be accelerated under an SIMD architecture?

The answer would seem to be no, because raymarching is highly conditional i.e. each ray follows a unique execution path, since on each step we check for opacity, termination etc. that will vary based on the direction of the individual ray.
So it would seem that SIMD would largely not be able to accelerate this; rather, MIMD would be required for acceleration.
Does this make sense? Or am I missing something(s)?
As stated already, you could probably get a speedup from implementing your
vector math using SSE instructions (be aware of the effects discussed
here - also for the other approach). This approach would allow the code
stay concise and maintainable.
I assume, however, your question is about "packet traversal" (or something
like it), in other words to process multiple scalar values each of a
different ray:
In principle it should be possible deferring the shading to another pass.
The SIMD packet could be repopulated with a new ray once the bare marching
pass terminates and the temporary result is stored as input for the shading
pass. This will allow to parallelize a certain, case-dependent percentage
of your code exploting all four SIMD lanes.
Tiling the image and indexing the rays within it in Morton-order might be
a good idea too in order to avoid cache pressure (unless your geometry is
strictly procedural).
You won't know whether it pays off unless you try. My guess is, that if it
does, the amount of speedup might not be worth the complication of the code
for just four lanes.
Have you considered using an SIMT architecture such as a programmable GPU?
A somewhat up-to-date programmable graphics board allows you to perform
raymarching at interactive rates (see it happen in your browser here).
The last days I built a software-based raymarcher for a menger sponge. At the moment without using SIMD and I also used no special algorithm. I just trace from -1 to 1 in X and Y, which are U and V for the destination texture. Then I got a camera position and a destination which I use to calculate the increment vector for the raymarch.
After that I use a constant value of iterations to perform, in which only one branch decides if there's an intersection with the fractal volume. So if my camera eye is E and my direction vector is D I have to find the smallest t. If I found that or reached a maximal distance I break the loop. At the end I have t - from that I calculate the fragment color.
In my opinion it should be possible to parallelize these operations by SSE1/2, because one can solve the branch by null'ing the field in the vector (__m64 / __m128), so further SIMD operations won't apply here. It really depends on what you raymarch/-cast but if you just calculate a fragment color from a function (like my fractal curve here is) and don't access memory non-linearly there are some tricks to make it possible.
Sure, this answer contains speculation, but I will keep you informed when I've parallelized this routine.
Only insofar as SSE, for instance, lets you do operations on vectors in parallel.

Is trigonometry computationally expensive?

I read in an article somewhere that trig calculations are generally expensive. Is this true? And if so, that's why they use trig-lookup tables right?
EDIT: Hmm, so if the only thing that changes is the degrees (accurate to 1 degree), would a look up table with 360 entries (for every angle) be faster?
Expensive is a relative term.
The mathematical operations that will perform fastest are those that can be performed directly by your processor. Certainly integer add and subtract will be among them. Depending upon the processor, there may be multiplication and division as well. Sometimes the processor (or a co-processor) can handle floating point operations natively.
More complicated things (e.g. square root) requires a series of these low-level calculations to be performed. These things are usually accomplished using math libraries (written on top of the native operations your processor can perform).
All of this happens very very fast these days, so "expensive" depends on how much of it you need to do, and how quickly you need it to happen.
If you're writing real-time 3D rendering software, then you may need to use lots of clever math tricks and shortcuts to squeeze every bit of speed out of your environment.
If you're working on typical business applications, odds are that the mathematical calculations you're doing won't contribute significantly to the overall performance of your system.
On the Intel x86 processor, floating point addition or subtraction requires 6 clock cycles, multiplication requires 8 clock cycles, and division 30-44 clock cycles. But cosine requires between 180 and 280 clock cycles.
It's still very fast, since the x86 does these things in hardware, but it's much slower than the more basic math functions.
Since sin(), cos() and tan() are mathematical functions which are calculated by summing a series developers will sometimes use lookup tables to avoid the expensive calculation.
The tradeoff is in accuracy and memory. The greater the need for accuracy, the greater the amount of memory required for the lookup table.
Take a look at the following table accurate to 1 degree.
http://www.analyzemath.com/trigonometry/trig_1.gif
While the quick answer is that they are more expensive than the primitive math functions (addition/multiplication/subtraction etc...) they are not -expensive- in terms of human time. Typically the reason people optimize them with look-up tables and approximations is because they are calling them potentially tens of thousands of times per second and every microsecond could be valuable.
If you're writing a program and just need to call it a couple times a second the built-in functions are fast enough by far.
I would recommend writing a test program and timing them for yourself. Yes, they're slow compared to plus and minus, but they're still single processor instructions. It's unlikely to be an issue unless you're doing a very tight loop with millions of iterations.
Yes, (relative to other mathematical operations multiply, divide): if you're doing something realtime (matrix ops, video games, whatever), you can knock off lots of cycles by moving your trig calculations out of your inner loop.
If you're not doing something realtime, then no, they're not expensive (relative to operations such as reading a bunch of data from disk, generating a webpage, etc.). Trig ops are hopefully done in hardware by your CPU (which can do billions of floating point operations per second).
If you always know the angles you are computing, you can store them in a variable instead of calculating them every time. This also applies within your method/function call where your angle is not going to change. You can be smart by using some formulas (calculating sin(theta) from sin(theta/2), knowing how often the values repeat - sin(theta + 2*pi*n) = sin(theta)) and reducing computation. See this wikipedia article
yes it is. trig functions are computed by summing up a series. So in general terms, it would be a lot more costly then a simple mathematical operation. same goes for sqrt

Resources