GPU and determinism - gpgpu

I was thinking of off-loading some math operations to the GPU. As I'm already using D3D11 I'd use a compute shader to do the work. But the thing is I need the results to be the same for the same input, no matter what GPU the user might have. (only requirement that it supports compute shader 4.0).
So is floating point math deterministic on GPUs?
If not, do GPUs support integer math?

I haven't used DirectCompute, only OpenCL.
GPUs definitely support integer math, both 32-bit and 64-bit integers. A couple of questions already have this discussion:
Integer Calculations on GPU
Performance of integer and bitwise operations on GPU
Basically, on modern GPUs 32-bit float and integer operations are equivalent in performance.
As for deterministic results, it depends on your code. For example, if you rely on multiple threads performing atomic operations on the same memory then reading that memory from other threads and performing operations depending on that value, then results may not be exactly the same every time.
From personal experience, I needed to generate random numbers but also required consistent results. So basically I had a largish array of seeds, one for each thread, and each one was completely independent. Other random number generators which rely on atomic operations and barriers would not have been.
The other half of having deterministic results is having the same result given different hardware. With integer operations you should be fairly safe. With floating point operations in OpenCL, avoiding the fast relaxed math option and the native variants of functions would increase chances of getting the same results on different hardware.

Related

how hard is it to implement the following operations/functions in VHDL for an FPGA?

As a firmware engineer, how do you compare the computation cost of the following operations (maybe amount of resources needed)
addition/subtraction
multiplication
division
trigonometric functions such as cosine
square root
For FLOATING point with 32 bits calculation.
Answer from a hardware (not firmware) point of view: unfortunately there is no simple answer to your question. Each function you listed has many different hardware implementations, usually between small-slow and large-fast. Moreover it depends on your target FPGA because some of them embed these functions as hard macros, so the question is not any more what do they cost, but do I have enough of them in this FPGA?
As a partial answer you can take this: with the most straightforward, combinatorial implementation, not using any hard macro, an integer or fixed-point N-bits adder/subtracter costs O(N) while a N x N-bits multiplier costs O(NxN). Roughly.
For the other functions, it is extremely difficult to answer, there are far too many variants to consider (fixed/floating point, latency, throughput, accuracy, range...). Assuming you make similar choices for all of them, I would say: add < mul < div < sqrt < trig. But honestly, do not take this for granted. If you are working with floating point numbers, for instance, the adder might be closer or even larger than the multiplier because it requires mantissas alignment, that is a barrel shifter.
A firmware engineer with a little hardware design knowledge would likely use premade cores for each of those functions. Xilinx, Altera, and others provide parametrizeable cores that can perform most of the functions that you specified. The parameters for the core will likely include input size and number of cycles taken for the specified operation, among many others. The resources used by each of these cores will vary significantly based on the configuration and also what vendor and corresponding FPGA family that the core is being created for ( for instance Altera Cyclone 5 or Xilinx Artix 7). The latency that the cores will have will need to be a factor in your decision as well, higher speed families of FPGAs can operate at faster clock rates decreasing the time that operations take.
The way to get real estimates is to install the vendors tools ( for instance Xilinx's Vivado or Altera's Quartus) and use those tools to generate each of the types of cores required. You will need pick an exact FPGA part, in the family that you are evaluating. Choose one with many pins to avoid running out of I/O on the FPGA. Then synthesize and implement a design for every of the cores you are evaluating. The simplest design will have just the core that you are evaluating and map it's inputs and outputs to the FPGAs I/O pins. The implementation results will give you the number of resources that the design takes and also the maximum clock frequency that the design can run at.

What all operations does FLOPS include?

FLOPS stands for FLoating-point Operations Per Second and I have some idea what Floating-point is. I want to know what these Operations are? Does +, -, *, / are the only operations or operations like taking logarithm(), exponential() are also FLOs?
Does + and * of two floats take same time? And if they take different time, then what interpretation should I draw from the statement: Performance is 100 FLOPS. How many + and * are there in one second.
I am not a computer science guy, so kindly try to be less technical. Also let me know if I have understood it completely wrong.
Thanks
There is no specific set of operations that are included in FLOPS, it's just measured using the operations that each processor supports as a single instruction. The basic arithmetic operations are generally supported, but operations like logarithms are calculated using a series of simpler operations.
For modern computers all the supported floating point operations generally run in a single clock cycle or less. Even if the complexity differs a bit between operations, it's rather getting the data in and out of the processor that is the bottle neck.
The reason that FLOPS is still a useful measure for computing speed is that CPUs are not specialized on floating point calculations. Adding more floating point units in the CPU would drive up the FLOPS, but there is no big market for CPUs that are only good at that.

What is the difference between a floating point merge and an integer merge?

In this paper, two cases have been considered for comparing algorithms - integers and floating points.
I understand the differences regarding these data types in terms of storage, but am not sure why there is a difference among these.
Why is there a difference in performance between the following two cases
Using merge sort on Integers
Using merge sort on Floating Points
I understand that it comes down to speed comparison in both the cases, the question is why these speeds might be different?
The paper states, in section 4, “Conclusion”, “the execution time for merging integers on the CPU is 2.5X faster than the execution time for floating point on the CPU”. This large a difference is surprising on the Intel Nehalem Xeon E5530 used in the measurements. However, the paper does not give information about source code, specific instructions or processor features used in the merge, compiler version, or other tools used. If the processor is used efficiently, there should be only very minor differences in the performance of an integer merge versus a floating-point merge. Thus, it seems likely that the floating-point code used in the test was inefficient and is an indicator of poor tools used rather than any shortcoming of the processor.
Merge sort has an inner loop of quite a bit of instructions. Comparing floats might be a little more expensive but only by 1-2 cycles. You will not notice the difference of that among the much bigger amount of merge code.
Comparing floats is hardware accelerated and fast compared to everything else you are doing in that algorithm.
Also, the comparison likely can overlap other instructions so the difference in wall-clock time might be exactly zero (or not).

Why Bresenham's line algorithm more efficent then Naive algorithm

For my graphics course we were taught the Naive line rasterizing algorithm then Bresenham's line drawing algorithm. We were told computers are integer machines which is why we should use the latter.
If we assume no optimization on software level is this true for modern cpus with mmx and other instruction sets ? As I have looked at Intel’s 64-ia-32-architectures-optimization-manual.pdf and the latency for addition subtraction multiplication is the same or better for float compared to int for mmx.
If the algorithm is executed in the gpu should this matter ? As having checked NVIDIA CUDA Programming Guide 1.0 (pdf), page 41, the clock cycles for int and float are the same.
What is the de-efficiency of casting float to int? is load-hit-store stall a real issue for us?
How efficient are functions which round up/down numbers? (we can think of the implementation in c++ stl)
Is the efficiency gained from Bresenham algorithm due to addition instead of the multiplication used in the inner loop?
It's a little misleading to call computers integer machines, but the sentiment is mostly true. Since as far as I know, CPUs use integer registers to generate memory addresses to read from and write to. Keeping line drawing in integer registers means you avoid the overhead of copying from other registers into integer registers to generate the memory address to write pixels to during line drawing.
As for your specific questions:
Since you need to use the general purpose registers to access memory, using SSE or the FPU to calculate memory offsets (pointers) will still have the overhead of transferring data from those registers to the general purpose ones. So it depends on whether the overhead of transferring from one register set to another is greater than the performance of using a particular instruction set.
GPUs tend to have a unified register set, so it shouldn't matter nearly as much.
Casting a float to an int in and of itself is not expensive. The overhead comes from transferring data from one register set to another. Usually that has to be done through memory, and if your CPU has load-hit-store penalties, this transfer is a big source of them.
The performance of rounding up or down depends on the CPU and the compiler. On the slow end, MSVC used to use a function to round to zero which mucked with the FPU control word. On the fast end you have special CPU instructions that handle rounding directly.
Bresenham's line drawing algorithm is fast because it reduces determining where to draw points on a line from the naive y= m*x + b formula to an addition plus a branch (and the branch can be eliminated through well know branchless integer techniques). The run-slice version of Brensenham's line drawing algorithm can be even faster as it determines "runs" of pixels with the same component directly rather than iterating.

Is trigonometry computationally expensive?

I read in an article somewhere that trig calculations are generally expensive. Is this true? And if so, that's why they use trig-lookup tables right?
EDIT: Hmm, so if the only thing that changes is the degrees (accurate to 1 degree), would a look up table with 360 entries (for every angle) be faster?
Expensive is a relative term.
The mathematical operations that will perform fastest are those that can be performed directly by your processor. Certainly integer add and subtract will be among them. Depending upon the processor, there may be multiplication and division as well. Sometimes the processor (or a co-processor) can handle floating point operations natively.
More complicated things (e.g. square root) requires a series of these low-level calculations to be performed. These things are usually accomplished using math libraries (written on top of the native operations your processor can perform).
All of this happens very very fast these days, so "expensive" depends on how much of it you need to do, and how quickly you need it to happen.
If you're writing real-time 3D rendering software, then you may need to use lots of clever math tricks and shortcuts to squeeze every bit of speed out of your environment.
If you're working on typical business applications, odds are that the mathematical calculations you're doing won't contribute significantly to the overall performance of your system.
On the Intel x86 processor, floating point addition or subtraction requires 6 clock cycles, multiplication requires 8 clock cycles, and division 30-44 clock cycles. But cosine requires between 180 and 280 clock cycles.
It's still very fast, since the x86 does these things in hardware, but it's much slower than the more basic math functions.
Since sin(), cos() and tan() are mathematical functions which are calculated by summing a series developers will sometimes use lookup tables to avoid the expensive calculation.
The tradeoff is in accuracy and memory. The greater the need for accuracy, the greater the amount of memory required for the lookup table.
Take a look at the following table accurate to 1 degree.
http://www.analyzemath.com/trigonometry/trig_1.gif
While the quick answer is that they are more expensive than the primitive math functions (addition/multiplication/subtraction etc...) they are not -expensive- in terms of human time. Typically the reason people optimize them with look-up tables and approximations is because they are calling them potentially tens of thousands of times per second and every microsecond could be valuable.
If you're writing a program and just need to call it a couple times a second the built-in functions are fast enough by far.
I would recommend writing a test program and timing them for yourself. Yes, they're slow compared to plus and minus, but they're still single processor instructions. It's unlikely to be an issue unless you're doing a very tight loop with millions of iterations.
Yes, (relative to other mathematical operations multiply, divide): if you're doing something realtime (matrix ops, video games, whatever), you can knock off lots of cycles by moving your trig calculations out of your inner loop.
If you're not doing something realtime, then no, they're not expensive (relative to operations such as reading a bunch of data from disk, generating a webpage, etc.). Trig ops are hopefully done in hardware by your CPU (which can do billions of floating point operations per second).
If you always know the angles you are computing, you can store them in a variable instead of calculating them every time. This also applies within your method/function call where your angle is not going to change. You can be smart by using some formulas (calculating sin(theta) from sin(theta/2), knowing how often the values repeat - sin(theta + 2*pi*n) = sin(theta)) and reducing computation. See this wikipedia article
yes it is. trig functions are computed by summing up a series. So in general terms, it would be a lot more costly then a simple mathematical operation. same goes for sqrt

Resources