What all operations does FLOPS include? - performance

FLOPS stands for FLoating-point Operations Per Second and I have some idea what Floating-point is. I want to know what these Operations are? Does +, -, *, / are the only operations or operations like taking logarithm(), exponential() are also FLOs?
Does + and * of two floats take same time? And if they take different time, then what interpretation should I draw from the statement: Performance is 100 FLOPS. How many + and * are there in one second.
I am not a computer science guy, so kindly try to be less technical. Also let me know if I have understood it completely wrong.
Thanks

There is no specific set of operations that are included in FLOPS, it's just measured using the operations that each processor supports as a single instruction. The basic arithmetic operations are generally supported, but operations like logarithms are calculated using a series of simpler operations.
For modern computers all the supported floating point operations generally run in a single clock cycle or less. Even if the complexity differs a bit between operations, it's rather getting the data in and out of the processor that is the bottle neck.
The reason that FLOPS is still a useful measure for computing speed is that CPUs are not specialized on floating point calculations. Adding more floating point units in the CPU would drive up the FLOPS, but there is no big market for CPUs that are only good at that.

Related

What is FLOPS in field of deep learning?

What is FLOPS in field of deep learning? Why we don't use the term just FLO?
We use the term FLOPS to measure the number of operations of a frozen deep learning network.
Following Wikipedia, FLOPS = floating point operations per second. When we test computing units, we should consider of the time. But in case of measuring deep learning network, how can I understand this concept of time? Shouldn't we use the term just FLO(floating point operations)?
Why do people use the term FLOPS? If there is anything I don't know, what is it?
==== attachment ===
Frozen deep learning networks that I mentioned is just a kind of software. It's not about hardware. In the field of deep learning, people use the term FLOPS to measure how many operations are needed to run the network model. In this case, in my opinion, we should use the term FLO. I thought people confused about the term FLOPS and I want to know if others think the same or if I'm wrong.
Please look at these cases:
how to calculate a net's FLOPs in CNN
https://iq.opengenus.org/floating-point-operations-per-second-flops-of-machine-learning-models/
Confusingly both FLOPs, floating point operations, and FLOPS, floating point operations per second, are used in reference to machine learning. FLOPs are often used to describe how many operations are required to run a single instance of a given model, like VGG19. This is the usage of FLOPs in both of the links you posted, though unfortunately the opengenus link incorrectly mistakenly uses 'Floating point operations per second' to refer to FLOPs.
You will see FLOPS used to describe the computing power of given hardware like GPUs which is useful when thinking about how powerful a given piece of hardware is, or conversely, how long it may take to train a model on that hardware.
Sometimes people write FLOPS when they mean FLOPs. It is usually clear from the context which one they mean.
I not sure my answer is 100% correct. but this is what i understand.
FLOPS = Floating point operations per second
FLOPs = Floating point operations
FLOPS is a unit of speed. FLOPs is a unit of amount.
What is FLOPS in field of deep learning? Why we don't use the term just FLO?
FLOPS (Floating Point Operations Per Second) is the same in most fields - its the (theoretical) maximum number of floating point operations that the hardware might (if you're extremely lucky) be capable of.
We don't use FLO because FLO would always be infinity (given an infinite amount of time hardware is capable of doing an infinite amount of floating point operations).
Note that one "floating point operation" is one multiplication, one division, one addition, ... Typically (for modern CPUs) FLOPS is calculated from repeated use of a "fused multiply then add" instruction, so that one instruction counts as 2 floating point operations. When combined with SIMD a single instruction (doing 8 "multiple and add" in parallel) might count as 16 floating point instructions. Of course this is a calculated theoretical value, so you ignore things like memory accesses, branches, IRQs, etc. This is why "theoretical FLOPs" is almost never achievable in practice.
Why do people use the term FLOPS? If there is anything I don't know, what is it?
Primarily it's used to describe how powerful hardware is for marketing purposes (e.g. "Our new CPU is capable of 5 GFLOPS!").

GPU and determinism

I was thinking of off-loading some math operations to the GPU. As I'm already using D3D11 I'd use a compute shader to do the work. But the thing is I need the results to be the same for the same input, no matter what GPU the user might have. (only requirement that it supports compute shader 4.0).
So is floating point math deterministic on GPUs?
If not, do GPUs support integer math?
I haven't used DirectCompute, only OpenCL.
GPUs definitely support integer math, both 32-bit and 64-bit integers. A couple of questions already have this discussion:
Integer Calculations on GPU
Performance of integer and bitwise operations on GPU
Basically, on modern GPUs 32-bit float and integer operations are equivalent in performance.
As for deterministic results, it depends on your code. For example, if you rely on multiple threads performing atomic operations on the same memory then reading that memory from other threads and performing operations depending on that value, then results may not be exactly the same every time.
From personal experience, I needed to generate random numbers but also required consistent results. So basically I had a largish array of seeds, one for each thread, and each one was completely independent. Other random number generators which rely on atomic operations and barriers would not have been.
The other half of having deterministic results is having the same result given different hardware. With integer operations you should be fairly safe. With floating point operations in OpenCL, avoiding the fast relaxed math option and the native variants of functions would increase chances of getting the same results on different hardware.

What is the difference between a floating point merge and an integer merge?

In this paper, two cases have been considered for comparing algorithms - integers and floating points.
I understand the differences regarding these data types in terms of storage, but am not sure why there is a difference among these.
Why is there a difference in performance between the following two cases
Using merge sort on Integers
Using merge sort on Floating Points
I understand that it comes down to speed comparison in both the cases, the question is why these speeds might be different?
The paper states, in section 4, “Conclusion”, “the execution time for merging integers on the CPU is 2.5X faster than the execution time for floating point on the CPU”. This large a difference is surprising on the Intel Nehalem Xeon E5530 used in the measurements. However, the paper does not give information about source code, specific instructions or processor features used in the merge, compiler version, or other tools used. If the processor is used efficiently, there should be only very minor differences in the performance of an integer merge versus a floating-point merge. Thus, it seems likely that the floating-point code used in the test was inefficient and is an indicator of poor tools used rather than any shortcoming of the processor.
Merge sort has an inner loop of quite a bit of instructions. Comparing floats might be a little more expensive but only by 1-2 cycles. You will not notice the difference of that among the much bigger amount of merge code.
Comparing floats is hardware accelerated and fast compared to everything else you are doing in that algorithm.
Also, the comparison likely can overlap other instructions so the difference in wall-clock time might be exactly zero (or not).

Why Bresenham's line algorithm more efficent then Naive algorithm

For my graphics course we were taught the Naive line rasterizing algorithm then Bresenham's line drawing algorithm. We were told computers are integer machines which is why we should use the latter.
If we assume no optimization on software level is this true for modern cpus with mmx and other instruction sets ? As I have looked at Intel’s 64-ia-32-architectures-optimization-manual.pdf and the latency for addition subtraction multiplication is the same or better for float compared to int for mmx.
If the algorithm is executed in the gpu should this matter ? As having checked NVIDIA CUDA Programming Guide 1.0 (pdf), page 41, the clock cycles for int and float are the same.
What is the de-efficiency of casting float to int? is load-hit-store stall a real issue for us?
How efficient are functions which round up/down numbers? (we can think of the implementation in c++ stl)
Is the efficiency gained from Bresenham algorithm due to addition instead of the multiplication used in the inner loop?
It's a little misleading to call computers integer machines, but the sentiment is mostly true. Since as far as I know, CPUs use integer registers to generate memory addresses to read from and write to. Keeping line drawing in integer registers means you avoid the overhead of copying from other registers into integer registers to generate the memory address to write pixels to during line drawing.
As for your specific questions:
Since you need to use the general purpose registers to access memory, using SSE or the FPU to calculate memory offsets (pointers) will still have the overhead of transferring data from those registers to the general purpose ones. So it depends on whether the overhead of transferring from one register set to another is greater than the performance of using a particular instruction set.
GPUs tend to have a unified register set, so it shouldn't matter nearly as much.
Casting a float to an int in and of itself is not expensive. The overhead comes from transferring data from one register set to another. Usually that has to be done through memory, and if your CPU has load-hit-store penalties, this transfer is a big source of them.
The performance of rounding up or down depends on the CPU and the compiler. On the slow end, MSVC used to use a function to round to zero which mucked with the FPU control word. On the fast end you have special CPU instructions that handle rounding directly.
Bresenham's line drawing algorithm is fast because it reduces determining where to draw points on a line from the naive y= m*x + b formula to an addition plus a branch (and the branch can be eliminated through well know branchless integer techniques). The run-slice version of Brensenham's line drawing algorithm can be even faster as it determines "runs" of pixels with the same component directly rather than iterating.

Is trigonometry computationally expensive?

I read in an article somewhere that trig calculations are generally expensive. Is this true? And if so, that's why they use trig-lookup tables right?
EDIT: Hmm, so if the only thing that changes is the degrees (accurate to 1 degree), would a look up table with 360 entries (for every angle) be faster?
Expensive is a relative term.
The mathematical operations that will perform fastest are those that can be performed directly by your processor. Certainly integer add and subtract will be among them. Depending upon the processor, there may be multiplication and division as well. Sometimes the processor (or a co-processor) can handle floating point operations natively.
More complicated things (e.g. square root) requires a series of these low-level calculations to be performed. These things are usually accomplished using math libraries (written on top of the native operations your processor can perform).
All of this happens very very fast these days, so "expensive" depends on how much of it you need to do, and how quickly you need it to happen.
If you're writing real-time 3D rendering software, then you may need to use lots of clever math tricks and shortcuts to squeeze every bit of speed out of your environment.
If you're working on typical business applications, odds are that the mathematical calculations you're doing won't contribute significantly to the overall performance of your system.
On the Intel x86 processor, floating point addition or subtraction requires 6 clock cycles, multiplication requires 8 clock cycles, and division 30-44 clock cycles. But cosine requires between 180 and 280 clock cycles.
It's still very fast, since the x86 does these things in hardware, but it's much slower than the more basic math functions.
Since sin(), cos() and tan() are mathematical functions which are calculated by summing a series developers will sometimes use lookup tables to avoid the expensive calculation.
The tradeoff is in accuracy and memory. The greater the need for accuracy, the greater the amount of memory required for the lookup table.
Take a look at the following table accurate to 1 degree.
http://www.analyzemath.com/trigonometry/trig_1.gif
While the quick answer is that they are more expensive than the primitive math functions (addition/multiplication/subtraction etc...) they are not -expensive- in terms of human time. Typically the reason people optimize them with look-up tables and approximations is because they are calling them potentially tens of thousands of times per second and every microsecond could be valuable.
If you're writing a program and just need to call it a couple times a second the built-in functions are fast enough by far.
I would recommend writing a test program and timing them for yourself. Yes, they're slow compared to plus and minus, but they're still single processor instructions. It's unlikely to be an issue unless you're doing a very tight loop with millions of iterations.
Yes, (relative to other mathematical operations multiply, divide): if you're doing something realtime (matrix ops, video games, whatever), you can knock off lots of cycles by moving your trig calculations out of your inner loop.
If you're not doing something realtime, then no, they're not expensive (relative to operations such as reading a bunch of data from disk, generating a webpage, etc.). Trig ops are hopefully done in hardware by your CPU (which can do billions of floating point operations per second).
If you always know the angles you are computing, you can store them in a variable instead of calculating them every time. This also applies within your method/function call where your angle is not going to change. You can be smart by using some formulas (calculating sin(theta) from sin(theta/2), knowing how often the values repeat - sin(theta + 2*pi*n) = sin(theta)) and reducing computation. See this wikipedia article
yes it is. trig functions are computed by summing up a series. So in general terms, it would be a lot more costly then a simple mathematical operation. same goes for sqrt

Resources