What is FLOPS in field of deep learning? Why we don't use the term just FLO?
We use the term FLOPS to measure the number of operations of a frozen deep learning network.
Following Wikipedia, FLOPS = floating point operations per second. When we test computing units, we should consider of the time. But in case of measuring deep learning network, how can I understand this concept of time? Shouldn't we use the term just FLO(floating point operations)?
Why do people use the term FLOPS? If there is anything I don't know, what is it?
==== attachment ===
Frozen deep learning networks that I mentioned is just a kind of software. It's not about hardware. In the field of deep learning, people use the term FLOPS to measure how many operations are needed to run the network model. In this case, in my opinion, we should use the term FLO. I thought people confused about the term FLOPS and I want to know if others think the same or if I'm wrong.
Please look at these cases:
how to calculate a net's FLOPs in CNN
https://iq.opengenus.org/floating-point-operations-per-second-flops-of-machine-learning-models/
Confusingly both FLOPs, floating point operations, and FLOPS, floating point operations per second, are used in reference to machine learning. FLOPs are often used to describe how many operations are required to run a single instance of a given model, like VGG19. This is the usage of FLOPs in both of the links you posted, though unfortunately the opengenus link incorrectly mistakenly uses 'Floating point operations per second' to refer to FLOPs.
You will see FLOPS used to describe the computing power of given hardware like GPUs which is useful when thinking about how powerful a given piece of hardware is, or conversely, how long it may take to train a model on that hardware.
Sometimes people write FLOPS when they mean FLOPs. It is usually clear from the context which one they mean.
I not sure my answer is 100% correct. but this is what i understand.
FLOPS = Floating point operations per second
FLOPs = Floating point operations
FLOPS is a unit of speed. FLOPs is a unit of amount.
What is FLOPS in field of deep learning? Why we don't use the term just FLO?
FLOPS (Floating Point Operations Per Second) is the same in most fields - its the (theoretical) maximum number of floating point operations that the hardware might (if you're extremely lucky) be capable of.
We don't use FLO because FLO would always be infinity (given an infinite amount of time hardware is capable of doing an infinite amount of floating point operations).
Note that one "floating point operation" is one multiplication, one division, one addition, ... Typically (for modern CPUs) FLOPS is calculated from repeated use of a "fused multiply then add" instruction, so that one instruction counts as 2 floating point operations. When combined with SIMD a single instruction (doing 8 "multiple and add" in parallel) might count as 16 floating point instructions. Of course this is a calculated theoretical value, so you ignore things like memory accesses, branches, IRQs, etc. This is why "theoretical FLOPs" is almost never achievable in practice.
Why do people use the term FLOPS? If there is anything I don't know, what is it?
Primarily it's used to describe how powerful hardware is for marketing purposes (e.g. "Our new CPU is capable of 5 GFLOPS!").
Related
1-1 What are the difference in delay times of the basic logic gates?
I found that NAND and NOR gates are preferred in digital circuit design for shorter delay time and that AND and OR gates might even be implemented with NOT and NAND/NOR gates.
1-2 Are there set or known difference in delay time between AND, OR, NOT gates?
For a typical fpga (LUT-based logical elements) there's no difference at all.
Single cell can implement a complex function based on its resulting truth table, and multiple expressions might be folded into single cell, so you wouldn't even find individual and/or/not "gates".
It might be different for ASIC, I don't know. But in a typical fpga you don't have gates, there are ram-based lookup tables, implementing complex functions of its inputs - 4-6 inputs, not just 2.
You'll find that in a big enough design the routing costs are much higher than delays in a single logical cell.
If you look at how these different gates are constructed you can see some of the reasons for differences. An inverter consists of one pull-up transistor and one pull down transistor. This is the simplest gate and is therefore potentially the fastest. A NAND has two pull-down devices in series and two pull-up transistors in parallel. The NOR is basically the opposite of the NAND. And yes: AND is usually just NAND + inverter.
The on resistance of a path will be higher with two transistors in series (making it slower), and the number of transistors connected to a single node will increase the captive load (making it slower). You can make things faster by using larger transistors (with lower on resistance) but that increases the load of whatever cell is driving it, which slows that cell down.
It is a big optimization problem which you probably shouldn't try to solve yourself. That is what the EDA tools are for.
Like most answers in life, it depends. There are many ways to build each type of logic gate and different types of transistors can be used to make each type of gate. You can build all gates from multiple universal gates like NAND and NOR. So the other gates would have a larger delay time. BJT transistors will have a larger delay than MOFET transistors. You can also use Schottky transistors to reduce delays compared to BJT. If you use an IC there are lots of components within the chip, some which may reduce delays and some that may increase delays. So you really have to compare what you are working with. Here is a video that shows the design of logic gates at the transistor level. https://youtu.be/nB6724G3b3E
As a firmware engineer, how do you compare the computation cost of the following operations (maybe amount of resources needed)
addition/subtraction
multiplication
division
trigonometric functions such as cosine
square root
For FLOATING point with 32 bits calculation.
Answer from a hardware (not firmware) point of view: unfortunately there is no simple answer to your question. Each function you listed has many different hardware implementations, usually between small-slow and large-fast. Moreover it depends on your target FPGA because some of them embed these functions as hard macros, so the question is not any more what do they cost, but do I have enough of them in this FPGA?
As a partial answer you can take this: with the most straightforward, combinatorial implementation, not using any hard macro, an integer or fixed-point N-bits adder/subtracter costs O(N) while a N x N-bits multiplier costs O(NxN). Roughly.
For the other functions, it is extremely difficult to answer, there are far too many variants to consider (fixed/floating point, latency, throughput, accuracy, range...). Assuming you make similar choices for all of them, I would say: add < mul < div < sqrt < trig. But honestly, do not take this for granted. If you are working with floating point numbers, for instance, the adder might be closer or even larger than the multiplier because it requires mantissas alignment, that is a barrel shifter.
A firmware engineer with a little hardware design knowledge would likely use premade cores for each of those functions. Xilinx, Altera, and others provide parametrizeable cores that can perform most of the functions that you specified. The parameters for the core will likely include input size and number of cycles taken for the specified operation, among many others. The resources used by each of these cores will vary significantly based on the configuration and also what vendor and corresponding FPGA family that the core is being created for ( for instance Altera Cyclone 5 or Xilinx Artix 7). The latency that the cores will have will need to be a factor in your decision as well, higher speed families of FPGAs can operate at faster clock rates decreasing the time that operations take.
The way to get real estimates is to install the vendors tools ( for instance Xilinx's Vivado or Altera's Quartus) and use those tools to generate each of the types of cores required. You will need pick an exact FPGA part, in the family that you are evaluating. Choose one with many pins to avoid running out of I/O on the FPGA. Then synthesize and implement a design for every of the cores you are evaluating. The simplest design will have just the core that you are evaluating and map it's inputs and outputs to the FPGAs I/O pins. The implementation results will give you the number of resources that the design takes and also the maximum clock frequency that the design can run at.
FLOPS stands for FLoating-point Operations Per Second and I have some idea what Floating-point is. I want to know what these Operations are? Does +, -, *, / are the only operations or operations like taking logarithm(), exponential() are also FLOs?
Does + and * of two floats take same time? And if they take different time, then what interpretation should I draw from the statement: Performance is 100 FLOPS. How many + and * are there in one second.
I am not a computer science guy, so kindly try to be less technical. Also let me know if I have understood it completely wrong.
Thanks
There is no specific set of operations that are included in FLOPS, it's just measured using the operations that each processor supports as a single instruction. The basic arithmetic operations are generally supported, but operations like logarithms are calculated using a series of simpler operations.
For modern computers all the supported floating point operations generally run in a single clock cycle or less. Even if the complexity differs a bit between operations, it's rather getting the data in and out of the processor that is the bottle neck.
The reason that FLOPS is still a useful measure for computing speed is that CPUs are not specialized on floating point calculations. Adding more floating point units in the CPU would drive up the FLOPS, but there is no big market for CPUs that are only good at that.
I am currently working on a project where I am trying to find which Algorithm is better when implemented in a Chip-Level design. I am pushing these to my FPGA board. I am writing the code in Verilog.
What I need is a rubric to compare
a) Time Complexity of the 2 functions.
b) Worst Case timings
c) Power Consumption
For eg,
Method 1: prod = mult1 * mult2;
Where mult1 and mult2 are two 8 bit inputs.
Method 2: prod = ((mult1+mult2-100) * 100) + ((100-mult1) * (100-mult2))
Where mult1 and mult2 are two 8 bit inputs.
So I am interested in knowing the total time it takes for the chip to compute the product, from the time the 2 inputs are passed in, and the product is calculated.
The Algorithms I am dealing with are both O(n). So I know it doesn't matter asymptotically. However, I am interested in knowing the exact worse case timings taken when implemented on an FPGA or ASIC board so that I can try to improve the functions. I also understand that these calculations take nanoseconds to compute, however, that is what I am trying to improve.
I saw a couple of journal publications which claimed to have a faster algorithm. However, when I implemented the same using Xilinx and using the Synthesis Reports, I was getting different results.
Is there a software that calculates the power consumption, and worst case timings? Or could anyone point me to some article that could help me out here?
Regarding the time complexity and worst case timing, these will be dependent on exactly how you write the algorithm, and the target fpga you are using.
To understand the length of time it takes to execute a computation, this depends on the clock frequency of the FPGA, and the number of clock cycles it takes to perform your algorithm. If your multiplication algorithm is single cycle, then the time it takes to compute will be simply the clock period of the fpga. Large multiplication circuits are typically not single cycle however, due to their complexity. Exactly what you get depends on your input code, the synthesizer, and the kind of fpga you are using.
Typically your FPGA synthesis tools can tell you what is the worst case timing of your design in the post-synthesis report. If you want to improve the worst case timing, then you can add pipeline stages to the multiplication to break up the work and increase clock frequency.
Measuring power consumption will also be heavily dependent on the fpga you use and the synthesized netlist that you are loading. Altera and Xilinx offer power estimators for their fpgas (probably other vendors as well), but I'm not sure if there are any vendor-agnostic power estimators out there.
Long story short, to get the metrics you want you'll need to work with the synthesis tools of the FPGA you are planning to use.
I read in an article somewhere that trig calculations are generally expensive. Is this true? And if so, that's why they use trig-lookup tables right?
EDIT: Hmm, so if the only thing that changes is the degrees (accurate to 1 degree), would a look up table with 360 entries (for every angle) be faster?
Expensive is a relative term.
The mathematical operations that will perform fastest are those that can be performed directly by your processor. Certainly integer add and subtract will be among them. Depending upon the processor, there may be multiplication and division as well. Sometimes the processor (or a co-processor) can handle floating point operations natively.
More complicated things (e.g. square root) requires a series of these low-level calculations to be performed. These things are usually accomplished using math libraries (written on top of the native operations your processor can perform).
All of this happens very very fast these days, so "expensive" depends on how much of it you need to do, and how quickly you need it to happen.
If you're writing real-time 3D rendering software, then you may need to use lots of clever math tricks and shortcuts to squeeze every bit of speed out of your environment.
If you're working on typical business applications, odds are that the mathematical calculations you're doing won't contribute significantly to the overall performance of your system.
On the Intel x86 processor, floating point addition or subtraction requires 6 clock cycles, multiplication requires 8 clock cycles, and division 30-44 clock cycles. But cosine requires between 180 and 280 clock cycles.
It's still very fast, since the x86 does these things in hardware, but it's much slower than the more basic math functions.
Since sin(), cos() and tan() are mathematical functions which are calculated by summing a series developers will sometimes use lookup tables to avoid the expensive calculation.
The tradeoff is in accuracy and memory. The greater the need for accuracy, the greater the amount of memory required for the lookup table.
Take a look at the following table accurate to 1 degree.
http://www.analyzemath.com/trigonometry/trig_1.gif
While the quick answer is that they are more expensive than the primitive math functions (addition/multiplication/subtraction etc...) they are not -expensive- in terms of human time. Typically the reason people optimize them with look-up tables and approximations is because they are calling them potentially tens of thousands of times per second and every microsecond could be valuable.
If you're writing a program and just need to call it a couple times a second the built-in functions are fast enough by far.
I would recommend writing a test program and timing them for yourself. Yes, they're slow compared to plus and minus, but they're still single processor instructions. It's unlikely to be an issue unless you're doing a very tight loop with millions of iterations.
Yes, (relative to other mathematical operations multiply, divide): if you're doing something realtime (matrix ops, video games, whatever), you can knock off lots of cycles by moving your trig calculations out of your inner loop.
If you're not doing something realtime, then no, they're not expensive (relative to operations such as reading a bunch of data from disk, generating a webpage, etc.). Trig ops are hopefully done in hardware by your CPU (which can do billions of floating point operations per second).
If you always know the angles you are computing, you can store them in a variable instead of calculating them every time. This also applies within your method/function call where your angle is not going to change. You can be smart by using some formulas (calculating sin(theta) from sin(theta/2), knowing how often the values repeat - sin(theta + 2*pi*n) = sin(theta)) and reducing computation. See this wikipedia article
yes it is. trig functions are computed by summing up a series. So in general terms, it would be a lot more costly then a simple mathematical operation. same goes for sqrt