counting FLOPs of a batch normalization layer - batch-normalization

Could you please let me know how I can count the number of FLOPs related to the batch normalization layer theoretically?
(FLOPs: Note that s is lowercase, which is the abbreviation of FLoating point OPerations (s stands for plural), which means floating point arithmetic, understood as calculation amount. It can be used to measure the complexity of the model. The evaluation of the complexity of neural network models should refer to FLOPs, not FLOPS.)

Related

Why is exponentiation not atomic?

In calculating the efficiency of algorithms, I have read that the exponentiation operation is not considered to be an atomic operation (like multiplication).
Is it because exponentiation is the same as the multiplication operation repeated several times over?
In principle, you can pick any set of "core" operations on numbers that you consider to take a single time unit to evaluate. However, there are a couple of reasons, though, why we typically don't count exponentiation as one of them.
Perhaps the biggest has to do with how large of an output you produce. Suppose you have two numbers x and y that are each d digits long. Then their sum x + y has (at most) d + 1 digits - barely bigger than what we started with. Their product xy has at most 2d digits - larger than what we started with, but not by a huge amount. On the other hand, the value xy has roughly yd digits, which can be significantly bigger than what we started with. (A good example of this: think about computing 100100, which has about 200 digits!) This means that simply writing down the result of the exponentiation would require a decent amount of time to complete.
This isn't to say that you couldn't consider exponentiation to be a constant-time operation. Rather, I've just never seen it done.
(Fun fact: some theory papers don't consider multiplication to be a constant-time operation, since the complexity of a hardware circuit to multiply two b-bit numbers grows quadratically with the size of b. And some theory papers don't consider addition to be constant-time either, especially when working with variable-length numbers! It's all about context. If you're dealing with "smallish" numbers that fit into machine words, then we can easily count addition and multiplication as taking constant time. If you have huge numbers - say, large primes for RSA encryption - then the size of the numbers starts to impact the algorithm's runtime and implementation.)
This is a matter of definition. For example in hardware-design and biginteger-processing multiplication is not considered an atomic operation (see e.g. this analysis of the karatsuba-algorithm).
On the level that is relevant for general purpose software-design on the other hand, multiplication can be considered as a fairly fast operation on fixed-digit numbers implemented in hardware. Exponentiation on the other hand is rarely implemented in hardware and an upper bound for the complexity can only be given in terms of the exponent, rather than the number of digits.

What is the meaning of "constant" in this context?

I am currently reading the Introduction to Algorithms book and I have a question in regard to analyzing an algorithm:
The computational cost for merge sort is c lg n according to the book and it says that
We restrict c to be a constant so that the word size does not grow arbirarily (If the word size could grow arbitrarily, we could store huge amounts of data in one word and operate on it all in constant time)
I do not understand the meaning of "constant" here. Could anyone explain clearly what this means?
Computational complexity in the study of algorithms deals with finding function(s) which provide upper and lower bounds for how much time (or space) the algorithm requires. Recall basic algebra in high school where you learned about the general point-slope formula for a line? That formula, y = mx + b, provided two parameters, m (slope), and b (y intercept), which described a line completely. Those constants (m,b) described where the line lay, and a larger slope meant that the line was steeper.
Algorithmic complexity is just a way to describe the upper (and possibly lower) bounds for how long an algorithm takes to run (and/or how much space is required). With big-O (and big-Theta) notation, you are finding a function which provides upper (and lower) bounds for the algorithm costs. The constants are just shifting the curve, not changing the shape of the curve.
We restrict c to be a constant so that the word size does not grow arbirarily (If the word size could grow arbitrarily, we could store huge amounts of data in one word and operate on it all in constant time)
On a physical computer, there is some maximum size to a machine word. On a 32-bit system, that would be 32 bits, and on a 64-bit system, it's probably 64 bits. Operations on machine words are (usually) assumed to take time O(1) even though they operate on lots of bits at the same time. For example, if you use a bitwise OR or bitwise AND on a machine word, you can think of it as performing 32 or 64 parallel OR or AND operations in a single unit of time.
When trying to build a theoretical model for a computing system, it's necessary to assume an upper bound on the maximum size of a machine word. If you don't do this, then you could claim that you could perform operations like "compute the OR of n values in time O(1)" or "add together two arbitrary-precision numbers in time O(1)," operations that you can't actually do on a real computer. Therefore, there's usually an assumption that the machine word has some maximum size so that if you do want to compute the OR of n values, you can still do so, but you can't do it instantaneously by packing all the values into one machine word and performing a single assembly instruction to get the result.
Hope this helps!

Why is division more expensive than multiplication?

I am not really trying to optimize anything, but I remember hearing this from programmers all the time, that I took it as a truth. After all they are supposed to know this stuff.
But I wonder why is division actually slower than multiplication? Isn't division just a glorified subtraction, and multiplication is a glorified addition? So mathematically I don't see why going one way or the other has computationally very different costs.
Can anyone please clarify the reason/cause of this so I know, instead of what I heard from other programmer's that I asked before which is: "because".
CPU's ALU (Arithmetic-Logic Unit) executes algorithms, though they are implemented in hardware. Classic multiplications algorithms includes Wallace tree and Dadda tree. More information is available here. More sophisticated techniques are available in newer processors. Generally, processors strive to parallelize bit-pairs operations in order the minimize the clock cycles required. Multiplication algorithms can be parallelized quite effectively (though more transistors are required).
Division algorithms can't be parallelized as efficiently. The most efficient division algorithms are quite complex (The Pentium FDIV bug demonstrates the level of complexity). Generally, they requires more clock cycles per bit. If you're after more technical details, here is a nice explanation from Intel. Intel actually patented their division algorithm.
But I wonder why is division actually slower than multiplication? Isn't division just a glorified subtraction, and multiplication is a glorified addition?
The big difference is that in a long multiplication you just need to add up a bunch of numbers after shifting and masking. In a long division you have to test for overflow after each subtraction.
Lets consider a long multiplication of two n bit binary numbers.
shift (no time)
mask (constant time)
add (neively looks like time proportional to n²)
But if we look closer it turns out we can optimise the addition by using two tricks (there are further optimisations but these are the most important).
We can add the numbers in groups rather than sequentially.
Until the final step we can add three numbers to produce two rather than adding two to produce one. While adding two numbers to produce one takes time proportional to n, adding three numbers to produce two can be done in constant time because we can eliminate the carry chain.
So now our algorithm looks like
shift (no time)
mask (constant time)
add numbers in groups of three to produce two until there are only two left (time proportional to log(n))
perform the final addition (time proportional to n)
In other words we can build a multiplier for two n bit numbers in time roughly proportional to n (and space roughly proportional to n²). As long as the CPU designer is willing to dedicate the logic multiplication can be almost as fast as addition.
In long division we need to know whether each subtraction overflowed before we can decide what inputs to use for the next one. So we can't apply the same parallising tricks as we can with long multiplication.
There are methods of division that are faster than basic long division but still they are slower than multiplication.

Fast algorithm to calculate Pi in parallel

I am starting to learn CUDA and I think calculating long digits of pi would be a nice, introductory project.
I have already implemented the simple Monte Carlo method which is easily parallelize-able. I simply have each thread randomly generate points on the unit square, figure out how many lie within the unit circle, and tally up the results using a reduction operation.
But that is certainly not the fastest algorithm for calculating the constant. Before, when I did this exercise on a single threaded CPU, I used Machin-like formulae to do the calculation for far faster convergence. For those interested, this involves expressing pi as the sum of arctangents and using Taylor series to evaluate the expression.
An example of such a formula:
Unfortunately, I found that parallelizing this technique to thousands of GPU threads is not easy. The problem is that the majority of the operations are simply doing high precision math as opposed to doing floating point operations on long vectors of data.
So I'm wondering, what is the most efficient way to calculate arbitrarily long digits of pi on a GPU?
You should use the Bailey–Borwein–Plouffe formula
Why? First of all, you need an algorithm that can be broken down. So, the first thing that came to my mind is having a representation of pi as an infinite sum. Then, each processor just computes one term, and you sum them all in the end.
Then, it is preferable that each processor manipulates small-precision values, as opposed to very high precision ones. For example, if you want one billion decimals, and you use some of the expressions used here, like the Chudnovsky algorithm, each of your processor will need to manipulate a billion long number. That's simply not the appropriate method for a GPU.
So, all in all, the BBP formula will allow you to compute the digits of pi separately (the algorithm is very cool), and with "low precision" processors! Read the "BBP digit-extraction algorithm for π"
Advantages of the BBP algorithm for computing π
This algorithm computes π without requiring custom data types having thousands or even millions of digits. The method calculates the nth digit without calculating the first n − 1 digits, and can use small, efficient data types.
The algorithm is the fastest way to compute the nth digit (or a few digits in a neighborhood of the nth), but π-computing algorithms using large data types remain faster when the goal is to compute all the digits from 1 to n.

computing derivatives of discrete periodic data

I have an array y[x], x=0,1,2,...,10^6 describing a periodic signal with y(10^6)=y(0), and I want to compute its derivative dy/dx with a fast method.
I tried the spectral difference method, namely
dy/dx = inverse_fourier_transform ( i*k fourier_transform(y)[k] ) .................(1)
and the result is different from (y[x+1]-y[x-1])/2 i.e. suggested by finite difference method.
Which of the two is more accurate, and which is faster? Are there other comparable methods?
Below is an effort to understand the difference of the results:
If one expand both the sum for the fourier_transform and that for the inverse_fourier_transform in (1), one can express dy/dx as a linear combination of y[x] with coefficients a[x]. I computed these coefficients and they seem to go as 1/n (when the length of the array goes to infinity) with n being the distance to where the derivative is examined. Compared to the finite differencing method which uses only the two neighboring points, the spectral difference is highly non-local... Am I correct with this result, and if yes, how to understand this?
if you are sampling the signal above the nyquist frequency then the fourier method gives you an exact answer because your data completely describe the signal (assuming no noise).
the finite difference method is a first order approximation and so is not exact. but still, if you plot the two, they should show the same basic trends. if they look completely different then you probably have an error somewhere.
however, a fast ft is O(nlog(n)) while finite differences are O(n) so the latter is faster (but not so much faster that it should be automatically preferred).
the fourier approach is non-local in the sense that it constructs the whole signal, exactly (and so uses all wavelengths).

Resources