why is addition operation faster than multiplication operation? - cpu

Could someone please explain why is addition operation faster than multiplication operation ?
for example if we need to multiply 25 by 50
will the compiler transform it to a for loop of additions ?

Multiplication is a much more complex process, requiring more silicon either as a multiplier circuit or in a lookup table in order to reach the same level of performance as provided by addition.

will the compiler transform it to a for loop of additions ?
No if the processor already has a multiply instruction, which most processors anyway have.

Related

Theoretically, is comparison between 0 and 255 faster than 0 and 1?

From the point of view of very low level programming, how is performed the comparison between two numbers?
Using one byte, unsigned numbers 0, 1 and 255 are written:
0 -----> 00000000
1 -----> 00000001
255 ---> 11111111
Now, what happens during the comparison between these numbers?
Using my vision as a human having learned basic programming, I could imagine the following algorithm about == implementation:
b = 0
while b < 8:
if first_number[b] != second_number[b]:
return False
b += 1
return True
Basically this is like comparing each bit step by step, and stop before the end if two bits are different.
Thus we note that the comparison stops at the first iteration compared 0 and 255, while it stops at the last if 0 and 1 are compared.
The first comparison would be 8 times faster than the second.
In practice, I doubt that is the case. But is this theoretically true?
If not, how does the computer work?
A comparison between integers is tipically implemented by the cpu as a subtraction, whose result sign contains information about which number is bigger.
While a naive implementation of subtraction executes one bit at a time (because every bit needs to know the carry of the preceding one), tipical implementation use a carry-lookahead circuit that allows the calculation of more result bits at the same time.
So, the answer is: no, every comparison takes almost the same time for every possible input.
Hardware is fundamentally different from the dominant programming paradigms in that all logic gates (or circuits in general) always do their work independently, in parallel, at all times. There is no such thing as "do this, then do that", only "do this here, feed the result into the circuit over there". If there's a circuit on the chip with input A and output B, then the circuit always, continuously, updates B in accordance with the current values of A — regardless of whether the result is needed right now "in the big picture".
Your pseudo code algorithm doesn't even begin to map to logic gates well. Instead, a comparator looks like this in Verilog (ignoring that there's a built-in == operator):
assign not_equal = (a[0] ^ b[0]) | (a[1] ^ b[1]) | ...;
Where each XOR is a separate logic gate and hence works independently from the others. The results are "reduced" with a logical or, i.e. the output is 1 if any of the XORs produces a 1 (this too does some work in parallel, but the critical path is longer than one gate). Furthermore, all these gates exist in silicon regardless of the specific bit values, and the signal has to propagate through about (1 + log w) gates for a w-bit integer. This propagation delay is again independent of the intermediate and final results.
On some CPU families, equality comparison is implemented by subtracting the two numbers and comparing the result to zero (using a circuit as described above), but the same principle applies. An adder/subtracter doesn't get slower or faster depending on the values.
Not to mention that instructions in a CPU can't take less than one clock cycle anyway, so even if the hardware would finish more quickly, the next instruction still wouldn't start until the next tick.
Now, some hardware operations can take a variable amount of time, but that's because they are state machines, i.e. sequential logic. Technically one could implement the moral equivalent of your algorithm with a state machine, but nobody does that, it's harder to implement than the naive, un-optimized combinatorial circuit above, and less efficient to boot.
State machine circuits are circuits with memory: They store their current state and always compute the outputs (depending on the current state) and the next state (depending on current state and inputs) each clock cycle. On some inputs they may go through N states until they produce an output, and N+x on other inputs. ALU operations generally don't do that though. Pipeline stalls, branch mispredictions, and cache misses are common reasons one instruction takes longer than usual in some circumstances. Properly reasoning about these in a way that helps programmers write faster code is hard though: You have to take into account all the tricky and quirks of real hardware, and there's a lot of those. Empirical evidence, i.e. benchmarking a real black box CPU, is vital.
When it gets down to the assembly the cmp instruction is used regardless of the contents of the variables.
So there is no performance difference.

Ising 2D Optimization

I have implemented a MC-Simulation of the 2D Ising model in C99.
Compiling with gcc 4.8.2 on Scientific Linux 6.5.
When I scale up the grid the simulation time increases, as expected.
The implementation simply uses the Metropolis–Hastings algorithm.
I tried to find out a way to speed up the algorithm, but I haven't any good idea ?
Are there some tricks to do so ?
As jimifiki wrote, try to do a profiling session.
In order to improve on the algorithmic side only, you could try the following:
Lookup Table:
When calculating the energy difference for the Metropolis criteria you need to evaluate the exponential exp[-K / T * dE ] where K is your scaling constant (in units of Boltzmann's constant) and dE the energy-difference between the original state and the one after a spin-flip.
Calculating exponentials is expensive
So you simply build a table beforehand where to look up the possible values for the dE. There will be (four choose one plus four choose two plus four choose three plus four choose four) possible combinations for a nearest-neightbour interaction, exploit the problem's symmetry and you get five values fordE: 8, 4, 0, -4, -8. Instead of using the exp-function, use the precalculated table.
Parallelization:
As mentioned before, it is possible to parallelize the algorithm. To preserve the physical correctness, you have to use a so-called checkerboard concept. Consider the two-dimensional grid as a checkerboard and compute only the white cells parallel at once, then the black ones. That should be clear, considering the nearest-neightbour interaction which introduces dependencies of the values.
Use GPGPU:
You can also implement the simulation on a GPGPU, e.g. using CUDA, if you're already working on C99.
Some tips:
- Don't forget to align C99-structs properly.
- Use linear Arrays, not that nested ones. Aligned memory is normally faster to access, if done properly.
- Try to let the compiler do loop-unrolling, etc. (gcc special options, not default on O2)
Some more information:
If you look for an efficient method to calculate the critical point of the system, the method of choice would be finite-size scaling where you simulate at different system-sizes and different temperature, then calculate a value which is system-size independet at the critical point, therefore an intersection point of the corresponding curves (please see the theory to get a detailed explaination)
I hope I was helpful.
Cheers...
It's normal that your simulation times scale at least with the square of the size. Isn't it?
Here some subjestions:
If you are concerned with thermalization issues, try to use parallel tempering. It can be of help.
The Metropolis-Hastings algorithm can be made parallel. You could try to do it.
Check you are not pessimizing the code.
Are your spin arrays of ints? You could put many spins on the same int. It's a lot of work.
Moreover, remember what Donald taught us:
premature optimisation is the root of all evil
Before optimising you should first understand where your program is slow. This is called profiling.

How is a prefix sum a bulk-synchronous algorithmic primitive?

Concerning NVIDIA GPU the author in High Performance and Scalable GPU Graph Traversal paper says:
1-A sequence of kernel invocations is bulk- synchronous: each kernel is
initially presented with a consistent view of the results from the
previous.
2-Prefix-sum is a bulk-synchronous algorithmic primitive
I am unable to understand these two points (I know GPU based prefix sum though), Can smeone help me this concept.
1-A sequence of kernel invocations is bulk- synchronous: each kernel is initially presented with a consistent view of the results from the previous.
It's about parallel computation model: each processor has its own memory which is fast (like cache in CPU) and is performing computation using values stored there without any synchronization. Then non-blocking synchronization takes place - processor puts data it has computed so far and gets data from its neighbours. Then there's another synchronization - barrier, which makes all of them wait for each other.
2-Prefix-sum is a bulk-synchronous algorithmic primitive
I believe that's about the second step of BSP model - synchronization. That's the way processors store and get data for the next step.
Name of the model implies that it is highly concurrent (many many processes that work synchronously relatively to each other). And this is how we get to the second point.
As far as we want to live up to the name (be highly concurrent) we want get rid of sequential parts where it is possible. We can achieve that with prefix-sum.
Consider prefix-sum associative operator +. Then scan on set [5 2 0 3 1] returns the set [0 5 7 7 10 11]. So, now we can replace such sequential pseudocode:
foreach i = 1...n
foo[i] = foo[i-1] + bar(i);
with this pseudocode, which now can be parallel(!):
foreach(i)
baz[i] = bar(i);
scan(foo, baz);
That is very much naive version, but it's okay for explanation.

Time taken in executing % / * + - operations

Recently, i heard that % operator is costly in terms of time.
So, the question is that, is there a way to find the remainder faster?
Also your help will be appreciated if anyone can tell the difference in the execution of % / * + - operations.
In some cases where you're using power-of-2 divisors you can do better with roll-your-own techniques for calculating remainder, but generally a halfway decent compiler will do the best job possible with variable divisors, or "odd" divisors that don't fit any pattern.
Note that a few CPUs don't even have a multiply operation, and so (on those) multiply is quite slow vs add (at least 64x for a 32-bit multiply). (But a smart compiler may improve on this if the multiplier is a literal.) A slightly larger number do not have a divide operation or have a pretty slow one. (On a CPU with a fast multiplier multiply may only be on the order of 4 times slower than add, but on "normal" hardware it's 16-32 times slower for a 32 bit operation. Divide is inherently 2-4x slower than multiply, but can be much slower on some hardware.)
The remainder operation is rarely implemented in hardware, and normally A % B maps to something along the lines of A - ((A / B) * B) (a few extra operations may be required to assure the proper sign, et al).
(I learned about this stuff while microprogramming the instruction set for the SUMC computer for RCA/NASA back in the early 70s.)
No, the compiler is going to implement % in the most efficient way possible.
In terms of speed, + and - are the fastest (and are equally fast, generally done by the same hardware).
*, /, and % are much slower. Multiplication is basically done by the method you learn in grade school- multiply the first number by every digit in the second number and add the results. With some hacks made possible by binary. As of a few years ago, multiply was 3x slower than add. Division should be similar to multiply. Remainder is similar to division (in fact it generally calculates both at once).
Exact differences depend on the CPU type and exact model. You'd need to look up the latencies in the CPU spec sheets for your particular machine.

Is it possible to count the number of Set bits in Number in O(1)? [duplicate]

This question already has answers here:
Count the number of set bits in a 32-bit integer
(65 answers)
Count bits in the number [duplicate]
(3 answers)
Closed 8 years ago.
I was asked the above question in an interview and interviewer is very certain of the answer. But i am not sure. Can anyone help me here?
Sure. The obvious brute force method is just a big lookup table with one entry for every possible value of the input number. That's not very practical if the number is very big, but is still enough to prove it's possible.
Edit: the notion has been raised that this is complete nonsense, and the same could be said of essentially any algorithm.
To a limited degree, that's a fair statement -- but the limitations are so severe that for most algorithms it remains utterly meaningless.
My original point (at least as well as I remember it) was that population counting is about equivalent to many other operations like addition and subtraction that we normally assume are O(1).
At the hardware level, circuitry for a single-cycle POPCNT instruction is probably easier than for a single-cycle ADD instruction. Just for one example, for any practical size of data word, we can use table lookups on 4-bit chunks in parallel, then add the results from those pieces together. Even using fairly unlikely worst-case assumptions (e.g., separate storage for each of those tables) this would still be easy to implement in a modern CPU -- in fact, it's probably at least somewhat simpler than the single-cycle addition or subtraction mentioned above1.
This is a decided contrast to many other algorithms. For one obvious example, let's consider sorting. For even the most trivial sort most people can imagine -- 2 items, 8 bits apiece, we're already at a 64 kilobyte lookup table to get constant complexity. Long before we can do even a fairly trivial sort (e.g., 100 items) we need a lookup table that contains far more data items than there are atoms in the universe.
Looking at it from the opposite direction, it's certainly true that at some point, essentially nothing is O(1) any more. Let's consider the most trivial operations possible. For an N-bit CPU, bitwise OR is normally implemented as a set of N OR gates in parallel. Unlike addition, there's no interaction between one bit and another, so for any practical size of CPU, this easy to execute in a single instruction.
Nonetheless, if I specify a bit-wise OR in which each operand is 100 petabits, there's nothing even approaching a practical way to do the job with constant complexity. Using the usual method of parallel OR gates, we end up with (among other things) 300 petabits worth of input and output lines -- a number that completely dwarfs even the number of pins on the largest CPUs.
On reasonable hardware, doing a bitwise OR on 100 petabit operands is going to take a while (not to mention quite a bit of hard drive space). If we increase that to 200 petabit operands, the time is likely to (about) double -- so from that viewpoint, it's an O(N) operation. Obviously enough, the same is going to be true with the other "trivial" operations like addition, subtraction, bit-wise AND, bit-wise XOR, and so on.
Nonetheless, unless you have very specific instructions to say you're going to be dealing with utterly immense operands, you're typically going to treat every one of these as a constant complexity operation. Looked at in these terms, a POPCNT instruction falls about halfway between bit-wise AND/OR/XOR on one hand, and addition/subtraction on the other, in terms of the difficulty to execute in fixed time.
1. You might wonder how it could possibly be simpler than an add when it actually includes an add after doing some other operations. If so, kudos -- it's an excellent question.
The answer is that it's because it only needs a much smaller adder. For example, a 64-bit CPU needs one half-adder and 63 full-adders. In the simple implementation, you carry out the addition bit-wise -- i.e., you add bit-0 of one operand to bit-0 of the other. That generates an output bit, and a carry bit. That carry bit becomes an input to the addition for the next pair of bits. There are some tricks to parallelize that to some degree, but the nature of the beast (so to speak) is bit-serial.
With a POPCNT instruction, we have an addition after doing the individual table lookups, but our result is limited to the size of the input words. Given the same size of inputs (64 bits) our final result can't be any larger than 64. That means we only need a 6-bit adder instead of a 64-bit adder.
Since, as outlined above, addition is basically bit-serial, this means that the addition at the end of the POPCNT instruction is fundamentally a lot faster than a normal add. To be specific, it's logarithmic on the operand size, whereas simple addition is roughly linear on the operand size.
If the bit size is fixed (e.g. natural word size of a 32- or 64-bit machine), you can just iterate over the bits and count them directly in O(1) time (though there are certainly faster ways to do it). For arbitrary precision numbers (BigInt, etc.), the answer must be no.
Some processors can do it in one instruction, obviously for integers of limited size. Look up the POPCNT mnemonic for further details.
For integers of unlimited size obviously you need to read the whole input, so the lower bound is O(n).
The interviewer probably meant the bit counting trick (the first Google result follows): http://www.gamedev.net/topic/547102-bit-counting-trick---new-to-me/

Resources