In quantum computers those 2 effects should be seen:
1) If an operator acts on an arbitrary QuBit Qn of a quantum system S consisting of several QuBits than we get a new quantum system S' where the amplitues of ALL QuBits have changed.
2) If an operator acts on one QuBit of a quantum system T consisting of two entangled QuBits than both QuBits are affected.
So which one of these is the reason for the exponential speed-up expected from quantum computers?
Q : "...which one of these is the reason...?"
None.
As of 2020-Q2, the so far available Q-devices operate in constant [TIME], i.e. having O(1) (in)-dependency of TimeDOMAIN duration of operations on the problem-complexity, so neither the isolated assumption 1) nor the isolated assumption 2) make any difference on their own, but both are the part of the quantum Level-of-Detail observed World, as we know it, that is ( by design intrinsically ) Q-[PARALLEL].
Actually, the World as we know it, is the Q-SpaceTime continuum itself, internally discrete in both Time and Space at such a LoD, we will never have a trouble with in any foreseeable Q-device computing, if considering The Universe not being a Q-device of it's own kind, which it obviously is ;)
All happens "now", not one after another ( as if sequenced in some pipelined fashion ). ( The full depth of this topic way exceeds the format of this Q/A-site. )
Sure,pedantic and Q-orthodox users may claim a need to use some [SPACE]-Domain downscaling tricks due to current Q-devices limits of the physical-Q-engine, yet even these retain the O(1) (in)-dependence, as the asymptotic complexity (in)-dependence model of the QPU-based computing.
Q : "...the exponential speed-up expected from quantum computers?"
Given the above, there is but a market-making motivation to call Q-devices "expected" to have(cit.): "exponential speed-up".
Given the target Q-device operates on O(1) scaling, all the previously known technologies are compared to a constant-Time "processing".
In this context,the category of Speedupwill be the way more "showing the better factor of comparison", the worse the original processing is (was).
The Q-device ( no matter how easy or bad the original processing was )
will be
and
will always remain a constant-Time Q-"processor"
The BONUS Part :
So,the O(1) Q-device will always show itself best against any worst scaling of the original problem-processing strategy, where the exponentially scaling is not our worst enemy in the Complexity ZOO.Similarly, the very same Q-device will show itself as a poor-neighbour once compared to any currently known O(1) processing that may and often will outperform any Q-device due to small if not missing at all initial-Setup / Result-detection / termination-Gap(s), known as the Q-device principal add-on processing-latencies ( which do not play any significant role for the former case, where nature of the shift-of-paradigm in the field of complexity works strongly against any classical device(s) )
Related
Can someone please explain the difference between a vector and an array processor, which one encounters when learning about the computer architecture involved in parallel programming?
One of the source which I referred tells that the vector processor is also called an array processor. It's a bit confusing. Thank you in advance!
See "Supercomputer Languages" by R H Perrott and A Zarea-Aliabadi, ACM Computing Surveys, Vol 18, No 1, March 1986, pp. 7-8.
Some consider vector processors as SIMD computers and some don't. If vector processors are SIMD computers, they are pipelined SIMD, whereas array processors are simultaneous SIMD.
Q : Can someone…explain the difference between a vector and an array processor…?
Yes. There is none.
Arrays are but human abstractions. Common computers "see" vectors ( weakly organised sections of their physical address-space, considered as a contiguous "block"-space, where vector's cell-by-cell data reside and get stored into and fetched from ( this is expensive, as of 2020/Q1 some CPUs may enjoy a few SIMD-instructions' tricks to move at once data to/from memory in their hardware-SIMD-specific blocks ( vectors of a few data-cells, currently not more than AVX-512 512bit-blocks, so eight float64-s or sixteen float32-s, etc. at max ) ) ).
So the data-"representation" is code-based, not hardware-based. Arrays, tensors in a bit more generalised view, are being stored in linearly addressed memory just by convention - being either:
a F-ortran convention (row first, having the first index changing fastest ), or
a C-language convention (column first, having the last index changing fastest )
so all such one-after-another aligned vector-splices, as a weakly, code-controlled sequences of vectors ( each vector being but a weakly code-controlled contiguous linear sequence of data-cells in a directly addressable memory )
( not speaking here knowingly about sparse-{vector|array|tensor}-representation formats, where non-zero elements' addressing conventions ( for the sake of memory-saving ( at a cost of losses in speed of access and cache-lines' inefficiencies ) due to the nature of sparsity often using indirect addressing tricks ) may go pretty wild to touch 'em here in brief )
Known Exceptions :
If needed, SoC and FPGA devices may get designed so as to go an extra mile for indeed a hardware-supported {matrix|tensor}-processing, yet all that comes at extra costs - human invention + clever and energy-efficient silicon-level designs for such processors' matrix-uops having also extremely increased data-I/O on-die + off-die bandwidths, that are needed for a reasonably both large ( O(N2) sizes) and fast matrix-processing are rather very expensive, so do not expect any great magic from even these ultimately niche use-case approaches, where rather very limited, almost ASIC-alike product market sizes restrict such increased costs adjustments ( consumer electronics markets are not seeking for these, while MIL/R&D/FinTech/other special segments can pay for such extreme, non-COTS solutions ).
Today, me and my colleague had a small argument about one particular code snippet. The code looks something like this. At least, this is what he imagined it to be.
for(int i = 0; i < n; i++) {
// Some operations here
}
for (int i = 0; i < m; i++) { // m is always small
// Some more operations here
}
He wanted me to remove the second loop, since it would cause performance issues.
However, I was sure that since I don't have any nested loops here, the complexity will always be O(n), no matter how many sequential loops I put (only 2 we had).
His argument was that if n is 1,000,000 and the loop takes 5 seconds, my code will take 10 seconds, since it has 2 for loops. I was confused after this statement.
What I remember from my DSA lessons is that we ignore such constants while calculating Big Oh.
What am I missing here?
Yes,
the complexity theory may help to compare two distinct methods of calculation in [?TIME][?SPACE],
but
Do not use [PTIME] complexity as an argument for a poor efficiency
Fact #1: O( f(N) ) is relevant for comparing complexities, in areas near N ~ INFTY, so the process principal limits are being possible to be compared "there"
Fact #2: Given N ~ { 10k | 10M | 10G }, none of such cases meets the above cited condition
Fact #3: If the process ( algorithm ) allows the loops to get merged without any side-effects ( on resources / blocking / etc ) into a single pass, the single-loop processing may always benefit from the reduced looping overheads.
A micro benchmark will decide, not the O( f( N ) ) for N ~ INFTY
as many additional effects get stronger influence - better or poor cache-line alignment and the amount of possible L1/L2/L3-cache re-uses, smart harnessing of more / less CPU-registers - all of which is driven by possible compiler-optimisations and may further increase code-execution speeds for small N-s, beyond any expectations from above.
So,
do perform several scaling-dependent microbenchmarking, before resorting to argue about limits of O( f( N ) )
Always do.
In asymptotic notation, your code has time complexity O(n + n) = O(2n) =
O(n)
Side note:
If the first loop takes n iterations and the second loop m, then the time complexity would be O(n + m).
PS: I assume that the bodies of your for loops is not heavy enough to affect the overall complexity, as you mentioned too.
You may be confusing time complexity and performance. These are two different (but related) things.
Time complexity deals with comparing the rate of growth of algorithms and ignores constant factors and messy real-world conditions. These simplifications make it a valuable theoretical framework for reasoning about algorithm scalability.
Performance is how fast code runs on an actual computer. Unlike in Big O-land, constant factors exist and often play a dominant role in determining execution time. Your coworker is reasonable to acknowledge this. It's easy to forget that O(1000000n) is the same as O(n) in Big O-land, but to an actual computer, the constant factor is a very real thing.
The bird's-eye view that Big O provides is still valuable; it can help determine if your coworker is getting lost in the details and pursuing a micro-optimization.
Furthermore, your coworker considers simple instruction counting as a step towards comparing actual performance of these loop arrangements, but this is still a major simplification. Consider cache characteristics; out-of-order execution potential; friendliness to prefetching, loop unrolling, vectorization, branch prediction, register allocation and other compiler optimizations; garbage collection/allocation overhead and heap vs stack memory accesses as just a few of the factors that can make enormous differences in execution time beyond including simple operations in the analysis.
For example, if your code is something like
for (int i = 0; i < n; i++) {
foo(arr[i]);
}
for (int i = 0; i < m; i++) {
bar(arr[i]);
}
and n is large enough that arr doesn't fit neatly in the cache (perhaps elements of arr are themselves large, heap-allocated objects), you may find that the second loop has a dramatically harmful effect due to having to bring evicted blocks back into the cache all over again. Rewriting it as
for (int i = 0, end = max(n, m); i < end; i++) {
if (i < n) {
foo(arr[i]);
}
if (i < m) {
bar(arr[i]);
}
}
may have a disproportionate efficiency increase because blocks from arr are brought into the cache once. The if statements might seem to add overhead, but branch prediction may make the impact negligible, avoiding pipeline flushes.
Conversely, if arr fits in the cache, the second loop's performance impact may be negligible (particularly if m is bounded and, better still, small).
Yet again, what is happening in foo and bar could be a critical factor. There simply isn't enough information here to tell which is likely to run faster by looking at these snippets, simple as they are, and the same applies to the snippets in the question.
In some cases, the compiler may have enough information to generate the same code for both of these examples.
Ultimately, the only hope to settle debates like this is to write an accurate benchmark (not necessarily an easy task) that measures the code under its normal working conditions (not always possible) and evaluate the outcome against other constraints and metrics you may have for the app (time, budget, maintainability, customer needs, energy efficiency, etc...).
If the app meets its goals or business needs either way it may be premature to debate performance. Profiling is a great way to determine if the code under discussion is even a problem. See Eric Lippert's Which is Faster? which makes a strong case for (usually) not worrying about these sort of things.
This is a benefit of Big O--if two pieces of code only differ by a small constant factor, there's a decent chance it's not worth worrying about until it proves to be worth attention through profiling.
Recently, i heard that % operator is costly in terms of time.
So, the question is that, is there a way to find the remainder faster?
Also your help will be appreciated if anyone can tell the difference in the execution of % / * + - operations.
In some cases where you're using power-of-2 divisors you can do better with roll-your-own techniques for calculating remainder, but generally a halfway decent compiler will do the best job possible with variable divisors, or "odd" divisors that don't fit any pattern.
Note that a few CPUs don't even have a multiply operation, and so (on those) multiply is quite slow vs add (at least 64x for a 32-bit multiply). (But a smart compiler may improve on this if the multiplier is a literal.) A slightly larger number do not have a divide operation or have a pretty slow one. (On a CPU with a fast multiplier multiply may only be on the order of 4 times slower than add, but on "normal" hardware it's 16-32 times slower for a 32 bit operation. Divide is inherently 2-4x slower than multiply, but can be much slower on some hardware.)
The remainder operation is rarely implemented in hardware, and normally A % B maps to something along the lines of A - ((A / B) * B) (a few extra operations may be required to assure the proper sign, et al).
(I learned about this stuff while microprogramming the instruction set for the SUMC computer for RCA/NASA back in the early 70s.)
No, the compiler is going to implement % in the most efficient way possible.
In terms of speed, + and - are the fastest (and are equally fast, generally done by the same hardware).
*, /, and % are much slower. Multiplication is basically done by the method you learn in grade school- multiply the first number by every digit in the second number and add the results. With some hacks made possible by binary. As of a few years ago, multiply was 3x slower than add. Division should be similar to multiply. Remainder is similar to division (in fact it generally calculates both at once).
Exact differences depend on the CPU type and exact model. You'd need to look up the latencies in the CPU spec sheets for your particular machine.
This question already has answers here:
Count the number of set bits in a 32-bit integer
(65 answers)
Count bits in the number [duplicate]
(3 answers)
Closed 8 years ago.
I was asked the above question in an interview and interviewer is very certain of the answer. But i am not sure. Can anyone help me here?
Sure. The obvious brute force method is just a big lookup table with one entry for every possible value of the input number. That's not very practical if the number is very big, but is still enough to prove it's possible.
Edit: the notion has been raised that this is complete nonsense, and the same could be said of essentially any algorithm.
To a limited degree, that's a fair statement -- but the limitations are so severe that for most algorithms it remains utterly meaningless.
My original point (at least as well as I remember it) was that population counting is about equivalent to many other operations like addition and subtraction that we normally assume are O(1).
At the hardware level, circuitry for a single-cycle POPCNT instruction is probably easier than for a single-cycle ADD instruction. Just for one example, for any practical size of data word, we can use table lookups on 4-bit chunks in parallel, then add the results from those pieces together. Even using fairly unlikely worst-case assumptions (e.g., separate storage for each of those tables) this would still be easy to implement in a modern CPU -- in fact, it's probably at least somewhat simpler than the single-cycle addition or subtraction mentioned above1.
This is a decided contrast to many other algorithms. For one obvious example, let's consider sorting. For even the most trivial sort most people can imagine -- 2 items, 8 bits apiece, we're already at a 64 kilobyte lookup table to get constant complexity. Long before we can do even a fairly trivial sort (e.g., 100 items) we need a lookup table that contains far more data items than there are atoms in the universe.
Looking at it from the opposite direction, it's certainly true that at some point, essentially nothing is O(1) any more. Let's consider the most trivial operations possible. For an N-bit CPU, bitwise OR is normally implemented as a set of N OR gates in parallel. Unlike addition, there's no interaction between one bit and another, so for any practical size of CPU, this easy to execute in a single instruction.
Nonetheless, if I specify a bit-wise OR in which each operand is 100 petabits, there's nothing even approaching a practical way to do the job with constant complexity. Using the usual method of parallel OR gates, we end up with (among other things) 300 petabits worth of input and output lines -- a number that completely dwarfs even the number of pins on the largest CPUs.
On reasonable hardware, doing a bitwise OR on 100 petabit operands is going to take a while (not to mention quite a bit of hard drive space). If we increase that to 200 petabit operands, the time is likely to (about) double -- so from that viewpoint, it's an O(N) operation. Obviously enough, the same is going to be true with the other "trivial" operations like addition, subtraction, bit-wise AND, bit-wise XOR, and so on.
Nonetheless, unless you have very specific instructions to say you're going to be dealing with utterly immense operands, you're typically going to treat every one of these as a constant complexity operation. Looked at in these terms, a POPCNT instruction falls about halfway between bit-wise AND/OR/XOR on one hand, and addition/subtraction on the other, in terms of the difficulty to execute in fixed time.
1. You might wonder how it could possibly be simpler than an add when it actually includes an add after doing some other operations. If so, kudos -- it's an excellent question.
The answer is that it's because it only needs a much smaller adder. For example, a 64-bit CPU needs one half-adder and 63 full-adders. In the simple implementation, you carry out the addition bit-wise -- i.e., you add bit-0 of one operand to bit-0 of the other. That generates an output bit, and a carry bit. That carry bit becomes an input to the addition for the next pair of bits. There are some tricks to parallelize that to some degree, but the nature of the beast (so to speak) is bit-serial.
With a POPCNT instruction, we have an addition after doing the individual table lookups, but our result is limited to the size of the input words. Given the same size of inputs (64 bits) our final result can't be any larger than 64. That means we only need a 6-bit adder instead of a 64-bit adder.
Since, as outlined above, addition is basically bit-serial, this means that the addition at the end of the POPCNT instruction is fundamentally a lot faster than a normal add. To be specific, it's logarithmic on the operand size, whereas simple addition is roughly linear on the operand size.
If the bit size is fixed (e.g. natural word size of a 32- or 64-bit machine), you can just iterate over the bits and count them directly in O(1) time (though there are certainly faster ways to do it). For arbitrary precision numbers (BigInt, etc.), the answer must be no.
Some processors can do it in one instruction, obviously for integers of limited size. Look up the POPCNT mnemonic for further details.
For integers of unlimited size obviously you need to read the whole input, so the lower bound is O(n).
The interviewer probably meant the bit counting trick (the first Google result follows): http://www.gamedev.net/topic/547102-bit-counting-trick---new-to-me/
I have to solve a huge linear equation for multiple right sides (Let's say 20 to 200). The Matrix is stored in a sparse format and distributed over multiple MPI nodes (Let's say 16 to 64). I run a CG solver on the rank 0 node. It's not possible to solve the linear equation directly, because the system matrix would be dense (Sys = A^T * S * A).
The basic Matrix-Vector multiplication is implemented as:
broadcast x
y = A_part * x
reduce y
While the collective operations are reasonably fast (OpenMPI seems to use a binary tree like communication pattern + Infiniband), it still accounts for a quite large part of the runtime. For performance reasons we already calculate 8 right sides per iteration (Basicly SpM * DenseMatrix, just to be complete).
I'm trying to come up with a good scheme to hide the communication latency, but I did not have a good idea yet. I also try to refrain from doing 1:n communication, although I did not yet measure if scaling would be a problem.
Any suggestions are welcome!
If your matrix is already distributed, would it be possible to use a distributed sparse linear solver instead of running it only on rank 0 and then broadcasting the result (if I'm reading your description correctly..). There's plenty of libraries for that, e.g. SuperLU_DIST, MUMPS, PARDISO, Aztec(OO), etc.
The "multiple rhs" optimization is supported by at least SuperLU and MUMPS (haven't checked the others, but I'd be VERY surprised if they didn't support it!), since they solve AX=B where X and B are matrices with potentially > 1 column. That is, each "rhs" is stored as a column vector in B.
If you don't need to have the results of an old right-hand-side before starting the next run you could try to use non-blocking communication (ISend, IRecv) and communicate the result while calculating the next right-hand-side already.
But make sure you call MPI_Wait before reading the content of the communicated array, in order to be sure you're not reading "old" data.
If the matrices are big enough (i.e. it takes long enough to calculate the matrix-product) you don't have any communication delay at all with this approach.