evaluating speedup of sum of products - performance

Assuming that computing times of each elementary operation and interprocess communication times are the same, evaluate speedup and efficiency of a full mesh processor network in the following cases:
a)Sum of products of 800 elements in a network of 4 processors;
b)Sum of products of 800 elements in a network of 8 processors;
c)Sum of products of 8000 elements in a network of 4 processors;
d)Sum of products of 8000 elements in a network of 8 processors;
How can I do this? I'm having troubles evaluating parallel computation times. For the sequential computation times I calculated them by adding the number of total sums and the number of total products (i.e in the first case I computed it as 799 t+800 t). Is this correct?

Related

OpenMP parallel "for" with "static" schedule

I've a confusion or maybe a misunderstanding of the parallel for behavior with a static schedule and default chunk size.
For example the below picture What I excepted to have is yes the
master thread will take an extra iteration but I excepted it would be
at index 8 not 2!
The static schedule algorithm with default chunk size applies the
round robin on the (#iterations / #threads) with 2 cases
If the #iterations is divisible by #threads like N=8 and #threads = 4. each thread will take an equal amount of iterations in round-robin fashion (straight forward case)
If the #iterations is not divisible by #threads. It will calculate the nearest integer of iterations divided by #threads and do the same as above
case of N=9 --> 8 it will divide 2 2 2 2 and 1
case of N=11 --> 12 it will be divided 3 3 3 and 2
threads are 0 1 2 3
When you use static scheduling, the OpenMP implementation will have to ensure that all iterations are computed by some thread if the number of threads does not evenly divide the number iterations.
From a load balancing perspective the compiler will try to allocate roughly the same number of iterations to each thread and to avoid that one thread receives all remaining iterations that are in excess of the division. So, in your example with N=11 and four threads, the remainder will be 3 and the first three threads 0..2 will get one extra iteration instead of assign 3 extra iterations to the last thread.
Your expectation on iterations distribution for no-chunk static schedule is wrong. Round robin distribution only specified for schedule(static,chunk), not for schedule(static) when "at most one chunk is distributed to each thread".
Iterations 0, 1 and 8 form two chunks because they are not consecutive, thus cannot be assigned to the same thread. Valid distributions of 9 iterations to 4 threads would be 3-2-2-2, 2-3-2-2, 2-2-3-2, 2-2-2-3, 3-3-2-1, etc, even 3-3-3-0 would be valid. All the OpenMP specification says is that chunks should be approximately equal in size, without specifying exact distribution algorithm.

applying Amdahl’s and Gustafson’s law on matrix vector multiply

I read about these laws on many threads here but still could not figure out how to apply their formulas on matrix vector multiply(y = y+ Ax). here I will try to explain my algorithm with respect to time:
T1(sequential): processor zero generates vectors y and x and broadcast
them. T2(parallel):matrix size(n) is divided among processors and
each marix generates its on portion and does the multiplication.
All processors then send results to processor zero.
T3(sequential):processor zero collects results, orders them and print results.
If I run this multiple times with different matrix sizes and processors. how can I apply Amdahl’s and Gustafson’s law on the results

CUDA: Launching many parallel calls to cuBLAS on different subsections of a matrix, without serializing

In my application, I have a double complex N*3 matrix (where N is several thousand) and a 3*1 vector, and I am forming an N*1 using zgemv.
The N*3 is a subsection of a larger M*3 matrix (where M is slightly larger then N, but the same order of magnitude).
Each thread must perform a zgemv call to a different subsection of the larger matrix. That is, the N*3 is different for every thread. But all of the N*3 are formed from some portion of the larger M*3.
There isn't enough memory for each thread to store an independent N*3. Furthermore, the M*3 is too large to fit in shared memory. Thus each thread must pull its data from a single copy of the M*3. How can I do this without millions of threads serializing memory reads to the same memory locations in the M*3? Is there a more efficient way to approach this?
Probably, based on what I can gather so far, there are 2 types of optimizations I would want to consider:
convert operations that use the same N subset to a matrix-matrix multiply (zgemm), instead of multiple zgemv operations.
cache-block for the GPU L2 cache.
I'll discuss these in reverse order using these numbers for discussion:
M: ~10,000
N: ~3,000
cublas zgemv calls: ~1e6
"typical" Kepler L2: 1.5MB
An Nx3 matrix requires approximately 10,000 elements, each of which is 16 bytes, so let's call it 160K bytes. So we could store ~5-10 of these subsets in a memory size comparable to L2 cache size (without taking into account overlap of subsets - which would increase the residency of subsets in L2).
There are (M-N) possible unique contiguous N-row subsets in the M matrix. There are 1e6 zgemv calls, so on average each subset gets re-used 1e6/M-N times, approximately 100-150 times each. We could store about 10 of these subsets in the proposed L2, so we could "chunk" our 1e6 calls into "chunks" of ~1,000 calls that all operate out of the same data set.
Therefore the process I would follow would be:
transfer the M*3 matrix to the device
predetermine the N*3 subset needed by each thread.
sort or otherwise group like subsets together
divide the sorted sets into cache-sized blocks
for each block, launch a CDP kernel that will spawn the necessary zgemv calls
repeat the above step until all blocks are processed.
One might also wonder if such a strategy could be extended (with considerably more complexity) to L1/Texture. Unfortunately, I think CDP would confound your efforts to achieve this. It's pretty rare that people want to invest the effort to cache-block for L1 anyway.
To extend the above strategy to the gemm case, once you sort your zgemv operations by the particular N subset they require, you will have grouped like operations together. If the above arithmetic is correct, you will have on average around 100-150 gemv operations needed for each particular N-subset. You should group the corresponding vectors for those gemv operations into a matrix, and convert the 100-150 gemv operations into a single gemm operation.
This reduces your ~1e6 zgemv operations to ~1e4 zgemm operations. You can then still cache-block however many of these zgemm operations will be "adjacent" in M and fit in a single cache-block, into a single CDP kernel call, to benefit from L2 cache reuse.
Given the operational intensity of GEMM vs. GEMV, it might make sense to dispense with the complexity of CDP altogether, and simply run a host loop that dispatches the ZGEMM call for a particular N subset. That host loop would iterate for M-N loops.

Minimum # of resistors algorithm

I'm trying to design a simple algorithm that takes a vector of standard resistor values along with an input of a desired resistance value and then goes through series and parallel combinations to figure out the minimum number of standard resistors require to achieve that equivalent resistance doing so by any combination of series and parallel resistors, whichever takes the least.
Anyone got any ideas? If I wanted parallel only or series only it would be a lot easier, but not sure how to combine the two for minimum total number of resistors.
FYI if you don't know total series R = S1 + S2 + ...+ SN and Total parallel R = (1/S1 + 1/S2 + ... + 1/SN)^-1
Perhaps a genetic algorithm would be best? I don't know the calculation for the big-O notation for this but it looks exponential: O(cⁿ).
I found this comment on another site's post, it's the number of variations that can be attained with resistors of different values (ie brute force):
Networks with 1 resistors: 1
Networks with 2 resistors: 2
Networks with 3 resistors: 10
Networks with 4 resistors: 68
Networks with 5 resistors: 558
Networks with 6 resistors: 5186
Networks with 7 resistors: 53805
A genetic algorithm would avoid brute force, possibly allowing you to come to an answer much sooner. Unfortunately, it cannot guarantee answers with the minimal amount of resistors. It is likely to find close equivalent resistor values with much less work, and it can be weight so that it favours the fewest possible resistors.
I will keep researching this and post anything else I find.
Create an object to hold a resistance value, plus two resistances from which it came, plus the operation used to obtain the value from the two previous values (series or parallel).
Use some collection data structure like a Set or an ArrayList to hold resistance objects. Your set S1 initially contains just the resistors you have (networks of 1 resistor). Now create a set S2 which is all combinations (series or parallel) of an element of S1 with an element of S1. S3 is combinations of S1 and S2. S4 is combinations of S1 and S3, plus combinations of S2 and S2. Continue until you have a member of Sk which is within tolerance (1%, 5%, or 10%, say) of your target value. The resulting resistance object can be unwrapped one step at a time to find the way it was built up.
One other thing you need to consider is how the tolerances combine. Errors will propagate, so you may need 1% resistors to start in order to achieve the resistance you want at the end to a 5% tolerance, say.

Perfect powers of numbers which can fit in 64 bit size integer (using priority queues)

How can we print out all perfect powers that can be represented as 64-bit long integers: 4, 8, 9, 16, 25, 27, .... A perfect power is a number that can be written as ab for integers a and b ≥ 2.
It's not a homework problem, I found it in job interview questions section of an algorithm design book. Hint, the chapter was based on priority queues.
Most of the ideas I have are quadratic in nature, that keep finding powers until they stop fitting 64 bit but that's not what an interviewer will look for. Also, I'm not able to understand how would PQ's help here.
Using a small priority queue, with one entry per power, is a reasonable way to list the numbers. See following python code.
import Queue # in Python 3 say: queue
pmax, vmax = 10, 150
Q=Queue.PriorityQueue(pmax)
p = 2
for e in range(2,pmax):
p *= 2
Q.put((p,2,e))
print 1,1,2
while not Q.empty():
(v, b, e) = Q.get()
if v < vmax:
print v, b, e
b += 1
Q.put((b**e, b, e))
With pmax, vmax as in the code above, it produces the following output. For the proposed problem, replace pmax and vmax with 64 and 2**64.
1 1 2
4 2 2
8 2 3
9 3 2
16 2 4
16 4 2
25 5 2
27 3 3
32 2 5
36 6 2
49 7 2
64 2 6
64 4 3
64 8 2
81 3 4
81 9 2
100 10 2
121 11 2
125 5 3
128 2 7
144 12 2
The complexity of this method is O(vmax^0.5 * log(pmax)). This is because the number of perfect squares is dominant over the number of perfect cubes, fourth powers, etc., and for each square we do O(log(pmax)) work for get and put queue operations. For higher powers, we do O(log(pmax)) work when computing b**e.
When vmax,pmax =64, 2**64, there will be about 2*(2^32 + 2^21 + 2^16 + 2^12 + ...) queue operations, ie about 2^33 queue ops.
Added note: This note addresses cf16's comment, “one remark only, I don't think "the number of perfect squares is dominant over the number of perfect cubes, fourth powers, etc." they all are infinite. but yes, if we consider finite set”. It is true that in the overall mathematical scheme of things, the cardinalities are the same. That is, if P(j) is the set of all j'th powers of integers, then the cardinality of P(j) == P(k) for all integers j,k > 0. Elements of any two sets of powers can be put into 1-1 correspondence with each other.
Nevertheless, when computing perfect powers in ascending order, no matter how many are computed, finite or not, the work of delivering squares dominates that for any other power. For any given x, the density of perfect kth powers in the region of x declines exponentially as k increases. As x increases, the density of perfect kth powers in the region of x is proportional to (x1/k)/x, hence third powers, fourth powers, etc become vanishingly rare compared to squares as x increases.
As a concrete example, among perfect powers between 1e8 and 1e9 the number of (2; 3; 4; 5; 6)th powers is about (21622; 535; 77; 24; 10). There are more than 30 times as many squares between 1e8 and 1e9 than there are instances of any higher powers than squares. Here are ratios of the number of perfect squares between two numbers, vs the number of higher perfect powers: 10¹⁰–10¹⁵, r≈301; 10¹⁵–10²⁰, r≈2K; 10²⁰–10²⁵, r≈15K; 10²⁵–10³⁰, r≈100K. In short, as x increases, squares dominate more and more when perfect powers are delivered in ascending order.
A priority queue helps, for example, if you want to avoid duplicates in the output, or if you want to list the values particularly sorted.
Priority queues can often be replaced by sorting and vice versa. You could therefore generate all combinations of ab, then sort the results and remove adjacent duplicates. In this application, this approach appears to be slightly but perhaps not drammatically memory-inefficient as witnessed by one of the sister answers.
A priority queue can be superior to sorting, if you manage to remove duplicates as you go; or if you want to avoid storing and processing the whole result to be generated in memory. The other sister answer is an example of the latter but it could easily do both with a slight modification.
Here it makes the difference between an array taking up ~16 GB of RAM and a queue with less than 64 items taking up several kilobytes at worst. Such a huge difference in memory consumption also translates to RAM access time versus cache access time difference, so the memory lean algorithm may end up much faster even if the underlying data structure incurs some overhead by maintaining itself and needs more instructions compared to the naive algorithm that uses sorting.
Because the size of the input is fixed, it is not technically possible that the methods you thought of have been quadratic in nature. Having two nested loops does not make an algorithm quadratic, until you can say that the upper bound of each such loop is proportional to input size, and often not even then). What really matters is how many times the innermost logic actually executes.
In this case the competition is between feasible constants and non-feasible constants.
The only way I can see the priority queue making much sense is that you want to print numbers as they become available, in strictly increasing order, and of course without printing any number twice. So you start off with a prime generator (that uses the sieve of eratosthenes or some smarter technique to generate the sequence 2, 3, 5, 7, 11, ...). You start by putting a triple representing the fact that 2^2 = 4 onto the queue. Then you repeat a process of removing the smallest item (the triple with the smallest exponentiation result) from the queue, printing it, increasing the exponent by one, and putting it back onto the queue (with its priority determined by the result of the new exponentiation). You interleave this process with one that generates new primes as needed (sometime before p^2 is output).
Since the largest exponent base we can possibly have is 2^32 (2^32)^2 = 2^64, the number of elements on the queue shouldn't exceed the number of primes less than 2^32, which is evidently 203,280,221, which I guess is a tractable number.

Resources