OpenMP parallel "for" with "static" schedule

OpenMP parallel "for" with "static" schedule - parallel-processing

I've a confusion or maybe a misunderstanding of the parallel for behavior with a static schedule and default chunk size.
For example the below picture What I excepted to have is yes the
master thread will take an extra iteration but I excepted it would be
at index 8 not 2!
The static schedule algorithm with default chunk size applies the
round robin on the (#iterations / #threads) with 2 cases
If the #iterations is divisible by #threads like N=8 and #threads = 4. each thread will take an equal amount of iterations in round-robin fashion (straight forward case)
If the #iterations is not divisible by #threads. It will calculate the nearest integer of iterations divided by #threads and do the same as above
case of N=9 --> 8 it will divide 2 2 2 2 and 1
case of N=11 --> 12 it will be divided 3 3 3 and 2
threads are 0 1 2 3

When you use static scheduling, the OpenMP implementation will have to ensure that all iterations are computed by some thread if the number of threads does not evenly divide the number iterations.
From a load balancing perspective the compiler will try to allocate roughly the same number of iterations to each thread and to avoid that one thread receives all remaining iterations that are in excess of the division. So, in your example with N=11 and four threads, the remainder will be 3 and the first three threads 0..2 will get one extra iteration instead of assign 3 extra iterations to the last thread.

Your expectation on iterations distribution for no-chunk static schedule is wrong. Round robin distribution only specified for schedule(static,chunk), not for schedule(static) when "at most one chunk is distributed to each thread".
Iterations 0, 1 and 8 form two chunks because they are not consecutive, thus cannot be assigned to the same thread. Valid distributions of 9 iterations to 4 threads would be 3-2-2-2, 2-3-2-2, 2-2-3-2, 2-2-2-3, 3-3-2-1, etc, even 3-3-3-0 would be valid. All the OpenMP specification says is that chunks should be approximately equal in size, without specifying exact distribution algorithm.

Related

How MATLAB divide the number of iterations of "parfor" on the workers of one computer?

Please, I have a question, how MATLAB divide the number of iterations of "parfor" on the workers of the computer?
In the following example of Mathworks in the picture, as I understood, they mentioned when the number of iteration is 10 and the number of workers is 4, the first three workers take the 2 iterations equally then the remaining four iterations divide into the four. That is mean the first three workers take 3 iterations and the last worker take just one iteration.
Please, could anyone correct me if I am wrong ! And please, explain to me how MATLAB divide the number of iterations, i.e. even or odd ?
Lease, if I have this case how MATLAB divide the iterations ?
1. If the number of iteration is 40 and we have 4 workers.
2. If the number of iteration is 40 and we have 5 workers.
3. If the number of iteration is 40 and we have 8 workers.
4. If the number of iteration is 40 and we have 12 workers.
Kind regards
Ammar

Yes, I would like to know the answer as well. Because we do not know what exactly the idea of division the number of iterations on the number of workers.
Kind regards
Fahdi

Dynamic Programming optimal broadcast

I know how to solve this problem in an usual way, not using dynamic programming.
If you could be kind enough to explain to me the solution/give me a general idea/ pseudocode. Thanks a bunch.
The input consists of a sequence R = hR0, . . . ,Rni of non-negative integers, and an integer k. The number Ri represents the number of users requesting some particular piece of information at time i (say from a www server).
If the server broadcasts this information at some time t, the the requests of all the users who requested the information
strictly before time t are satisfied. The server can broadcast this information at most k times. The goal is to pick the k times to broadcast in order to minimize the total time (over all requests) that requests/users have to wait in order
to have their requests satisfied.
As an example, assume that the input was R = 3, 4, 0, 5, 2, 7 (so n = 6) and k = 3. Then one possible solution
(there is no claim that this is the optimal solution) would be to broadcast at times 2, 4, and 7 (note that it is obvious
that in every optimal schedule that there is a broadcast at time n + 1 if Rn 6= 0). The 3 requests at time 1 would
then have to wait 1 time unit. The 4 requests at time 2 would then have to wait 2 time units. The 5 requests at
time 4 would then have to wait 3 time units. The 2 requests at time 5 would then have to wait 2 time units. The
7 requests at time 6 would then have to wait 1 time units. Thus the total waiting time for this solution would be
3 × 1 + 4 × 2 + 5 × 3 + 2 × 2 + 7 × 1. .
I/O description. Input: n and k, separated by one space on the first line, then R on second line. Output: the
sequence of the k times.

Set the first broadcast at time i.
Solve the problem for R' = {Rj | j>=i} and k' = k-1
Of course, you will need to store all sub-solutions to make it a dynamic programming algorithm rather than plain recursion.
Note that you must begin with k-1 broadcast times, as the kth broadcast will always be after the last time with non-zero users.
The problem is with step 1. You can try every possible position (worst case time complexity will be n*k i think). I recommend you try this naive method first, test it on some data, and see if you can come up with a better way to find the position of the first broadcast.

Calculation time for addition of two 8 digits numbers to 4 million digits

I am hoping to find the calculation time of adding two 8 digits A and B, and keep adding B with sum of A+B, keep adding B, till result is a 4 million digits number.

Reaching a 4 million digit number means reaching/exceeding M=10^(4*10^6) which is the first number having 4 million digits. Any 8-digit number is between 10^8 and 10^9-1, so you will have to add B approximately 10^(4*10^6) / 10^8 times in order to reach M and because 8 (or 9) is so small compared to 4*10^6, you can ignore it and you get around 10^(4*10^6) additions. And now, if you consider that a standard PC executes around 10^9 instructions per seconds, it will take around 10^(4*10^6-9) seconds which again because 9 is small you get ~10^(4*10^6) seconds.
Note: it is about the complexity here, and not the programming language.

Perfect powers of numbers which can fit in 64 bit size integer (using priority queues)

How can we print out all perfect powers that can be represented as 64-bit long integers: 4, 8, 9, 16, 25, 27, .... A perfect power is a number that can be written as ab for integers a and b ≥ 2.
It's not a homework problem, I found it in job interview questions section of an algorithm design book. Hint, the chapter was based on priority queues.
Most of the ideas I have are quadratic in nature, that keep finding powers until they stop fitting 64 bit but that's not what an interviewer will look for. Also, I'm not able to understand how would PQ's help here.

Using a small priority queue, with one entry per power, is a reasonable way to list the numbers. See following python code.
import Queue # in Python 3 say: queue
pmax, vmax = 10, 150
Q=Queue.PriorityQueue(pmax)
p = 2
for e in range(2,pmax):
p *= 2
Q.put((p,2,e))
print 1,1,2
while not Q.empty():
(v, b, e) = Q.get()
if v < vmax:
print v, b, e
b += 1
Q.put((b**e, b, e))
With pmax, vmax as in the code above, it produces the following output. For the proposed problem, replace pmax and vmax with 64 and 2**64.
1 1 2
4 2 2
8 2 3
9 3 2
16 2 4
16 4 2
25 5 2
27 3 3
32 2 5
36 6 2
49 7 2
64 2 6
64 4 3
64 8 2
81 3 4
81 9 2
100 10 2
121 11 2
125 5 3
128 2 7
144 12 2
The complexity of this method is O(vmax^0.5 * log(pmax)). This is because the number of perfect squares is dominant over the number of perfect cubes, fourth powers, etc., and for each square we do O(log(pmax)) work for get and put queue operations. For higher powers, we do O(log(pmax)) work when computing b**e.
When vmax,pmax =64, 2**64, there will be about 2*(2^32 + 2^21 + 2^16 + 2^12 + ...) queue operations, ie about 2^33 queue ops.
Added note: This note addresses cf16's comment, “one remark only, I don't think "the number of perfect squares is dominant over the number of perfect cubes, fourth powers, etc." they all are infinite. but yes, if we consider finite set”. It is true that in the overall mathematical scheme of things, the cardinalities are the same. That is, if P(j) is the set of all j'th powers of integers, then the cardinality of P(j) == P(k) for all integers j,k > 0. Elements of any two sets of powers can be put into 1-1 correspondence with each other.
Nevertheless, when computing perfect powers in ascending order, no matter how many are computed, finite or not, the work of delivering squares dominates that for any other power. For any given x, the density of perfect kth powers in the region of x declines exponentially as k increases. As x increases, the density of perfect kth powers in the region of x is proportional to (x1/k)/x, hence third powers, fourth powers, etc become vanishingly rare compared to squares as x increases.
As a concrete example, among perfect powers between 1e8 and 1e9 the number of (2; 3; 4; 5; 6)th powers is about (21622; 535; 77; 24; 10). There are more than 30 times as many squares between 1e8 and 1e9 than there are instances of any higher powers than squares. Here are ratios of the number of perfect squares between two numbers, vs the number of higher perfect powers: 10¹⁰–10¹⁵, r≈301; 10¹⁵–10²⁰, r≈2K; 10²⁰–10²⁵, r≈15K; 10²⁵–10³⁰, r≈100K. In short, as x increases, squares dominate more and more when perfect powers are delivered in ascending order.

A priority queue helps, for example, if you want to avoid duplicates in the output, or if you want to list the values particularly sorted.
Priority queues can often be replaced by sorting and vice versa. You could therefore generate all combinations of ab, then sort the results and remove adjacent duplicates. In this application, this approach appears to be slightly but perhaps not drammatically memory-inefficient as witnessed by one of the sister answers.
A priority queue can be superior to sorting, if you manage to remove duplicates as you go; or if you want to avoid storing and processing the whole result to be generated in memory. The other sister answer is an example of the latter but it could easily do both with a slight modification.
Here it makes the difference between an array taking up ~16 GB of RAM and a queue with less than 64 items taking up several kilobytes at worst. Such a huge difference in memory consumption also translates to RAM access time versus cache access time difference, so the memory lean algorithm may end up much faster even if the underlying data structure incurs some overhead by maintaining itself and needs more instructions compared to the naive algorithm that uses sorting.
Because the size of the input is fixed, it is not technically possible that the methods you thought of have been quadratic in nature. Having two nested loops does not make an algorithm quadratic, until you can say that the upper bound of each such loop is proportional to input size, and often not even then). What really matters is how many times the innermost logic actually executes.
In this case the competition is between feasible constants and non-feasible constants.

The only way I can see the priority queue making much sense is that you want to print numbers as they become available, in strictly increasing order, and of course without printing any number twice. So you start off with a prime generator (that uses the sieve of eratosthenes or some smarter technique to generate the sequence 2, 3, 5, 7, 11, ...). You start by putting a triple representing the fact that 2^2 = 4 onto the queue. Then you repeat a process of removing the smallest item (the triple with the smallest exponentiation result) from the queue, printing it, increasing the exponent by one, and putting it back onto the queue (with its priority determined by the result of the new exponentiation). You interleave this process with one that generates new primes as needed (sometime before p^2 is output).
Since the largest exponent base we can possibly have is 2^32 (2^32)^2 = 2^64, the number of elements on the queue shouldn't exceed the number of primes less than 2^32, which is evidently 203,280,221, which I guess is a tractable number.

OpenMP Dynamic For Loop Chunk Remainders

When specifying a chunk size for a for loop in OpenMP, if there is a remainder, is it handled by the compiler? For example, if I am iterating through 13 points, with chunk size 4 and 3 threads, assuming that all threads are used, will one of them be given a 5th point, or do I need to specify this?

Yes OpenMP handles that for you. You don't have to specify anything.
I assume you talk about static scheduling here since for dynamic scheduling it seems rather evident.
For instance from Intel doc static scheduling
Divide the loop into equal-sized chunks or as equal as possible in the
case where the number of loop iterations is not evenly divisible by
the number of threads multiplied by the chunk size.
The remaining chunks are divided depending on the implementation.
If you want more details according to MSDN doc
For a team of p threads, let ceiling(n/p) be the integer q, which
satisfies n = p*q - r with 0 <= r < p. One implementation of the
static schedule for this example would assign q iterations to the
first p–1 threads, and q-r iterations to the last thread. Another
acceptable implementation would assign q iterations to the first p-r
threads, and q-1 iterations to the remaining r threads. This
illustrates why a program should not rely on the details of a
particular implementation.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio