Amdahl's Law: matrix multiplication

Amdahl's Law: matrix multiplication - parallel-processing

I'm trying to calculate the fraction P of my code which can be parallelized, to apply Amdahl's Law and observe the theoretical maximum speedup.
My code spends most of its time on multiplying matrices (using the library Eigen). Should I consider this part entirely parallelizable?

If your your matrices are large enough, let's say larger than 60, then you can compile with OpenMP enabled (e.g., -fopenmp with gcc) and the products will be parallelized for you. However, it is often better to parallelize at the highest level as possible, especially if the matrices are not very large. Then it depends whether you can identify independent tasks in your algorithm.

First, it would be appropriate to consider how the Eigen library is handling the matrix multiplication.
Then, a matrix(mxn)-vector(nx1) multiplication without Eigen could be written like this:
1 void mxv(int m, int n, double* a, double* b, double* c)
2 { //a=bxc
3 int i, j;
4
5 for (i=0; i<m; i++)
6 {
7 a[i] = 0.0;
8 for (j=0; j<n; j++)
9 a[i] += b[i*n+j]*c[j];
10 }
11 }
As you can see, since no two products compute the same element of the result vector a[] and since the order in which the values for the elements a[i] for i=0...m are calculated does not affect the correctness of the answer, these computations can be carried out independently over the index value of i.
Then a loop like the previous one is entirely parallelizable. It would be relatively straightforward using OpenMP for parallel-implementation purposes on such loops.

Related

Fastest way to compute (n + 1)^j from (n^j)

I need to compute 0^j, 1^j, ..., k^j for some very large k and j (both in the order of a few millions). I am using GMP to handle the big integer numbers (yes, I need integer numbers as I need full precision). Now, I wonder, once I have gone through the effort of computing n^j, isn't there a way to speed up the computation of (n + 1)^j, instead of starting from scratch?
Here is the algorithm I am currently using to compute the power:
mpz_class pow(unsigned long int b, unsigned long int e)
{
mpz_class res = 1;
mpz_class m = b;
while(e)
{
if(e & 1)
{
res *= m;
}
e >>= 1;
m *= m;
}
return res;
}
As you can see, every time I start from scratch, and it takes a lot of time.

To compute n^j, why not find at least one factor of n, say k perform n^j = k^j * (n/k)^j ? By the time n^j is being computed, both k^j and (n/k)^j should be known.
However the above takes potentially O(sqrt(n)) time for n. We have a computation of n^j independently in O(log(j)) time by Exponentiation by Squaring as you have mentioned in the code above.
So you could have a mix of the above depending on which is larger:
If n is much smaller than log(j), compute n^j by factorization.
Whenever n^j is known compute {(2*n)^j, (3*n)^j, ..., ((n-1)*n)^j, n * n^j} and keep it in a lookup table.
If n is larger than log(j) and a ready computation as above is not possible, use the logarithmic method and then compute the other related powers like above.
If n is a pure power of 2 (possible const time computation), compute the jth power by shifting and calculate the related sums.
If n is even (const time computation again), use the factorization method and compute associated products.
The above should make it quite fast. For example, identification of even numbers by itself should convert half of the power computations to multiplications. There could be many more thumb rules that could be found regarding factorization that could reduce the computation further (especially for divisibility by 3, 7 etc)

You may want to use the binomial expansion of (n+1)^j as n^j + jn^(j-1)+j(j-1)/2 * n^(j-2) +... + 1 and memoize lower powers already computed and reuse them to compute (n+1)^j in O(n) time by addition. If you compute the coefficients j, j*(j-1)/2,... incrementally while adding each term, that can be done in O(n) too.

A binomial random number generating algorithm that works when n*p is very small [duplicate]

This question already has answers here:
A efficient binomial random number generator code in Java
(2 answers)
Closed 8 years ago.
I need to generate binomial random numbers:
For example, consider binomial random numbers. A binomial random
number is the number of heads in N tosses of a coin with probability p
of a heads on any single toss. If you generate N uniform random
numbers on the interval (0,1) and count the number less than p, then
the count is a binomial random number with parameters N and p.
In my case, my N could range from 1*10^3 to 1*10^10. My p is around 1*10^(-7).
Often my n*p is around 1*10^(-3).
There is a trivial implementation to generate such binomial random number through loops:
public static int getBinomial(int n, double p) {
int x = 0;
for(int i = 0; i < n; i++) {
if(Math.random() < p)
x++;
}
return x;
}
This native implementation is very slow. I tried the Acceptance Rejection/Inversion method [1] implemented in the Colt (http://acs.lbl.gov/software/colt/) lib. It is very fast, but the distribution of its generated number only agrees with the native implementation when n*p is not very small. In my case when n*p = 1*10^(-3), the native implementation can still generate the number 1 after many runs, but the Acceptance Rejection/Inversion method can never generate the number 1 (always returns 0).
Does anyone know what is the problem here? Or can you suggest a better binomial random number generating algorithm that can solve my case.
[1] V. Kachitvichyanukul, B.W. Schmeiser (1988): Binomial random variate generation, Communications of the ACM 31, 216-222.

If n*p is a fixed small number t, and n is much larger than 1/t, then the binomial distribution is very close to the Poisson distribution, which returns k with probability e^{-t} t^k/k!.
Here is some pseudocode
r=e^t * RandomReal[0,1];
s=1;
k=0;
While[s<r,
(k++; s=s+t^k/k!;)]
Return k;
If t is really small, it will be pretty hard to tell the difference between this and a routine that just returns 0 with probability 1-t and t the rest of the time. For example, if t=0.001 and n is large, then the probabilities of various values of k are
k=0 0.9990005
k=1 0.0009990
k=2 0.0000005
k>2 1.7 * 10^{-10}

Where np is very small, only the very smallest values of n are at all likely. You could work out the probabilities of those and then use http://en.wikipedia.org/wiki/Alias_method. If you are feeling extra scrupulous, you could work out the probability that some value above those you are prepared to deal with occurs instead, and divert to a special case method with this probability, such as generating a second alias table to cope with the most likely N values above those your first alias method coped with.

Algorithm on interview

Recently I was asked the following interview question:
You have two sets of numbers of the same length N, for example A = [3, 5, 9] and B = [7, 5, 1]. Next, for each position i in range 0..N-1, you can pick either number A[i] or B[i], so at the end you will have another array C of length N which consists in elements from A and B. If sum of all elements in C is less than or equal to K, then such array is good. Please write an algorithm to figure out the total number of good arrays by given arrays A, B and number K.
The only solution I've come up is Dynamic Programming approach, when we have a matrix of size NxK and M[i][j] represents how many combinations could we have for number X[i] if current sum is equal to j. But looks like they expected me to come up with a formula. Could you please help me with that? At least what direction should I look for? Will appreciate any help. Thanks.

After some consideration, I believe this is an NP-complete problem. Consider:
A = [0, 0, 0, ..., 0]
B = [b1, b2, b3, ..., bn]
Note that every construction of the third set C = ( A[i] or B[i] for i = 0..n ) is is just the union of some subset of A and some subset of B. In this case, since every subset of A sums to 0, the sum of C is the same as the sum of some subset of B.
Now your question "How many ways can we construct C with a sum less than K?" can be restated as "How many subsets of B sum to less than K?". Solving this problem for K = 1 and K = 0 yields the solution to the subset sum problem for B (the difference between the two solutions is the number of subsets that sum to 0).
By similar argument, even in the general case where A contains nonzero elements, we can construct an array S = [b1-a1, b2-a2, b3-a3, ..., bn-an], and the question becomes "How many subsets of S sum to less than K - sum(A)?"
Since the subset sum problem is NP-complete, this problem must be also. So with that in mind, I would venture that the dynamic programming solution you proposed is the best you can do, and certainly no magic formula exists.

" Please write an algorithm to figure out the total number of good
arrays by given arrays A, B and number K."
Is it not the goal?
int A[];
int B[];
int N;
int K;
int Solutions = 0;
void FindSolutons(int Depth, int theSumSoFar) {
if (theSumSoFar > K) return;
if (Depth >= N) {
Solutions++;
return;
}
FindSolutions(Depth+1,theSumSoFar+A[Depth]);
FindSolutions(Depth+1,theSumSoFar+B[Depth]);
}
Invoke FindSolutions with both arguments set to zero. On return the Solutions will be equal to the number of good arrays;

this is how i would try to solve the problem
(Sorry if its stupid)
think of arrays
A=[3,5,9,8,2]
B=[7,5,1,8,2]
if
elements
0..N-1
number of choices
2^N
C1=0,C2=0
for all A[i]=B[i]
{
C1++
C2+=A[i]+B[i]
}
then create new two arrays like
A1=[3,5,9]
B1=[7,5,1]
also now C2 is 10
now number of all choices are reduced to 2^(N-C1)
now calculate all good numbers
using 'K' as K=K-C2
unfortunately
no matter what method you use, you have
to calculate sum 2^(N-C1) times

So there's 2^N choices, since at each point you either pick from A or from B. In the specific example you give where N happens to be 3 there are 8. For discussion you can characterise each set of decisions as a bit pattern.
So as a brute-force approach would try every single bit pattern.
But what should be obvious is that if the first few bits produce a number too large then every subsequent possible group of tail bits will also produce a number that is too large. So probably a better way to model it is a tree where you don't bother walking down the limbs that have already grown beyond your limit.
You can also compute the maximum totals that can be reached from each bit to the end of the table. If at any point your running total plus the maximum that you can obtain from here on down is less than K then every subtree from where you are is acceptable without any need for traversal. The case, as discussed in the comments, where every single combination is acceptable is a special case of this observation.
As pointed out by Serge below, a related observation is to us minimums and use the converse logic to cancel whole subtrees without traversal.
A potential further optimisation rests behind the observation that, as long as we shuffle each in the same way, changing the order of A and B has no effect because addition is commutative. You can therefore make an effort to ensure either that the maximums grow as quickly as possible or the minimums grow as slowly as possible, to try to get the earliest possible exit from traversal. In practice you'd probably want to apply a heuristic comparing the absolute maximum and minimum (both of which you've computed anyway) to K.
That being the case, a recursive implementation is easiest, e.g. (in C)
/* assume A, B and N are known globals */
unsigned int numberOfGoodArraysFromBit(
unsigned int bit,
unsigned int runningTotal,
unsigned int limit)
{
// have we ended up in an unacceptable subtree?
if(runningTotal > limit) return 0;
// have we reached the leaf node without at any
// point finding this subtree to be unacceptable?
if(bit >= N) return 1;
// maybe every subtree is acceptable?
if(runningTotal + MAXV[bit] <= limit)
{
return 1 << (N - bit);
}
// maybe no subtrees are acceptable?
if(runningTotal + MINV[bit] > limit)
{
return 0;
}
// if we can't prima facie judge the subtreees,
// we'll need specifically to evaluate them
return
numberOfGoodArraysFromBit(bit+1, runningTotal+A[bit], limit) +
numberOfGoodArraysFromBit(bit+1, runningTotal+B[bit], limit);
}
// work out the minimum and maximum values at each position
for(int i = 0; i < N; i++)
{
MAXV[i] = MAX(A[i], B[i]);
MINV[i] = MIN(A[i], B[i]);
}
// hence work out the cumulative totals from right to left
for(int i = N-2; i >= 0; i--)
{
MAXV[i] += MAXV[i+1];
MINV[i] += MINV[i+1];
}
// to kick it off
printf("Total valid combinations is %u", numberOfGoodArraysFromBit(0, 0, K));
I'm just thinking extemporaneously; it's likely better solutions exist.

Why is my Strassen Matrix multiplier so fast?

As an experiment I implemented the Strassen Matrix Multiplication Algorithm to see if truly lead to faster code for large n.
https://github.com/wcochran/strassen_multiplier/blob/master/mm.c
To my surprise it was way faster for large n. For example, the n=1024 case
took 17.20 seconds using the conventional method whereas it only took 1.13 seconds
using the Strassen method (2x2.66 GHz Xeon). What -- a 15x speedup!? It should only be marginally faster. In fact, it seemed to be as good for even small 32x32 matrices!?
The only way I can explain this much of a speed-up is that my algorithm is more cache-friendly -- i.e., it focuses on small pieces of the matrices and thus the data is more localized. Maybe I should be doing all my matrix arithmetic piecemeal when possible.
Any other theories on why this is so fast?

The recursive nature of Strassen has better memory locality,
so that may be a part of the picture. A recursive regular
matrix multiplication is perhaps a reasonable thing
to compare to.

First question is "are the results correct?". If so, it's likely that your "conventional" method is not a good implementation.
The conventional method is not to use 3 nested FOR loops to scan the inputs in the order you learned in math class. One simple improvement is to transpose the matrix on the right so that it sits in memory with columns being coherent rather than rows. Modify the multiply loop to use this alternate layout and it will run much faster on a large matrix.
The standard matrix libraries implement much more cache friendly methods that consider the size of the data cache.
You might also implement a recursive version of the standard matrix product (subdivide into 2x2 matrix of matricies that are half the size). This will give something closer to optimal cache performance, which strassen gets from being recursive.
So either you're doing it wrong, or your conventional code is not optimized.

What is the loop order in your conventional multiplication? If you have
for (int i = 0; i < new_height; ++i)
{
for (int j = 0; j < new_width; ++j)
{
double sum = 0.0;
for (int k = 0; k < common; ++k)
{
sum += lhs[i * common + k] * rhs[k * new_width + j];
}
product[i * new_width + j] = sum;
}
}
then you're not being very nice to the cache because you're accessing the right hand side matrix in a non-continuous manner. After reordering to
for (int i = 0; i < new_height; ++i)
{
for (int k = 0; k < common; ++k)
{
double const fixed = lhs[i * common + k];
for (int j = 0; j < new_width; ++j)
{
product[i * new_width + j] += fixed * rhs[k * new_width + j];
}
}
}
access to two matrices in the inner-most loop are continuous and one is even fixed. A good compiler would probably do this automatically, but I chose to explicitly pull it out for demonstration.
You didn't specify the language, but as for C++, advanced compilers even recognize the unfriendly loop order in some configurations and reorder them.

How do I generate a uniform random integer partition?

A Google search reveals plenty about generating all possible partitions of an integer n into m parts, but I haven't found anything about sampling a uniformly distributed random partition of n into m parts.

The title of this post is a bit misleading. A random integer partition is by default unrestricted, meaning it can have as many parts of any size. The specific question asked is about partitions of n into m parts, which is a type of restricted integer partition.
For generating unrestricted integer partitions, a very fast and simple algorithm is due to Fristedt, in a paper called The Structure of Random Partitions of Large Integer (1993). The algorithm is as follows:
Set x = exp(-pi/sqrt(6n) ).
Generate independent random variables Z(1), Z(2), ..., Z(n), where Z(i) is geometrically distributed with parameter 1-x^i.
IF sum i*Z(i) = n, where the sum is taken over all i=1,2,...,n, then STOP.
ELSE, repeat 2.
Once the algorithm stops, then Z(1) is the number of 1s, Z(2) is the number of 2s, etc., in a partition chosen uniformly at random. The probability of accepting a randomly chosen set of Z's is asymptotically 1/(94n^3)^(1/4), which means one would expect to run this algorithm O(n^(3/4)) times before accepting a single sample.
The reason I took the time to explain this algorithm is because it applies directly to the problem of generating a partition of n into exactly m parts. First, observe that
The number of partitions of n into exactly m parts is equal to the number of partitions of n with largest part equal to m.
Then we may apply Fristedt's algorithm directly, but instead of generating Z(1), Z(2), ..., Z(n), we can generate Z(1), Z(2), ..., Z(m-1), Z(m)+1 (the +1 here ensures that the largest part is exactly m, and 1+Z(m) is equal in distribution to Z(m) conditional on Z(m)>=1) and set all other Z(m+1), Z(m+2), ... equal to 0. Then once we obtain the target sum in step 3 we are also guaranteed to have an unbiased sample. To obtain a partition of n into exactly m parts simply take the conjugate of the partition generated.
The advantage this has over the recursive method of Nijenhuis and Wilf is that there is no memory requirements other than to store the random variables Z(1), Z(2), etc. Also, the value of x can be anything between 0 and 1 and this algorithm is still unbiased! Choosing a good value of x, however, can make the algorithm much faster, though the choice in Step 1 is nearly optimal for unrestricted integer partitions.
If n is really huge and Fristedt's algorithm takes too long (and table methods are out of the question), then there are other options, but they are a little more complicated; see my thesis https://sites.google.com/site/stephendesalvo/home/papers for more info on probabilistic divide-and-conquer and its applications.

Here is some code that does it. This is O(n2) the first time you call it, but it builds a cache so that subsequent calls are O(n).
import random
cache = {}
def count_partitions(n, limit):
if n == 0:
return 1
if (n, limit) in cache:
return cache[n, limit]
x = cache[n, limit] = sum(count_partitions(n-k, k) for k in range(1, min(limit, n) + 1))
return x
def random_partition(n):
a = []
limit = n
total = count_partitions(n, limit)
which = random.randrange(total)
while n:
for k in range(1, min(limit, n) + 1):
count = count_partitions(n-k, k)
if which < count:
break
which -= count
a.append(k)
limit = k
n -= k
return a
How this works: We can calculate how many partitions of an integer n there are in O(n2) time. As a side effect, this produces a table of size O(n2) which we can then use to generate the kth partition of n, for any integer k, in O(n) time.
So let total = the number of partitions. Pick a random number k from 0 to total - 1. Generate the kth partition.

Another algorithm from Combinatorial Algorithms page 52, "Random Generation of n into k parts"
Choose a1, a2, .. , ak-1 a random k-1 subset of {1,2,..,n+k-1} (see below 1., 2.)
Set r1 = a1-1; rj = aj - aj-1-1 (j=2..k-1); rk = n+k-1- ak-1
The rj (j=1..k) constitute the random partition of n into k parts
This algorithm for random compositions is based on the
"balls-in-cells" model.
Briefly we choose the posiitons of the cell
boundaries at random, then by differencing we find out how many balls
are in each cell.
For efficiently generating a random subset of a set, see a 1. related answer here and 2. here
update
Another approach using a single random number in [0,1] to uniformly generate a random partition (also called composition) is given in IVAN STOJMENOVIC, "ON RANDOM AND ADAPTIVE PARALLEL GENERATION OF COMBINATORIAL OBJECTS" (section 5, section 10)

Just one more version in c#.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication6
{
class Program
{
static Random random = new Random();
static void Main(string[] args)
{
PrintPartition(GetUniformPartition(24, 5));
PrintPartition(GetUniformPartition(24, 5));
PrintPartition(GetUniformPartition(24, 5));
PrintPartition(GetUniformPartition(24, 5));
PrintPartition(GetUniformPartition(24, 5));
Console.ReadKey();
}
static int[] GetUniformPartition(int input, int parts)
{
if(input<= 0 || parts <= 0)
throw new ArgumentException("invalid input or parts");
if (input < MinUniformPartition(parts))
throw new ArgumentException("input is to small");
int[] partition = new int[parts];
int sum = 0;
for (int i = 0; i < parts-1; i++)
{
int max = input - MinUniformPartition(parts - i - 1) - sum;
partition[i] = random.Next(parts - i, max);
sum += partition[i];
}
partition[parts - 1] = input - sum; // last
return partition;
}
// sum of 1,2,3,4,..,n
static int MinUniformPartition(int n)
{
return n * n - 1;
}
static void PrintPartition(int[] p)
{
for (int i = 0; i < p.Length; i++)
{
Console.Write("{0},", p[i]);
}
Console.WriteLine();
}
}
}
This code will produce next output:
5,8,7,2,2,
6,6,7,2,3,
5,7,6,2,4,
6,4,3,2,9,
7,8,4,4,1,

I have an evenly distributed partition generator.
Where n := the integer to be partitioned, r:= the number of slices:
The algorithm is a patched version of the naive method of simply inserting partings at random. The problem with this method, as it appeared to me when I looked at its output, was that scenarios where partings are placed in the same spot are less likely to occur. There is only one way to get {1,1,1}, while there are 3! ways of getting {2,4,9}, any of {4,2,9},{2,4,9},{9,4,2}... will lead to the same partition placement when sorted. This has been amended by providing additional explicit opportunities for repeats. For each parting insertion, there's a chance that the position of the parting wont be random, but will be selected as a repeat of a formerly selected value. This balances the uneven probability distribution of the naive method right out.
I have proved by exhaustion that each partitioning is perfectly equally likely for r = 3, n = 2. I cbf proving it for higher values but healfhearted ventures to do so found only promising signs. I also tested it on random input, finding that it is at least roughly even for every values I tried[but probably perfectly even].
here it is in C++11: [the output format is different to what you're expecting, it's the positions of the partings rather than the size of the space between them. The conversion is easy, though]
#include <vector>
#include <algorithm>
#include <random>
#include <cassert>
template <typename Parting, typename Seed>
vector<Parting> partitionGen(unsigned nparts, unsigned bandw, Seed seed){//nparts is the number of parts, that is, one greater than the number of dividers listed in the output vector. Bandw is the integer being partitioned.
assert(nparts > 0);
vector<Parting> out(nparts-1);
srand(seed);
unsigned genRange = bandw;
for(auto i=out.begin(); i<out.end(); ++i, ++genRange){
unsigned gen = rand()%genRange;
*i = ((gen<bandw)?
gen:
*(i-(gen-bandw+1)));
}
sort(out.begin(), out.end(), less<Parting>());
return out;
}
I don't like the fact that I have to sort it though. If Vlody's version has an even distribution, it appears that it'd be better.

After some googling I found an algorithm for this in the "Handbook of Applied Algorithms," which Google Books has indexed. The algorithm is given in section 1.12.2, on page 31.

I have implemented the above solution and found that it works very well if one wants to calculate integer partitions for n but not with respect to m. If working with large n, recursion limits and call stacks may need to be increased a lot.
However, you don't need the first function because count_partitions(n, limit) will actually equal the number of partitions of 'n+limit' with 'limit' number of parts. Some mathematical software have very fast functions for finding the number of partition of n into m parts.
I have recently derived a definitely unbiased, very simple, and very fast method (using memoization) to solve your exact question: An algorithm for randomly generating integer partitions of a particular length, in Python?
It's based on knowing something about lexically ordered partitions of n having m parts and uses a similar approach to well-accepted algorithms (e.g. Nijenhuis and Wilf 1978) that find random partitions of n, and is conceptually similar to the above.
In short, if there are x partitions of n with m parts, then we choose a random number between 1 and x. That random number will code for one and only one partition satisfying n and m. I hope this helps.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Amdahl's Law: matrix multiplication - parallel-processing

I'm trying to calculate the fraction P of my code which can be parallelized, to apply Amdahl's Law and observe the theoretical maximum speedup. My code spends most of its time on multiplying matrices (using the library Eigen). Should I consider this part entirely parallelizable?

Related

Fastest way to compute (n + 1)^j from (n^j)

A binomial random number generating algorithm that works when n*p is very small [duplicate]

Algorithm on interview

Why is my Strassen Matrix multiplier so fast?

How do I generate a uniform random integer partition?

Categories

Resources