applying Amdahl’s and Gustafson’s law on matrix vector multiply - parallel-processing

I read about these laws on many threads here but still could not figure out how to apply their formulas on matrix vector multiply(y = y+ Ax). here I will try to explain my algorithm with respect to time:
T1(sequential): processor zero generates vectors y and x and broadcast
them. T2(parallel):matrix size(n) is divided among processors and
each marix generates its on portion and does the multiplication.
All processors then send results to processor zero.
T3(sequential):processor zero collects results, orders them and print results.
If I run this multiple times with different matrix sizes and processors. how can I apply Amdahl’s and Gustafson’s law on the results

Related

Cannon's Algorithm for Matrix Multiplication with small number of processors

While researching Cannon's algorithm, there are always examples of the same kind. For example if the size of matrix A and B is 3x3; there are always 9 processor in the examples. Each processor is responsible for its cell and adds the value to the sum. Also the instructions always have the same "Process P(i , j) initially stores A(i , j) and B(i , j) computes block C(i , j) of the result matrix" expression.
I understand this case. I put an example case below. These are the initial matrixes.
And this is the position of processors;
As in this example I put in, the number of processors has always been chosen to deal with only the 1x1 part of the matrix. I wonder what the situation would have been if the number of processors were not chosen like this, but in a way that each processor would deal with more parts.
However, what would happen if 4 processors were used to multiply 4x4 matrices? As I understood from the instructions, each process would take the 2x2 parts of the A and B matrix. In other words, as in the picture I put below, the 1st process would keep the elements of the A and B matrix in the indices (0,0), (0,1), (1,0) (1,1). Second process would keep the indexes of (0,2), (0,3), (1,2) and (1,3).
What would changing the number of processes change about the communication situations or the number of steps required to complete the algorithm? For example, would you have to do more shifts with each step?

CUDA: Launching many parallel calls to cuBLAS on different subsections of a matrix, without serializing

In my application, I have a double complex N*3 matrix (where N is several thousand) and a 3*1 vector, and I am forming an N*1 using zgemv.
The N*3 is a subsection of a larger M*3 matrix (where M is slightly larger then N, but the same order of magnitude).
Each thread must perform a zgemv call to a different subsection of the larger matrix. That is, the N*3 is different for every thread. But all of the N*3 are formed from some portion of the larger M*3.
There isn't enough memory for each thread to store an independent N*3. Furthermore, the M*3 is too large to fit in shared memory. Thus each thread must pull its data from a single copy of the M*3. How can I do this without millions of threads serializing memory reads to the same memory locations in the M*3? Is there a more efficient way to approach this?
Probably, based on what I can gather so far, there are 2 types of optimizations I would want to consider:
convert operations that use the same N subset to a matrix-matrix multiply (zgemm), instead of multiple zgemv operations.
cache-block for the GPU L2 cache.
I'll discuss these in reverse order using these numbers for discussion:
M: ~10,000
N: ~3,000
cublas zgemv calls: ~1e6
"typical" Kepler L2: 1.5MB
An Nx3 matrix requires approximately 10,000 elements, each of which is 16 bytes, so let's call it 160K bytes. So we could store ~5-10 of these subsets in a memory size comparable to L2 cache size (without taking into account overlap of subsets - which would increase the residency of subsets in L2).
There are (M-N) possible unique contiguous N-row subsets in the M matrix. There are 1e6 zgemv calls, so on average each subset gets re-used 1e6/M-N times, approximately 100-150 times each. We could store about 10 of these subsets in the proposed L2, so we could "chunk" our 1e6 calls into "chunks" of ~1,000 calls that all operate out of the same data set.
Therefore the process I would follow would be:
transfer the M*3 matrix to the device
predetermine the N*3 subset needed by each thread.
sort or otherwise group like subsets together
divide the sorted sets into cache-sized blocks
for each block, launch a CDP kernel that will spawn the necessary zgemv calls
repeat the above step until all blocks are processed.
One might also wonder if such a strategy could be extended (with considerably more complexity) to L1/Texture. Unfortunately, I think CDP would confound your efforts to achieve this. It's pretty rare that people want to invest the effort to cache-block for L1 anyway.
To extend the above strategy to the gemm case, once you sort your zgemv operations by the particular N subset they require, you will have grouped like operations together. If the above arithmetic is correct, you will have on average around 100-150 gemv operations needed for each particular N-subset. You should group the corresponding vectors for those gemv operations into a matrix, and convert the 100-150 gemv operations into a single gemm operation.
This reduces your ~1e6 zgemv operations to ~1e4 zgemm operations. You can then still cache-block however many of these zgemm operations will be "adjacent" in M and fit in a single cache-block, into a single CDP kernel call, to benefit from L2 cache reuse.
Given the operational intensity of GEMM vs. GEMV, it might make sense to dispense with the complexity of CDP altogether, and simply run a host loop that dispatches the ZGEMM call for a particular N subset. That host loop would iterate for M-N loops.

Parallel Matrix Multiplication using multi GPU

I have installed two GPUs (2x Nvidia Quadro 410) in my system in different pci slots. To solve Martix multiplication on both of these GPU, how can I split the input matrices such that each GPU processes/computes a part of output matrix and then returns it back.
For eg. for two matrix A, B each of order 10x10 , then the to compute the output matrix C= A x B ,such that ,out of 100 elements(10 x 10) 50 elements should be calculated on 1st GPU and other half i.e 50 to b computed in 2nd GPU.
I am trying to implement it on OpenCL. But, any algorithm is welcomed which will help me come up with the solution.
In general, if you have matrices X (of size axb, rows first) and Y (of size bxc),
X * Y = vcat(X[0:a/2,0:b] * Y, X[a/2:a,0:b] * Y)
In this pseudocode, vcat is vertical concatenation (putting one matrix on top of each other, e.g. a 4x3 matrix concatenated with 2x3 matrix will produce a 6x3 matrix), : denotes ranges and [] is indexing.
Both arguments to vcat can be computed on different GPUs, and the concatenation can be achieved just by pointing the output to different sub-regions of the output buffer (assuming we have C-ordered arrays). The initial splitting of X can be similarly achieved just by using different sub-regions (since it is split along a row).

Efficient all-pairs set intersection on GPU

I have n sets, subsets of a finite universe. I want to calculate the n*n matrix in which the (I, J) entry contains the cardinality of the intersection of set I and set J. n is in the order of 50000.
My idea is to split the matrix into blocks sufficiently small so to have one thread per entry. Every thread should calculate the intersection using bitwise and.
Are there more efficient approaches to solve this problem?
I'm going to assume you want to compute it as you described: actually computing the intersection of every pair of sets, using bitwise and of bitsets.
With the right mathematical setup, you are literally computing the outer product of two vectors, so I will think in terms of high performance linear algebra.
The key to performance is going to be reducing memory traffic, and that means holding things in registers when you can. The overwhelmingly most significant factor is that your elements are huge; it takes 6250 32-bit words to store a single set! An entire multiprocessor of cuda compute capability 3.0, for example, can only hold a mere 10 sets in registers.
What you probably want to do is to spread each element out across an entire thread block. With 896 threads in a block and 7 registers per block, you can store one set of 200704 elements. With cuda compute capability 3.0, you will have 36 registers available per block.
The simplest implementation would be to have each block own one row of the output matrix. It loads the corresponding element of the second vector and stores it in registers, and then iterates over all of the elements of the first vector, computing the intersection, computing and reducing the popcount, and then storing the result in the output vector.
This optimization should reduce the overall number of memory reads by a factor of 2, and thus is likely to double performance.
Better would be to have each block own 3-4 rows of the output matrix at once, and loads the corresponding 3-4 elements of the second vector into registers. Then the block iterates over all of the elements of the first register, and for each it computes the 3-4 intersections it can, storing the result in the output matrix.
This optimization reduces the memory traffic by an additional factor of 3-4.
A completely different approach would be to work with each element of the universe individually: for each element of the universe, you compute which sets actually contain that element, and then (atomically) increment the corresponding entries of the output matrix.
Asymptotically, this should be much more efficient than computing the intersections of sets. Unfortunately, it sounds hard to implement efficiently.
An improvement is to work with, say, 4 elements of the universe at a time. You split up all of your sets up into 16 buckets, depending on which of those 4 elements the set contains. Then, for each of the 16*16 possible pairs of buckets, you iterate through all pairs of vectors from the buckets and (atomically) update the corresponding entry of the matrix appropriately.
This should be even faster than the version described above, but it still may potentially be difficult to implement.
To reduce the difficulty of getting all of the synchronization worked out, you could partition all of input sets into k groups of n/k sets each. Then, the (i,j)-th thread (or warp or block) only does the updates for the corresponding block of the output matrix.
A different approach to breaking up the problem is to to split the universe into smaller partitions of 1024 elements each, and compute just the size of the intersections in this part of the universe.
I'm not sure if I've described that well; basically you're computing
A[i,j] = sum((k in v[i]) * (k in w[j]) for k in the_universe)
where v and w are the two lists of sets, and k in S is 1 if true and 0 otherwise. The point is to permute the indices so that k is in the outer loop rather than the inner loop, although for efficiency you will have to work with many consecutive k at once, rather than one at a time.
That is, you initialize the output matrix to all zeroes, and for each block of 1024 universe elements, you compute the sizes of the intersections and accumulate the results into the output matrix.
I choose 1024, because I imagine you'll have a data layout where that's probably the smallest size where you can still get the full memory bandwidth when reading from device memory, and all of the threads in warp work together. (adjust this appropriately if you know better than me, or you aren't using nVidia and whatever other GPUs you're using would work with something better)
Now that your elements are a reasonable size, you can now appeal to traditional linear algebra optimizations to compute this product. I would probably do the following:
Each warp is assigned a large number of rows of the output matrix. It reads the corresponding elements out of the second vector, and then iterates through the first vector, computing products.
You could have all of the warps operate independently, but it may be better to do the following:
All of the warps in a block work together to load some number of elements from the first vector
Each warp computes the intersections it can and writes the results to the output matrix
You could store the loaded elements in shared memory, but you might get better results holding them in registers. Each warp can only compute the intersections with the set elements its holding onto, and you but after doing so the warps can all rotate which warps are holding which elements.
If you do enough optimizations along these lines, you will probably reach the point where you are no longer memory bound, which means you might not have to go so far as to do the most complicated optimizations (e.g. the shared memory approach described above might already be enough).

Difference between observations and variables in Matlab

I'm kind of ashamed to even ask this but here goes. In every Matlab help file where the input matrix is a NxD matrix X Matlab describes the matrix arrangement as
Data, specified as a numeric matrix. The rows of X correspond to
observations, and the columns correspond to variables.
Above taken from help of kmeans
I'm kind of confused as to what does Matlab mean by observations and variables.
Suppose I have a data matrix composed of 100 images. Each image is represented by a feature vector of size 128 x 1. So here is 100 my observations and 128 the variables or is it the other way around?
Will my data matrix be of the size 128 x 100 or 100 x 128
Eugene's explanation in a statistical and probability construct is great, but I would like to explain it more in the viewpoint of data analysis and image processing.
Think of an observation as one sample from your data set. In this case, one observation is one image. For each sample, it has some dimensionality associated to it or a number of variables used to represent such a sample.
For example, if we had a set of 100 2D Cartesian points, the amount of observations is 100, while the dimensionality or the total number of variables used to describe the point is 2: We have a x point and a y point. As such, in the MATLAB universe, we'd place all of these data points into a single matrix. Each row of the matrix denotes one point in your data set. Therefore, the matrix you would create here is 100 x 2.
Now, go back to your problem. We have 100 images and each image can be expressed by 128 features. This suspiciously looks like you are trying to use SIFT or SURF to represent an image so think of this situation where each image can be described by a 128-dimensional vector, or a histogram with bins of 128 elements. Each feature is part of the dimensionality makeup that makes up the image. Therefore, you would have a 100 x 128 matrix. Each row represents one image, where each image is represented as a 1 x 128 feature vector.
In general, MATLAB's machine learning and data analysis algorithms assume that your matrix is M x N, where M is the total number of points that make up your data set while N is the dimensionality of one such point in your data set. In MATLAB's universe, the total number of observations is equal to the total number of points in your data set, while the total number of features / distinct attributes to represent one sample is the total number of variables.
tl:dr
Observation: One sample from your data set
Variable: One feature / attribute that helps describe an observation or sample in your data set.
Number of observations: Total number of points in your data set
Number of variables: Total number of features / attributes that make up an observation or sample in your data set.
It looks like you are talking about some specific statistical/probabilistic functions. In statistics or probability theory there are some random variables that are results of some kind of measurements/observations over time (or some other dimension). So such a matrix is just a collection of N measurements of D different random variables.

Resources