partition a 2D array row-wise and use allgather? - parallel-processing

I have a loop that look like this:
do j=1,100
do i=1,1000
combined_array(i,j)=combined_array(i,j-1)
call foo(combined_array(i,j))
enddo
enddo
subroutine foo(x)
x= somefunction(x)
end subroutine foo
I want to split the computation but there is a dependancy in the columns. If it was not there I could have split the task on the columns and used allgather (
partition a 2D array column-wise and use allgather )
For this loop I can split the tasks on the rows but how do I combine the results using allgather? The chunk each rank gets is not contiguous in memory

MPI provides the strided vector type for such kind of problems. You would construct such type using a stride that equals to the total column height and a block size equal to the number of rows in the subblock. You would also give this type a count that equals the number of columns in a row. Here is a concrete example: let's say you have a matrix of REAL with nr rows and nc columns and each process holds a subblock with nr1 rows. Then you would do:
integer :: rowtype
call MPI_TYPE_VECTOR(nc, nr1, nr, MPI_REAL, rowtype, ierr)
call MPI_TYPE_COMMIT(rowtype, ierr)
Let's say you would like to receive some data into a subblock that starts at row myrow. Using this new type you can simply do:
call MPI_RECV(array(myrow,1), 1, rowtype, src, tag, &
MPI_COMM_WORLD, status, ierr)
It works like that - starting at the provided address of the top lef element of the subblock (that is array(myrow,1)) it will put nr1 elements from the received message, then will skip nr - nr1 elements of array, then put nr1 more elements and again skip nr - nr1 elements and so on, nc times.
But there is a problem here. The extent of the rowtype type would be nc*nr elements. You cannot use it like that with MPI_(ALL)GATHERV() since you would be only able to position the beginning of each piece at offsets that are multiple of the type extent, i.e. multiples of nc*nr. To overcome this limitation, MPI allows you to artificially change the extent of the type using MPI_TYPE_CREATE_RESIZED. It takes a type and builds a new one that has the same type map (e.g. will lay out elements in memory using the same "recipe" as the old type), but when computing offsets and other things that depend on the extent of the type, MPI would use the user provided value instead of the real one. What you need to do is to alter the extent of myrow to be equal to the extent of nr1 elements of type MPI_REAL. It is done like this:
integer(kind = MPI_ADDRESS_KIND) :: lb, extent
integer :: rowtype_resized
call MPI_TYPE_GET_EXTENT(MPI_REAL, lb, extent, ierr)
extent = extent * nr1
call MPI_TYPE_CREATE_RESIZED(rowtype, 0, extent, rowtype_resized, ierr)
call MPI_TYPE_COMMIT(rowtype_resized, ierr)
Now you can use rowtype_resized to receive subblocks of nr1 rows and nc columns but you can position them so to start at any row of the big array that is multiple or nr1 and not only on multiples of the total size of array. You can then proceed like this:
call MPI_ALLGATHER(smallarray, nr1*nc, MPI_REAL, &
array, 1, rowtype_resized, MPI_COMM_WORLD, ierr)
This will gather the content of the small arrays smallarray (each of nr1 rows and nc columns) into the big arrays array (each of nr1 * #processes rows and nc columns).
You can even be more flexible and register the vector type with a block length of 1 instead of nr1. This will allow you to send a single row. Then you can create a resized type with an extent of one MPI_REAL element and use MPI_(ALL)GATHERV to gather differently sized subblocks.
I know it's a bit tricky, but in time one learns how to master the type system of MPI.

Related

Making a 'hash' of two floats

I have two 32-bit floating point numbers. I want to keep a count of how often any combination of the two occurs. I could technically do this by concatenating them into a string and use a regular hash map to keep track of the count, but the overhead of that is considerable in my application, so I was thinking if there would be a better way. I don't need to keep the full precision of a 32 bit float, and I know that one number is never > 10, and the other never > 100. So I could technically multiple the first by 10000, the second by 1000, cast the result to int to chop off anything after the comma, bit-shift the first nr 16 bits and & them together into an integer. I could then allocate an array of MAX_INT elements and use the integer I just created as an index into that array.
However, that would leave me with a 2GB array, most of which would be empty, so I'd like to avoid that. I was wondering if there are any hashing algorithms that go about this in a more sophisticated way, or any data structures that work in a 'tiered' way, like a tree where a lookup is first done on a combination of the first digits of each number, then on the second numbers and so on, so that no room needs to be allocated for any combinations that aren't known yet. (There is probably a problem with this exact approach, it's just an example of the direction I'm thinking in). Or any other way is fine too - like maybe a more sophisticated way of hashing two floats together, in such a way that the result is scaled between 0 and some number, where the chose of that 'some number' would give me a way to tune the max size of the lookup table in memory.
Any ideas?
You could copy the floats to one buffer and than hash that buffer, like this (this is c/c++):
int hashNumbers(float a, float b){
char bytes[2 * sizeof(float)];
memcpy(bytes, &a, sizeof(float));
memcpy(bytes + sizeof(float), &b, sizeof(float));
//I don't know your implementation of the actual hash function
return hash(bytes, 2 * sizeof(float)); //assuming the hash function would take in an array of bytes with its size.
//I don't know it's output type here. I'm assuming it's an int.
}
As for your has function you could use the modulo, if you for example just take the int which has 2^32 possibilities and that do x = x % 100, that would leave you with 100 possibilities as to what x could be.
If instead of 100 you would take a number which is a power of 2(2, 4, 8, 16, etc). You could accelerate this by instead of using modulo, you use bitwise operations. By doing: x = x >> 24. You now only have 256 possibilities for x.

Sorted Two-Way Tabulation of Many Values

I have a decent-sized dataset (about 18,000 rows). I have two variables that I want to tabulate, one taking on many string values, and the second taking on just 4 values. I want to tabulate the string values by the 4 categories. I need these sorted. I have tried several commands, including tabsort, which works, but only if I restrict the number of rows it uses to the first 603 (at least with the way it is currently sorted). If the number of rows is greater than this, then I get the r(134) error that there are too many values. Is there anything to be done? My goal is to create a table with the most common words and export it to LaTeX. Would it be a lot easier to try and do this in something like R?
Here's one way, via contract and texsave from SSC:
/* Fake Data */
set more off
clear
set matsize 5000
set seed 12345
set obs 1000
gen x = string(rnormal())
expand mod(_n,10)
gen y = mod(_n,4)
/* Collapse Data to Get Frequencies for Each x-y Cell */
preserve
contract x y, freq(N)
reshape wide N, i(x) j(y)
forvalues v=0/3 {
lab var N`v' "`v'" // need this for labeling
replace N`v'=0 if missing(N`v')
}
egen T = rowtotal(N*)
gsort -T x // sort by occurrence
keep if T > 0 // set occurrence threshold
capture ssc install texsave
texsave x N0 N1 N2 N3 using "tab_x_y.tex", varlabel replace title("tab x y")
restore
/* Check Calculations */
type "tab_x_y.tex"
tab x y, rowsort

Sorting and Counting Elements in OpenCL

I want to create an OpenCL kernel that sorts and counts millions of ulong.
There is a particular algorithm that fits my needs or should I go for an hash table?
To be clear, given the following input:
[42, 13, 9, 42]
I would like to get an output like this:
[(9,1), (13,1), (42,2)]
My first idea was to modify the Counting Sort - which already counts in order to sort - but because of the wide range of ulongs it requires too much memory. Bitonic or Radix sort plus something to count elements could be a way but I miss a fast way to count the elements. Any suggestions on this?
Extra notes:
I'm developing using an NVIDIA Tesla K40C GPU and a Terasic DE5-Net FPGA. So far the main goal is to make it work on the GPU but I'm also interested in solutions that might be a nice fit for FPGAs.
I know that some values inside the range of ulong aren't used so we can use them to mark invalid elements or duplicates.
I want to consume the output from the GPU using multiple threads in the CPU so a would like to avoid any solution that require some post-processing (in the host side I mean) that has data dependencies spread around the output.
This solution requires two passes of the bitonic sort to both count the duplicates as well as remove them (well move them to the end of the array). Bitonic sort is O(log(n)^2), so this then will run with time complexity 2(log(n)^2), which shouldn't be a problem unless you are running this in a loop.
Create a simple struct for each of the elements, to include the number of duplicates, and if the element has been added as a duplicate, something like:
// Note: If you are worried about space, or know that there
// will only be a few duplicates for each element, then
// make the count element smaller
typedef struct {
cl_ulong value;
cl_ulong count : 63;
cl_ulong seen : 1;
} Element;
Algorithm:
You can start by creating a comparison function which will move duplicates to the end, and count the duplicates if they are you to be added to the total count for the element. This is the logic behind the comparison function:
If one element is a duplicate and another is not, return that the non-duplicate element is smaller (regardless of the values), which will move all duplicates to the end.
If the elements are duplicates and the right element has not been marked a duplicate (seen=0), then add the right element's count to the left element's count and set the right element as a duplicate (seen=1). This has the effect of moving the total count of an element with a specific value to the leftmost element in the array with that value, and all duplicates with that value to the end.
Otherwise return that the element with the smaller value, is smaller.
The comparison function would look like:
bool compare(const Element* E1, const Element* E2) {
if (!E1->seen && E2->seen) return true; // E1 smaller
if (!E2->seen && E1->seen) return false; // E2 smaller
// If the elements are duplicates and the right element has
// not yet been "seen" by an element with the same value
if (E1->value == E2->value && !E2->seen) {
E1->count += E2->count;
E2->seen = 1;
return true;
}
// They aren't duplicates, and either
// neither has been seen, or both have
return E1->value < E2->value;
}
Bitonic sort has a specific structure, which can be nicely illustrated with a diagram. In the diagram, each element is referred to by a 3-tuple (a,b,c) where a = value, b = count, and c = seen.
Each diagram shows one run of bitonic sort on the array (vertical lines denote a comparison between elements, and horizontal lines move right to the next stage of the bitonic sort). Using the diagram and the above comparison function and logic, you should be able to convince yourself that this does what is required.
Run 1:
Run 2:
At the end of run 2, all elements are arranged by value. Duplicates with seen = 1 are at the end, and duplicates with seen = 0 are in their correct place and count is the number of other elements with the same value.
Implementation:
The diagrams are color coded to illustrate the sub-processes of bitonic sort. I'll call the blue blocks a phase (there are three phases in each run in the diagrams). In general, there will be ceil(log(N)) phases for each run. Each phase consists of a number of green block (I'll call these out-in blocks, because the shape of the comparisons is out to in), and red blocks (I'll call these constant blocks, because the distance between elements to compare remains constant).
From the diagram, the out-in block size (elements in each block) starts at 2 and doubles in each pass. The constant block size for each pass starts at half the out-in block size (in the second (blue block) phase, there are 2 elements in each of the four red blocks, because the green blocks have a size of 4) and halves for each successive vertical lines of red block within the phase. Also, the number of successive vertical lines of the constant (red) blocks in a phase is always the same as the phase number with 0 indexing (0 vertical lines of red blocks for phase 0, 1 vertical line of red bocks for phase 1, and 2 vertical lines of red blocks for phase 2 -- each vertical line is an iteration of calling that kernel).
You can then make kernels for the out-in passes, and the constant passes, then invoke the kernels from the host side (because you need to constantly synchronise, which is a disadvantage, but you should still see large performance improvements over sequential implementations).
From the host side, the overall bitonic sort might look like:
cl_uint num_elements = 4; // Set number of elements
cl_uint phases = (cl_uint)ceil((float)log2(num_elements));
cl_uint out_in_block_size = 2;
cl_uint constant_block_size;
// Set the elements_buffer, which should have been created with
// with clCreateBuffer, as the first kernel argument, and the
// number of elements as the second kernel argument
clSetKernelArg(out_in_kernel, 0, sizeof(cl_mem), (void*)(&elements_buffer));
clSetKernelArg(out_in_kernel, 1, sizeof(cl_uint), (void*)(&num_elements));
clSetKernelArg(constant_kernel, 0, sizeof(cl_mem), (void*)(&elements_buffer));
clSetKernelArg(constant_kernel, 1, sizeof(cl_uint), (void*)(&num_elements));
// For each pass
for (unsigned int phase = 0; phase < phases; ++phase) {
// -------------------- Green Part ------------------------ //
// Set the out_in_block size for the kernel
clSetKernelArg(out_in_kernel, 2, sizeof(cl_int), (void*)(&out_in_block_size));
// Call the kernel - command_queue is the clCommandQueue
// which should have been created during cl setup
clEnqueNDRangeKernel(command_queue , // clCommandQueue
out_in_kernel , // The kernel
1 , // Work dim = 1 since 1D array
NULL , // No global offset
&global_work_size,
&local_work_size ,
0 ,
NULL ,
NULL);
barrier(CLK_GLOBAL_MEM_FENCE); // Synchronise
// ---------------------- End Green Part -------------------- //
// Set the block size for constant blocks based on the out_in_block_size
constant_block_size = out_in_block_size / 2;
// -------------------- Red Part ------------------------ //
for (unsigned int i 0; i < phase; ++i) {
// Set the constant_block_size as a kernel argument
clSetKernelArg(constant_kernel, 2, sizeof(cl_int), (void*)(&constant_block_size));
// Call the constant kernel
clEnqueNDRangeKernel(command_queue , // clCommandQueue
constant_kernel , // The kernel
1 , // Work dim = 1 since 1D array
NULL , // No global offset
&global_work_size,
&local_work_size ,
0 ,
NULL ,
NULL);
barrier(CLK_GLOBAL_MEM_FENCE); // Synchronise
// Update constant_block_size for next iteration
constant_block_size /= 2;
}
// ------------------- End Red Part ---------------------- //
}
And then the kernels would be something like (you also need to put the struct typedef in the kernel file so that the OpenCL compiler know what 'Element' is):
__global void out_in_kernel(__global Element* elements, unsigned int num_elements, unsigned int block_size) {
const unsigned int idx_upper = // index of upper element in diagram.
const unsigned int idx_lower = // index of lower element in diagram
// Check that both indices are in range (this depends on thread mapping)
if (idx_upper is in range && index_lower is in range) {
// Do the comparison
if (!compare(elements + idx_upper, elements + idx_lower) {
// Swap the elements
}
}
}
The constant_kernel will look the same, but the thread mapping (how you determine idx_upper and idx_lower) will be different. There are many ways you can map the threads to the elements generally to mimic the diagrams (note that the number of threads required is half the total number of elements, since each thread can do one comparison).
Another consideration is how to make the thread mapping general (so that if you have a number of elements which is not a power of two the algorithm doesn't break).
How about boost.compute or VexCL? Both provide sorting algorithms.
Mergesort works quite well on GPUs and you could modify it to sort key+count instead of keys only. During merging you would then also check if do keys are identical and if yes, fuse them into a single key during merge. (If you merge [9/c:1, 42/c:1] and [13/c:1,42/c:1] you would get [9/c:1,13/c:1,42/c:2] )
You might have to use parallel prefix sum to remove the gaps caused by fusing keys.
Or: Use a regular GPU sort first, mark all keys where the key to its right is different (this is only true at the last key of each unique key), use parallel prefix sum to get consecutive indexes for all unique keys and note their position in the sorted array. Then you only need to subtract the index of the previous unique key to get the count.

Stacked MPI derived data types in Fortran

MPI2 allows us to create derived data types and send them by writing
call mpi_type_create_indexed_block(size,1,dspl_send,rtype,DerType,ierr)
call mpi_send(data,1,DerType,jRank,20,comm,ierr)
By doing this the position dspl_send of data(N) are sent by the MPI library.
Now, for a matrix data(M,N) we can send its position via the following code:
call mpi_type_create_indexed_block(size,M,dspl_send,rtype,DerTypeM,ierr)
call mpi_send(data,1,DerTypeM,jRank,20,comm,ierr)
That is the entries data(i, dspl_send(j)) are sent.
My question concern the role of the 1 in the subsequent mpi_send. Does it has always to be 1? Is another size possible? MPI derived data types are explained nicely in many documents on the internet, but always the size in send/recv is 1 without mention if another size is allowed and then how it could be used.
If we want to work with matrices data(M,N) with a size M that varies between calls, do we need to always create a derived data type whenever we call it? Is it impossible to use DerType for sending a matrix data(M,N) or data(N,M)?
Each MPI datatype has two properties: size and extent. The size is the actual number of bytes that the datatype represent while the extent is the number of bytes that the datatype covers in memory. Some datatypes are not contiguous, which means that their size might be less than their extent, e.g. (shown here in pseudocode)
MPI_TYPE_VECTOR(count = 1,
blocklength = 10,
stride = 20,
oldtype = MPI_INTEGER,
newtype = newtype)
creates a datatype that takes the first 10 (blocklength) elements from a total of 20 (stride). This datatype has a size of 10 times the size of MPI_INTEGER which counts to 40 bytes on most systems. Its extent is two times larger or 80 bytes on most systems. If count was 2, then it would take 10 elements, then skip the next 10, then take another 10 elements and once again skip the next 10. Consequently its size and its extend would be twice as larger.
When you specify a certain element count in any MPI routine, e.g. MPI_SEND, MPI does something like this:
It initialises the internal data buffer with the address of the source buffer argument.
It consults the datatype type map to decide how many bytes and from where to take and appends them to the message being constructed. The number of bytes added equals the size of the datatype.
It increments the internal data pointer by the extent of the datatype.
It decrements the internal count and if it is still non-zero, repeats the previous two steps.
One nifty feature of MPI is that the extent of the datatype is not required to match its size (as shown in the vector example) and one can even bestow whatever value of the extent that he wants on the datatype using MPI_TYPE_CREATE_RESIZED. This allows for very complex data access patterns to be created. For example, using MPI_SCATTERV to scatter a matrix by blocks that do not span entire rows (C) or columns (Fortran) requires the use of such resized types.
Back to the vector example. Whether you create a vector type with count = 1 and then call MPI_SEND with count = 2 or you create a vector type with count = 2 and then call MPI_SEND with count = 1, the end result is the same. Often one constructs a datatype that fully describes the object that one wants to send. In this case one gives count = 1 in the call to MPI_SEND. But there are cases when it might be more beneficial to create a datatype that describes only a portion of the object, for example a single part, and then call MPI_SEND with count set to the number of parts that one wants to send. Sometimes it is a matter of personal preferences, sometimes it is a matter of algorithmic requirements.
As to your last question, Fortran stores matrices in column-major order, which means that data(i,j) is next to data(i±1,j) in memory and not to data(i,j±1). Consequently, data(M,N) consists of N consecutive column-vectors of M elements each. The distance between two elements, for example data(1,1) and data(1,2) depends on M. That's why you supply M in the type constructor. Matrices with different number of rows (e.g. different M) would not "fit" the type map of the created type and the wrong elements would be used to construct the message.
The description about extent in https://stackoverflow.com/a/13802243/7784768 is not entirely correct, as the extent does not take into account the padding in the end of datatype. MPI datatypes are defined by typemap:
typemap = ((type_0, disp_0 ), ..., (type_n−1, disp_n−1 ))
Extent is then defined according to
lb = min(disp_j)
ub = max(disp_j + sizeof(type_j)) + e)
extent = ub - lb,
where e can be non-zero due alignment requirements.
This means that in the example
MPI_TYPE_VECTOR(count = 1,
blocklength = 10,
stride = 20,
oldtype = MPI_INTEGER,
newtype = newtype)
with count=1, typemap is
((int, 0), (int, 4), ... (int, 36))
and extent is in most systems 40 and not 80 (i.e. stride has no effect for the typemap in this case). For count=2, typemap would be
((int, 0), (int, 4), ... (int, 36), (int, 80), (int, 84), ... (int, 116))
and extent 120 (40 bytes for the first block of 10 integers, 40 bytes for the stride, and 40 bytes for the second block of 10 integers, but the remaining stride is neglected in the extent). One can easily find out the extent with the MPI_Type_get_extent function.
Extent is quite tricky concept, and it is easy to make mistakes when trying to communicate multiple elements of derived datatype.

Is there an efficient data structure for row and column swapping?

I have a matrix of numbers and I'd like to be able to:
Swap rows
Swap columns
If I were to use an array of pointers to rows, then I can easily switch between rows in O(1) but swapping a column is O(N) where N is the amount of rows.
I have a distinct feeling there isn't a win-win data structure that gives O(1) for both operations, though I'm not sure how to prove it. Or am I wrong?
Without having thought this entirely through:
I think your idea with the pointers to rows is the right start. Then, to be able to "swap" the column I'd just have another array with the size of number of columns and store in each field the index of the current physical position of the column.
m =
[0] -> 1 2 3
[1] -> 4 5 6
[2] -> 7 8 9
c[] {0,1,2}
Now to exchange column 1 and 2, you would just change c to {0,2,1}
When you then want to read row 1 you'd do
for (i=0; i < colcount; i++) {
print m[1][c[i]];
}
Just a random though here (no experience of how well this really works, and it's a late night without coffee):
What I'm thinking is for the internals of the matrix to be a hashtable as opposed to an array.
Every cell within the array has three pieces of information:
The row in which the cell resides
The column in which the cell resides
The value of the cell
In my mind, this is readily represented by the tuple ((i, j), v), where (i, j) denotes the position of the cell (i-th row, j-th column), and v
The would be a somewhat normal representation of a matrix. But let's astract the ideas here. Rather than i denoting the row as a position (i.e. 0 before 1 before 2 before 3 etc.), let's just consider i to be some sort of canonical identifier for it's corresponding row. Let's do the same for j. (While in the most general case, i and j could then be unrestricted, let's assume a simple case where they will remain within the ranges [0..M] and [0..N] for an M x N matrix, but don't denote the actual coordinates of a cell).
Now, we need a way to keep track of the identifier for a row, and the current index associated with the row. This clearly requires a key/value data structure, but since the number of indices is fixed (matrices don't usually grow/shrink), and only deals with integral indices, we can implement this as a fixed, one-dimensional array. For a matrix of M rows, we can have (in C):
int RowMap[M];
For the m-th row, RowMap[m] gives the identifier of the row in the current matrix.
We'll use the same thing for columns:
int ColumnMap[N];
where ColumnMap[n] is the identifier of the n-th column.
Now to get back to the hashtable I mentioned at the beginning:
Since we have complete information (the size of the matrix), we should be able to generate a perfect hashing function (without collision). Here's one possibility (for modestly-sized arrays):
int Hash(int row, int column)
{
return row * N + column;
}
If this is the hash function for the hashtable, we should get zero collisions for most sizes of arrays. This allows us to read/write data from the hashtable in O(1) time.
The cool part is interfacing the index of each row/column with the identifiers in the hashtable:
// row and column are given in the usual way, in the range [0..M] and [0..N]
// These parameters are really just used as handles to the internal row and
// column indices
int MatrixLookup(int row, int column)
{
// Get the canonical identifiers of the row and column, and hash them.
int canonicalRow = RowMap[row];
int canonicalColumn = ColumnMap[column];
int hashCode = Hash(canonicalRow, canonicalColumn);
return HashTableLookup(hashCode);
}
Now, since the interface to the matrix only uses these handles, and not the internal identifiers, a swap operation of either rows or columns corresponds to a simple change in the RowMap or ColumnMap array:
// This function simply swaps the values at
// RowMap[row1] and RowMap[row2]
void MatrixSwapRow(int row1, int row2)
{
int canonicalRow1 = RowMap[row1];
int canonicalRow2 = RowMap[row2];
RowMap[row1] = canonicalRow2
RowMap[row2] = canonicalRow1;
}
// This function simply swaps the values at
// ColumnMap[row1] and ColumnMap[row2]
void MatrixSwapColumn(int column1, int column2)
{
int canonicalColumn1 = ColumnMap[column1];
int canonicalColumn2 = ColumnMap[column2];
ColumnMap[row1] = canonicalColumn2
ColumnMap[row2] = canonicalColumn1;
}
So that should be it - a matrix with O(1) access and mutation, as well as O(1) row swapping and O(1) column swapping. Of course, even an O(1) hash access will be slower than the O(1) of array-based access, and more memory will be used, but at least there is equality between rows/columns.
I tried to be as agnostic as possible when it comes to exactly how you implement your matrix, so I wrote some C. If you'd prefer another language, I can change it (it would be best if you understood), but I think it's pretty self descriptive, though I can't ensure it's correctedness as far as C goes, since I'm actually a C++ guys trying to act like a C guy right now (and did I mention I don't have coffee?). Personally, writing in a full OO language would do it the entrie design more justice, and also give the code some beauty, but like I said, this was a quickly whipped up implementation.

Resources