How-to map process to an Hypercube using MPI_CART - sorting

I am trying to implement bitonic sorting using MPI for 2^n processors.
I would like to use an n-dimensional hypercube to do so for convenience. Using MPI_Cart_Create I can create self-organising dimensions. Doing so will maximize efficiency of my process and also reduce the number of LOC I have to spit to get it done..
Googling AND the litterature always tell the same thing:
Note that an n -dimensional hypercube
is an n -dimensional torus with 2
processes per coordinate direction.
Thus, special support for hypercube
structures is not necessary.
I haven’t seen any single example + n -dimensional torus with 2 processes per coordinate direction seems nothing but mystery to me. Would anyone have to suggest?
Thanks,

Well, found it
so that would be for a 4-d hypercube.. The pattern is pretty straight-forward. In n-dimensional hypercube each point have N neighbour and they are represented in this code. Note that this code should used instead of xoring bit mask because MPI can re-order the processes to fit the physical layout of your clusters.
int rank, size; //I am process RANK and we are a total of SIZE
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
myFairShareOfNumber = totalNumber / size;
MPI_Comm nthCube;
int nDim=4;
int processPerDim [4]= {2,2,2,2};
int period [4]= {1,1,1,1};
MPI_Cart_create(MPI_COMM_WORLD, nDim, processPerDim, period, true, &nthCube);
int rankInDim;
MPI_Comm_rank(nthCube, &rankInDim);
int rank_source, rank_desta, rank_destb, rank_destc, rank_destd;
MPI_Cart_shift(nthCube, 0,1,&rank_source, &rank_desta);
MPI_Cart_shift(nthCube, 1,1,&rank_source, &rank_destb);
MPI_Cart_shift(nthCube, 2,1,&rank_source, &rank_destc);
MPI_Cart_shift(nthCube, 3,1,&rank_source, &rank_destd);
cerr << "I am known in the world as " << rankInDim << " my adjacents are -> " << rank_desta << "-" << rank_destb << "-" << rank_destc << "-" << rank_destd <<"\n";

Related

PyOpenCL - Multi-dimensional reduction kernel

I'm a total newbie to OpenCL.
I'm trying to code a reduction kernel that sums along one axis for a multi-dimensional array. I have stumbled upon that code which comes from here: https://tmramalho.github.io/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
__kernel void reduce(__global float *a, __global float *r, __local float *b) {
uint gid = get_global_id(0);
uint wid = get_group_id(0);
uint lid = get_local_id(0);
uint gs = get_local_size(0);
b[lid] = a[gid];
barrier(CLK_LOCAL_MEM_FENCE);
for(uint s = gs/2; s > 0; s >>= 1) {
if(lid < s) {
b[lid] += b[lid+s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) r[wid] = b[lid];
}
I don't understand the for loop part. I get that uint s = gs/2 means that we split the array in half, but then it is a complete mystery. Without understanding it, I can't really implement another version for taking the maximum of an array for instance, even less for multi-dimensional arrays.
Furthermore, as far as I understand, the reduce kernel needs to be rerun another time if "N is bigger than the number of cores in a single unit".
Could you give me further explanations on that whole piece of code? Or even guidance on how to implement it for taking the max of an array?
Complete code can be found here: https://github.com/tmramalho/easy-pyopencl/blob/master/008_localreduce.py
Your first question about the meaning of the for loop:
for(uint s = gs/2; s > 0; s >>= 1)
It means that you divide the local size gs by 2, and keep dividing by 2 (the shift part s >>= 1 is equivalent to s = s/2) while s > 0, in other words, until s = 1. This algorithm depends on your array's size being a power of 2, otherwise you'd have to deal with the excess of a power of 2 until you have reduced the whole array, or you'd have to fill your array with neutral values for the reduction until completing a power of 2 size.
Your second concern when N is bigger than the capacity of your GPU, you are right: you have to run your reduction in portions that fit and then merge the results.
Finally, when you ask for guidance on how to implement a reduction to get the max of an array, I would suggest the following:
For a simple reduction like max or sum, try using numpy, especially if you are dealing with programming the reduction by axis.
If you think that the GPU would give you an advantage, try first using pyopencl's Multidimensional Array functionality, e.g. max.
If the reduction is more math intensive, try using pyopencl's Parallel Algorithms, e.g. reduction
I think that the whole point of using pyopencl is to avoid dealing with the underlying GPU's architecture. Otherwise, it is easier to deal with CUDA or HIP directly instead of OpenCL.

What is the best nearest neighbor algorithm for my case?

I have a predefined list of gps positions which basically makes a predefined car track. There are around 15000 points in the list. The whole list is known in prior, no points are needed to insert afterwards. Then I get around 1 milion extra sampled gps positions for which I need to find the nearest neighbor in the predefined list. I need to process all 1 milion items in single iteration and I need to do it as quickly as possible. What would be the best nearest neighbor algorithm for this case?
I can preprocess the predefined list as much as I need, but the processing 1 milion items then should be as quick as possible.
I have tested a KDTree c# implementation but the performance seemed to be poor, maybe there exists a more appropriate algorithm for my 2D data. (the gps altitude is ignored in my case)
Thank you for any suggestions!
CGAL has a 2d point library for nearest neighbour and range searches based on a Delaunay triangulation data structure.
Here is a benchmark of their library for your use case:
// file: cgal_benchmark_2dnn.cpp
#include <CGAL/Exact_predicates_inexact_constructions_kernel.h>
#include <CGAL/Point_set_2.h>
#include <chrono>
#include <list>
#include <random>
typedef CGAL::Exact_predicates_inexact_constructions_kernel K;
typedef CGAL::Point_set_2<K>::Vertex_handle Vertex_handle;
typedef K::Point_2 Point_2;
/**
* #brief Time a lambda function.
*
* #param lambda - the function to execute and time
*
* #return the number of microseconds elapsed while executing lambda
*/
template <typename Lambda>
std::chrono::microseconds time_lambda(Lambda lambda) {
auto start_time = std::chrono::high_resolution_clock::now();
lambda();
auto end_time = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::microseconds>(end_time -
start_time);
}
int main() {
const int num_index_points = 15000;
const int num_trials = 1000000;
std::random_device
rd; // Will be used to obtain a seed for the random number engine
std::mt19937 gen(rd()); // Standard mersenne_twister_engine seeded with rd()
std::uniform_real_distribution<> dis(-1, 1.);
std::list<Point_2> index_point_list;
{
auto elapsed_microseconds = time_lambda([&] {
for (int i = 0; i < num_index_points; ++i) {
index_point_list.emplace_back(dis(gen), dis(gen));
}
});
std::cout << " Generating " << num_index_points << " random points took "
<< elapsed_microseconds.count() << " microseconds.\n";
}
CGAL::Point_set_2<K> point_set;
{
auto elapsed_microseconds = time_lambda([&] {
point_set.insert(index_point_list.begin(), index_point_list.end());
});
std::cout << " Building point set took " << elapsed_microseconds.count()
<< " microseconds.\n";
}
{
auto elapsed_microseconds = time_lambda([&] {
for (int j = 0; j < num_trials; ++j) {
Point_2 query_point(dis(gen), dis(gen));
Vertex_handle v = point_set.nearest_neighbor(query_point);
}
});
auto rate = elapsed_microseconds.count() / static_cast<double>(num_trials);
std::cout << " Querying " << num_trials << " random points took "
<< elapsed_microseconds.count()
<< " microseconds.\n >> Microseconds / query :" << rate << "\n";
}
}
On my system (Ubuntu 18.04) this can be compiled with
g++ cgal_benchmark_2dnn.cpp -lCGAL -lgmp -O3
and when run yields the performance:
Generating 15000 random points took 1131 microseconds.
Building point set took 11469 microseconds.
Querying 1000000 random points took 2971201 microseconds.
>> Microseconds / query :2.9712
Which is pretty fast. Note, with N processors you could speed this up roughly N times.
Fastest possible implementation
If two or more of the following are true:
You have a small bounding box for the 150000 index points
You only care of a precision up to a few decimal points (note that for lat & long coordinates going much more than 6 decimal points yields centimeter/millimeter scale precision)
You have copious amounts of memory on your system
Then cache everything! You can pre-compute a grid of desired precision over your bounding box of index points. Map each grid cell to a unique address that can be indexed knowing the 2D coordinate of a query point.
Then simply use any nearest neighbour algorithm (such as the one I supplied) to map each grid cell to the nearest index point. Note this step only has to be done once to initialize the grid cells within the grid.
To run a query, this would require one 2D coordinate to grid cell coordinate calculation followed by one memory access, meaning you can't really hope for a faster approach (probably would be 2-3 CPU cycles per query.)
I suspect (with some insight) this is how a giant corporation like Google or Facebook would approach the problem (since #3 is not a problem for them even for the entire world.) Even smaller non-profit organizations use schemes like this (like NASA.) Albeit, the scheme NASA uses is far more sophisticated with multiple scales of resolution/precision.
Clarification
From the comment below, its clear the last section was not well understood, so I will include some more details.
Suppose your set of points is given by two vectors x and y which contain the x & y coordinates of your data (or lat & long or whatever you are using.)
Then the bounding box of your data is defined with dimension width = max(x)-min(x) & height=max(y)-min(y).
Now create a fine mesh grid to represent the entire bounding box using NxM points using the mapping of a set of test points (x_t,y_t)
u(x_t) = round((x_t - min(x)) / double(width) * N)
v(y_t) = round((y_t - min(y)) / double(height) * M)
Then simply use indices = grid[u(x_t),v(y_t)], where indices are the indices of the closest index points to [x_t,y_t] and grid is a precomputed lookup table that maps each item in the grid to the closest index point [x,y].
For example, suppose that your index points are [0,0] and [2,2] (in that order.) You can create the grid as
grid[0,0] = 0
grid[0,1] = 0
grid[0,2] = 0 // this is a tie
grid[1,0] = 0
grid[1,1] = 0 // this is a tie
grid[1,2] = 1
grid[2,0] = 1 // this is a tie
grid[2,1] = 1
grid[2,2] = 1
where the right hand side above is either index 0 (which maps to the point [0,0]) or 1 (which maps to the point [2,2]). Note: due to the discrete nature of this approach you will have ties where distance from one point is exactly equal to the distance to another index point, you will have to come up with some means to determine how to break these ties. Note, the number of entries in the grid determine the degree of precision you are trying to reach. Obviously in the example I gave above the precision is terrible.
K-D trees are indeed well suited to the problem. You should first try again with known-good implementations, and if performance is not good enough, you can easily parallelize queries -- since each query is completely independent of others, you can achieve a speedup of N by working on N queries in parallel, if you have enough hardware.
I recommend OpenCV's implementation, as mentioned in this answer
Performance-wise, the ordering of the points that you insert can have a bearing on query times, since implementations may choose whether or not to rebalance unbalanced trees (and, for example, OpenCV's does not do so). A simple safeguard is to insert points in a random order: shuffle the list first, and then insert all points in the shuffled order. While not optimal, this ensures that, with overwhelming probability, the resulting order will not be pathological.

How can I print in console a formatted sparse matrix with eigen?

I am working with Eigen Eigen. I have a sparse Matrix defined by a set of Triplet and I would like to print the Matrix in a formatted way. I have seen that it is possible with ordinary Matrix by doing Matrix.format(FORMAT_TYPE) Eigen: IOFormat. But i do not find a way to do the same for sparse Matrix. I would like to obtain an output like the Matlab output for matrices.
Many thanks in advance.
To get nice formatting, you need to first convert it to a dense matrix:
SparseMatrix<double> spmat;
...
std::cout << MatrixXd(spmat) << std::endl;
Probably not of interest for the OP anymore, but I came here via Google and so others will maybe too...
It's not pratical to print the whole sparse matrix directly, because they are usually very big. The block operator works for sparse also, so you can do something like:
int nElements = 10;
std::cout <<
compMat.block( compMat.rows() - nElements, compMat.cols() - nElements, nElements, nElements )
<< std::endl;
to print the last 10 elements in the bottom right corner of a square sparse matrix.
This takes 6ms in release mode on my machine.
The following code does the same on the full matrix with roughly 35000*35000 entries, but takes ~25000ms...
int nElements = 10;
std::cout <<
Eigen::MatrixXd( compMat ).block( compMat.rows() - nElements, compMat.cols() - nElements, nElements,
nElements )
<< std::endl;

C++ Beginner Excercise

I'm working through a book and one of the assignments is to write a program that does this:
Prompts the user for values.
Stores the highest and lowest value.
Displays the highest and lowest value.
All using a while loop.
So I wrote this:
#include <iostream>
double length;
double length_highest=0;
double length_lowest=0;
int main()
{
std::cout << "Please enter a length.\n";
while(std::cin>>length){
if (length_lowest+length_highest==0){
length_lowest = length;
length_highest = length;
}else if (length<length_lowest){
length_lowest = length;
}else if(length>length_highest){
length_highest = length;
}
std::cout << "The highest length is " << length_highest << ".\n";
std::cout << "The lowest length is " << length_lowest << ".\n";
}
}
Then, the book asks me to modify the program so that it will also accept the units of length of cm, m, ft, and in AND to take into account conversion factors. So, if a user entered in 10 cm, then one inch, the program would have to know that 10 cm > 1 inch. The program would have to store it AND display it WITH the correct unit that corresponds to it.
I've been trying to write this in for the past 3 days or so and all of my methods have failed so I kind of want to move on with the book at this point.
Any suggestions help.
Since it's an exercice i won't give you a direct answer with the code solution.
First of all, since you will need to know which number goes with wich units. You will have to store each numbers.
You could store all numbers in an array which contains 2 element, the number, and the units. To do so, just parse the input.
Then, since you'll have to retrieve in your array your elements. Instead of storing the length as the maxLength, you should store the index where it is stored in the array as the maxIndex.
Then everything is easy, you know how to convert from cm to inch (basic maths), you know how to retrieve the max length and min length with their units.
Another piece of advice to help you is that you should make function. Easy and small functions.
Ideas of functions you could do :
InchToCm(length)
CmToInch(length)
isGreaterThan([length,units],[length,units]
be creative :D
There are other ways to do this, it's is just one

Population segmentation algorithm

I have a population of 50 ordered integers (1,2,3,..,50) and I look for a generic way to slice it "n" ways ("n" is the number of cutoff points ranging from 1 to 25) that maintains the order of the elements.
For example, for n=1 (one cutoff point) there are 49 possible grouping alternatives ([1,2-49], [1-2,3-50], [1-3,4-50],...). For n=2 (two cutoff points), the grouping alternatives are like: [1,2,3-50], [1,2-3,4-50],...
Could you recommend any general-purpose algorithm to complete this task in an efficient way?
Thanks,
Chris
Thanks everyone for your feedback. I reviewed all your comments and I am working on a generic solution that will return all combinations (e.g., [1,2,3-50], [1,2-3,4-50],...) for all numbers of cutoff points.
Thanks again,
Chris
Let sequence length be N, and number of slices n.
That problem becomes easier when you notice that, choosing a slicing to n slices is equivalent to choosing n - 1 from N - 1 possible split points (a split point is between every two numbers in the sequence). Hence there is (N - 1 choose n - 1) such slicings.
To generate all slicings (to n slices), you have to generate all n - 1 element subsets of numbers from 1 to N - 1.
The exact algorithm for this problem is placed here: How to iteratively generate k elements subsets from a set of size n in java?
Do you need the cutoffs, or are you just counting them. If you're just going to count them, then it's simple:
1 cutoff = (n-1) options
2 cutoffs = (n-1)*(n-2)/2 options
3 cutoffs = (n-1)(n-2)(n-3)/4 options
you can see the patterns here
If you actually need the cutoffs, then you have to actually do the loops, but since n is so small, Emilio is right, just brute force it.
1 cutoff
for(i=1,i<n;++i)
cout << i;
2 cutoffs
for(i=1;<i<n;++i)
for(j=i+1,j<n;++j)
cout << i << " " << j;
3 cutoffs
for(i=1;<i<n;++i)
for(j=i+1,j<n;++j)
for(k=j+1,k<n;++k)
cout << i << " " << j << " " << k;
again, you can see the pattern
So you want to select 25 split point from 49 choices in all possible ways. There are a lot of well known algorithms to do that.
I want to draw your attention to another side of this problem. There are 49!/(25!*(49-25)!) = 63 205 303 218 876 >= 2^45 ~= 10^13 different combinations. So if you want to store it, the required amount of memory is 32TB * sizeof(Combination). I guess that it will pass 1 PB mark.
Now lets assume that you want to process generated data on the fly. Lets make rather optimistic assumption that you can process 1 million combinations per second (here i assume that there is no parallelization). So this task will take 10^7 seconds = 2777 hours = 115 days.
This problem is more complicated than it seems at first glance. If you want to solve if at home in reasonable time, my suggestion is to change the strategy or wait for the advance of quantum computers.
This will generate an array of all the ranges, but I warn you, it'll take tons of memory, due to the large numbers of results (50 elements with 3 splits is 49*48*47=110544) I haven't even tried to compile it, so there's probably errors, but this is the general algorithm I'd use.
typedef std::vector<int>::iterator iterator_t;
typedef std::pair<iterator_t, iterator_t> range_t;
typedef std::vector<range_t> answer_t;
answer_t F(std::vector<int> integers, int slices) {
answer_t prev; //things to slice more
answer_t results; //thin
//initialize results for 0 slices
results.push_back(answer(range(integers.begin(), integers.end()), 1));
//while there's still more slicing to do
while(slices--) {
//move "results" to the "things to slice" pile
prev.clear();
prev.swap(results);
//for each thing to slice
for(int group=0; group<prev.size(); ++group) {
//for each range
for(int crange=0; crange<prev[group].size(); ++crange) {
//for each place in that range
for(int newsplit=0; newsplit<prev[group][crange].size(); ++newsplit) {
//copy the "result"
answer_t cur = prev[group];
//slice it
range_t L = range(cur[crange].first, cur[crange].first+newsplit);
range_t R = range(cur[crange].first+newsplit), cur[crange].second);
answer_t::iterator loc = cur.erase(cur.begin()+crange);
cur.insert(loc, R);
cur.insert(loc, L);
//add it to the results
results.push_back(cur);
}
}
}
}
return results;
}

Resources