parallel_for writing to an unsorted vector - c++11

I have a parallel_for loop that iterates through a large vector, analyses different portions of it (all possible sequential sequences, 1-2, 1-3, 1-4, 2-3, 2-4, etc.) and stores statistic information about those portions in another, unsorted vector (using push_back). That vector is then sorted after the parallel_for loop ends. Isn't this possible? But I get strange readings, the size of the unsorted vector produced is not correct, about 2% of the necessary iterations are missing (everything is correct with a normal for loop). Part of the problem might be that the parallel_for loop has an unequal work load: for example, for a vector with 100 members, the first runs of the outer loop have to iterate through the entire 100 members, while the last runs only have to go from 98-100, 99-100.
Here's a simplified version of the code (I use unsigned in the loops because I store them along with the index):
vector<patchIndex> indexList;
indexList.reserve(2000000);
parallel_for(unsigned(1), units.size(), [&](unsigned n)
{
for (unsigned j = 0; j != (units.size() - n); j++)
{
patchIndex currIndex;
for (auto it = units.begin() + n; it != units.begin() + (j + n + 1); it++)
{
//calculate an index from the (*it).something
}
//some more calculations of the index
indexList.push_back(currIndex);
}
});
sort(indexList.begin(), indexList.end(), [](patchIndex &a, patchIndex &b) {return a.index > b.index; });
// at this point Visual Studio says that "sort is ambiguous" but it compiles anyway
indexList.size() should be (units.size() + 1) * (units.size()/2) but it is slightly less. And a bunch of the indexes are just zeros that the algorithm cannot correctly produce. So is it simply impossible to write to a shared vector in a parallel_for, as simple as that?

Along with #Yakk's suggestion and concurrent_vector you could also do this.
combinable<vector<patchIndex>> indexList;
// In the loop.
indexList.local().push_back(currIndex);
// When out of the loop.
vector<patchIndex> result;
result.reserve(2000000);
indexList.combine_each([&](patchIndex x)
{
result.push_back(x);
});
sort(result.begin(), result.end()[](patchIndex &a, patchIndex &b) {return a.index > b.index; });
I haven't tested whether using concurrent_vector is more performant or combinable. The point is to understand that we could also use lock-free containers for this work.

Related

Why is recursive Merge Sort preferred over iterative Merge Sort even though the latter has auxillary space complexity?

While studying about Merge Sort algorithm, I was curious to know if this sorting algorithm can be further optimised. Found out that there exists Iterative version of Merge Sort algorithm with same time complexity but even better O(1) space complexity. And Iterative approach is always better than recursive approch in terms of performance. Then why is it less common and rarely talked about in any regular Algorithm course?
Here's the link to Iterative Merge Sort algorithm
If you think that it has O(1) space complexity, look again. They have the original array A of size n, and an auxiliary temp also of size n. (It actually only needs to be n/2 but they kept it simple.)
And the reason why they need that second array is that when you merge, you copy the bottom region out to temp, then merge back starting with where it was.
So the tradeoff is this. A recursive solution involves a lot less fiddly bits and makes the concepts clearer, but adds a O(log(n)) memory overhead on top of the O(n) memory overhead that both solutions share. When you're trying to communicate concepts, that's a straight win.
Furthermore in practice I believe that recursive is also a win.
In the iterative approach you keep making full passes through your entire array. Which, in the case of a large array, means that data comes into the cache for a pass, gets manipulated, and then falls out as you load the rest of the array. Only to have to be loaded again for the next pass.
In the recursive approach, by contrast, for the operations that are the equivalent of the first few passes you load them into cache, completely sort them, then move on. (How many passes you get this win for depends heavily on data type, memory layout, and the size of your CPU cache.) You are only loading/unloading from cache when you're merging too much data to fit into cache. Algorithms courses generally omit such low-level details, but they matter a lot to real-world performance.
Found out that there exists Iterative version of Merge Sort algorithm with same time complexity but even better O(1) space complexity
The iterative, bottom-up implementation of Merge Sort you linked to, doesn't have O(1) space complexity. It maintains a copy of the array, so this represents O(n) space complexity. By consequence that makes the additional O(logn) stack space (for the recursive implementation) irrelevant for the total space complexity.
In the title of your question, and in some comments, you use the words "auxiliary space complexity". This is what we usually mean with space complexity, but you seem to suggest this term means constant space complexity. This is not true. "Auxiliary" refers to the space other than the space used by the input. This term tells us nothing about the actual complexity.
Recursive top down merge sort is mostly educational. Most actual libraries use some variation of a hybrid insertion sort and bottom up merge sort, using insertion sort to create small sorted runs that will be merged in an even number of merge passes, so that merging back and forth between original and temp array ends up with the sorted data in the original array (no copy operation in merge other than singleton runs at the end of an array, which can be avoided by choosing an appropriate initial run size for insertion sort (note - this is not done in my example code, I only use run size 32 or 64, while a more advanced method like Timsort does choose an appropriate run size).
Bottom up is slightly faster because the array pointers and indexes will be kept in registers (assuming an optimizing compiler), while top down is pushing|popping array pointers and indexes to|from the stack.
Although I'm not sure that the OP actually meant O(1) space complexity for a merge sort, it is possible, but it is about 50% slower than conventional merge sort with O(n) auxiliary space. It's mostly an research (educational) effort now. The code is fairly complex. Link to example code. One of the options is no extra buffer at all. The benchmark table is for a relatively small number of keys (max is 32767 keys). For a large number of keys, this example ends up about 50% slower than an optimized insertion + bottom up merge sort (std::stable_sort is generalized, such as using a pointer to function for every compare, so it is not fully optimized).
https://github.com/Mrrl/GrailSort
Example hybrid insertion + bottom up merge sort C++ code (left out the prototypes):
void MergeSort(int a[], size_t n) // entry function
{
if(n < 2) // if size < 2 return
return;
int *b = new int[n];
MergeSort(a, b, n);
delete[] b;
}
void MergeSort(int a[], int b[], size_t n)
{
size_t s; // run size
s = ((GetPassCount(n) & 1) != 0) ? 32 : 64;
{ // insertion sort
size_t l, r;
size_t i, j;
int t;
for (l = 0; l < n; l = r) {
r = l + s;
if (r > n)r = n;
l--;
for (j = l + 2; j < r; j++) {
t = a[j];
i = j-1;
while(i != l && a[i] > t){
a[i+1] = a[i];
i--;
}
a[i+1] = t;
}
}
}
while(s < n){ // while not done
size_t ee = 0; // reset end index
size_t ll;
size_t rr;
while(ee < n){ // merge pairs of runs
ll = ee; // ll = start of left run
rr = ll+s; // rr = start of right run
if(rr >= n){ // if only left run
rr = n; // copy left run
while(ll < rr){
b[ll] = a[ll];
ll++;
}
break; // end of pass
}
ee = rr+s; // ee = end of right run
if(ee > n)
ee = n;
Merge(a, b, ll, rr, ee);
}
std::swap(a, b); // swap a and b
s <<= 1; // double the run size
}
}
void Merge(int a[], int b[], size_t ll, size_t rr, size_t ee)
{
size_t o = ll; // b[] index
size_t l = ll; // a[] left index
size_t r = rr; // a[] right index
while(1){ // merge data
if(a[l] <= a[r]){ // if a[l] <= a[r]
b[o++] = a[l++]; // copy a[l]
if(l < rr) // if not end of left run
continue; // continue (back to while)
while(r < ee) // else copy rest of right run
b[o++] = a[r++];
break; // and return
} else { // else a[l] > a[r]
b[o++] = a[r++]; // copy a[r]
if(r < ee) // if not end of right run
continue; // continue (back to while)
while(l < rr) // else copy rest of left run
b[o++] = a[l++];
break; // and return
}
}
}
size_t GetPassCount(size_t n) // return # passes
{
size_t i = 0;
for(size_t s = 1; s < n; s <<= 1)
i += 1;
return(i);
}

Eigen - return type of .cwiseProduct?

I am writing a function in RcppEigen for weighted covariances. In one of the steps I want to take column i and column j of a matrix, X, and compute the cwiseProduct, which should return some kind of vector. The output of cwiseProduct will go into an intermediate variable which can be reused many times. From the docs it seems cwiseProduct returns a CwiseBinaryOp, which itself takes two types. My cwiseProduct operates on two column vectors, so I thought the correct return type should be Eigen::CwiseBinaryOp<Eigen::ColXpr, Eigen::ColXpr>, but I get the error no member named ColXpr in namespace Eigen
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
Rcpp::List Crossprod_sparse(Eigen::MappedSparseMatrix<double> X, Eigen::Map<Eigen::MatrixXd> W) {
int K = W.cols();
int p = X.cols();
Rcpp::List crossprods(W.cols());
for (int i = 0; i < p; i++) {
for (int j = i; j < p; j++) {
Eigen::CwiseBinaryOp<Eigen::ColXpr, Eigen::ColXpr> prod = X.col(i).cwiseProduct(X.col(j));
for (int k = 0; k < K; k++) {
//double out = prod.dot(W.col(k));
}
}
}
return crossprods;
}
I have also tried saving into a SparseVector
Eigen::SparseVector<double> prod = X.col(i).cwiseProduct(X.col(j));
as well as computing, but not saving at all
X.col(i).cwiseProduct(X.col(j));
If I don't save the product at all, the functions returns very quickly, hinting that cwiseProduct is not an expensive function. When I save it into a SparseVector, the function is extremely slow, making me think that SparseVector is not the right return type and Eigen is doing extra work to get it into that type.
Recall that Eigen relies on expression templates, so if you don't assign an expression then this expression is essentially a no-op. In your case, assigning it to a SparseVector is the right thing to do. Regarding speed, make sure to compile with compiler optimizations ON (like -O3).
Nonetheless, I believe there is a faster way to write your overall computations. For instance, are you sure that all X.col(i).cwiseProduct(X.col(j)) are non empty? If not, then the second loop should be rewritten to iterate over the sparse set of overlapping columns only. Loops could also be interchanged to leverage efficient matrix products.

Selection Sort in Cuda

So, I'm trying to implement selection sort in Cuda, but so far I haven't been as successful.
__device__ void selection_sort( int *data, int left, int right ){
for( int i = left ; i <= right ; ++i ){
int min_val = data[i];
int min_idx = i;
// Find the smallest value in the range [left, right].
for( int j = i+1 ; j <= right ; ++j ){
int val_j = data[j];
if( val_j < min_val ){
min_idx = j;
min_val = val_j;
}
}
// Swap the values.
if( i != min_idx ){
data[min_idx] = data[i];
data[i] = min_val;
}
}
}
My main attempt here is to find the minimum and parallelize the solution. Now, I realize the code looks very C++ 'ish but I'm nowhere qualified as skilled in Cuda.
Is there a way to parallelize the solution? Are there any more additions to be made?
Selection sort algorithm for N numbers can be roughly described as:
for i from N-1 down to 0
find the maximum element among data[0] ~ data[i]
swap that maximum element with data[i] within the data array
The first part (finding the maximum element) falls into a widely known and well documented class of problems called reduction. However, to perform the second part (swapping), you must track the index of the maximum element while comparing the values, and it is not so natural to do that while performing reduction. This is one of the reasons why selection sort do not port well to parallel architectures.
Also, you can see that the problem size diminishes by one for each loop, and this is another aspect of the selection sort algorithm that does not map well to parallel architectures. In case of CUDA, 32 threads form a warp, which execute at the same time. Although you can tell arbitrary number of threads to run within a warp, it is generally not recommended to do so because it is a loss of computing power.
I've tried to build a CUDA version of selection sort myself, but I stopped doing it because it seems there are better algorithms well suited for CUDA. But I'll just show you what I've done so far to illustrate why selection sort is not good for CUDA.
Firstly, start from a small and simple problem: sorting 32 elements. Since 32 threads form a warp, you can use shuffle instructions to find maximum value. (Full code)
// Finds the maximum element within a warp and gives the maximum element to
// thread with lane id 0. Note that other elements do not get lost but their
// positions are shuffled.
__inline__ __device__ int warpMax(int data, unsigned int threadId)
{
for (int mask = 16; mask > 0; mask /= 2) {
int dual_data = __shfl_xor(data, mask, 32);
if (threadId & mask)
data = min(data, dual_data);
else
data = max(data, dual_data);
}
return data;
}
__global__ void selection32(int* d_data, int* d_data_sorted)
{
unsigned int threadId = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int laneId = threadIdx.x % 32;
int n = N;
while(n-- > 0) {
// get the maximum element among d_data and put it in d_data_sorted[n]
int data = d_data[threadId];
data = warpMax(data, threadId);
d_data[threadId] = data;
// now maximum element is in d_data[0]
if (laneId == 0) {
d_data_sorted[n] = d_data[0];
d_data[0] = INT_MIN; // this element is ignored from now on
}
}
}
int main()
{
// ... build data and trasfer to d_data ...
selection32<<<1, 32>>>(d_data, d_data_sorted);
// ... get the sorted array stored at d_data_sorted ...
}
(Some may argue that this is not exactly a selection sort since 1) the array elements of the unsorted area keep shuffling, and 2) it is not an in-place sort. Please note that I'm just trying to show that selection sort does not fit in for CUDA. Also, note that warpMax has highly divergent branches, making it less optimal for CUDA.)
The case with only 1 warp of elements may look parallel-ish, but the thing gets worse when the problem size increases to multiple warps. Let's see the case for 1024 elements. (I've chosen the number 1024 becuase it is the maximum number limit of threads in a block.) Now there are 32 warps, and after calling warpMax for each warp, we must compare the maximum elements of each warp to get the maximum element among the 1024 elements. This problem of comparing 32 warp-maximum-values cannot be done with warpMax because we need to track in which warp the maximum value came from to swap the maximum value with the last element in the data array. One way I can think of for doing this is using one single thread to compare warp-maximum-values. This is not a good implemenation for CUDA becuase other 1023 threads in the block become idle.
Furthermore, if the problem size grows larger than a block can cover, we need to compare the maximum values of each block, implying that we will have to launch separate kernels since we need to synchronize between blocks. And it is redundant to say that we need to keep track of in which block the maximum value came from. All of these just tells that implementing selection sort for CUDA is not a good idea.

Resolve 16-Queens Problem in 1 second only

I should resolve 16-Queens Problem in 1 second.
I used backtracking algorithm like below.
This code is enough to resolve N-Queens Problem in 1 second when the N is smaller than 13.
But it takes long time if N is bigger than 13.
How can I improve it?
#include <stdio.h>
#include <stdlib.h>
int n;
int arr[100]={0,};
int solution_count = 0;
int check(int i)
{
int k=1, ret=1;
while (k < i && ret == 1) {
if (arr[i] == arr[k] ||
abs(arr[i]-arr[k]) == abs(i-k))
ret = 0;
k++;
}
return ret;
}
void backtrack(int i)
{
if(check(i)) {
if(i == n) {
solution_count++;
} else {
for(int j=1; j<=n; j++) {
arr[i+1] = j;
backtrack(i+1);
}
}
}
}
int main()
{
scanf("%d", &n);
backtrack(0);
printf("%d", solution_count);
}
Your algorithm is almost fine. A small change will probably give you enough time improvement to produce a solution much faster. In addition, there is a data structure change that should let you reduce the time even further.
First, tweak the algorithm a little: rather than waiting for the check all the way till you place all N queens, check early: every time you are about to place a new queen, check if another queen is occupying the same column or the same diagonal before making the arr[i+1] = j; assignment. This will save you a lot of CPU cycles.
Now you need to speed up checking of the next queen. In order to do that you have to change your data structure so that you could do all your checks without any loops. Here is how to do it:
You have N rows
You have N columns
You have 2N-1 ascending diagonals
You have 2N-1 descending diagonals
Since no two queens can take the same spot in any of the four "dimensions" above, you need an array of boolean values for the last three things; the rows are guaranteed to be different, because the i parameter of backtrack, which represents the row, is guaranteed to be different.
With N up to 16, 2N-1 goes up to 31, so you can use uint32_t for your bit arrays. Now you can check if a column c is taken by applying bitwise and & to the columns bit mask and 1 << c. Same goes for the diagonal bit masks.
Note: Doing a 16 Queen problem in under a second would be rather tricky. A very highly optimized program does it in 23 seconds on an 800 MHz PC. A 3.2 GHz should give you a speed-up of about 4 times, but it would be about 8 seconds to get a solution.
I would change while (k < i && ret == 1) { to while (k < i) {
and instead of ret = 0; do return 0;.
(this will save a check every iteration. It might be that your compiler does this anyway, or some other performance trick, but this might help a bit).

Efficiently implementing erode/dilate

So normally and very inefficiently min/max filter is implemented by using four for loops.
for( index1 < dy ) { // y loop
for( index2 < dx ) { // x loop
for( index3 < StructuringElement.dy() ) { // kernel y
for( index4 < StructuringElement.dx() ) { // kernel x
pixel = src(index3+index4);
val = (pixel > val) ? pixel : val; // max
}
}
dst(index2, index1) = val;
}
}
However this approach is damn inefficient since it checks again previously checked values. So I am wondering what methods are there to implement this with using previously checked values on next iteration?
Any assumptions regarding structuring element size/point of origin can be made.
Update: I am especially keen to know any insights of this or kind of implementation: http://dl.acm.org/citation.cfm?id=2114689
I have been following this question for some time, hoping someone would write a fleshed-out answer, since I am pondering the same problem.
Here is my own attempt so far; I have not tested this, but I think you can do repeated dilation and erosion with any structuring element, by only accessing each pixel twice:
Assumptions: Assume the structuring element/kernel is a KxL rectangle and the image is a NxM rectangle. Assume that K and L are odd.
The basic approach you outlined has four for loops and takes O(K*L*N*M) time to complete.
Often you want to dilate repeatedly with the same kernel, so the time is again multiplied by the desired number of dilations.
I have three basic ideas for speeding up the dilation:
dilation by a KxL kernel is equal to dilation by a Kx1 kernel followed by dilation by a 1xL kernel. You can do both of these dilations with only three for loops, in O(KNM) and O(LNM)
However you can do a dilation with a Kx1 kernel much faster: You only need to access each pixel once. For this you need a particular data structure, explained below. This allows you to do a single dilation in O(N*M), regardless of the kernel size
repeated dilation by a Kx1 kernel is equal to a single dilation by a larger kernel. If you dilate P times with a Kx1 kernel, this is equal to a single dilation with a ((K-1)*P + 1) x 1 kernel.
So you can do repeated dilation with any kernel size in a single pass, in O(N*M) time.
Now for a detailed description of step 2.
You need a queue with the following properties:
push an element to the back of the queue in constant time.
pop an element from the front of the queue in constant time.
query the current smallest or largest element in the queue in constant time.
How to build such a queue is described in this stackoverflow answer: Implement a queue in which push_rear(), pop_front() and get_min() are all constant time operations.
Unfortunately not much pseudocode, but the basic idea seems sound.
Using such a queue, you can calculate a Kx1 dilation in a single pass:
Assert(StructuringElement.dy()==1);
int kernel_half = (StructuringElement.dx()-1) /2;
for( y < dy ) { // y loop
for( x <= kernel_half ) { // initialize the queue
queue.Push(src(x, y));
}
for( x < dx ) { // x loop
// get the current maximum of all values in the queue
dst(x, y) = queue.GetMaximum();
// remove the first pixel from the queue
if (x > kernel_half)
queue.Pop();
// add the next pixel to the queue
if (x < dx - kernel_half)
queue.Push(src(x + kernel_half, y));
}
}
The only approach I can think of is to buffer the maximum pixel values and the rows in which they are found so that you only have to do the full iteration over a kernel sized row/column when the maximum is no longer under it.
In the following C-like pseudo code, I have assumed signed integers, 2d row-major arrays for the source and destination and a rectangular kernel over [±dx, ±dy].
//initialise the maxima and their row positions
for(x=0; x < nx; ++x)
{
row[x] = -1;
buf[x] = 0;
}
for(sy=0; sy < ny; ++sy)
{
//update the maxima and their row positions
for(x=0; x < nx; ++x)
{
if(row[x] < max(sy-dy, 0))
{
//maximum out of scope, search column
row[x] = max(sy-dy, 0);
buf[x] = src[row[x]][x];
for(y=row[x]+1; y <= min(sy+dy, ny-1); ++y)
{
if(src[y][x]>=buf[x])
{
row[x] = y;
buf[x] = src[y][x];
}
}
}
else
{
//maximum in scope, check latest value
y = min(sy+dy, ny-1);
if(src[y][x] >= buf[x])
{
row[x] = y;
buf[x] = src[y][x];
}
}
}
//initialise maximum column position
col = -1;
for(sx=0; sx < nx; ++sx)
{
//update maximum column position
if(col<max(sx-dx, 0))
{
//maximum out of scope, search buffer
col = max(sx-dx, 0);
for(x=col+1; x <= min(sx+dx, nx-1); ++x)
{
if(buf[x] >= buf[col]) col = x;
}
}
else
{
//maximum in scope, check latest value
x = min(sx+dx, nx-1);
if(buf[x] >= buf[col]) col = x;
}
//assign maximum to destination
dest[sy][sx] = buf[col];
}
}
The worst case performance occurs when the source goes smoothly from a maximum at the top left to a minimum at the bottom right, forcing a full row or column scan at each step (although it's still more efficient than the original nested loops).
I would expect average case performance to be much better though, since regions containing increasing values (both row and column wise) will update the maximum before a scan is required.
That said, not having actually tested it I'd recommend that you run a few benchmarks rather than trust my gut feeling!
a theoretical way of improving the complexity would be to maintain a BST for the KxK pixels, delete previsous Kx1 pixels and add the next Kx1 pixels to it. The cost of this operation would be 2K log K and it would be repeated NxN times. Overall the computation time would become NxNxKxlog K from NxNxKxK
Same kind of optimizations can be used as "non maximum suppression" algorithms
http://www.vision.ee.ethz.ch/publications/papers/proceedings/eth_biwi_00446.pdf
In 1D, using morphological wavelet transform in O(N) :
https://gist.github.com/matovitch/11206318
You could get O(N * M) in 2D. HugoRune solution is way simpler and probably faster (though this one could probably be improved).

Resources