How to partly sort arrays on CUDA? - sorting

Problem
Provided I have two arrays:
const int N = 1000000;
float A[N];
myStruct *B[N];
The numbers in A can be positive or negative (e.g. A[N]={3,2,-1,0,5,-2}), how can I make the array A partly sorted (all positive values first, not need to be sorted, then negative values)(e.g. A[N]={3,2,5,0,-1,-2} or A[N]={5,2,3,0,-2,-1}) on the GPU? The array B should be changed according to A (A is keys, B is values).
Since the scale of A,B can be very large, I think the sort algorithm should be implemented on GPU (especially on CUDA, because I use this platform). Surely I know thrust::sort_by_key can do this work, but it does muck extra work since I do not need the array A&B to be sorted entirely.
Has anyone come across this kind of problem?
Thrust example
thrust::sort_by_key(thrust::device_ptr<float> (A),
thrust::device_ptr<float> ( A + N ),
thrust::device_ptr<myStruct> ( B ),
thrust::greater<float>() );

Thrust's documentation on Github is not up-to-date. As #JaredHoberock said, thrust::partition is the way to go since it now supports stencils. You may need to get a copy from the Github repository:
git clone git://github.com/thrust/thrust.git
Then run scons doc in the Thrust folder to get an updated documentation, and use these updated Thrust sources when compiling your code (nvcc -I/path/to/thrust ...). With the new stencil partition, you can do:
#include <thrust/partition.h>
#include <thrust/execution_policy.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
struct is_positive
{
__host__ __device__
bool operator()(const int &x)
{
return x >= 0;
}
};
thrust::partition(thrust::host, // if you want to test on the host
thrust::make_zip_iterator(thrust::make_tuple(keyVec.begin(), valVec.begin())),
thrust::make_zip_iterator(thrust::make_tuple(keyVec.end(), valVec.end())),
keyVec.begin(),
is_positive());
This returns:
Before:
keyVec = 0 -1 2 -3 4 -5 6 -7 8 -9
valVec = 0 1 2 3 4 5 6 7 8 9
After:
keyVec = 0 2 4 6 8 -5 -3 -7 -1 -9
valVec = 0 2 4 6 8 5 3 7 1 9
Note that the 2 partitions are not necessarily sorted. Also, the order may differ between the original vectors and the partitions. If this is important to you, you can use thrust::stable_partition:
stable_partition differs from partition in that stable_partition is
guaranteed to preserve relative order. That is, if x and y are
elements in [first, last), such that pred(x) == pred(y), and if x
precedes y, then it will still be true after stable_partition that x
precedes y.
If you want a complete example, here it is:
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/partition.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
struct is_positive
{
__host__ __device__
bool operator()(const int &x)
{
return x >= 0;
}
};
void print_vec(const thrust::host_vector<int>& v)
{
for(size_t i = 0; i < v.size(); i++)
std::cout << " " << v[i];
std::cout << "\n";
}
int main ()
{
const int N = 10;
thrust::host_vector<int> keyVec(N);
thrust::host_vector<int> valVec(N);
int sign = 1;
for(int i = 0; i < N; ++i)
{
keyVec[i] = sign * i;
valVec[i] = i;
sign *= -1;
}
// Copy host to device
thrust::device_vector<int> d_keyVec = keyVec;
thrust::device_vector<int> d_valVec = valVec;
std::cout << "Before:\n keyVec = ";
print_vec(keyVec);
std::cout << " valVec = ";
print_vec(valVec);
// Partition key-val on device
thrust::partition(thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.begin(), d_valVec.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.end(), d_valVec.end())),
d_keyVec.begin(),
is_positive());
// Copy result back to host
keyVec = d_keyVec;
valVec = d_valVec;
std::cout << "After:\n keyVec = ";
print_vec(keyVec);
std::cout << " valVec = ";
print_vec(valVec);
}
UPDATE
I made a quick comparison with the thrust::sort_by_key version, and the thrust::partition implementation does seem to be faster (which is what we could naturally expect). Here is what I obtain on NVIDIA Visual Profiler, with N = 1024 * 1024, with the sort version on the left, and the partition version on the right. You may want to do the same kind of tests on your own.

How about this?:
Count how many positive numbers to determine the inflexion point
Evenly divide each side of the inflexion point into groups (negative-groups are all same length but different length to positive-groups. these groups are the memory chunks for the results)
Use one kernel call (one thread) per chunk pair
Each kernel swaps any out-of-place elements in the input groups into the desired output groups. You will need to flag any chunks that have more swaps than the maximum so that you can fix them during subsequent iterations.
Repeat until done
Memory traffic is swaps only (from original element position, to sorted position). I don't know if this algorithm sounds like anything already defined...

You should be able to achieve this in thrust simply with a modification of your comparison operator:
struct my_compare
{
__device__ __host__ bool operator()(const float x, const float y) const
{
return !((x<0.0f) && (y>0.0f));
}
};
thrust::sort_by_key(thrust::device_ptr<float> (A),
thrust::device_ptr<float> ( A + N ),
thrust::device_ptr<myStruct> ( B ),
my_compare() );

Related

Get a number value from Vector positions

I'm new here and actually
I've got a problem in my mind, and it's like this:
I get an input of a vector of any size, but for this case, let's take this one:
vetor = {1, 2, 3, 4}
Now, all I want to do is to take this numbers and sum each one (considering it's unity, tens, hundred, thousand) and register the result into a integer variable, for the case, 'int vec_value'.
Considering the vector stated above, the answer should be: vec_value = 4321.
I will leave the main.cpp attached to the post, however I will tell you how I calculated the result, but it gave me the wrong answer.
vetor[0] = 1
vetor[1] = 2
vetor[2] = 3
vetor[3] = 4
the result should be = (1*10^0)+(2*10^1)+(3*10^2)+(4*10^3) = 1 + 20 +
300 + 4000 = 4321.
The program is giving me the solution as 4320, and if I change the values randomly, the answer follows the new values, but with wrong numbers still.
If anyone could take a look at my code to see what I'm doing wrong I'd appreciate it a lot!
Thanks..
There's a link to a picture at the end of the post showing an example of wrong result.
Keep in mind that sometimes the program gives me the right answer (what leaves me more confused)
Code:
#include <iostream>
#include <ctime>
#include <cstdlib>
#include <vector>
#include <cmath>
using namespace std;
int main()
{
vector<int> vetor;
srand(time(NULL));
int lim = rand() % 2 + 3; //the minimum size must be 3 and the maximum must be 4
int value;
for(int i=0; i<lim; i++)
{
value = rand() % 8 + 1; // I'm giving random values to each position of the vector
vetor.push_back(value);
cout << "\nPos [" << i << "]: " << vetor[i]; //just to keep in mind what are the elements inside the vector
}
int vec_value=0;
for(int i=0; i<lim; i++)
{
vec_value += vetor[i] * pow(10, i); //here i wrote the formula to sum each element of the vector with the correspondent unity, tens, hundreds or thousands
}
cout << "\n\nValor final: " << vec_value; //to see what result the program will give me
return 0;
}
Example of the program
Try this for the main loop:
int power = 1;
for(int i=0; i<lim; i++)
{
vec_value += vetor[i] * power;
power *= 10;
}
This way, all the computations are in integers, you are not affected by floating point rounding.

Qsort comparison

I'm converting C++ code to Go, but I have difficulties in understanding this comparison function:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <iostream>
using namespace std;
typedef struct SensorIndex
{ double value;
int index;
} SensorIndex;
int comp(const void *a, const void* b)
{ SensorIndex* x = (SensorIndex*)a;
SensorIndex* y = (SensorIndex*)b;
return abs(y->value) - abs(x->value);
}
int main(int argc , char *argv[])
{
SensorIndex *s_tmp;
s_tmp = (SensorIndex *)malloc(sizeof(SensorIndex)*200);
double q[200] = {8.48359,8.41851,-2.53585,1.69949,0.00358129,-3.19341,3.29215,2.68201,-0.443549,-0.140532,1.64661,-1.84908,0.643066,1.53472,2.63785,-0.754417,0.431077,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256};
for( int i=0; i < 200; ++i ) {
s_tmp[i].value = q[i];
s_tmp[i].index = i;
}
qsort(s_tmp, 200, sizeof(SensorIndex), comp);
for( int i=0; i<200; i++)
{
cout << s_tmp[i].index << " " << s_tmp[i].value << endl;
}
}
I expected that the "comp" function would allow the sorting from the highest (absolute) value to the minor, but in my environment (gcc 32 bit) the result is:
1 8.41851
0 8.48359
2 -2.53585
3 1.69949
11 -1.84908
5 -3.19341
6 3.29215
7 2.68201
10 1.64661
14 2.63785
12 0.643066
13 1.53472
4 0.00358129
9 -0.140532
8 -0.443549
15 -0.754417
16 0.431077
17 -0.123256
18 -0.123256
19 -0.123256
20 -0.123256
...
Moreover one thing that seems strange to me is that by executing the same code with online services I get different values (cpp.sh, C++98):
0 8.48359
1 8.41851
5 -3.19341
6 3.29215
2 -2.53585
7 2.68201
14 2.63785
3 1.69949
10 1.64661
11 -1.84908
13 1.53472
4 0.00358129
8 -0.443549
9 -0.140532
12 0.643066
15 -0.754417
16 0.431077
17 -0.123256
18 -0.123256
19 -0.123256
20 -0.123256
...
Any help?
This behavior is caused by using abs, a function that works with int, and passing it double arguments. The doubles are being implicitly cast to int, truncating the decimal component before comparing them. Essentially, this means you take the original number, strip off the sign, and then strip off everything to the right of the decimal and compare those values. So 8.123 and -8.9 are both converted to 8, and compare equal. Since the inputs are reversed for the subtraction, the ordering is in descending order by magnitude.
Your cpp.sh output reflects this; all the values with a magnitude between 8 and 9 appear first, then 3-4s, then 2-3s, 1-2s and less than 1 values.
If you wanted to fix this to actually sort in descending order in general, you'd need a comparison function that properly used the double-friendly fabs function, e.g.
int comp(const void *a, const void* b)
{ SensorIndex* x = (SensorIndex*)a;
SensorIndex* y = (SensorIndex*)b;
double diff = fabs(y->value) - fabs(x->value);
if (diff < 0.0) return -1;
return diff > 0;
}
Update: On further reading, it looks like std::abs from <cmath> has worked with doubles for a long time, but std::abs for doubles was only added to <cstdlib> (where the integer abs functions dwell) in C++17. And the implementers got this stuff wrong all the time, so different compilers would behave differently at random. In any event, both the answers given here are right; if you haven't included <cmath> and you're on pre-C++17 compilers, you should only have access to integer based versions of std::abs (or ::abs from math.h), which would truncate each value before the comparison. And even if you were using the correct std::abs, returning the result of double subtraction as an int would drop fractional components of the difference, making any values with a magnitude difference of less than 1.0 appear equal. Worse, depending on specific comparisons performed and their ordering (since not all values are compared to each other), the consequences of this effect could chain, as comparison ordering changes could make 1.0 appear equal to 1.6 which would in turn appear equal to 2.5, even though 1.0 would be correctly identified as less than 2.5 if they were compared to each other; in theory, as long as each number is within 1.0 of every other number, the comparisons might evaluate as if they're all equal to each other (pathological case yes, but smaller runs of such errors would definitely happen).
Point is, the only way to figure out the real intent of this code is to figure out the exact compiler version and C++ standard it was originally compiled under and test it there.
There is a bug in your comparison function. You return an int which means you lose the distinction between element values whose absolute difference is less then 1!
int comp(const void* a, const void* b)
{
SensorIndex* x = (SensorIndex*)a;
SensorIndex* y = (SensorIndex*)b;
// what about differences between 0.0 and 1.0?
return abs(y->value) - abs(x->value);
}
You can fix it like this:
int comp(const void* a, const void* b)
{ SensorIndex* x = (SensorIndex*)a;
SensorIndex* y = (SensorIndex*)b;
if(std::abs(y->value) < std::abs(x->value))
return -1;
return 1;
}
A more modern (and safer) way to do this would be to use std::vector and std::sort:
// use a vector for dynamic arrays
std::vector<SensorIndex> s_tmp;
for(int i = 0; i < 200; ++i) {
s_tmp.push_back({q[i], i});
}
// use std::sort
std::sort(std::begin(s_tmp), std::end(s_tmp), [](SensorIndex const& a, SensorIndex const& b){
return std::abs(b.value) < std::abs(a.value);
});

Efficiently count occurrences of each element from given ranges

So i have some ranges like these:
2 4
1 9
4 5
4 7
For this the result should be
1 -> 1
2 -> 2
3 -> 2
4 -> 4
5 -> 3
6 -> 2
7 -> 2
8 -> 1
9 -> 1
The naive approach will be to loop through all the ranges but that would be very inefficient and the worst case would take O(n * n)
What would be the efficient approach probably in O(n) or O(log(n))
Here's the solution, in O(n):
The rationale is to add a range [a, b] as a +1 in a, and a -1 after b. Then, after adding all the ranges, then compute the accumulated sums for that array and display it.
If you need to perform queries while adding the values, a better choice would be to use a Binary Indexed Tree, but your question doesn't seem to require this, so I left it out.
#include <iostream>
#define MAX 1000
using namespace std;
int T[MAX];
int main() {
int a, b;
int min_index = 0x1f1f1f1f, max_index = 0;
while(cin >> a >> b) {
T[a] += 1;
T[b+1] -= 1;
min_index = min(min_index, a);
max_index = max(max_index, b);
}
for(int i=min_index; i<=max_index; i++) {
T[i] += T[i-1];
cout << i << " -> " << T[i] << endl;
}
}
UPDATE: Based on the "provocations" (in a good sense) by גלעד ברקן, you can also do this in O(n log n):
#include <iostream>
#include <map>
#define ull unsigned long long
#define miit map<ull, int>::iterator
using namespace std;
map<ull, int> T;
int main() {
ull a, b;
while(cin >> a >> b) {
T[a] += 1;
T[b+1] -= 1;
}
ull last;
int count = 0;
for(miit it = T.begin(); it != T.end(); it++) {
if (count > 0)
for(ull i=last; i<it->first; i++)
cout << i << " " << count << endl;
count += it->second;
last = it->first;
}
}
The advantage of this solution is being able to support ranges with much larger values (as long as the output isn't so large).
The solution would be pretty simple:
generate two lists with the indices of all starting and ending indices of the ranges and sort them.
Generate a counter for the number of ranges that cover the current index. Start at the first item that is at any range and iterate over all numbers to the last element that is in any range. Now if an index is either part of the list of starting-indices, we add 1 to the counter, if it's an element of the ending-indices, we substract 1 from the counter.
Implementation:
vector<int> count(int** ranges , int rangecount , int rangemin , int rangemax)
{
vector<int> res;
set<int> open, close;
for(int** r = ranges ; r < ranges + sizeof(int*) * rangecount ; r++)
{
open.add((*r)[0]);
close.add((*r)[1]);
}
int rc = 0;
for(int i = rangemin ; i < rangemax ; i++)
{
if(open.count(i))
++rc;
res.add(rc);
if(close.count(i))
--rc;
}
return res;
}
Paul's answer still counts from "the first item that is at any range and iterate[s] over all numbers to the last element that is in any range." But what is we could aggregate overlapping counts? For example, if we have three (or say a very large number of) overlapping ranges [(2,6),[1,6],[2,8] the section (2,6) could be dependent only on the number of ranges, if we were to label the overlaps with their counts [(1),3(2,6),(7,8)]).
Using binary search (once for the start and a second time for the end of each interval), we could split the intervals and aggregate the counts in O(n * log m * l) time, where n is our number of given ranges and m is the number of resulting groups in the total range and l varies as the number of disjoint updates required for a particular overlap (the number of groups already within that range). Notice that at any time, we simply have a sorted list grouped as intervals with labeled count.
2 4
1 9
4 5
4 7
=>
(2,4)
(1),2(2,4),(5,9)
(1),2(2,3),3(4),2(5),(6,9)
(1),2(2,3),4(4),3(5),2(6,7),(8,9)
So you want the output to be an array, where the value of each element is the number of input ranges that include it?
Yeah, the obvious solution would be to increment every element in the range by 1, for each range.
I think you can get more efficient if you sort the input ranges by start (primary), end (secondary). So for 32bit start and end, start:end can be a 64bit sort key. Actually, just sorting by start is fine, we need to sort the ends differently anyway.
Then you can see how many ranges you enter for an element, and (with a pqueue of range-ends) see how many you already left.
# pseudo-code with possible bugs.
# TODO: peek or put-back the element from ranges / ends
# that made the condition false.
pqueue ends; // priority queue
int depth = 0; // how many ranges contain this element
for i in output.len {
while (r = ranges.next && r.start <= i) {
ends.push(r.end);
depth++;
}
while (ends.pop < i) {
depth--;
}
output[i] = depth;
}
assert ends.empty();
Actually, we can just sort the starts and ends separately into two separate priority queues. There's no need to build the pqueue on the fly. (Sorting an array of integers is more efficient than sorting an array of structs by one struct member, because you don't have to copy around as much data.)

Parallel multiplication of many small matrices by fixed vector

Situation is the following: I have a number (1000s) of elements which are given by small matrices of dimensions 4x2, 9x3 ... you get the idea. All matrices have the same dimension.
I want to multiply each of these matrices with a fixed vector of precalculated values. In short:
for(i = 1...n)
X[i] = M[i] . N;
What is the best approach to do this in parallel using Thrust? How do I lay out my data in memory?
NB: There might be specialized, more suitable libraries to do this on GPUs. I'm interested in Thrust because it allows me to deploy to different backends, not just CUDA.
One possible approach:
flatten the arrays (matrices) into a single data vector. This is an advantageous step for enabling general thrust processing anyway.
use a strided range mechanism to take your scaling vector and extend it to the overall length of your flattened data vector
use thrust::transform with thrust::multiplies to multiply the two vectors together.
If you need to access the matrices later out of your flattened data vector (or result vector), you can do so with pointer arithmetic, or a combination of fancy iterators.
If you need to re-use the extended scaling vector, you may want to use the method outlined in step 2 exactly (i.e. create an actual vector using that method, length = N matrices, repeated). If you are only doing this once, you can achieve the same effect with a counting iterator, followed by a transform iterator (modulo the length of your matrix in elements), followed by a permutation iterator, to index into your original scaling vector (length = 1 matrix).
The following example implements the above, without using the strided range iterator method:
#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/functional.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/transform.h>
#define N_MAT 1000
#define H_MAT 4
#define W_MAT 3
#define RANGE 1024
struct my_modulo_functor : public thrust::unary_function<int, int>
{
__host__ __device__
int operator() (int idx) {
return idx%(H_MAT*W_MAT);}
};
int main(){
thrust::host_vector<int> data(N_MAT*H_MAT*W_MAT);
thrust::host_vector<int> scale(H_MAT*W_MAT);
// synthetic; instead flatten/copy matrices into data vector
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++) data[i] = rand()%RANGE;
for (int i = 0; i < H_MAT*W_MAT; i++) scale[i] = rand()%RANGE;
thrust::device_vector<int> d_data = data;
thrust::device_vector<int> d_scale = scale;
thrust::device_vector<int> d_result(N_MAT*H_MAT*W_MAT);
thrust::transform(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_scale.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), my_modulo_functor())) ,d_result.begin(), thrust::multiplies<int>());
thrust::host_vector<int> result = d_result;
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++)
if (result[i] != data[i] * scale[i%(H_MAT*W_MAT)]) {std::cout << "Mismatch at: " << i << " cpu result: " << (data[i] * scale[i%(H_MAT*W_MAT)]) << " gpu result: " << result[i] << std::endl; return 1;}
std::cout << "Success!" << std::endl;
return 0;
}
EDIT: Responding to a question below:
The benefit of fancy iterators (i.e. transform(numbers, iterator)) is that they often allow for eliminaion of extra data copies/data movement, as compared to assembling other number (which requires extra steps and data movement) and then passing it to transform(numbers, other numbers). If you're only going to use other numbers once, then the fancy iterators will generally be better. If you're going to use other numbers again, then you may want to assemble it explicitly. This preso is instructive, in particular "Fusion".
For a one-time use of other numbers the overhead of assembling it on the fly using fancy iterators and the functor is generally lower than explicitly creating a new vector, and then passing that new vector to the transform routine.
When looking for a software library which is concisely made for multiplying small matrices, then one may have a look at https://github.com/hfp/libxsmm. Below, the code requests a specialized matrix kernel according to the typical GEMM parameters (please note that some limitations apply).
double alpha = 1, beta = 1;
const char transa = 'N', transb = 'N';
int flags = LIBXSMM_GEMM_FLAGS(transa, transb);
int prefetch = LIBXSMM_PREFETCH_AUTO;
libxsmm_blasint m = 23, n = 23, k = 23;
libxsmm_dmmfunction xmm = NULL;
xmm = libxsmm_dmmdispatch(m, n, k,
&m/*lda*/, &k/*ldb*/, &m/*ldc*/,
&alpha, &beta, &flags, &prefetch);
Given the above code, one can proceed and run "xmm" for an entire series of (small) matrices without a particular data structure (below code also uses "prefetch locations").
if (0 < n) { /* check that n is at least 1 */
# pragma parallel omp private(i)
for (i = 0; i < (n - 1); ++i) {
const double *const ai = a + i * asize;
const double *const bi = b + i * bsize;
double *const ci = c + i * csize;
xmm(ai, bi, ci, ai + asize, bi + bsize, ci + csize);
}
xmm(a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize,
/* pseudo prefetch for last element of batch (avoids page fault) */
a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize);
}
In addition to the manual loop control as shown above, libxsmm_gemm_batch (or libxsmm_gemm_batch_omp) can be used (see ReadTheDocs). The latter is useful if data structures exist that describe the series of operands (A, B, and C matrices).
There are two reasons why this library gives superior performance: (1) on-the-fly code specialization using an in-memory code generation technique, and (2) loading the next matrix operands while calculating the current product.
( Given one is looking for something that blends well with C/C++, this library supports it. However, it does not aim for CUDA/Thrust. )

Concurrently sorting many arrays with CUDA Thrust

I need to sort 20+ arrays, already on the GPU, each of the same length, by the same keys. I can not use sort_by_key() directly since it sorts the keys as well (making them useless to sort the next array). Here is what I tried instead:
thrust::device_vector<int> indices(N);
thrust::sequence(indices.begin(),indices.end());
thrust::sort_by_key(keys.begin(),keys.end(),indices.begin());
thrust::gather(indices.begin(),indices.end(),a_01,a_01);
thrust::gather(indices.begin(),indices.end(),a_02,a_02);
...
thrust::gather(indices.begin(),indices.end(),a_20,a_20);
This does not seem to work since gather() expects a different array for the output than for the input, i.e. this works:
thrust::gather(indices.begin(),indices.end(),a_01,o_01);
...
However, I would prefer to not allocate 20+ extra arrays for this task. I know that there is a solution using a thrust::tuple, thrust::zip_iterator and thrust::sort_by_keys(), similiar to here. However, I can only combine up to 10 arrays in a tuple, s.t. I would need to duplicate the key vector again. How would you tackle this task?
I think that the classical way to sort multiple arrays is the so-called back-to-back approach which uses uses thrust::stable_sort_by_key two times. You need to create a keys vector such that elements within the same array have the same key. For example:
Elements: 10.5 4.3 -2.3 0. 55. 24. 66.
Keys: 0 0 0 1 1 1 1
In this case we have two arrays, the first with 3 elements and the second with 4 elements.
You first need to call thrust::stable_sort_by_key having the matrix values as the keys like
thrust::stable_sort_by_key(d_matrix.begin(),
d_matrix.end(),
d_keys.begin(),
thrust::less<float>());
After that, you have
Elements: -2.3 0 4.3 10.5 24. 55. 66.
Keys: 0 1 0 0 1 1 1
which means that the array elements are ordered, while the keys are not. Then you need a second to call thrust::stable_sort_by_key
thrust::stable_sort_by_key(d_keys.begin(),
d_keys.end(),
d_matrix.begin(),
thrust::less<int>());
so performing a sorting according to the keys. After that step, you have
Elements: -2.3 4.3 10.5 0 24. 55. 66.
Keys: 0 0 0 1 1 1 1
which is the final desired result.
Below, a full working example which considers the following problem: separately order each row of a matrix. This is a particular case in which all the arrays have the same length, but the approach works with arrays having possibly different lengths.
#include <cublas_v2.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>
#include <stdio.h>
#include <iostream>
#include "Utilities.cuh"
/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {
T Ncols; // --- Number of columns
__host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}
__host__ __device__ T operator()(T i) { return i / Ncols; }
};
/********/
/* MAIN */
/********/
int main()
{
const int Nrows = 5; // --- Number of rows
const int Ncols = 8; // --- Number of columns
// --- Random uniform integer distribution between 10 and 99
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(10, 99);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_matrix(Nrows * Ncols);
for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);
// --- Print result
printf("Original matrix\n");
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "]\n";
}
/*************************/
/* BACK-TO-BACK APPROACH */
/*************************/
thrust::device_vector<float> d_keys(Nrows * Ncols);
// --- Generate row indices
thrust::transform(thrust::make_counting_iterator(0),
thrust::make_counting_iterator(Nrows*Ncols),
thrust::make_constant_iterator(Ncols),
d_keys.begin(),
thrust::divides<int>());
// --- Back-to-back approach
thrust::stable_sort_by_key(d_matrix.begin(),
d_matrix.end(),
d_keys.begin(),
thrust::less<float>());
thrust::stable_sort_by_key(d_keys.begin(),
d_keys.end(),
d_matrix.begin(),
thrust::less<int>());
// --- Print result
printf("\n\nSorted matrix\n");
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "]\n";
}
return 0;
}
Well, you really only need to allocate one extra array if you are OK with manipulating pointers to device_vector instead:
thrust::device_vector<int> indices(N);
thrust::sequence(indices.begin(),indices.end());
thrust::sort_by_key(keys.begin(),keys.end(),indices.begin());
thrust::device_vector<int> temp(N);
thrust::device_vector<int> *sorted = &temp;
thrust::device_vector<int> *pa_01 = &a_01;
thrust::device_vector<int> *pa_02 = &a_02;
...
thrust::device_vector<int> *pa_20 = &a_20;
thrust::gather(indices.begin(), indices.end(), *pa_01, *sorted);
pa_01 = sorted; sorted = &a_01;
thrust::gather(indices.begin(), indices.end(), *pa_02, *sorted);
pa_02 = sorted; sorted = &a_02;
...
thrust::gather(indices.begin(), indices.end(), *pa_20, *sorted);
pa_20 = sorted; sorted = &a_20;
Or something like that should work anyway. You would need to fix it so the temp device vector is not automatically deallocated when it goes out of scope -- I suggest allocating the CUDA device pointers using cudaMalloc and then wrapping them with device_ptr instead of using automatic device_vectors.

Resources