Concurrently sorting many arrays with CUDA Thrust - sorting

I need to sort 20+ arrays, already on the GPU, each of the same length, by the same keys. I can not use sort_by_key() directly since it sorts the keys as well (making them useless to sort the next array). Here is what I tried instead:
thrust::device_vector<int> indices(N);
thrust::sequence(indices.begin(),indices.end());
thrust::sort_by_key(keys.begin(),keys.end(),indices.begin());
thrust::gather(indices.begin(),indices.end(),a_01,a_01);
thrust::gather(indices.begin(),indices.end(),a_02,a_02);
...
thrust::gather(indices.begin(),indices.end(),a_20,a_20);
This does not seem to work since gather() expects a different array for the output than for the input, i.e. this works:
thrust::gather(indices.begin(),indices.end(),a_01,o_01);
...
However, I would prefer to not allocate 20+ extra arrays for this task. I know that there is a solution using a thrust::tuple, thrust::zip_iterator and thrust::sort_by_keys(), similiar to here. However, I can only combine up to 10 arrays in a tuple, s.t. I would need to duplicate the key vector again. How would you tackle this task?

I think that the classical way to sort multiple arrays is the so-called back-to-back approach which uses uses thrust::stable_sort_by_key two times. You need to create a keys vector such that elements within the same array have the same key. For example:
Elements: 10.5 4.3 -2.3 0. 55. 24. 66.
Keys: 0 0 0 1 1 1 1
In this case we have two arrays, the first with 3 elements and the second with 4 elements.
You first need to call thrust::stable_sort_by_key having the matrix values as the keys like
thrust::stable_sort_by_key(d_matrix.begin(),
d_matrix.end(),
d_keys.begin(),
thrust::less<float>());
After that, you have
Elements: -2.3 0 4.3 10.5 24. 55. 66.
Keys: 0 1 0 0 1 1 1
which means that the array elements are ordered, while the keys are not. Then you need a second to call thrust::stable_sort_by_key
thrust::stable_sort_by_key(d_keys.begin(),
d_keys.end(),
d_matrix.begin(),
thrust::less<int>());
so performing a sorting according to the keys. After that step, you have
Elements: -2.3 4.3 10.5 0 24. 55. 66.
Keys: 0 0 0 1 1 1 1
which is the final desired result.
Below, a full working example which considers the following problem: separately order each row of a matrix. This is a particular case in which all the arrays have the same length, but the approach works with arrays having possibly different lengths.
#include <cublas_v2.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>
#include <stdio.h>
#include <iostream>
#include "Utilities.cuh"
/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {
T Ncols; // --- Number of columns
__host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}
__host__ __device__ T operator()(T i) { return i / Ncols; }
};
/********/
/* MAIN */
/********/
int main()
{
const int Nrows = 5; // --- Number of rows
const int Ncols = 8; // --- Number of columns
// --- Random uniform integer distribution between 10 and 99
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(10, 99);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_matrix(Nrows * Ncols);
for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);
// --- Print result
printf("Original matrix\n");
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "]\n";
}
/*************************/
/* BACK-TO-BACK APPROACH */
/*************************/
thrust::device_vector<float> d_keys(Nrows * Ncols);
// --- Generate row indices
thrust::transform(thrust::make_counting_iterator(0),
thrust::make_counting_iterator(Nrows*Ncols),
thrust::make_constant_iterator(Ncols),
d_keys.begin(),
thrust::divides<int>());
// --- Back-to-back approach
thrust::stable_sort_by_key(d_matrix.begin(),
d_matrix.end(),
d_keys.begin(),
thrust::less<float>());
thrust::stable_sort_by_key(d_keys.begin(),
d_keys.end(),
d_matrix.begin(),
thrust::less<int>());
// --- Print result
printf("\n\nSorted matrix\n");
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "]\n";
}
return 0;
}

Well, you really only need to allocate one extra array if you are OK with manipulating pointers to device_vector instead:
thrust::device_vector<int> indices(N);
thrust::sequence(indices.begin(),indices.end());
thrust::sort_by_key(keys.begin(),keys.end(),indices.begin());
thrust::device_vector<int> temp(N);
thrust::device_vector<int> *sorted = &temp;
thrust::device_vector<int> *pa_01 = &a_01;
thrust::device_vector<int> *pa_02 = &a_02;
...
thrust::device_vector<int> *pa_20 = &a_20;
thrust::gather(indices.begin(), indices.end(), *pa_01, *sorted);
pa_01 = sorted; sorted = &a_01;
thrust::gather(indices.begin(), indices.end(), *pa_02, *sorted);
pa_02 = sorted; sorted = &a_02;
...
thrust::gather(indices.begin(), indices.end(), *pa_20, *sorted);
pa_20 = sorted; sorted = &a_20;
Or something like that should work anyway. You would need to fix it so the temp device vector is not automatically deallocated when it goes out of scope -- I suggest allocating the CUDA device pointers using cudaMalloc and then wrapping them with device_ptr instead of using automatic device_vectors.

Related

Get a number value from Vector positions

I'm new here and actually
I've got a problem in my mind, and it's like this:
I get an input of a vector of any size, but for this case, let's take this one:
vetor = {1, 2, 3, 4}
Now, all I want to do is to take this numbers and sum each one (considering it's unity, tens, hundred, thousand) and register the result into a integer variable, for the case, 'int vec_value'.
Considering the vector stated above, the answer should be: vec_value = 4321.
I will leave the main.cpp attached to the post, however I will tell you how I calculated the result, but it gave me the wrong answer.
vetor[0] = 1
vetor[1] = 2
vetor[2] = 3
vetor[3] = 4
the result should be = (1*10^0)+(2*10^1)+(3*10^2)+(4*10^3) = 1 + 20 +
300 + 4000 = 4321.
The program is giving me the solution as 4320, and if I change the values randomly, the answer follows the new values, but with wrong numbers still.
If anyone could take a look at my code to see what I'm doing wrong I'd appreciate it a lot!
Thanks..
There's a link to a picture at the end of the post showing an example of wrong result.
Keep in mind that sometimes the program gives me the right answer (what leaves me more confused)
Code:
#include <iostream>
#include <ctime>
#include <cstdlib>
#include <vector>
#include <cmath>
using namespace std;
int main()
{
vector<int> vetor;
srand(time(NULL));
int lim = rand() % 2 + 3; //the minimum size must be 3 and the maximum must be 4
int value;
for(int i=0; i<lim; i++)
{
value = rand() % 8 + 1; // I'm giving random values to each position of the vector
vetor.push_back(value);
cout << "\nPos [" << i << "]: " << vetor[i]; //just to keep in mind what are the elements inside the vector
}
int vec_value=0;
for(int i=0; i<lim; i++)
{
vec_value += vetor[i] * pow(10, i); //here i wrote the formula to sum each element of the vector with the correspondent unity, tens, hundreds or thousands
}
cout << "\n\nValor final: " << vec_value; //to see what result the program will give me
return 0;
}
Example of the program
Try this for the main loop:
int power = 1;
for(int i=0; i<lim; i++)
{
vec_value += vetor[i] * power;
power *= 10;
}
This way, all the computations are in integers, you are not affected by floating point rounding.

How does std::distance() work?

I am very much new to C++11 and learning about the STL Libraries. I have written a code which is like this,
#include <bits/stdc++.h>
#include <vector>
#include <algorithm>
#include <iterator>
using namespace std;
void Print( const vector<int> &arrays )
{
for ( int x : arrays ) cout << x << ' ';
}
int main() {
int citys, cityPairs, fv, lv, w;
vector <int> fvarr;
vector <int> lvarr;
vector <int> warr;
vector <int> warr_temp;
vector <int> disjoint_pairs;
scanf("%d%d", &citys, &cityPairs);
for(int nr = 0; nr < cityPairs; nr++){
scanf("%d%d%d", &fv, &lv, &w);
fvarr.push_back(fv);
lvarr.push_back(lv);
warr.push_back(w);
warr_temp = warr;
}
for (int j = 0; j < citys; j++){
auto result = min_element(begin(warr_temp), end(warr_temp));
auto pos_temp = distance(begin(warr_temp), result);
cout << pos_temp;
auto pos = distance(begin(warr), result);
cout << pos;
disjoint_pairs.push_back(fvarr[pos]);
disjoint_pairs.push_back(lvarr[pos]);
warr_temp.erase(warr_temp.begin() + pos_temp);
}
// Print(disjoint_pairs);
}
What i am doing in this code is i am taking 3 vectors and 1 vector to copy the last one warr_temp = warr;. Then i am checking the minimum value in vectorwarr_temp and storing it's index in pos_temp, next i am storing that min value's index from vector warr into pos.
Now the problem is the first cout which is pos_temp giving me correct values but the second one which is pos giving me the output something like this,
-61-62-63-64
why is this happening? what are these numbers? are they pointers? I know that distance is a template so what is the right way to implement this?
If anyone can clear my doubts that would be very helpfull.
Sorry if stupid question!!!
The root cause of the problem is auto pos = distance(begin(warr), result); line. It gives unpredictable results because result and begin(warr) belong to different vectors.
result is iterator pointing to warr_temp element, it cannot be mixed with iterators pointing to warr elements like begin(warr).
To get element position in warr vector use std::find(begin(warr), end(warr), *result) instead:
auto warr_res = std::find(begin(warr), end(warr), *result);
auto pos = distance(begin(warr), warr_res);

vector accessing non zero elements but output as zero

I' did this program what suppose save pairs of string ,int on one vector and print the strings of the maximum number on vector
but when i try to find this strings don't appears nothing so I try print all values of int's on vector and although was finding the maximum of 10 all values in the vector was printing as 0. Someone can explain was it occurred and how I can access the values , please.
#include <iostream>
#include <utility>
#include <vector>
#include <string>
#include <algorithm>
using namespace std;
typedef vector<pair<string,int>> vsi;
bool paircmp(const pair<string,int>& firste,const pair<string,int>& seconde );
int main(int argc, char const *argv[]) {
vsi v(10);
string s;
int n,t;
cin>>t;
for (size_t i = 0;i < t;i++) {
for (size_t j = 0; j < 10; j++) {
cin>>s>>n;
v.push_back(make_pair(s,n));
}
sort(v.begin(),v.end(),paircmp);
int ma=v[v.size()-1].second;
cout<<ma<<endl;
for (size_t j = 0; j < 10; j++) {
cout << v.at(j).second <<endl;
if(v[j].second == ma)
cout<<v[j].first<<endl;
}
}
return 0;
}
bool paircmp(const pair<string,int>& firste,const pair<string,int>& seconde ){
return firste.second < seconde.second;
}
This line
vsi v(10);
creates you a std::vector filled with 10 default-constructed std::pair<std::string, int>s. That is, an empty string and zero.
You then push_back other values to your vector but they happen to be sorted after those ten initial elements, probably because they all have positive ints in them.
Therefore, printing the first member of the first ten elements prints ten empty strings.
This is all I can guess from what you have provided. I don't know what you are trying to accomplish with this code.
Try something like
for (const auto& item : v)
{
std::cout << "{ first: '" << item.first << "', "
<< "second: " << item.second << " }\n";
}
to print all elements of the vector v.

"Warning : Non-POD class type passed through ellipsis" for simple thrust program

In spite of reading many answers on the same kind of questions on SO I am not able to figure out solution in my case. I have written the following code to implement a thrust program. Program performs simple copy and display operation.
#include <stdio.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
int main(void)
{
// H has storage for 4 integers
thrust::host_vector<int> H(4);
H[0] = 14;
H[1] = 20;
H[2] = 38;
H[3] = 46;
// H.size() returns the size of vector H
printf("\nSize of vector : %d",H.size());
printf("\nVector Contents : ");
for (int i = 0; i < H.size(); ++i) {
printf("\t%d",H[i]);
}
thrust::device_vector<int> D = H;
printf("\nDevice Vector Contents : ");
for (int i = 0; i < D.size(); i++) {
printf("%d",D[i]); //This is where I get the warning.
}
return 0;
}
Thrust implements certain operations to facilitate using elements of a device_vector in host code, but this apparently isn't one of them.
There are many approaches to addressing this issue. The following code demonstrates 3 possible approaches:
explicitly copy D[i] to a host variable, and thrust has an appropriate method defined for that.
copy the thrust device_vector back to a host_vector before print-out.
use thrust::copy to directly copy the elements of the device_vector to a stream.
Code:
#include <stdio.h>
#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
int main(void)
{
// H has storage for 4 integers
thrust::host_vector<int> H(4);
H[0] = 14;
H[1] = 20;
H[2] = 38;
H[3] = 46;
// H.size() returns the size of vector H
printf("\nSize of vector : %d",H.size());
printf("\nVector Contents : ");
for (int i = 0; i < H.size(); ++i) {
printf("\t%d",H[i]);
}
thrust::device_vector<int> D = H;
printf("\nDevice Vector Contents : ");
//method 1
for (int i = 0; i < D.size(); i++) {
int q = D[i];
printf("\t%d",q);
}
printf("\n");
//method 2
thrust::host_vector<int> Hnew = D;
for (int i = 0; i < Hnew.size(); i++) {
printf("\t%d",Hnew[i]);
}
printf("\n");
//method 3
thrust::copy(D.begin(), D.end(), std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
Note that for methods like these, thrust is generating various kinds of device-> host copy operations to facilitate the use of device_vector in host code. This has performance implications, so you might want to use the defined copy operations for large vectors.

How to partly sort arrays on CUDA?

Problem
Provided I have two arrays:
const int N = 1000000;
float A[N];
myStruct *B[N];
The numbers in A can be positive or negative (e.g. A[N]={3,2,-1,0,5,-2}), how can I make the array A partly sorted (all positive values first, not need to be sorted, then negative values)(e.g. A[N]={3,2,5,0,-1,-2} or A[N]={5,2,3,0,-2,-1}) on the GPU? The array B should be changed according to A (A is keys, B is values).
Since the scale of A,B can be very large, I think the sort algorithm should be implemented on GPU (especially on CUDA, because I use this platform). Surely I know thrust::sort_by_key can do this work, but it does muck extra work since I do not need the array A&B to be sorted entirely.
Has anyone come across this kind of problem?
Thrust example
thrust::sort_by_key(thrust::device_ptr<float> (A),
thrust::device_ptr<float> ( A + N ),
thrust::device_ptr<myStruct> ( B ),
thrust::greater<float>() );
Thrust's documentation on Github is not up-to-date. As #JaredHoberock said, thrust::partition is the way to go since it now supports stencils. You may need to get a copy from the Github repository:
git clone git://github.com/thrust/thrust.git
Then run scons doc in the Thrust folder to get an updated documentation, and use these updated Thrust sources when compiling your code (nvcc -I/path/to/thrust ...). With the new stencil partition, you can do:
#include <thrust/partition.h>
#include <thrust/execution_policy.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
struct is_positive
{
__host__ __device__
bool operator()(const int &x)
{
return x >= 0;
}
};
thrust::partition(thrust::host, // if you want to test on the host
thrust::make_zip_iterator(thrust::make_tuple(keyVec.begin(), valVec.begin())),
thrust::make_zip_iterator(thrust::make_tuple(keyVec.end(), valVec.end())),
keyVec.begin(),
is_positive());
This returns:
Before:
keyVec = 0 -1 2 -3 4 -5 6 -7 8 -9
valVec = 0 1 2 3 4 5 6 7 8 9
After:
keyVec = 0 2 4 6 8 -5 -3 -7 -1 -9
valVec = 0 2 4 6 8 5 3 7 1 9
Note that the 2 partitions are not necessarily sorted. Also, the order may differ between the original vectors and the partitions. If this is important to you, you can use thrust::stable_partition:
stable_partition differs from partition in that stable_partition is
guaranteed to preserve relative order. That is, if x and y are
elements in [first, last), such that pred(x) == pred(y), and if x
precedes y, then it will still be true after stable_partition that x
precedes y.
If you want a complete example, here it is:
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/partition.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
struct is_positive
{
__host__ __device__
bool operator()(const int &x)
{
return x >= 0;
}
};
void print_vec(const thrust::host_vector<int>& v)
{
for(size_t i = 0; i < v.size(); i++)
std::cout << " " << v[i];
std::cout << "\n";
}
int main ()
{
const int N = 10;
thrust::host_vector<int> keyVec(N);
thrust::host_vector<int> valVec(N);
int sign = 1;
for(int i = 0; i < N; ++i)
{
keyVec[i] = sign * i;
valVec[i] = i;
sign *= -1;
}
// Copy host to device
thrust::device_vector<int> d_keyVec = keyVec;
thrust::device_vector<int> d_valVec = valVec;
std::cout << "Before:\n keyVec = ";
print_vec(keyVec);
std::cout << " valVec = ";
print_vec(valVec);
// Partition key-val on device
thrust::partition(thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.begin(), d_valVec.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.end(), d_valVec.end())),
d_keyVec.begin(),
is_positive());
// Copy result back to host
keyVec = d_keyVec;
valVec = d_valVec;
std::cout << "After:\n keyVec = ";
print_vec(keyVec);
std::cout << " valVec = ";
print_vec(valVec);
}
UPDATE
I made a quick comparison with the thrust::sort_by_key version, and the thrust::partition implementation does seem to be faster (which is what we could naturally expect). Here is what I obtain on NVIDIA Visual Profiler, with N = 1024 * 1024, with the sort version on the left, and the partition version on the right. You may want to do the same kind of tests on your own.
How about this?:
Count how many positive numbers to determine the inflexion point
Evenly divide each side of the inflexion point into groups (negative-groups are all same length but different length to positive-groups. these groups are the memory chunks for the results)
Use one kernel call (one thread) per chunk pair
Each kernel swaps any out-of-place elements in the input groups into the desired output groups. You will need to flag any chunks that have more swaps than the maximum so that you can fix them during subsequent iterations.
Repeat until done
Memory traffic is swaps only (from original element position, to sorted position). I don't know if this algorithm sounds like anything already defined...
You should be able to achieve this in thrust simply with a modification of your comparison operator:
struct my_compare
{
__device__ __host__ bool operator()(const float x, const float y) const
{
return !((x<0.0f) && (y>0.0f));
}
};
thrust::sort_by_key(thrust::device_ptr<float> (A),
thrust::device_ptr<float> ( A + N ),
thrust::device_ptr<myStruct> ( B ),
my_compare() );

Resources