Insert into host_vector using thrust - insert

I'm trying to insert one value into the third location in a host_vector using thrust.
static thrust::host_vector <int *> bins;
int * p;
bins.insert(3, 1, p);
But am getting errors:
error: no instance of overloaded function "thrust::host_vector<T, Alloc>::insert [with T=int *, Alloc=std::allocator<int *>]" matches the argument list
argument types are: (int, int, int *)
object type is: thrust::host_vector<int *, std::allocator<int *>>
Has anyone seen this before, and how can I solve this? I want to use a vector to pass information into the GPU. I was originally trying to use a vector of vectors to represent spatial cells that hold different numbers of data, but learned that wasn't possible with thrust. So instead, I'm using a vector bins that holds my data, sorted by the spatial cell (first 3 values might correspond to the first cell, the next 2 to the second cell, the next 0 to the third cell, etc.). The values held are pointers to particles, and represent the numbers of particles in the spatial cell (which is not known before runtime).

As noted in comments, thrust::host_vector is modelled directly on std::vector and the operation you are trying to use requires an iterator for the position argument, which is why you get a compilation error. You can see this if you consult the relevant documentation:
A complete working example of the code snippet you showed would look like this:
#include <iostream>
#include <thrust/host_vector.h>
int main()
thrust::host_vector <int *> bins(10, reinterpret_cast<int *>(0));
int * p = reinterpret_cast<int *>(0xdeadbeef);
bins.insert(bins.begin()+3, 1, p);
auto it = bins.begin();
for(int i=0; it != bins.end(); ++it, i++) {
int* v = *it;
std::cout << i << " " << v << std::endl;
return 0;
Note that this requires that C++11 language features are enabled in nvcc (so use CUDA 8.0):
~/SO$ nvcc -std=c++11 -arch=sm_52
~/SO$ ./a.out
0 0
1 0
2 0
3 0xdeadbeef
4 0
5 0
6 0
7 0
8 0
9 0
10 0


Sorting multiple arrays using CUDA/Thrust

I have a large array that I need to sort on the GPU. The array itself is a concatenation of multiple smaller subarrays that satisfy the condition that given i < j, the elements of the subarray i are smaller than the elements of the subarray j. An example of such array would be {5 3 4 2 1 6 9 8 7 10 11},
where the elements of the first subarray of 5 elements are smaller than the elements of the second subarray of 6 elements. The array I need is {1, 2, 3, 4, 5, 6, 7, 10, 11}. I know the position where each subarray starts in the large array.
I know I can simply use thrust::sort on the whole array, but I was wondering if it's possible to launch multiple concurrent sorts, one for each subarray. I'm hoping to get a performance improvement by doing that. My assumption is that it would be faster to sort multiple smaller arrays than one large array with all the elements.
I'd appreciate if someone could give me a way to do that or correct my assumption in case it's wrong.
A way to do multiple concurrent sorts (a "vectorized" sort) in thrust is via the marking of the sub arrays, and providing a custom functor that is an ordinary thrust sort functor that also orders the sub arrays by their key.
Another possible method is to use back-to-back thrust::stable_sort_by_key as described here.
As you have pointed out, another method in your case is just to do an ordinary sort, since that is ultimately your objective.
However I think its unlikely that any of the thrust sort methods will give a signficant speed-up over a pure sort, although you can try it. Thrust has a fast-path radix sort which it will use in certain situations, which the pure sort method could probably use in your case. (In other cases, e.g. when you provide a custom functor, thrust will often use a slower merge-sort method.)
If the sizes of the sub arrays are within certain ranges, I think you're likely to get much better results (performance-wise) with block radix sort in cub, one block per sub-array.
Here is an example that uses specific sizes (since you've given no indication of size ranges and other details), comparing a thrust "pure sort" to a thrust segmented sort with functor, to the cub block sort method. For this particular case, the cub sort is fastest:
$ cat
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/scan.h>
#include <thrust/equal.h>
#include <cstdlib>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
const int num_blocks = 2048;
const int items_per = 4;
const int nTPB = 512;
const int block_size = items_per*nTPB; // must be a whole-number multiple of nTPB;
typedef float mt;
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
struct my_sort_functor
template <typename T, typename T2>
__host__ __device__
bool operator()(T t1, T2 t2){
if (thrust::get<1>(t1) < thrust::get<1>(t2)) return true;
if (thrust::get<1>(t1) > thrust::get<1>(t2)) return false;
if (thrust::get<0>(t1) > thrust::get<0>(t2)) return false;
return true;}
// from:
#define CUB_STDERR
#include <stdio.h>
#include <iostream>
#include <algorithm>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_sort.cuh>
using namespace cub;
// Globals, constants and typedefs
bool g_verbose = false;
bool g_uniform_keys;
// Kernels
template <
typename Key,
__launch_bounds__ (BLOCK_THREADS)
__global__ void BlockSortKernel(
Key *d_in, // Tile of input
Key *d_out) // Tile of output
// Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
// Specialize BlockRadixSort type for our thread block
typedef BlockRadixSort<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// Shared memory
__shared__ union TempStorage
typename BlockLoadT::TempStorage load;
typename BlockRadixSortT::TempStorage sort;
} temp_storage;
// Per-thread tile items
// Our current block's offset
int block_offset = blockIdx.x * TILE_SIZE;
// Load items into a blocked arrangement
BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
// Barrier for smem reuse
// Sort keys
// Store output in striped fashion
StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
int main(){
const int ds = num_blocks*block_size;
thrust::host_vector<mt> data(ds);
thrust::host_vector<int> keys(ds);
for (int i = block_size; i < ds; i+=block_size) keys[i] = 1; // mark beginning of blocks
thrust::device_vector<int> d_keys = keys;
for (int i = 0; i < ds; i++) data[i] = (rand()%block_size) + (i/block_size)*block_size; // populate data
thrust::device_vector<mt> d_data = data;
thrust::inclusive_scan(d_keys.begin(), d_keys.end(), d_keys.begin()); // fill out keys array 000111222...
thrust::device_vector<mt> d1 = d_data; // make a copy of unsorted data
unsigned long long os = dtime_usec(0);
thrust::sort(d1.begin(), d1.end()); // ordinary sort
os = dtime_usec(os);
thrust::device_vector<mt> d2 = d_data; // make a copy of unsorted data
unsigned long long ss = dtime_usec(0);
thrust::sort(thrust::make_zip_iterator(thrust::make_tuple(d2.begin(), d_keys.begin())), thrust::make_zip_iterator(thrust::make_tuple(d2.end(), d_keys.end())), my_sort_functor());
ss = dtime_usec(ss);
if (!thrust::equal(d1.begin(), d1.end(), d2.begin())) {std::cout << "oops1" << std::endl; return 0;}
std::cout << "ordinary thrust sort: " << os/(float)USECPSEC << "s " << "segmented sort: " << ss/(float)USECPSEC << "s" << std::endl;
thrust::device_vector<mt> d3(ds);
unsigned long long cs = dtime_usec(0);
BlockSortKernel<mt, nTPB, items_per><<<num_blocks, nTPB>>>(thrust::raw_pointer_cast(, thrust::raw_pointer_cast(;
cs = dtime_usec(cs);
if (!thrust::equal(d1.begin(), d1.end(), d3.begin())) {std::cout << "oops2" << std::endl; return 0;}
std::cout << "cub sort: " << cs/(float)USECPSEC << "s" << std::endl;
$ nvcc -o t1
$ ./t1
ordinary thrust sort: 0.001652s segmented sort: 0.00263s
cub sort: 0.000265s
(CUDA 10.2.89, Tesla V100, Ubuntu 18.04)
I have no doubt that your sizes and array dimensions don't correspond to mine. The purpose here is to illustrate some possible methods, not a black-box solution that works for your particular case. You probably should do benchmark comparisons of your own. I also acknowledge that the block radix sort method for cub expects equal-sized sub-arrays, which you may not have. It may not be a suitable method for you, or you may wish to explore some kind of padding arrangement. There's no need to ask this question of me; I won't be able to answer it based on the information in your question.
I don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the questions in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.

Qsort comparison

I'm converting C++ code to Go, but I have difficulties in understanding this comparison function:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <iostream>
using namespace std;
typedef struct SensorIndex
{ double value;
int index;
} SensorIndex;
int comp(const void *a, const void* b)
{ SensorIndex* x = (SensorIndex*)a;
SensorIndex* y = (SensorIndex*)b;
return abs(y->value) - abs(x->value);
int main(int argc , char *argv[])
SensorIndex *s_tmp;
s_tmp = (SensorIndex *)malloc(sizeof(SensorIndex)*200);
double q[200] = {8.48359,8.41851,-2.53585,1.69949,0.00358129,-3.19341,3.29215,2.68201,-0.443549,-0.140532,1.64661,-1.84908,0.643066,1.53472,2.63785,-0.754417,0.431077,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256};
for( int i=0; i < 200; ++i ) {
s_tmp[i].value = q[i];
s_tmp[i].index = i;
qsort(s_tmp, 200, sizeof(SensorIndex), comp);
for( int i=0; i<200; i++)
cout << s_tmp[i].index << " " << s_tmp[i].value << endl;
I expected that the "comp" function would allow the sorting from the highest (absolute) value to the minor, but in my environment (gcc 32 bit) the result is:
1 8.41851
0 8.48359
2 -2.53585
3 1.69949
11 -1.84908
5 -3.19341
6 3.29215
7 2.68201
10 1.64661
14 2.63785
12 0.643066
13 1.53472
4 0.00358129
9 -0.140532
8 -0.443549
15 -0.754417
16 0.431077
17 -0.123256
18 -0.123256
19 -0.123256
20 -0.123256
Moreover one thing that seems strange to me is that by executing the same code with online services I get different values (, C++98):
0 8.48359
1 8.41851
5 -3.19341
6 3.29215
2 -2.53585
7 2.68201
14 2.63785
3 1.69949
10 1.64661
11 -1.84908
13 1.53472
4 0.00358129
8 -0.443549
9 -0.140532
12 0.643066
15 -0.754417
16 0.431077
17 -0.123256
18 -0.123256
19 -0.123256
20 -0.123256
Any help?
This behavior is caused by using abs, a function that works with int, and passing it double arguments. The doubles are being implicitly cast to int, truncating the decimal component before comparing them. Essentially, this means you take the original number, strip off the sign, and then strip off everything to the right of the decimal and compare those values. So 8.123 and -8.9 are both converted to 8, and compare equal. Since the inputs are reversed for the subtraction, the ordering is in descending order by magnitude.
Your output reflects this; all the values with a magnitude between 8 and 9 appear first, then 3-4s, then 2-3s, 1-2s and less than 1 values.
If you wanted to fix this to actually sort in descending order in general, you'd need a comparison function that properly used the double-friendly fabs function, e.g.
int comp(const void *a, const void* b)
{ SensorIndex* x = (SensorIndex*)a;
SensorIndex* y = (SensorIndex*)b;
double diff = fabs(y->value) - fabs(x->value);
if (diff < 0.0) return -1;
return diff > 0;
Update: On further reading, it looks like std::abs from <cmath> has worked with doubles for a long time, but std::abs for doubles was only added to <cstdlib> (where the integer abs functions dwell) in C++17. And the implementers got this stuff wrong all the time, so different compilers would behave differently at random. In any event, both the answers given here are right; if you haven't included <cmath> and you're on pre-C++17 compilers, you should only have access to integer based versions of std::abs (or ::abs from math.h), which would truncate each value before the comparison. And even if you were using the correct std::abs, returning the result of double subtraction as an int would drop fractional components of the difference, making any values with a magnitude difference of less than 1.0 appear equal. Worse, depending on specific comparisons performed and their ordering (since not all values are compared to each other), the consequences of this effect could chain, as comparison ordering changes could make 1.0 appear equal to 1.6 which would in turn appear equal to 2.5, even though 1.0 would be correctly identified as less than 2.5 if they were compared to each other; in theory, as long as each number is within 1.0 of every other number, the comparisons might evaluate as if they're all equal to each other (pathological case yes, but smaller runs of such errors would definitely happen).
Point is, the only way to figure out the real intent of this code is to figure out the exact compiler version and C++ standard it was originally compiled under and test it there.
There is a bug in your comparison function. You return an int which means you lose the distinction between element values whose absolute difference is less then 1!
int comp(const void* a, const void* b)
SensorIndex* x = (SensorIndex*)a;
SensorIndex* y = (SensorIndex*)b;
// what about differences between 0.0 and 1.0?
return abs(y->value) - abs(x->value);
You can fix it like this:
int comp(const void* a, const void* b)
{ SensorIndex* x = (SensorIndex*)a;
SensorIndex* y = (SensorIndex*)b;
if(std::abs(y->value) < std::abs(x->value))
return -1;
return 1;
A more modern (and safer) way to do this would be to use std::vector and std::sort:
// use a vector for dynamic arrays
std::vector<SensorIndex> s_tmp;
for(int i = 0; i < 200; ++i) {
s_tmp.push_back({q[i], i});
// use std::sort
std::sort(std::begin(s_tmp), std::end(s_tmp), [](SensorIndex const& a, SensorIndex const& b){
return std::abs(b.value) < std::abs(a.value);

Iterate over first N elements of c++11 std::array

I am using a std::array (c++11). I am choosing to use a std::array because I want the size to be fixed at compile time (as opposed to runtime). Is there anyway I can iterate over the first N elements ONLY. i.e. something like:
std::array<int,6> myArray = {0,0,0,0,0,0};
std::find_if(myArray.begin(), myArray.begin() + 4, [](int x){return (x%2==1);});
This is not the best example because find_if returns an iterator marking the FIRST odd number, but you get the idea (I only want to consider the first N, in this case N=4, elements of my std::array).
Note: There are questions similar to this one, but the answer always involves using a different container (vector or valarray, which is not what I want. As I described early, I want to size of the container to be fixed at compile time).
Thank you in advance!!
From the way you presented your question, I assume that you say "iterate over", but actually mean "operate on with an algorithm".
The behaviour is not specific to a container, but to the iterator type of the container.
std::array::iterator_type satisfies RandomAccessIterator, the same as std::vector and std::deque.
That means that, given
std::array<int,6> myArray = {0,0,0,0,0,0};
auto end = myArray.begin() // ...
you can add a number n to it...
auto end = myArray.begin() + 4;
...resulting in an iterator to one element beyond the nth element in the array. As that is the very definition for an end iterator for the sequence,
std::find_if(myArray.begin(), myArray.begin() + 4, ... )
works just fine. A somewhat more intuitive example:
#include <algorithm>
#include <array>
#include <iostream>
#define N 4
int main()
std::array<char, 6> myArray = { 'a', 'b', 'c', 'd', 'e', 'f' };
auto end = myArray.begin() + N;
if ( std::find( myArray.begin(), end, 'd' ) != end )
std::cout << "Found.\n";
return 0;
This finds the 4th element in the array, and prints "Found."
Change #define N 4 to #define N 3, and it prints nothing.
Of course, this is assuming that your array has N elements. If you aren't sure, check N <= myArray.size() first and use myArray.end() instead if required.
For completeness:
A BidirectionalIterator (list, set, multiset, map, multimap) only supports ++ and --.
A ForwardIterator (forward_list, unordered_set, unordered_multiset, unordered_map, unordered_multimap) only supports ++.
An InputIterator does not support dereferencing the result of postfix ++.
If you want to iterate over the first N numbers of a std::array, just do something like:
#include <iostream>
#include <array>
int main() {
constexpr const int N = 4;
std::array<int, 6> arr{ 0, 1, 2, 3, 4, 5 };
for (auto it = std::begin(arr); it != std::begin(arr) + N && it != std::end(arr); ++it)
std::cout << *it << std::endl;
With C++20, a std::span can be used to create a subset view of a std::array much like std::string_view does for std::string. The span replaces maintaining the variable 'N' for the number of sub-elements.
auto part = std::span(myArray).first(4);
std::find_if(part.begin(), part.end(), [](int x) {return (x % 2 == 1); });
A std::span offers many other benefits. It can be used in range based for loops. And by using std::span.subspan, a span can view any range of elements, not limited to just the first N. A span can also be used not just with std::array, but also with C arrays, std::vector, and other contiguous containers.

Parallel multiplication of many small matrices by fixed vector

Situation is the following: I have a number (1000s) of elements which are given by small matrices of dimensions 4x2, 9x3 ... you get the idea. All matrices have the same dimension.
I want to multiply each of these matrices with a fixed vector of precalculated values. In short:
for(i = 1...n)
X[i] = M[i] . N;
What is the best approach to do this in parallel using Thrust? How do I lay out my data in memory?
NB: There might be specialized, more suitable libraries to do this on GPUs. I'm interested in Thrust because it allows me to deploy to different backends, not just CUDA.
One possible approach:
flatten the arrays (matrices) into a single data vector. This is an advantageous step for enabling general thrust processing anyway.
use a strided range mechanism to take your scaling vector and extend it to the overall length of your flattened data vector
use thrust::transform with thrust::multiplies to multiply the two vectors together.
If you need to access the matrices later out of your flattened data vector (or result vector), you can do so with pointer arithmetic, or a combination of fancy iterators.
If you need to re-use the extended scaling vector, you may want to use the method outlined in step 2 exactly (i.e. create an actual vector using that method, length = N matrices, repeated). If you are only doing this once, you can achieve the same effect with a counting iterator, followed by a transform iterator (modulo the length of your matrix in elements), followed by a permutation iterator, to index into your original scaling vector (length = 1 matrix).
The following example implements the above, without using the strided range iterator method:
#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/functional.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/transform.h>
#define N_MAT 1000
#define H_MAT 4
#define W_MAT 3
#define RANGE 1024
struct my_modulo_functor : public thrust::unary_function<int, int>
__host__ __device__
int operator() (int idx) {
return idx%(H_MAT*W_MAT);}
int main(){
thrust::host_vector<int> data(N_MAT*H_MAT*W_MAT);
thrust::host_vector<int> scale(H_MAT*W_MAT);
// synthetic; instead flatten/copy matrices into data vector
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++) data[i] = rand()%RANGE;
for (int i = 0; i < H_MAT*W_MAT; i++) scale[i] = rand()%RANGE;
thrust::device_vector<int> d_data = data;
thrust::device_vector<int> d_scale = scale;
thrust::device_vector<int> d_result(N_MAT*H_MAT*W_MAT);
thrust::transform(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_scale.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), my_modulo_functor())) ,d_result.begin(), thrust::multiplies<int>());
thrust::host_vector<int> result = d_result;
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++)
if (result[i] != data[i] * scale[i%(H_MAT*W_MAT)]) {std::cout << "Mismatch at: " << i << " cpu result: " << (data[i] * scale[i%(H_MAT*W_MAT)]) << " gpu result: " << result[i] << std::endl; return 1;}
std::cout << "Success!" << std::endl;
return 0;
EDIT: Responding to a question below:
The benefit of fancy iterators (i.e. transform(numbers, iterator)) is that they often allow for eliminaion of extra data copies/data movement, as compared to assembling other number (which requires extra steps and data movement) and then passing it to transform(numbers, other numbers). If you're only going to use other numbers once, then the fancy iterators will generally be better. If you're going to use other numbers again, then you may want to assemble it explicitly. This preso is instructive, in particular "Fusion".
For a one-time use of other numbers the overhead of assembling it on the fly using fancy iterators and the functor is generally lower than explicitly creating a new vector, and then passing that new vector to the transform routine.
When looking for a software library which is concisely made for multiplying small matrices, then one may have a look at Below, the code requests a specialized matrix kernel according to the typical GEMM parameters (please note that some limitations apply).
double alpha = 1, beta = 1;
const char transa = 'N', transb = 'N';
int flags = LIBXSMM_GEMM_FLAGS(transa, transb);
libxsmm_blasint m = 23, n = 23, k = 23;
libxsmm_dmmfunction xmm = NULL;
xmm = libxsmm_dmmdispatch(m, n, k,
&m/*lda*/, &k/*ldb*/, &m/*ldc*/,
&alpha, &beta, &flags, &prefetch);
Given the above code, one can proceed and run "xmm" for an entire series of (small) matrices without a particular data structure (below code also uses "prefetch locations").
if (0 < n) { /* check that n is at least 1 */
# pragma parallel omp private(i)
for (i = 0; i < (n - 1); ++i) {
const double *const ai = a + i * asize;
const double *const bi = b + i * bsize;
double *const ci = c + i * csize;
xmm(ai, bi, ci, ai + asize, bi + bsize, ci + csize);
xmm(a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize,
/* pseudo prefetch for last element of batch (avoids page fault) */
a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize);
In addition to the manual loop control as shown above, libxsmm_gemm_batch (or libxsmm_gemm_batch_omp) can be used (see ReadTheDocs). The latter is useful if data structures exist that describe the series of operands (A, B, and C matrices).
There are two reasons why this library gives superior performance: (1) on-the-fly code specialization using an in-memory code generation technique, and (2) loading the next matrix operands while calculating the current product.
( Given one is looking for something that blends well with C/C++, this library supports it. However, it does not aim for CUDA/Thrust. )

How to partly sort arrays on CUDA?

Provided I have two arrays:
const int N = 1000000;
float A[N];
myStruct *B[N];
The numbers in A can be positive or negative (e.g. A[N]={3,2,-1,0,5,-2}), how can I make the array A partly sorted (all positive values first, not need to be sorted, then negative values)(e.g. A[N]={3,2,5,0,-1,-2} or A[N]={5,2,3,0,-2,-1}) on the GPU? The array B should be changed according to A (A is keys, B is values).
Since the scale of A,B can be very large, I think the sort algorithm should be implemented on GPU (especially on CUDA, because I use this platform). Surely I know thrust::sort_by_key can do this work, but it does muck extra work since I do not need the array A&B to be sorted entirely.
Has anyone come across this kind of problem?
Thrust example
thrust::sort_by_key(thrust::device_ptr<float> (A),
thrust::device_ptr<float> ( A + N ),
thrust::device_ptr<myStruct> ( B ),
thrust::greater<float>() );
Thrust's documentation on Github is not up-to-date. As #JaredHoberock said, thrust::partition is the way to go since it now supports stencils. You may need to get a copy from the Github repository:
git clone git://
Then run scons doc in the Thrust folder to get an updated documentation, and use these updated Thrust sources when compiling your code (nvcc -I/path/to/thrust ...). With the new stencil partition, you can do:
#include <thrust/partition.h>
#include <thrust/execution_policy.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
struct is_positive
__host__ __device__
bool operator()(const int &x)
return x >= 0;
thrust::partition(thrust::host, // if you want to test on the host
thrust::make_zip_iterator(thrust::make_tuple(keyVec.begin(), valVec.begin())),
thrust::make_zip_iterator(thrust::make_tuple(keyVec.end(), valVec.end())),
This returns:
keyVec = 0 -1 2 -3 4 -5 6 -7 8 -9
valVec = 0 1 2 3 4 5 6 7 8 9
keyVec = 0 2 4 6 8 -5 -3 -7 -1 -9
valVec = 0 2 4 6 8 5 3 7 1 9
Note that the 2 partitions are not necessarily sorted. Also, the order may differ between the original vectors and the partitions. If this is important to you, you can use thrust::stable_partition:
stable_partition differs from partition in that stable_partition is
guaranteed to preserve relative order. That is, if x and y are
elements in [first, last), such that pred(x) == pred(y), and if x
precedes y, then it will still be true after stable_partition that x
precedes y.
If you want a complete example, here it is:
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/partition.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
struct is_positive
__host__ __device__
bool operator()(const int &x)
return x >= 0;
void print_vec(const thrust::host_vector<int>& v)
for(size_t i = 0; i < v.size(); i++)
std::cout << " " << v[i];
std::cout << "\n";
int main ()
const int N = 10;
thrust::host_vector<int> keyVec(N);
thrust::host_vector<int> valVec(N);
int sign = 1;
for(int i = 0; i < N; ++i)
keyVec[i] = sign * i;
valVec[i] = i;
sign *= -1;
// Copy host to device
thrust::device_vector<int> d_keyVec = keyVec;
thrust::device_vector<int> d_valVec = valVec;
std::cout << "Before:\n keyVec = ";
std::cout << " valVec = ";
// Partition key-val on device
thrust::partition(thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.begin(), d_valVec.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.end(), d_valVec.end())),
// Copy result back to host
keyVec = d_keyVec;
valVec = d_valVec;
std::cout << "After:\n keyVec = ";
std::cout << " valVec = ";
I made a quick comparison with the thrust::sort_by_key version, and the thrust::partition implementation does seem to be faster (which is what we could naturally expect). Here is what I obtain on NVIDIA Visual Profiler, with N = 1024 * 1024, with the sort version on the left, and the partition version on the right. You may want to do the same kind of tests on your own.
How about this?:
Count how many positive numbers to determine the inflexion point
Evenly divide each side of the inflexion point into groups (negative-groups are all same length but different length to positive-groups. these groups are the memory chunks for the results)
Use one kernel call (one thread) per chunk pair
Each kernel swaps any out-of-place elements in the input groups into the desired output groups. You will need to flag any chunks that have more swaps than the maximum so that you can fix them during subsequent iterations.
Repeat until done
Memory traffic is swaps only (from original element position, to sorted position). I don't know if this algorithm sounds like anything already defined...
You should be able to achieve this in thrust simply with a modification of your comparison operator:
struct my_compare
__device__ __host__ bool operator()(const float x, const float y) const
return !((x<0.0f) && (y>0.0f));
thrust::sort_by_key(thrust::device_ptr<float> (A),
thrust::device_ptr<float> ( A + N ),
thrust::device_ptr<myStruct> ( B ),
my_compare() );
