sort one array by another on the gpu - sorting

I have the code, which looks like:
...
const N=10000;
std::array<std::pair <int,int>,N> nnt;
bool compar(std::pair<int,int> i, std::pair <int,int> j) {return (int)
(i.second) > (int)(j.second);}
...
int main(int argc, char **argv)
{
#pragma acc data create(...,nnt)
{
#pragma acc parallel loop
{...}
//the nnt array is filled here
//here i need to sort nnt allocated on gpu, using the
//comparator compar()
}
}
So i need to sort an array of pairs, alocated on the GPU by the means of CUDA of OpenAcc.
As far as i understood, it is unlikely that i will be able to sort std::array of std::pair's on GPU.
Actually, i need to sort one array, allocated on the gpu, by another one alocated on the gpu, i. e. if there are
int a[N];
int b[N];
which are allocated or copied to the GPU by the means of CUDA or OpenAcc, i need to sort the array a by the values of the array b, and i need this sort to be done on GPU. May be, there are some CUDA functions that will help or the CUDA Thrust sort functions could be used (like thrust::stable_sort), i don't know. Is there a way to do it?

Is there a way to do it?
yes, one possible method would be to use thrust::sort_by_key, which allows you to sort device data using a device pointer.
This blog explains the method to interface between thrust and OpenACC. Including the passage of a deviceptr between routines.
This example code may be of interest. Specifically, the hash example gives a fully-worked example of calling thrust::sort_by_key from OpenACC.

Related

How to change a boost::multiprecision::cpp_int from big endian to little endian

I have a boost::multiprecision::cpp_int in big endian and have to change it to little endian. How can I do that? I tried with boost::endian::conversion but that did not work.
boost::multiprecision::cpp_int bigEndianInt("0xe35fa931a0000*);
boost::multiprecision::cpp_int littleEndianInt;
littleEndianIn = boost::endian::endian_reverse(m_cppInt);
The memory layout of boost multi-precision types is implementation detail. So you cannot assume much about it anyways (they're not supposed to be bitwise serializable).
Just read a random section of the docs:
MinBits
Determines the number of Bits to store directly within the object before resorting to dynamic memory allocation. When zero, this field is determined automatically based on how many bits can be stored in union with the dynamic storage header: setting a larger value may improve performance as larger integer values will be stored internally before memory allocation is required.
It's not immediately clear that you have any chance at some level of "normal int behaviour" in memory layout. The only exception would be when MinBits==MaxBits.
Indeed, we can static_assert that the size of cpp_int with such backend configs match the corresponding byte-sizes.
It turns out that there's even a promising tag in the backend base-class to indicate "triviality" (this is truly promising): trivial_tag, so let's use it:
Live On Coliru
#include <boost/multiprecision/cpp_int.hpp>
namespace mp = boost::multiprecision;
template <int bits> using simple_be =
mp::cpp_int_backend<bits, bits, mp::unsigned_magnitude>;
template <int bits> using my_int =
mp::number<simple_be<bits>, mp::et_off>;
using my_int8_t = my_int<8>;
using my_int16_t = my_int<16>;
using my_int32_t = my_int<32>;
using my_int64_t = my_int<64>;
using my_int128_t = my_int<128>;
using my_int192_t = my_int<192>;
using my_int256_t = my_int<256>;
template <typename Num>
constexpr bool is_trivial_v = Num::backend_type::trivial_tag::value;
int main() {
static_assert(sizeof(my_int8_t) == 1);
static_assert(sizeof(my_int16_t) == 2);
static_assert(sizeof(my_int32_t) == 4);
static_assert(sizeof(my_int64_t) == 8);
static_assert(sizeof(my_int128_t) == 16);
static_assert(is_trivial_v<my_int8_t>);
static_assert(is_trivial_v<my_int16_t>);
static_assert(is_trivial_v<my_int32_t>);
static_assert(is_trivial_v<my_int64_t>);
static_assert(is_trivial_v<my_int128_t>);
// however it doesn't scale
static_assert(sizeof(my_int192_t) != 24);
static_assert(sizeof(my_int256_t) != 32);
static_assert(not is_trivial_v<my_int192_t>);
static_assert(not is_trivial_v<my_int256_t>);
}
Conluding: you can have trivial int representation up to a certain point, after which you get the allocator-based dynamic-limb implementation no matter what.
Note that using unsigned_packed instead of unsigned_magnitude representation never leads to a trivial backend implementation.
Note that triviality might depend on compiler/platform choices (it's likely that cpp_128_t uses some builtin compiler/standard library support on GCC, e.g.)
Given this, you MIGHT be able to pull of what you wanted to do with hacks IF your backend configuration support triviality. Sadly I think it requires you to manually overload endian_reverse for 128 bits case, because the GCC builtins do not have __builtin_bswap128, nor does Boost Endian define things.
I'd suggest working off the information here How to make GCC generate bswap instruction for big endian store without builtins?
Final Demo (not complete)
#include <boost/multiprecision/cpp_int.hpp>
#include <boost/endian/buffers.hpp>
namespace mp = boost::multiprecision;
namespace be = boost::endian;
template <int bits> void check() {
using T = mp::number<mp::cpp_int_backend<bits, bits, mp::unsigned_magnitude>, mp::et_off>;
static_assert(sizeof(T) == bits/8);
static_assert(T::backend_type::trivial_tag::value);
be::endian_buffer<be::order::big, T, bits, be::align::no> buf;
buf = T("0x0102030405060708090a0b0c0d0e0f00");
std::cout << std::hex << buf.value() << "\n";
}
int main() {
check<128>();
}
(Changing be::order::big to be::order::native obviously makes it compile. The other way to complete it would be to have an ADL accessible overload for endian_reverse for your int type.)
This is both trivial and in the general case unanswerable, let me explain:
For a general N-bit integer, where N is a large number, there is unlikely to be any well defined byte order, indeed even for 64 and 128 bit integers there are more than 2 possible orders in use: https://en.wikipedia.org/wiki/Endianness#Middle-endian.
On any platform, with any native endianness you can always extract the bytes of a cpp_int, the first example here: https://www.boost.org/doc/libs/1_73_0/libs/multiprecision/doc/html/boost_multiprecision/tut/import_export.html#boost_multiprecision.tut.import_export.examples shows you how. When exporting bytes like this, they are always most significant byte first, so you can subsequently rearrange them how you wish. You should not however, rearrange them and load them back into a cpp_int as the class won't know what to do with the result!
If you know that the value is small enough to fit into a native integer type, then you can simply cast to the native integer and use a system API on the result. As in endian_reverse(static_cast<int64_t>(my_cpp_int)). Again, don't assign the result back into a cpp_int as it requires native byte order.
If you wish to check whether a value is small enough to fit in an N-bit integer for the approach above, you can use the msb function, which returns the index of the most significant bit in the cpp_int, add one to that to obtain the number of bits used, and filter out the zero case and the code looks like:
unsigned bits_used = my_cpp_int.is_zero() ? 0 : msb(my_cpp_int) + 1;
Note that all of the above use completely portable code - no hacking of the underlying implementation is required.

Pass a vector starting from index i by reference

I am writing a function in C++
int maxsubarray(vector<int>&nums)
say I have a vector
v={1,2,3,4,5}
I want to pass
{3,4,5}
to the function,i.e. pass the vector starting from index 2. In C I know I can call maxsubarray(v+2)
but in C++ it doesn't work. I can modify the function by adding start index parameter to make it work of course. Just want to know can I do it without modifying my original function?
THX
You will have to create a temporary vector with the part you want to pass:
std::vector<int> v = {1,2,3,4,5};
std::vector<int> v2(v.begin() + 2, v.end());
maxsubarray(v2);
The obvious solution is to make a new vector and pass that one instead. I definitely do not recommend that. The most idiomatic way is to make your function take iterators:
template<typename It>
It::value_type maxsubarray(It begin, It end) { ... }
and then use it like this:
std::vector<int> nums(...);
auto max = maxsubarray(begin(nums) + 2, end(nums));
Anything else involving copies, is just inefficient and not necessary.
Not without constructing another vector.
You can either build a new vector a pass it by reference to the function (but this might not be ideal from a performance point of view. You generally pass by reference to avoid unnecessary copies) or use pointers:
//copy the vector
std::vector<int> copy(v.begin()+2, v.end());
maxsubarray(copy);
//pass a pointer to the given element
int maxsubarray(int * nums)
maxsubarray(&v[2]);
You could try calling it with a temporary:
int myMax = maxsubarray(vector<int>(v.begin() + 2, v.end()));
That might require changing the function signature to
int maxsubarray(const vector<int> &nums);
since (I think) temporaries can't bind to non-const references, but that change should be preferred here if maxsubarray won't modify nums.

Best way to maintain an RNG state in multiple devices in openCL

So I'm trying to make use of this custom RNG library for openCL:
http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html
The library defines a state struct:
//! Represents the state of a particular generator
typedef struct{ uint x; uint c; } mwc64x_state_t;
And in order to generate a random uint, you pass in the state into the following function:
uint MWC64X_NextUint(mwc64x_state_t *s)
which updates the state, so that when you pass it into the function again, the next "random" number in the sequence will be generated.
For the project I am creating I need to be able to generate random numbers not just in different work groups/items but also across multiple devices simultaneously and I'm having trouble figuring out the best way to design this. Like should I create 1 mwc64x_state_t object per device/commandqueue and pass that state in as a global variable? Or is it possible to create 1 state object for all devices at once?
Or do I not even pass it in as a global variable and declare a new state locally within each kernel function?
The library also comes with this function:
void MWC64X_SeedStreams(mwc64x_state_t *s, ulong baseOffset, ulong perStreamOffset)
Which supposedly is supposed to split up the RNG into multiple "streams" but including this in my kernel makes it incredibly slow. For instance, if I do something very simple like the following:
__kernel void myKernel()
{
mwc64x_state_t rng;
MWC64X_SeedStreams(&rng, 0, 10000);
}
Then the kernel call becomes around 40x slower.
The library does come with some source code that serves as example usages but the example code is kind of limited and doesn't seem to be that helpful.
So if anyone is familiar with RNGs in openCL or if you've used this particular library before I'd very much appreciate your advice.
The MWC64X_SeedStreams function is indeed relatively slow, at least in comparison
to the MWC64X_NextUint call, but this is true of most parallel RNGs that try
to split a large global stream into many sub-streams that can be used in
parallel. The assumption is that you'll be calling NextUint many times
within the kernel (e.g. a hundred or more), but SeedStreams is only at the top.
This is an annotated version of the EstimatePi example that comes with
with the library (mwc64x/test/estimate_pi.cpp and mwc64x/test/test_mwc64x.cl).
__kernel void EstimatePi(ulong n, ulong baseOffset, __global ulong *acc)
{
// One RNG state per work-item
mwc64x_state_t rng;
// This calculates the number of samples that each work-item uses
ulong samplesPerStream=n/get_global_size(0);
// And then skip each work-item to their part of the stream, which
// will from stream offset:
// baseOffset+2*samplesPerStream*get_global_id(0)
// up to (but not including):
// baseOffset+2*samplesPerStream*(get_global_id(0)+1)
//
MWC64X_SeedStreams(&rng, baseOffset, 2*samplesPerStream);
// Now use the numbers
uint count=0;
for(uint i=0;i<samplesPerStream;i++){
ulong x=MWC64X_NextUint(&rng);
ulong y=MWC64X_NextUint(&rng);
ulong x2=x*x;
ulong y2=y*y;
if(x2+y2 >= x2)
count++;
}
acc[get_global_id(0)] = count;
}
So the intent is that n should be large and grow as the number
of work items grow, so that samplesPerStream remains around
a hundred or more.
If you want multiple kernels on multiple devices, then you
need to add another level of hierarchy to the stream splitting,
so for example if you have:
K : Number of devices (possibly on parallel machines)
W : Number work-items per device
C : Number of calls to NextUint per work-item
You end up with N=KWC total calls to NextUint across all
work-items. If your devices are identified as k=0..(K-1),
then within each kernel you would do:
MWC64X_SeedStreams(&rng, W*C*k, C);
Then the indices within the stream would be:
[0 .. N ) : Parts of stream used across all devices
[k*(W*C) .. (k+1)*(W*C) ) : Used within device k
[k*(W*C)+(i*C) .. (k*W*C)+(i+1)*C ) : Used by work-item i in device k.
It is fine if each kernel uses less than C samples, you can
over-estimate if necessary.
(I'm the author of the library).

VS2013: Potential issue with optimizing move semantics for classes with vector members?

I compiled the following code on VS2013 (using "Release" mode optimization) and was dismayed to find the assembly of std::swap(v1,v2) was not the same as std::swap(v3,v4).
#include <vector>
#include <iterator>
#include <algorithm>
template <class T>
class WRAPPED_VEC
{
public:
typedef T value_type;
void push_back(T value) { m_vec.push_back(value); }
WRAPPED_VEC() = default;
WRAPPED_VEC(WRAPPED_VEC&& other) : m_vec(std::move(other.m_vec)) {}
WRAPPED_VEC& operator =(WRAPPED_VEC&& other)
{
m_vec = std::move(other.m_vec);
return *this;
}
private:
std::vector<T> m_vec;
};
int main (int, char *[])
{
WRAPPED_VEC<int> v1, v2;
std::generate_n(std::back_inserter(v1), 10, std::rand);
std::generate_n(std::back_inserter(v2), 10, std::rand);
std::swap(v1, v2);
std::vector<int> v3, v4;
std::generate_n(std::back_inserter(v3), 10, std::rand);
std::generate_n(std::back_inserter(v4), 10, std::rand);
std::swap(v3, v4);
return 0;
}
The std::swap(v3, v4) statement turns into "perfect" assembly. How can I achieve the same efficiency for std::swap(v1, v2)?
There are a couple of points to be made here.
1. If you don't know for absolutely certain that your way of calling swap is equivalent to the "correct" way of calling swap, you should always use the "correct" way:
using std::swap;
swap(v1, v2);
2. A really convenient way to look at the assembly for something like calling swap is to put the call by itself in a test function. That makes it easy to isolate the assembly:
void
test1(WRAPPED_VEC<int>& v1, WRAPPED_VEC<int>& v2)
{
using std::swap;
swap(v1, v2);
}
void
test2(std::vector<int>& v1, std::vector<int>& v2)
{
using std::swap;
swap(v1, v2);
}
As it stands, test1 will call std::swap which looks something like:
template <class T>
inline
swap(T& x, T& y) noexcept(is_nothrow_move_constructible<T>::value &&
is_nothrow_move_assignable<T>::value)
{
T t(std::move(x));
x = std::move(y);
y = std::move(t);
}
And this is fast. It will use WRAPPED_VEC's move constructor and move assignment operator.
However vector swap is even faster: It swaps the vector's 3 pointers, and if std::allocator_traits<std::vector<T>::allocator_type>::propagate_on_container_swap::value is true (and it is not), also swaps the allocators. If it is false (and it is), and if the two allocators are equal (and they are), then everything is ok. Otherwise Undefined Behavior happens.
To make test1 identical to test2 performance-wise you need:
friend
void
swap(WRAPPED_VEC<int>& v1, WRAPPED_VEC<int>& v2)
{
using std::swap;
swap(v1.m_vec, v2.m_vec);
}
One interesting thing to point out:
In your case, where you are always using std::allocator<T>, the friend function is always a win. However if your code allowed other allocators, possibly those with state, which might compare unequal, and which might have propagate_on_container_swap::value false (as std::allocator<T> does), then these two implementations of swap for WRAPPED_VEC diverge somewhat:
1. If you rely on std::swap, then you take a performance hit, but you will never have the possibility to get into undefined behavior. Move construction on vector is always well-defined and O(1). Move assignment on vector is always well-defined and can be either O(1) or O(N), and either noexcept(true) or noexcept(false).
If propagate_on_container_move_assignment::value is false, and if the two allocators involved in a move assignment are unequal, vector move assignment will become O(N) and noexcept(false). Thus a swap using vector move assignment will inherit these characteristics. However, no matter what, the behavior is always well-defined.
2. If you overload swap for WRAPPED_VEC, thus relying on the swap overload for vector, then you expose yourself to the possibility of undefined behavior if the allocators compare unequal and have propagate_on_container_swap::value equal to false. But you pick up a potential performance win.
As always, there are engineering tradeoffs to be made. This post is meant to alert you to the nature of those tradeoffs.
PS: The following comment is purely stylistic. All capital names for class types are generally considered poor style. It is tradition that all capital names are reserved for macros.
The reason for this is that std::swap does have an optimized overload for type std::vector<T> (see right click -> go to definition). To make this code work fast for your wrapper, follow instructions found on cppreference.com about std::swap:
std::swap may be specialized in namespace std for user-defined types,
but such specializations are not found by ADL (the namespace std is
not the associated namespace for the user-defined type). The expected
way to make a user-defined type swappable is to provide a non-member
function swap in the same namespace as the type: see Swappable for
details.

C/OpenMP - issue with threadprivate and vectors of pointers

I'm new to the world of parallel programming and openmp, so this may be a futile question, but I can't really come up with good answer to what I'm experiencing, so I hope someone will be able to shed some light on the matter.
What I am trying to achieve is to have a private copy of a dinamically allocated matrix (of integers) for every thread that will handle the following parallel section, but as soon as the flow of execution enters said region the reference to the supposedly private matrix holds a null value.
Is there any limitation of this directive I'm not aware of? Everything seems to work just fine with monodimensional dynamic arrays.
A snippet of the code is the following one...
#define n 10000
int **matrix;
#pragma omp threadprivate(matrix)
int main()
{
matrix = (int**) calloc(n, sizeof(int*));
for(i=0;i<n;i++) matrix[i] = (int*) calloc(n, sizeof(int));
AdjacencyMatrix(n, matrix);
...
/* Explicitly turn off dynamic threads */
omp_set_dynamic(0);
#pragma omp parallel
{
// From now on, matrix is NULL...
executor_p(matrix, n);
}
....
Look at the OpenMP documentation regarding what happens with the threadprivate clause:
On first entry to a parallel region, data in THREADPRIVATE variables and common blocks should be assumed undefined, unless a COPYIN clause is specified in the PARALLEL directive
There's no guarantee of what value is going to be stored in the matrix variable in the parallel region.
OpenMP can privatise only variables with known storage size. That is you can have a private copy of an array if it was defined like double matrix[N][M]. In your case is not only the storage size unknown (a pointer doesn't store the number of elements that it is pointing to) but also your matrix is not a contiguous area in memory and rather a pointer to a list of dynamically allocated rows.
What you would end up with is having a private copy of the top-level pointer, not a private copy of the matrix data itself.

Resources