Optimizing a threaded simultaneous check

Optimizing a threaded simultaneous check - parallel-processing

I have a device function that checks a byte array using threads, each thread checking a different byte in the array for a certain value and returns bool true or false.
How can I efficiently decide if all the checks have returned true or otherwise?

// returns true if predicate is true for all threads in a block
__device__ bool unanimous(bool predicate) { ... }
__device__ bool all_the_same(unsigned char* bytes, unsigned char value, int n) {
return unanimous(bytes[threadIdx.x] == value);
}
The implementation of unanimous() depends on the compute capability of your hardware. For compute capability 2.0 or higher devices, it is trivial:
__device__ bool unanimous(bool predicate) { return __syncthreads_and(predicate); }
For compute capability 1.0 and 1.1 devices, you will need to implement an AND reduction (exercise for the reader, since it's well documented). For the special case of compute capability 1.3, you can optimize the AND reduction using warp vote instructions, using the __all() intrinsic function provided in the CUDA headers.
edit:
OK, since gamerx is asking in the comments. On sm_13 hardware, you can do this.
// returns true if predicate is true for all threads in a block
// note: supports maximum of 1024 threads in block as written
__device__ bool unanimous(bool predicate) {
__shared__ bool warp_votes[32];
if (threadIdx.x < warpSize) warp_votes[threadIdx.x] = true;
warp_votes[threadIdx.x / warpSize] = __all(pred);
__syncthreads();
if (threadIdx.x < warpSize) warp_votes[0] = __all(warp_votes[threadIdx.x];
__syncthreads();
return warp_votes[0];
}

Related

How priority queue and heap are similar? [duplicate]

It seems that a priority queue is just a heap with normal queue operations like insert, delete, top, etc. Is this the correct way to interpret a priority queue? I know you can build priority queues in different ways but if I were to build a priority queue from a heap is it necessary to create a priority queue class and give instructions for building a heap and the queue operations or is it not really necessary to build the class?
What I mean is if I have a function to build a heap and functions to do operations like insert and delete, do I need to put all these functions in a class or can I just use the instructions by calling them in main.
I guess my question is whether having a collection of functions is equivalent to storing them in some class and using them through a class or just using the functions themselves.
What I have below is all the methods for a priority queue implementation. Is this sufficient to call it an implementation or do I need to put it in a designated priority queue class?
#ifndef MAX_PRIORITYQ_H
#define MAX_PRIORITYQ_H
#include <iostream>
#include <deque>
#include "print.h"
#include "random.h"
int parent(int i)
{
return (i - 1) / 2;
}
int left(int i)
{
if(i == 0)
return 1;
else
return 2*i;
}
int right(int i)
{
if(i == 0)
return 2;
else
return 2*i + 1;
}
void max_heapify(std::deque<int> &A, int i, int heapsize)
{
int largest;
int l = left(i);
int r = right(i);
if(l <= heapsize && A[l] > A[i])
largest = l;
else
largest = i;
if(r <= heapsize && A[r] > A[largest])
largest = r;
if(largest != i) {
exchange(A, i, largest);
max_heapify(A, largest, heapsize);
//int j = max_heapify(A, largest, heapsize);
//return j;
}
//return i;
}
void build_max_heap(std::deque<int> &A)
{
int heapsize = A.size() - 1;
for(int i = (A.size() - 1) / 2; i >= 0; i--)
max_heapify(A, i, heapsize);
}
int heap_maximum(std::deque<int> &A)
{
return A[0];
}
int heap_extract_max(std::deque<int> &A, int heapsize)
{
if(heapsize < 0)
throw std::out_of_range("heap underflow");
int max = A.front();
//std::cout << "heapsize : " << heapsize << std::endl;
A[0] = A[--heapsize];
A.pop_back();
max_heapify(A, 0, heapsize);
//int i = max_heapify(A, 0, heapsize);
//A.erase(A.begin() + i);
return max;
}
void heap_increase_key(std::deque<int> &A, int i, int key)
{
if(key < A[i])
std::cerr << "New key is smaller than current key" << std::endl;
else {
A[i] = key;
while(i > 1 && A[parent(i)] < A[i]) {
exchange(A, i, parent(i));
i = parent(i);
}
}
}
void max_heap_insert(std::deque<int> &A, int key)
{
int heapsize = A.size();
A[heapsize] = std::numeric_limits<int>::min();
heap_increase_key(A, heapsize, key);
}

A priority queue is an abstract datatype. It is a shorthand way of describing a particular interface and behavior, and says nothing about the underlying implementation.
A heap is a data structure. It is a name for a particular way of storing data that makes certain operations very efficient.
It just so happens that a heap is a very good data structure to implement a priority queue, because the operations which are made efficient by the heap data strucure are the operations that the priority queue interface needs.

Having a class with exactly the interface you need (just insert and pop-max?) has its advantages.
You can exchange the implementation (list instead of heap, for example) later.
Someone reading the code that uses the queue doesn't need to understand the more difficult interface of the heap data structure.
I guess my question is whether having a collection of functions is
equivalent to storing them in some class and using them through a
class or just using the functions themselves.
It's mostly equivalent if you just think in terms of "how does my program behave". But it's not equivalent in terms of "how easy is my program to understand by a human reader"

The term priority queue refers to the general data structure useful to order priorities of its element. There are multiple ways to achieve that, e.g., various ordered tree structures (e.g., a splay tree works reasonably well) as well as various heaps, e.g., d-heaps or Fibonacci heaps. Conceptually, a heap is a tree structure where the weight of every node is not less than the weight of any node in the subtree routed at that node.

The C++ Standard Template Library provides the make_heap, push_heap
and pop_heap algorithms for heaps (usually implemented as binary
heaps), which operate on arbitrary random access iterators. It treats
the iterators as a reference to an array, and uses the array-to-heap
conversion. It also provides the container adaptor priority_queue,
which wraps these facilities in a container-like class. However, there
is no standard support for the decrease/increase-key operation.
priority_queue referes to abstract data type defined entirely by the operations that may be performed on it. In C++ STL prioroty_queue is thus one of the sequence adapters - adaptors of basic containers (vector, list and deque are basic because they cannot be built from each other without loss of efficiency), defined in <queue> header (<bits/stl_queue.h> in my case actually). As can be seen from its definition, (as Bjarne Stroustrup says):
container adapter provides a restricted interface to a container. In
particular, adapters do not provide iterators; they are intended to be
used only through their specialized interfaces.
On my implementation prioroty_queue is described as
/**
* #brief A standard container automatically sorting its contents.
*
* #ingroup sequences
*
* This is not a true container, but an #e adaptor. It holds
* another container, and provides a wrapper interface to that
* container. The wrapper is what enforces priority-based sorting
* and %queue behavior. Very few of the standard container/sequence
* interface requirements are met (e.g., iterators).
*
* The second template parameter defines the type of the underlying
* sequence/container. It defaults to std::vector, but it can be
* any type that supports #c front(), #c push_back, #c pop_back,
* and random-access iterators, such as std::deque or an
* appropriate user-defined type.
*
* The third template parameter supplies the means of making
* priority comparisons. It defaults to #c less<value_type> but
* can be anything defining a strict weak ordering.
*
* Members not found in "normal" containers are #c container_type,
* which is a typedef for the second Sequence parameter, and #c
* push, #c pop, and #c top, which are standard %queue operations.
* #note No equality/comparison operators are provided for
* %priority_queue.
* #note Sorting of the elements takes place as they are added to,
* and removed from, the %priority_queue using the
* %priority_queue's member functions. If you access the elements
* by other means, and change their data such that the sorting
* order would be different, the %priority_queue will not re-sort
* the elements for you. (How could it know to do so?)
template:
template<typename _Tp, typename _Sequence = vector<_Tp>,
typename _Compare = less<typename _Sequence::value_type> >
class priority_queue
{
In opposite to this, heap describes how its elements are being fetched and stored in memory. It is a (tree based) data structure, others are i.e array, hash table, struct, union, set..., that in addition satisfies heap property: all nodes are either [greater than or equal to] or [less than or equal to] each of its children, according to a comparison predicate defined for the heap.
So in my heap header I find no heap container, but rather a set of algorithms
/**
* #defgroup heap_algorithms Heap Algorithms
* #ingroup sorting_algorithms
*/
like:
__is_heap_until
__is_heap
__push_heap
__adjust_heap
__pop_heap
make_heap
sort_heap
all of them (excluding __is_heap, commented as "This function is an extension, not part of the C++ standard") described as
* #ingroup heap_algorithms
*
* This operation... (what it does)

Not really. The "priority" in the name stems from a priority value for the entries in the queue, defining their ... of course: priority. There are many ways to implement such a PQ, however.

A priority queue is an abstract data structure that can be implemented in many ways-unsorted array,sorted array,heap-. It is like an interface, it gives you the signature of heap:
class PriorityQueue {
top() → element
peek() → element
insert(element, priority)
remove(element)
update(element, newPriority)
size() → int
}
A heap is a concrete implementation of the priority queue using an array (it can conceptually be represented as a particular kind of binary tree) to hold elements and specific algorithms to enforce invariants. Invariants are internal properties that always hold true throughout the life of the data structure.
here is the performance comparison of priority queue implementions:

Optimizing bit-waste for custom data encoding

I was wondering what's a good solution to make it so that a custom data structure took the least amount of space possible, and I've been searching around without finding anything.
The general idea is I may have a some kind of data structure with a lot of different variables, integers, booleans, etc. With booleans, it's fairly easy to use bitmasks/flags. For integers, perhaps I only need to use 10 of the numbers for one of the integers, and 50 for another. I would like to have some function encode the structure, without wasting any bits. Ideally I would be able to pack them side-by-side in an array, without any padding.
I have a vague idea that I would have to have way of enumerating all the possible permutations of values of all the variables, but I'm unsure where to start with this.
Additionally, though this may be a bit more complicated, what if I have a bunch of restrictions such as not caring about certain variables if other variables meet certain criteria. This reduces the amount of permutations, so there should be a way of saving some bits here as well?
Example: Say I have a server for an online game, containing many players. Each player. The player struct stores a lot of different variables, level, stats, and a bunch of flags for which quests the player has cleared.
struct Player {
int level; //max is 100
int strength //max is
int int // max is 500
/* ... */
bool questFlag30;
bool questFlag31;
bool questFlag32;
/* ... */
};
and I want to have a function that takes an vector of Players called encodedData encode(std::vector<Player> players) and a function decodeData which returns a vector from the encoded data.

This is what I came up with; it's not perfect, but it's something:
#include <vector>
#include <iostream>
#include <bitset>
#include <assert.h>
/* Data structure for packing multiple variables, without padding */
struct compact_collection {
std::vector<bool> data;
/* Returns a uint32_t since we don't want to store the length of each variable */
uint32_t query_bits(int index, int length) {
std::bitset<32> temp;
for (int i = index; i < index + length; i++) temp[i - index] = data[i];
return temp.to_ulong();
};
/* */
void add_bits(int32_t value, int32_t bits) {
assert(std::pow(2, bits) >= value);
auto a = std::bitset<32>(value).to_string();
for (int i = 32 - bits; i < 32; i++) data.insert(data.begin(), (a[i] == '1'));
};
};
int main() {
compact_collection myCollection;
myCollection.add_bits(45,6);
std::cout << myCollection.query_bits(0,6);
std::cin.get();
return 0;
}

STL algorithm for splitting a vector into multiple smaller ones based on lambda

Let's say I have a vector containing a struct with a member describing its target vector.
struct Foo
{
int target;
static const int A = 0;
static const int B = 1;
static const int C = 2;
};
std::vector<Foo> elements;
std::vector<Foo> As;
std::vector<Foo> Bs;
std::vector<Foo> Cs;
std::vector<Foo> others;
Now I want to move each Foo in one of the four other vectors based on the value of Target.
For example
auto elements = std::vector<Foo>{ {Foo::A}, {Foo::A}, {Foo::B} };
Should result in two elements in As, one in Bs and none in Cs or others. Elements should be empty afterwards.
I could as well do it myself, but I wonder if there is an STL algorithm I could use to do its job.

Standard algorithms usually don't operate on multiple output destinations, so it's hard to come up with a suitable solution here when you want to abstract away the destination containers through output iterators. What might come closest is std::copy_if. This could look like
// Help predicate creation:
auto pred = [](int target){ return [target](const Foo& f){ return f.target == target; }; };
std::copy_if(elements.begin(), elements.end(), std::back_inserter(As), pred(Foo::A));
std::copy_if(elements.begin(), elements.end(), std::back_inserter(Bs), pred(Foo::B));
std::copy_if(elements.begin(), elements.end(), std::back_inserter(Cs), pred(Foo::C));
std::copy_if(elements.begin(), elements.end(), std::back_inserter(others),
[](const Foo& f){ return false; /* TODO */ });
elements.clear();
If copying is more expensive than move-construction, you should pass std::make_move_iterator(elements.begin()) and the same for elements.end() to the algorithm. The issue here is that this doesn't scale. std::copy_if linearly traverses the input range, and the above has to do this four times. One traversal can be obtained e.g. like the following.
auto doTheWork = [&As, &Bs, &Cs, &others](const Foo& foo) {
if (foo.target == Foo::A)
As.push_back(foo);
else if (foo.target == Foo::B)
Bs.push_back(foo);
else if (foo.target == Foo::C)
Cs.push_back(foo);
else
others.push_back(foo);
};
std::for_each(elements.begin(), elements.end(), doTheWork);
In this scenario, we have at least employed a standard algorithm, but shifted the logic into a rather ugly lambda. Note that the above lambda will always copy its arguments, it needs some adjustments to properly work with std::move_iterators.
Sometimes, a good old range based for loop is the most readable solution.

warp shuffling to reduction of arrays with any length

I am working on a Cuda kernel which performs vector dot product (A x B). I assumed that the length of each vector is multiple of 32 (32,64, ...) and defined the block size to be equal to the length of the array. Each thread in the block multiplies one element of A to the corresponding element of B (thread i ==>psum = A[i]xB[i]). After multiplication, I used the following functions which used warp shuffling technique to perform reduction and calculate the sum all multiplications.
__inline__ __device__
float warpReduceSum(float val) {
int warpSize =32;
for (int offset = warpSize/2; offset > 0; offset /= 2)
val += __shfl_down(val, offset);
return val;
}
__inline__ __device__
float blockReduceSum(float val) {
static __shared__ int shared[32]; // Shared mem for 32 partial sums
int lane = threadIdx.x % warpSize;
int wid = threadIdx.x / warpSize;
val = warpReduceSum(val); // Each warp performs partial reduction
if (lane==0)
shared[wid]=val; // Write reduced value to shared memory
__syncthreads(); // Wait for all partial reductions
//read from shared memory only if that warp existed
val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0;
if (wid==0)
val = warpReduceSum(val); // Final reduce within first warp
return val;
}
I simply call blockReduceSum(psum) which psum is the multiplication of two elements by a thread.
This approach doesn't work when the length of the array is not multiple of 32, so my question is, can we change this code so that it also works for any length? or is it impossible because if the length of the array is not multiple of 32, some warps have elements belonging more than one array?

First of all, depending on the GPU you are using, performing dot product with just 1 block will probably not be very efficient (as long as you are not batching several dot products in 1 kernel, each done by a single block).
To answer your question: you can reuse the code you have written by just calling your kernel with the number of threads being the closest multiple of 32 higher than N (length of the array) and introducing if statement before calling to blockReduceSum that would like this:
__global__ void kernel(float * A, float * B, int N) {
float psum = 0;
if(threadIdx.x < N) //threadIDx.x because your are using single block, you will need to change it to more general id once you move to multiple blocks
psum = A[threadIdx.x] * B[threadIdx.x];
blockReduceSum(psum);
//The rest of computation
}
That way, threads that do not have array element associated with them, but that need to be there due to use of __shfl, will contribute 0 to the sum.

Are std::get<> and std::tuple<> slower then raw pointers?

I have an C++11 application where I commonly iterate over several different structure of arrays for various algorithms. Raw CPU performance is important for this app.
The array elements are fundamental types (int, double, ..) or simple struct. The array are typically tens of thousands of elements long. I often need to iterate several arrays at once in a given loop. So typically I would need one pointer for each array of whatever type. So times I need to increment five individual pointers which is verbose.
Based on these answers about tuples,
Why is std::pair faster than std::tuple
C++11 tuple performance
I hoped there was no overhead to using tuples to pack the pointers together into a single object.
I thought it might be nice to implement a cursor like object to assist in iterating, since missing the increment on a particular pointer would be an annoying bug.
auto pts = std::make_tuple(p1, p2, p3...);
allow you to bundle a bunch of variables together in a typesafe way. Then you can implement a variadic template function to increment each pointer in the tuple in a type safe way.
However...
When I measure performance, the tuple version was slower then using raw pointers. But when I look at the generated assembly I see additional mov instructions in the tuple loop increment. Maybe due to the fact the std::get<> returns a reference? I had hoped that would be compiled away...
Am I missing something or are raw pointers just going to beat tuples when used like this? Here is a simple test harness. I threw away the fancy cursor code and just use a std::tuple<> for this test
On my machine, the tuple loop is consistently twice as slow as the raw pointer version for various data sizes.
My system config is Visual C++ 2013 x64 on Windows 8 with a release build. I did try turning on various optimization in Visual Studio such as
Inline Function Expansion : Any Suitable (/Ob2)
but it did not seem to change the time result for my case.
I did need to do two extra things to avoid aggressive optimization by VS
1) I forced the test data array to allocated on the heap, not the stack. That made a big difference when I timed things, possibly due to memory cache effects.
2) I forced a side effect by writing to static variable at the end so the compiler would not just skip my loop.
struct forceHeap
{
__declspec(noinline) int* newData(int M)
{
int* data = new int[M];
return data;
}
};
void timeSumCursor()
{
static int gIntStore;
int maxCount = 20;
int M = 10000000;
// compiler might place array on stack which changes the timing
// int* data = new int[N];
forceHeap fh;
int* data = fh.newData(M);
int *front = data;
int *end = data + M;
int j = 0;
for (int* p = front; p < end; ++p)
{
*p = (++j) % 1000;
}
{
BEGIN_TIMING_BLOCK("raw pointer loop", maxCount);
int* p = front;
int sum = 0;
int* cursor = front;
while (++cursor != end)
{
sum += *cursor;
}
gIntStore = sum;// force a side effect
END_TIMING_BLOCK();
}
printf("%d\n", gIntStore);
{
// just use a simple tuple to show the issue
// rather full blown cursor object
BEGIN_TIMING_BLOCK("tuple loop", maxCount);
int sum = 0;
auto cursor = std::make_tuple(front);
while (++std::get<0>(cursor) != end)
{
sum += *std::get<0>(cursor);
}
gIntStore = sum; // force a side effect
END_TIMING_BLOCK();
}
printf("%d\n", gIntStore);
delete[] data;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Optimizing a threaded simultaneous check - parallel-processing

I have a device function that checks a byte array using threads, each thread checking a different byte in the array for a certain value and returns bool true or false. How can I efficiently decide if all the checks have returned true or otherwise?

Related

How priority queue and heap are similar? [duplicate]

Optimizing bit-waste for custom data encoding

STL algorithm for splitting a vector into multiple smaller ones based on lambda

warp shuffling to reduction of arrays with any length

Are std::get<> and std::tuple<> slower then raw pointers?

Categories

Resources