Generate functions at compile time - c++11

I have an image. Every pixel contains information about RGB intensity. Now I want to sum intenity these channels, but I also want to choose which channels intensity to sum. Straightforwad implementation of this would look like this:
int intensity(const unsiged char* pixel, bool red, bool green, bool blue){
return 0 + (red ? pixel[0] : 0) + (green ? pixel[1] : 0) + (blue ? pixel[2] : 0);
}
Because I will call this function for every pixel in image I want to discard all conditions If I can. So I guess I have to have a function for every case:
std::function<int(const unsigned char* pixel)> generateIntensityAccumulator(
const bool& accumulateRChannel,
const bool& accumulateGChannel,
const bool& accumulateBChannel)
{
if (accumulateRChannel && accumulateGChannel && accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[0]) + static_cast<int>(pixel[1]) + static_cast<int>(pixel[2]);
};
}
if (!accumulateRChannel && accumulateGChannel && accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[1]) + static_cast<int>(pixel[2]);
};
}
if (!accumulateRChannel && !accumulateGChannel && accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[2]);
};
}
if (!accumulateRChannel && !accumulateGChannel && !accumulateBChannel){
return [](const unsigned char* pixel){
return 0;
};
}
if (accumulateRChannel && !accumulateGChannel && !accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[0]);
};
}
if (!accumulateRChannel && accumulateGChannel && !accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[1]);
};
}
if (accumulateRChannel && !accumulateGChannel && accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[0]) + static_cast<int>(pixel[2]);
};
}
if (accumulateRChannel && accumulateGChannel && !accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[0]) + static_cast<int>(pixel[1]);
};
}
}
Now I can use this generator before entering image loop and use function without any conditions:
...
auto accumulator = generateIntensityAccumulator(true, false, true);
for(auto pixel : pixels){
auto intensity = accumulator(pixel);
}
...
But it is a lot of writting for such simple task and I have a feeling that there is a better way to accomplish this: for example make compiler to do a dirty work for me and generate all above cases. Can someone point me in the right direction?

Using a std::function like this will cost you dear, because you dont let a chance for the compiler to optimize by inlining what it can.
What you are trying to do is a good job for templates. And since you use integral numbers, the expression itself may be optimized away, sparing you the need to write a specialization of each version. Look at this example :
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
template <bool AccumulateR, bool AccumulateG, bool AccumulateB>
inline int accumulate(const unsigned char *pixel) {
static constexpr int enableR = static_cast<int>(AccumulateR);
static constexpr int enableG = static_cast<int>(AccumulateG);
static constexpr int enableB = static_cast<int>(AccumulateB);
return enableR * static_cast<int>(pixel[0]) +
enableG * static_cast<int>(pixel[1]) +
enableB * static_cast<int>(pixel[2]);
}
int main(void) {
std::vector<std::array<unsigned char, 3>> pixels(
1e7, std::array<unsigned char, 3>{0, 0, 0});
// Fill up with randomness
std::random_device rd;
std::uniform_int_distribution<unsigned char> dist(0, 255);
for (auto &pixel : pixels) {
pixel[0] = dist(rd);
pixel[1] = dist(rd);
pixel[2] = dist(rd);
}
// Measure perf
using namespace std::chrono;
auto t1 = high_resolution_clock::now();
int sum1 = 0;
for (auto const &pixel : pixels)
sum1 += accumulate<true, true, true>(pixel.data());
auto t2 = high_resolution_clock::now();
int sum2 = 0;
for (auto const &pixel : pixels)
sum2 += accumulate<false, true, false>(pixel.data());
auto t3 = high_resolution_clock::now();
std::cout << "Sum 1 " << sum1 << " in "
<< duration_cast<milliseconds>(t2 - t1).count() << "ms\n";
std::cout << "Sum 2 " << sum2 << " in "
<< duration_cast<milliseconds>(t3 - t2).count() << "ms\n";
}
Compiled with Clang 3.9 with -O2, yields this result on my CPU:
Sum 1 -470682949 in 7ms
Sum 2 1275037960 in 2ms
Please notice the fact that we have an overflow here, you may need to use something bigger than an int. A uint64_t might do. If you inspect assembly code, you will see that the two versions of the function are inlined and optimized differently.

First things first. Don't write a std::function that takes a single pixel; write one that takes a contiguous range of pixels (a scanline of pixels).
Second, you want to write a template version of intensity:
template<bool red, bool green, bool blue>
int intensity(const unsiged char* pixel){
return (red ? pixel[0] : 0) + (green ? pixel[1] : 0) + (blue ? pixel[2] : 0);
}
pretty simple, eh? That will optimize down to your hand-crafted version.
template<std::size_t index>
int intensity(const unsiged char* pixel){
return intensity< index&1, index&2, index&4 >(pixel);
}
this one maps from the bits of index to which of the intensity<bool, bool, bool> to call. Now for the scanline version:
template<std::size_t index, std::size_t pixel_stride=3>
int sum_intensity(const unsiged char* pixel, std::size_t count){
int value = 0;
while(count--) {
value += intensity<index>(pixel);
pixel += pixel_stride;
}
return value;
}
We can now generate our scanline intensity calculator:
int(*)( const unsigned char* pel, std::size_t pixels )
scanline_intensity(bool red, bool green, bool blue) {
static const auto table[] = {
sum_intensity<0b000>, sum_intensity<0b001>,
sum_intensity<0b010>, sum_intensity<0b011>,
sum_intensity<0b100>, sum_intensity<0b101>,
sum_intensity<0b110>, sum_intensity<0b111>,
};
std::size_t index = red + green*2 + blue*4;
return sum_intensity[index];
}
and done.
These techniques can be made generic, but you don't need the generic ones.
If your pixel stride is not 3 (say there is an alpha channel), sum_intensity needs to be passed it (as a template parameter ideally).

Related

Effective implementation of conversion small string to uint64_t

#include <cstdint>
#include <cstring>
template<typename T>
T oph_(const char *s){
constexpr std::size_t MAX = sizeof(T);
const std::size_t size = strnlen(s, MAX);
T r = 0;
for(auto it = s; it - s < size; ++it)
r = r << 8 | *it;
return r;
}
inline uint64_t oph(const char *s){
return oph_<uint64_t>(s);
}
int main(){
uint64_t const a = oph("New York City");
uint64_t const b = oph("Boston International");
return a > b;
}
I want to convert first 8 characters from const char * to uint64_t so I can easily compare if two strings are greater / lesser.
I am aware that equals will semi-work.
However I am not sure if this is most efficient implementation.
I want the implementation to work on both little and big endian machines.
This is a C implementation, that should be faster that your implementation, but I still need to use strncpy which should be the bottleneck
#include <string.h>
#include <stdio.h>
#include <stdint.h>
#include <byteswap.h>
union small_str {
uint64_t v;
char buf[8];
};
static uint64_t fill_small_str(const char *str)
{
union small_str ss = { 0 };
strncpy(ss.buf, str, 8);
#if defined(__BYTE_ORDER__) && (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
return ss.v;
#else
return bswap_64(ss.v);
#endif
}
int main(void)
{
uint64_t const a = fill_small_str("Aew York City");
uint64_t const b = fill_small_str("Boston International");
printf("%lu ; %lu ; %d\n", a, b, (a < b));
return 0;
}

Variadic Template Recursion

I am trying to use recursion to solve this problem where if i call
decimal<0,0,1>();
i should get the decimal number (4 in this case).
I am trying to use recursion with variadic templates but cannot get it to work.
Here's my code;
template<>
int decimal(){
return 0;
}
template<bool a,bool...pack>
int decimal(){
cout<<a<<"called"<<endl;
return a*2 + decimal<pack...>();
};
int main(int argc, char *argv[]){
cout<<decimal<0,0,1>()<<endl;
return 0;
}
What would be the best way to solve this?
template<typename = void>
int decimal(){
return 0;
}
template<bool a,bool...pack>
int decimal(){
cout<<a<<"called"<<endl;
return a + 2*decimal<pack...>();
};
The problem was with the recursive case, where it expects to be able to call decltype<>(). That is what I have defined in the first overload above. You can essentially ignore the typename=void, the is just necessary to allow the first one to compile.
A possible solution can be the use of a constexpr function (so you can use it's values it's value run-time, when appropriate) where the values are argument of the function.
Something like
#include <iostream>
constexpr int decimal ()
{ return 0; }
template <typename T, typename ... packT>
constexpr int decimal (T const & a, packT ... pack)
{ return a*2 + decimal(pack...); }
int main(int argc, char *argv[])
{
constexpr int val { decimal(0, 0, 1) };
static_assert( val == 2, "!");
std::cout << val << std::endl;
return 0;
}
But I obtain 2, not 4.
Are you sure that your code should return 4?
-- EDIT --
As pointed by aschepler, my example decimal() template function return "eturns twice the sum of its arguments, which is not" what do you want.
Well, with 0, 1, true and false you obtain the same; with other number, you obtain different results.
But you can modify decimal() as follows
template <typename ... packT>
constexpr int decimal (bool a, packT ... pack)
{ return a*2 + decimal(pack...); }
to avoid this problem.
This is a C++14 solution. It is mostly C++11, except for std::integral_sequence nad std::index_sequence, both of which are relatively easy to implement in C++11.
template<bool...bs>
using bools = std::integer_sequence<bool, bs...>;
template<std::uint64_t x>
using uint64 = std::integral_constant< std::uint64_t, x >;
template<std::size_t N>
constexpr uint64< ((std::uint64_t)1) << (std::uint64_t)N > bit{};
template<std::uint64_t... xs>
struct or_bits : uint64<0> {};
template<std::int64_t x0, std::int64_t... xs>
struct or_bits<x0, xs...> : uint64<x0 | or_bits<xs...>{} > {};
template<bool...bs, std::size_t...Is>
constexpr
uint64<
or_bits<
uint64<
bs?bit<Is>:std::uint64_t(0)
>{}...
>{}
>
from_binary( bools<bs...> bits, std::index_sequence<Is...> ) {
(void)bits; // suppress warning
return {};
}
template<bool...bs>
constexpr
auto from_binary( bools<bs...> bits={} )
-> decltype( from_binary( bits, std::make_index_sequence<sizeof...(bs)>{} ) )
{ return {}; }
It generates the resulting value as a type with a constexpr conversion to scalar. This is slightly more powerful than a constexpr function in its "compile-time-ness".
It assumes that the first bit is the most significant bit in the list.
You can use from_binary<1,0,1>() or from_binary( bools<1,0,1>{} ).
Live example.
This particular style of type-based programming results in code that does all of its work in its signature. The bodies consist of return {};.

iterator over non-existing sequence

I have K objects (K is small, e.g. 2 or 5) and I need to iterate over them N times in random order where N may be large. I need to iterate in a foreach loop and for this I should provide an iterator.
So far I created a std::vector of my K objects copied accordingly, so the size of vector is N and now I use begin() and end() provided by that vector. I use std::shuffle() to randomize the vector and this takes up to 20% of running time. I think it would be better (and more elegant, anyways) to write a custom iterator that returns one of my object in random order without creating the helping vector of size N. But how to do this?
It is obvious that your iterator must:
Store pointer to original vector or array: m_pSource
Store the count of requests (to be able to stop): m_nOutputCount
Use random number generator (see random): m_generator
Some iterator must be treated as end iterator: m_nOutputCount == 0
I've made an example for type int:
#include <iostream>
#include <random>
class RandomIterator: public std::iterator<std::forward_iterator_tag, int>
{
public:
//Creates "end" iterator
RandomIterator() : m_pSource(nullptr), m_nOutputCount(0), m_nCurValue(0) {}
//Creates random "start" iterator
RandomIterator(const std::vector<int> &source, int nOutputCount) :
m_pSource(&source), m_nOutputCount(nOutputCount + 1),
m_distribution(0, source.size() - 1)
{
operator++(); //make new random value
}
int operator* () const
{
return m_nCurValue;
}
RandomIterator operator++()
{
if (m_nOutputCount == 0)
return *this;
--m_nOutputCount;
static std::default_random_engine generator;
static bool bWasGeneratorInitialized = false;
if (!bWasGeneratorInitialized)
{
std::random_device rd; //expensive calls
generator.seed(rd());
bWasGeneratorInitialized = true;
}
m_nCurValue = m_pSource->at(m_distribution(generator));
return *this;
}
RandomIterator operator++(int)
{ //postincrement
RandomIterator tmp = *this;
++*this;
return tmp;
}
int operator== (const RandomIterator& other) const
{
if (other.m_nOutputCount == 0)
return m_nOutputCount == 0; //"end" iterator
return m_pSource == other.m_pSource;
}
int operator!= (const RandomIterator& other) const
{
return !(*this == other);
}
private:
const std::vector<int> *m_pSource;
int m_nOutputCount;
int m_nCurValue;
std::uniform_int_distribution<std::vector<int>::size_type> m_distribution;
};
int main()
{
std::vector<int> arrTest{ 1, 2, 3, 4, 5 };
std::cout << "Original =";
for (auto it = arrTest.cbegin(); it != arrTest.cend(); ++it)
std::cout << " " << *it;
std::cout << std::endl;
RandomIterator rndEnd;
std::cout << "Random =";
for (RandomIterator it(arrTest, 15); it != rndEnd; ++it)
std::cout << " " << *it;
std::cout << std::endl;
}
The output is:
Original = 1 2 3 4 5
Random = 1 4 1 3 2 4 5 4 2 3 4 3 1 3 4
You can easily convert it into a template. And make it to accept any random access iterator.
I just want to increment Dmitriy answer, because reading your question, it seems that you want that every time that you iterate your newly-created-and-shuffled collection the items should not repeat and Dmitryi´s answer does have repetition. So both iterators are useful.
template <typename T>
struct RandomIterator : public std::iterator<std::forward_iterator_tag, typename T::value_type>
{
RandomIterator() : Data(nullptr)
{
}
template <typename G>
RandomIterator(const T &source, G& g) : Data(&source)
{
Order = std::vector<int>(source.size());
std::iota(begin(Order), end(Order), 0);
std::shuffle(begin(Order), end(Order), g);
OrderIterator = begin(Order);
OrderIteratorEnd = end(Order);
}
const typename T::value_type& operator* () const noexcept
{
return (*Data)[*OrderIterator];
}
RandomIterator<T>& operator++() noexcept
{
++OrderIterator;
return *this;
}
int operator== (const RandomIterator<T>& other) const noexcept
{
if (Data == nullptr && other.Data == nullptr)
{
return 1;
}
else if ((OrderIterator == OrderIteratorEnd) && (other.Data == nullptr))
{
return 1;
}
return 0;
}
int operator!= (const RandomIterator<T>& other) const noexcept
{
return !(*this == other);
}
private:
const T *Data;
std::vector<int> Order;
std::vector<int>::iterator OrderIterator;
std::vector<int>::iterator OrderIteratorEnd;
};
template <typename T, typename G>
RandomIterator<T> random_begin(const T& v, G& g) noexcept
{
return RandomIterator<T>(v, g);
}
template <typename T>
RandomIterator<T> random_end(const T& v) noexcept
{
return RandomIterator<T>();
}
whole code at
http://coliru.stacked-crooked.com/a/df6ce482bbcbafcf or
https://github.com/xunilrj/sandbox/blob/master/sources/random_iterator/source/random_iterator.cpp
Implementing custom iterators can be very tricky so I tried to follow some tutorials, but please let me know if something have passed:
http://web.stanford.edu/class/cs107l/handouts/04-Custom-Iterators.pdf
https://codereview.stackexchange.com/questions/74609/custom-iterator-for-a-linked-list-class
Operator overloading
I think that the performance is satisfactory:
On the Coliru:
<size>:<time for 10 iterations>
1:0.000126582
10:3.5179e-05
100:0.000185914
1000:0.00160409
10000:0.0161338
100000:0.180089
1000000:2.28161
Off course it has the price to allocate a whole vector with the orders, that is the same size of the original vector.
An improvement would be to pre-allocate the Order vector if for some reason you have to random iterate very often and allow the iterator to use this pre-allocated vector, or some form of reset() in the iterator.

Conditional reduction in CUDA

I need to sum about 100000 values stored in an array, but with conditions.
Is there a way to do that in CUDA to produce fast results?
Can anyone post a small code to do that?
I think that, to perform conditional reduction, you can directly introduce the condition as a multiplication by 0 (false) or 1 (true) to the addends. In other words, suppose that the condition you would like to meet is that the addends be smaller than 10.f. In this case, borrowing the first code at Optimizing Parallel Reduction in CUDA by M. Harris, then the above would mean
__global__ void reduce0(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i]*(g_data[i]<10.f);
__syncthreads();
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
If you wish to use CUDA Thrust to perform conditional reduction, you can do the same by using thrust::transform_reduce. Alternatively, you can create a new vector d_b copying in that all the elements of d_a satisfying the predicate by thrust::copy_if and then applying thrust::reduce on d_b. I haven't checked which solution performs the best. Perhaps, the second solution will perform better on sparse arrays. Below is an example with an implementation of both the approaches.
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/count.h>
#include <thrust/copy.h>
// --- Operator for the first approach
struct conditional_operator {
__host__ __device__ float operator()(const float a) const {
return a*(a<10.f);
}
};
// --- Operator for the second approach
struct is_smaller_than_10 {
__host__ __device__ bool operator()(const float a) const {
return (a<10.f);
}
};
void main(void)
{
int N = 20;
// --- Host side allocation and vector initialization
thrust::host_vector<float> h_a(N,1.f);
h_a[0] = 20.f;
h_a[1] = 20.f;
// --- Device side allocation and vector initialization
thrust::device_vector<float> d_a(h_a);
// --- First approach
float sum = thrust::transform_reduce(d_a.begin(), d_a.end(), conditional_operator(), 0.f, thrust::plus<float>());
printf("Result = %f\n",sum);
// --- Second approach
int N_prime = thrust::count_if(d_a.begin(), d_a.end(), is_smaller_than_10());
thrust::device_vector<float> d_b(N_prime);
thrust::copy_if(d_a.begin(), d_a.begin() + N, d_b.begin(), is_smaller_than_10());
sum = thrust::reduce(d_b.begin(), d_b.begin() + N_prime, 0.f);
printf("Result = %f\n",sum);
getchar();
}

Boost.Variant Vs Virtual Interface Performance

I'm trying to measure a performance difference between using Boost.Variant and using virtual interfaces. For example, suppose I want to increment different types of numbers uniformly, using Boost.Variant I would use a boost::variant over int and float and a static visitor which increments each one of them. Using class interfaces I would use a pure virtual class number and number_int and number_float classes which derive from it and implement an "increment" method.
From my testing, using interfaces is far faster than using Boost.Variant.
I ran the code at the bottom and received these results:
Virtual: 00:00:00.001028
Variant: 00:00:00.012081
Why do you suppose this difference is? I thought Boost.Variant would be a lot faster.
** Note: Usually Boost.Variant uses heap allocations to guarantee that the variant would always be non-empty. But I read on the Boost.Variant documentation that if boost::has_nothrow_copy is true then it doesn't use heap allocations which should make things significantly faster. For int and float boost::has_nothrow_copy is true.
Here is my code for measuring the two approaches against each other.
#include <iostream>
#include <boost/variant/variant.hpp>
#include <boost/variant/static_visitor.hpp>
#include <boost/variant/apply_visitor.hpp>
#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/posix_time/posix_time_types.hpp>
#include <boost/date_time/posix_time/posix_time_io.hpp>
#include <boost/format.hpp>
const int iterations_count = 100000;
// a visitor that increments a variant by N
template <int N>
struct add : boost::static_visitor<> {
template <typename T>
void operator() (T& t) const {
t += N;
}
};
// a number interface
struct number {
virtual void increment() = 0;
};
// number interface implementation for all types
template <typename T>
struct number_ : number {
number_(T t = 0) : t(t) {}
virtual void increment() {
t += 1;
}
T t;
};
void use_virtual() {
number_<int> num_int;
number* num = &num_int;
for (int i = 0; i < iterations_count; i++) {
num->increment();
}
}
void use_variant() {
typedef boost::variant<int, float, double> number;
number num = 0;
for (int i = 0; i < iterations_count; i++) {
boost::apply_visitor(add<1>(), num);
}
}
int main() {
using namespace boost::posix_time;
ptime start, end;
time_duration d1, d2;
// virtual
start = microsec_clock::universal_time();
use_virtual();
end = microsec_clock::universal_time();
// store result
d1 = end - start;
// variant
start = microsec_clock::universal_time();
use_variant();
end = microsec_clock::universal_time();
// store result
d2 = end - start;
// output
std::cout <<
boost::format(
"Virtual: %1%\n"
"Variant: %2%\n"
) % d1 % d2;
}
For those interested, after I was a bit frustrated, I passed the option -O2 to the compiler and boost::variant was way faster than a virtual call.
Thanks
This is obvious that -O2 reduces the variant time, because that whole loop is optimized away. Change the implementation to return the accumulated result to the caller, so that the optimizer wouldn't remove the loop, and you'll get the real difference:
Output:
Virtual: 00:00:00.000120 = 10000000
Variant: 00:00:00.013483 = 10000000
#include <iostream>
#include <boost/variant/variant.hpp>
#include <boost/variant/static_visitor.hpp>
#include <boost/variant/apply_visitor.hpp>
#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/posix_time/posix_time_types.hpp>
#include <boost/date_time/posix_time/posix_time_io.hpp>
#include <boost/format.hpp>
const int iterations_count = 100000000;
// a visitor that increments a variant by N
template <int N>
struct add : boost::static_visitor<> {
template <typename T>
void operator() (T& t) const {
t += N;
}
};
// a visitor that increments a variant by N
template <typename T, typename V>
T get(const V& v) {
struct getter : boost::static_visitor<T> {
T operator() (T t) const { return t; }
};
return boost::apply_visitor(getter(), v);
}
// a number interface
struct number {
virtual void increment() = 0;
};
// number interface implementation for all types
template <typename T>
struct number_ : number {
number_(T t = 0) : t(t) {}
virtual void increment() { t += 1; }
T t;
};
int use_virtual() {
number_<int> num_int;
number* num = &num_int;
for (int i = 0; i < iterations_count; i++) {
num->increment();
}
return num_int.t;
}
int use_variant() {
typedef boost::variant<int, float, double> number;
number num = 0;
for (int i = 0; i < iterations_count; i++) {
boost::apply_visitor(add<1>(), num);
}
return get<int>(num);
}
int main() {
using namespace boost::posix_time;
ptime start, end;
time_duration d1, d2;
// virtual
start = microsec_clock::universal_time();
int i1 = use_virtual();
end = microsec_clock::universal_time();
// store result
d1 = end - start;
// variant
start = microsec_clock::universal_time();
int i2 = use_variant();
end = microsec_clock::universal_time();
// store result
d2 = end - start;
// output
std::cout <<
boost::format(
"Virtual: %1% = %2%\n"
"Variant: %3% = %4%\n"
) % d1 % i1 % d2 % i2;
}

Resources