std::bind - vector as an argument in bound function - performance

I have a question which way is the best to forward vector to bound function?
Below is code with two approaches. In production code vector will contain huge amount of data and I would like to avoid copying it as much as possible.
#include <iostream>
#include <vector>
#include <functional>
void foo(const std::vector<uint16_t>& v)
{
for(const auto& c : v)
{
std::cout << c;
}
std::cout << std::endl;
}
int main()
{
std::vector<uint16_t> vv{1, 2, 3, 4, 5, 6, 7, 8, 9, 0};
auto f1 = std::bind(&foo, vv); // 1)
auto f2 = std::bind(&foo, std::move(vv)); // 2)
f1();
f2();
}

It really depends on what you want to do with the bound functions.
If they are going to be copied passed around (beyond the life of vv), this is correct (and is going to copy vv).
auto f1 = std::bind(&foo, vv); // 1)
This is also correct, (vv is not going to be copied initially at least.)
auto f2 = std::bind(&foo, std::move(vv)); // 2)
but you will not have access to vv after that point.
This is however the most likely scenario that I can deduce from your example:
if the bound function will be used locally while vv is still alive it is more likely from the example that want you want f3 to hold a "reference" to vv. This is done with the ref convention:
auto f3 = std::bind(&foo, std::ref(vv));

Related

Boost Geometry: segments intersection not yet implemented?

I am trying a simple test: compute the intersection of 2 segments with Boost Geometry. It does not compile. I also tried with some variations (int points instead of float points, 2D instead of 3D) with no improvement.
Is it really possible that boost doesn't implement segment intersection ? Or what did I do wrong ? Missing some hpp ? Confusion between algorithms "intersects" & "intersection" ?
The code is very basic:
#include <boost/geometry.hpp>
#include <boost/geometry/geometries/point.hpp>
#include <boost/geometry/geometries/segment.hpp>
#include <boost/geometry/algorithms/intersection.hpp>
typedef boost::geometry::model::point<float, 3, boost::geometry::cs::cartesian> testPoint;
typedef boost::geometry::model::segment<testPoint> testSegment;
testSegment s1(
testPoint(-1.f, 0.f, 0.f),
testPoint(1.f, 0.f, 0.f)
);
testSegment s2(
testPoint(0.f, -1.f, 0.f),
testPoint(0.f, 1.f, 0.f)
);
std::vector<testPoint> output;
bool intersectionExists = boost::geometry::intersects(s1, s2, output);
But I got the following errors at compile time by Visual:
- Error C2039 'apply' n'est pas membre de 'boost::geometry::dispatch::disjoint<Geometry1,Geometry2,3,boost::geometry::segment_tag,boost::geometry::segment_tag,false>' CDCadwork C:\Program Files\Boost\boost_1_75_0\boost\geometry\algorithms\detail\disjoint\interface.hpp 54
- Error C2338 This operation is not or not yet implemented. CDCadwork C:\Program Files\Boost\boost_1_75_0\boost\geometry\algorithms\not_implemented.hpp 47
There are indeed two problems:
you're intersecting 3D geometries. That's not implemented
Instead you can do the same operation on a projection.
you're passing an "output" geometry to intersects (which indeed only returns the true/false value as your chosen name intersectionExists suggested). In the presence of a third parameter, it would be used as a Strategy - a concept for which output obviously doesn't satisfy.
Note intersection always returns true: What does boost::geometry::intersection return - although that's not part of the documented interface
Since your geometries are trivially projected onto 2d plane Z=0:
Live On Coliru
#include <boost/geometry.hpp>
#include <boost/geometry/geometries/point.hpp>
#include <boost/geometry/geometries/segment.hpp>
#include <iostream>
namespace bg = boost::geometry;
namespace bgm = bg::model;
using Point = bgm::point<float, 2, bg::cs::cartesian>;
using Segment = bgm::segment<Point>;
int main() {
Segment s1{{-1, 0}, {1, 0}};
Segment s2{{0, -1}, {0, 1}};
bool exists = bg::intersects(s1, s2);
std::vector<Point> output;
/*bool alwaysTrue = */ bg::intersection(s1, s2, output);
std::cout << bg::wkt(s1) << "\n";
std::cout << bg::wkt(s2) << "\n";
for (auto& p : output) {
std::cout << bg::wkt(p) << "\n";
}
return exists? 0:1;
}
Prints
LINESTRING(-1 0,1 0)
LINESTRING(0 -1,0 1)
POINT(0 0)

Sorting multiple arrays using CUDA/Thrust

I have a large array that I need to sort on the GPU. The array itself is a concatenation of multiple smaller subarrays that satisfy the condition that given i < j, the elements of the subarray i are smaller than the elements of the subarray j. An example of such array would be {5 3 4 2 1 6 9 8 7 10 11},
where the elements of the first subarray of 5 elements are smaller than the elements of the second subarray of 6 elements. The array I need is {1, 2, 3, 4, 5, 6, 7, 10, 11}. I know the position where each subarray starts in the large array.
I know I can simply use thrust::sort on the whole array, but I was wondering if it's possible to launch multiple concurrent sorts, one for each subarray. I'm hoping to get a performance improvement by doing that. My assumption is that it would be faster to sort multiple smaller arrays than one large array with all the elements.
I'd appreciate if someone could give me a way to do that or correct my assumption in case it's wrong.
A way to do multiple concurrent sorts (a "vectorized" sort) in thrust is via the marking of the sub arrays, and providing a custom functor that is an ordinary thrust sort functor that also orders the sub arrays by their key.
Another possible method is to use back-to-back thrust::stable_sort_by_key as described here.
As you have pointed out, another method in your case is just to do an ordinary sort, since that is ultimately your objective.
However I think its unlikely that any of the thrust sort methods will give a signficant speed-up over a pure sort, although you can try it. Thrust has a fast-path radix sort which it will use in certain situations, which the pure sort method could probably use in your case. (In other cases, e.g. when you provide a custom functor, thrust will often use a slower merge-sort method.)
If the sizes of the sub arrays are within certain ranges, I think you're likely to get much better results (performance-wise) with block radix sort in cub, one block per sub-array.
Here is an example that uses specific sizes (since you've given no indication of size ranges and other details), comparing a thrust "pure sort" to a thrust segmented sort with functor, to the cub block sort method. For this particular case, the cub sort is fastest:
$ cat t1.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/scan.h>
#include <thrust/equal.h>
#include <cstdlib>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
const int num_blocks = 2048;
const int items_per = 4;
const int nTPB = 512;
const int block_size = items_per*nTPB; // must be a whole-number multiple of nTPB;
typedef float mt;
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
struct my_sort_functor
{
template <typename T, typename T2>
__host__ __device__
bool operator()(T t1, T2 t2){
if (thrust::get<1>(t1) < thrust::get<1>(t2)) return true;
if (thrust::get<1>(t1) > thrust::get<1>(t2)) return false;
if (thrust::get<0>(t1) > thrust::get<0>(t2)) return false;
return true;}
};
// from: https://nvlabs.github.io/cub/example_block_radix_sort_8cu-example.html#_a0
#define CUB_STDERR
#include <stdio.h>
#include <iostream>
#include <algorithm>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_sort.cuh>
using namespace cub;
//---------------------------------------------------------------------
// Globals, constants and typedefs
//---------------------------------------------------------------------
bool g_verbose = false;
bool g_uniform_keys;
//---------------------------------------------------------------------
// Kernels
//---------------------------------------------------------------------
template <
typename Key,
int BLOCK_THREADS,
int ITEMS_PER_THREAD>
__launch_bounds__ (BLOCK_THREADS)
__global__ void BlockSortKernel(
Key *d_in, // Tile of input
Key *d_out) // Tile of output
{
enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };
// Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
typedef BlockLoad<Key, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoadT;
// Specialize BlockRadixSort type for our thread block
typedef BlockRadixSort<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// Shared memory
__shared__ union TempStorage
{
typename BlockLoadT::TempStorage load;
typename BlockRadixSortT::TempStorage sort;
} temp_storage;
// Per-thread tile items
Key items[ITEMS_PER_THREAD];
// Our current block's offset
int block_offset = blockIdx.x * TILE_SIZE;
// Load items into a blocked arrangement
BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
// Barrier for smem reuse
__syncthreads();
// Sort keys
BlockRadixSortT(temp_storage.sort).SortBlockedToStriped(items);
// Store output in striped fashion
StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
}
int main(){
const int ds = num_blocks*block_size;
thrust::host_vector<mt> data(ds);
thrust::host_vector<int> keys(ds);
for (int i = block_size; i < ds; i+=block_size) keys[i] = 1; // mark beginning of blocks
thrust::device_vector<int> d_keys = keys;
for (int i = 0; i < ds; i++) data[i] = (rand()%block_size) + (i/block_size)*block_size; // populate data
thrust::device_vector<mt> d_data = data;
thrust::inclusive_scan(d_keys.begin(), d_keys.end(), d_keys.begin()); // fill out keys array 000111222...
thrust::device_vector<mt> d1 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long os = dtime_usec(0);
thrust::sort(d1.begin(), d1.end()); // ordinary sort
cudaDeviceSynchronize();
os = dtime_usec(os);
thrust::device_vector<mt> d2 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long ss = dtime_usec(0);
thrust::sort(thrust::make_zip_iterator(thrust::make_tuple(d2.begin(), d_keys.begin())), thrust::make_zip_iterator(thrust::make_tuple(d2.end(), d_keys.end())), my_sort_functor());
cudaDeviceSynchronize();
ss = dtime_usec(ss);
if (!thrust::equal(d1.begin(), d1.end(), d2.begin())) {std::cout << "oops1" << std::endl; return 0;}
std::cout << "ordinary thrust sort: " << os/(float)USECPSEC << "s " << "segmented sort: " << ss/(float)USECPSEC << "s" << std::endl;
thrust::device_vector<mt> d3(ds);
cudaDeviceSynchronize();
unsigned long long cs = dtime_usec(0);
BlockSortKernel<mt, nTPB, items_per><<<num_blocks, nTPB>>>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d3.data()));
cudaDeviceSynchronize();
cs = dtime_usec(cs);
if (!thrust::equal(d1.begin(), d1.end(), d3.begin())) {std::cout << "oops2" << std::endl; return 0;}
std::cout << "cub sort: " << cs/(float)USECPSEC << "s" << std::endl;
}
$ nvcc -o t1 t1.cu
$ ./t1
ordinary thrust sort: 0.001652s segmented sort: 0.00263s
cub sort: 0.000265s
$
(CUDA 10.2.89, Tesla V100, Ubuntu 18.04)
I have no doubt that your sizes and array dimensions don't correspond to mine. The purpose here is to illustrate some possible methods, not a black-box solution that works for your particular case. You probably should do benchmark comparisons of your own. I also acknowledge that the block radix sort method for cub expects equal-sized sub-arrays, which you may not have. It may not be a suitable method for you, or you may wish to explore some kind of padding arrangement. There's no need to ask this question of me; I won't be able to answer it based on the information in your question.
I don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the questions in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.

Conversion of data type using auto in C++

I have 2 vector container which contains 2 different kind of value with data type uint32_t. I want to print both of them together.
Like this is what I have
vector<uint32_t> data1;
vector<uint32_t> data2;
Now I know a method for single data like below
for(auto const& d1: data1)
cout<< d1 << endl;
But I want to print both data together like this,
cout<< d1 << "\t" << d2 << endl;
How can I do this using auto? (where d2 is auto converted value from data2)
You could use a normal for loop over the index:
for (auto i = 0u; i != n; ++i)
std::cout << data1[i] << "\t" << data2[i] << "\n";
Edit: if you want to convert the uint32_t to an int, for example, you could do:
auto d1 = static_cast<int>(data1[i]);
but it is up to you to ensure the conversion is safe. i.e the value fits in the target type.
Use the Boost Zip Iterator, which will let you have a range of pairs rather than two ranges of the vectors' data types. Something along the lines of:
#include <boost/iterator/zip_iterator.hpp>
#include <boost/range.hpp>
#include <stdint.h>
#include <vector>
#include <iostream>
template <typename... TContainer>
auto zip(TContainer&... containers) -> boost::iterator_range<boost::zip_iterator<decltype(boost::make_tuple(std::begin(containers)...))>> {
auto zip_begin = boost::make_zip_iterator(boost::make_tuple(std::begin(containers)...));
auto zip_end = boost::make_zip_iterator(boost::make_tuple(std::end(containers)...));
return boost::make_iterator_range(zip_begin, zip_end);
}
int main()
{
std::vector<uint32_t> data1( { 11, 22, 33 } );
std::vector<uint32_t> data2( { 44, 55, 66 } );
for (auto t : zip(data1, data2)) {
std::cout << boost::get<0>(t) << "\t" << boost::get<1>(t) << "\n";
}
}
The zip() function is due to this question and you can put it in a separate header file since it's not specific to your case.
If possible (and plausible for your use case): work with a container of pairs
If your application is not in a bind w.r.t. computer resources, and you know that you will be working with the values of your two containers as pairs (assuming same-length containers, as in your example), it might be useful to actually work with a container of pairs, which also ease the use of the neat range-based for loops ( >= C++11).
#include <iostream>
#include <vector>
#include <algorithm>
int main()
{
std::vector<uint32_t> data1 = {1, 2, 3};
std::vector<uint32_t> data2 = {4, 5, 6};
// construct container of (int, int) pairs
std::vector<std::pair<int, int>> data;
data.reserve(data1.size());
std::transform(data1.begin(), data1.end(), data2.begin(), std::back_inserter(data),
[](uint32_t first, uint32_t second) {
return std::make_pair(static_cast<int>(first), static_cast<int>(second));
}); /* as noted in accepted answer: you're responsible for
ensuring that the conversion here is safe */
// easily use range-based for loops to traverse of the
// pairs of your container
for(const auto& pair: data) {
std::cout << pair.first << " " << pair.second << "\n";
} /* 1 4
2 5
3 6 */
return 0;
}

how to erase from vector in range-based loop?

I simply wanna erase the specified element in the range-based loop:
vector<int> vec = { 3, 4, 5, 6, 7, 8 };
for (auto & i:vec)
{
if (i>5)
vec.erase(&i);
}
what's wrong?
You can't erase elements by value on a std::vector, and since range-based loop expose directly values your code doesn't make sense (vec.erase(&i)).
The main problem is that a std::vector invalidates its iterators when you erase an element.
So since the range-based loop is basically implemented as
auto begin = vec.begin();
auto end = vec.end()
for (auto it = begin; it != end; ++it) {
..
}
Then erasing a value would invalidate it and break the successive iterations.
If you really want to remove an element while iterating you must take care of updating the iterator correctly:
for (auto it = vec.begin(); it != vec.end(); /* NOTHING */)
{
if ((*it) > 5)
it = vec.erase(it);
else
++it;
}
Removing elements from a vector that you're iterating over is generally a bad idea. In your case you're most likely skipping the 7. A much better way would be using std::remove_if for it:
vec.erase(std::remove_if(vec.begin(), vec.end(),
[](const int& i){ return i > 5; }),
vec.end());
std::remove shift the elements that should be removed to the end of the container and returns an iterator to the first of those elements. You only got to erase those elements up to the end then.
It's quite simple: don't use a range-based loop. These loops are intended as a concise form for sequentially iterating over all the values in a container. If you want something more complicated (such as erasing or generally access to iterators), do it the explicit way:
for (auto it = begin(vec); it != end(vec);) {
if (*it > 5)
it = vec.erase(it);
else
++it;
}
Actually it IS possible, despite what the other answers say.
#include <vector>
#include <iostream>
#include <algorithm>
using namespace std;
int main() {
vector<int> ints{1,2,3,4};
for (auto it = ints.begin(); auto& i: ints) { // you can create the iterator here in C++20
if (i == 3)
ints.erase(it--); // Decrement after erasing a single element, and it preserves the iterator
it++;
}
for_each(ints.cbegin(), ints.cend(),
[] (int i) {cout << i << " ";}
);
}
Godbolt
in C++ 23 you can just erase_if(ints, [](const int i){return i==3;});

use of for_each in a partial copy

I have some old C code that still runs very fast. One of the things it does is store the part of an array for which a condition holds (a 'masked' copy)
So the C code is:
int *msk;
int msk_size;
double *ori;
double out[msk_size];
...
for ( int i=0; i<msk_size; i++ )
out[i] = ori[msk[i]];
When I was 'modernising' this code, I figured that there would be a way to do this in C++11 with iterators that don't need to use index counters. But there does not seem to be a shorter way to do this with std::for_each or even std::copy.
Is there a way to write this up more concisely in C++11? Or should I stop looking and leave the old code in?
I think you are looking for std::transfrom.
std::array<int, msk_size> msk;
std::array<double, msk_size> out;
double *ori;
....
std::transform(std::begin(msk), std::end(msk),
std::begin(out),
[&](int i) { return ori[i]; });
In case you only want to modernize the loop, and keep the ori and msk data around, use #YuxiuLi's solution. If you also want to modernize the generation of the msk data, you can use std::copy_if with a predicate (here: a lambda that keeps only the negative numbers) to filter the elements directly.
#include <algorithm>
#include <vector>
#include <iostream>
#include <iterator>
int main()
{
auto ori = std::vector<double> { 0.1, -1.2, 2.4, 3.4, -7.1 };
std::vector<double> out;
std::copy_if(begin(ori), end(ori), std::back_inserter(out), [&](double d) { return d < 0.0; });
std::copy(begin(out), end(out), std::ostream_iterator<double>(std::cout, ","));
}
Live Example. This saves an intermediate storage of msk.

Resources