I have some old C code that still runs very fast. One of the things it does is store the part of an array for which a condition holds (a 'masked' copy)
So the C code is:
int *msk;
int msk_size;
double *ori;
double out[msk_size];
...
for ( int i=0; i<msk_size; i++ )
out[i] = ori[msk[i]];
When I was 'modernising' this code, I figured that there would be a way to do this in C++11 with iterators that don't need to use index counters. But there does not seem to be a shorter way to do this with std::for_each or even std::copy.
Is there a way to write this up more concisely in C++11? Or should I stop looking and leave the old code in?
I think you are looking for std::transfrom.
std::array<int, msk_size> msk;
std::array<double, msk_size> out;
double *ori;
....
std::transform(std::begin(msk), std::end(msk),
std::begin(out),
[&](int i) { return ori[i]; });
In case you only want to modernize the loop, and keep the ori and msk data around, use #YuxiuLi's solution. If you also want to modernize the generation of the msk data, you can use std::copy_if with a predicate (here: a lambda that keeps only the negative numbers) to filter the elements directly.
#include <algorithm>
#include <vector>
#include <iostream>
#include <iterator>
int main()
{
auto ori = std::vector<double> { 0.1, -1.2, 2.4, 3.4, -7.1 };
std::vector<double> out;
std::copy_if(begin(ori), end(ori), std::back_inserter(out), [&](double d) { return d < 0.0; });
std::copy(begin(out), end(out), std::ostream_iterator<double>(std::cout, ","));
}
Live Example. This saves an intermediate storage of msk.
Related
I have a large array that I need to sort on the GPU. The array itself is a concatenation of multiple smaller subarrays that satisfy the condition that given i < j, the elements of the subarray i are smaller than the elements of the subarray j. An example of such array would be {5 3 4 2 1 6 9 8 7 10 11},
where the elements of the first subarray of 5 elements are smaller than the elements of the second subarray of 6 elements. The array I need is {1, 2, 3, 4, 5, 6, 7, 10, 11}. I know the position where each subarray starts in the large array.
I know I can simply use thrust::sort on the whole array, but I was wondering if it's possible to launch multiple concurrent sorts, one for each subarray. I'm hoping to get a performance improvement by doing that. My assumption is that it would be faster to sort multiple smaller arrays than one large array with all the elements.
I'd appreciate if someone could give me a way to do that or correct my assumption in case it's wrong.
A way to do multiple concurrent sorts (a "vectorized" sort) in thrust is via the marking of the sub arrays, and providing a custom functor that is an ordinary thrust sort functor that also orders the sub arrays by their key.
Another possible method is to use back-to-back thrust::stable_sort_by_key as described here.
As you have pointed out, another method in your case is just to do an ordinary sort, since that is ultimately your objective.
However I think its unlikely that any of the thrust sort methods will give a signficant speed-up over a pure sort, although you can try it. Thrust has a fast-path radix sort which it will use in certain situations, which the pure sort method could probably use in your case. (In other cases, e.g. when you provide a custom functor, thrust will often use a slower merge-sort method.)
If the sizes of the sub arrays are within certain ranges, I think you're likely to get much better results (performance-wise) with block radix sort in cub, one block per sub-array.
Here is an example that uses specific sizes (since you've given no indication of size ranges and other details), comparing a thrust "pure sort" to a thrust segmented sort with functor, to the cub block sort method. For this particular case, the cub sort is fastest:
$ cat t1.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/scan.h>
#include <thrust/equal.h>
#include <cstdlib>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
const int num_blocks = 2048;
const int items_per = 4;
const int nTPB = 512;
const int block_size = items_per*nTPB; // must be a whole-number multiple of nTPB;
typedef float mt;
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
struct my_sort_functor
{
template <typename T, typename T2>
__host__ __device__
bool operator()(T t1, T2 t2){
if (thrust::get<1>(t1) < thrust::get<1>(t2)) return true;
if (thrust::get<1>(t1) > thrust::get<1>(t2)) return false;
if (thrust::get<0>(t1) > thrust::get<0>(t2)) return false;
return true;}
};
// from: https://nvlabs.github.io/cub/example_block_radix_sort_8cu-example.html#_a0
#define CUB_STDERR
#include <stdio.h>
#include <iostream>
#include <algorithm>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_sort.cuh>
using namespace cub;
//---------------------------------------------------------------------
// Globals, constants and typedefs
//---------------------------------------------------------------------
bool g_verbose = false;
bool g_uniform_keys;
//---------------------------------------------------------------------
// Kernels
//---------------------------------------------------------------------
template <
typename Key,
int BLOCK_THREADS,
int ITEMS_PER_THREAD>
__launch_bounds__ (BLOCK_THREADS)
__global__ void BlockSortKernel(
Key *d_in, // Tile of input
Key *d_out) // Tile of output
{
enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };
// Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
typedef BlockLoad<Key, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoadT;
// Specialize BlockRadixSort type for our thread block
typedef BlockRadixSort<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// Shared memory
__shared__ union TempStorage
{
typename BlockLoadT::TempStorage load;
typename BlockRadixSortT::TempStorage sort;
} temp_storage;
// Per-thread tile items
Key items[ITEMS_PER_THREAD];
// Our current block's offset
int block_offset = blockIdx.x * TILE_SIZE;
// Load items into a blocked arrangement
BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
// Barrier for smem reuse
__syncthreads();
// Sort keys
BlockRadixSortT(temp_storage.sort).SortBlockedToStriped(items);
// Store output in striped fashion
StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
}
int main(){
const int ds = num_blocks*block_size;
thrust::host_vector<mt> data(ds);
thrust::host_vector<int> keys(ds);
for (int i = block_size; i < ds; i+=block_size) keys[i] = 1; // mark beginning of blocks
thrust::device_vector<int> d_keys = keys;
for (int i = 0; i < ds; i++) data[i] = (rand()%block_size) + (i/block_size)*block_size; // populate data
thrust::device_vector<mt> d_data = data;
thrust::inclusive_scan(d_keys.begin(), d_keys.end(), d_keys.begin()); // fill out keys array 000111222...
thrust::device_vector<mt> d1 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long os = dtime_usec(0);
thrust::sort(d1.begin(), d1.end()); // ordinary sort
cudaDeviceSynchronize();
os = dtime_usec(os);
thrust::device_vector<mt> d2 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long ss = dtime_usec(0);
thrust::sort(thrust::make_zip_iterator(thrust::make_tuple(d2.begin(), d_keys.begin())), thrust::make_zip_iterator(thrust::make_tuple(d2.end(), d_keys.end())), my_sort_functor());
cudaDeviceSynchronize();
ss = dtime_usec(ss);
if (!thrust::equal(d1.begin(), d1.end(), d2.begin())) {std::cout << "oops1" << std::endl; return 0;}
std::cout << "ordinary thrust sort: " << os/(float)USECPSEC << "s " << "segmented sort: " << ss/(float)USECPSEC << "s" << std::endl;
thrust::device_vector<mt> d3(ds);
cudaDeviceSynchronize();
unsigned long long cs = dtime_usec(0);
BlockSortKernel<mt, nTPB, items_per><<<num_blocks, nTPB>>>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d3.data()));
cudaDeviceSynchronize();
cs = dtime_usec(cs);
if (!thrust::equal(d1.begin(), d1.end(), d3.begin())) {std::cout << "oops2" << std::endl; return 0;}
std::cout << "cub sort: " << cs/(float)USECPSEC << "s" << std::endl;
}
$ nvcc -o t1 t1.cu
$ ./t1
ordinary thrust sort: 0.001652s segmented sort: 0.00263s
cub sort: 0.000265s
$
(CUDA 10.2.89, Tesla V100, Ubuntu 18.04)
I have no doubt that your sizes and array dimensions don't correspond to mine. The purpose here is to illustrate some possible methods, not a black-box solution that works for your particular case. You probably should do benchmark comparisons of your own. I also acknowledge that the block radix sort method for cub expects equal-sized sub-arrays, which you may not have. It may not be a suitable method for you, or you may wish to explore some kind of padding arrangement. There's no need to ask this question of me; I won't be able to answer it based on the information in your question.
I don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the questions in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.
I have a std::set<vector<int>> from which I would like to move (not copy) elements to a std::vector<vector<int>>. How do I do this?
I tried using move (which I use to move between vectors) with the following piece of code but it won't compile.
#include<iostream>
#include<set>
#include<vector>
using namespace std;
int main(){
set<vector<int>> res;
vector<int> first = {1,2,3};
vector<int> second = {4,5,6};
res.insert(first);
res.insert(second);
vector<vector<int>> ans;
for(auto i : res){
ans.emplace_back(ans.end(),move(i));
}
return 0;
}
A set<T> does not contain Ts; it contains const Ts. As such, you cannot simply move objects out of it.
This is one of reasons why we still may need const_cast sometimes:
for(auto& i : res){
ans.emplace_back(ans.end(),move(const_cast<T&>(i)));
}
No point to do it for int elements though.
Is it possible to put the iterator of list in to set:
I wrote codes as follows :
It failed on VS2015 but run smoothly on g++
And I also tried to use std::hash to calculate a hash value of std::list::iterator
but failed again, it has no hash func for iterator.
And one can help ? Or it's impossible .....
#include <set>
#include <list>
#include <cstring>
#include <cassert>
// like std::less
struct myless
{
typedef std::list<int>::iterator first_argument_type;
typedef std::list<int>::iterator second_argument_type;
typedef bool result_type;
bool operator()(const std::list<int>::iterator& x,const std::list<int>::iterator& y) const
{
return memcmp(&x, &y, sizeof(std::list<int>::iterator)) < 0; // using memcmp
}
};
int main()
{
std::list<int> lst = {1,2,3,4,5};
std::set<std::list<int>::iterator,myless> test;
auto it = lst.begin();
test.insert(it++);
test.insert(it++);
assert(test.find(lst.begin()) != test.end()); // fail on vs 2015
auto it1 = lst.end();
auto it2 = lst.end();
assert(memcmp(&it1,&it2,sizeof(it1)) == 0); // fail on vs 2015
system("pause");
return 0;
}
Yes, you can put std::list<T>::iterator in a std::set, if you tell std::set what order they should be in. A reasonable order could be std::less<T>, i.e. you sort the iterators by the values they point to (obviously you then can't insert an std::list::end iterator). Any other order is also OK.
However, you tried to use memcmp, and that is wrong. The predicate used by set requires that equal values compare equal, and there is no guarantee that equal iterators (as defined by list::iterator::operator==) also compare equal using memcmp.
I find a way to do this as like but not for the end iterator
bool operator<(const T& x, const T& y)
{
return &*x < &*y;
}
#include <iostream>
#include <vector>
#include <algorithm>
#include <tuple>
using namespace std;
typedef long long ll;
vector < tuple <ll,ll,ll> > a;
int main()
{
ll t;
cin>>t;
ll id,z,p,l,c,s,newz;
while(t--)
{
cin>>id>>z>>p>>l>>c>>s;
newz=p*50+l*5+c*10+s*20;
a.push_back(make_tuple(z-newz,id,newz));
}
sort(a.begin(),a.end());
for(int i=0;i<5;i++)
{
tie(ignore,id,z)=a[i];
cout<<id<<" "<<z<<endl;
}
return 0;
}
I want the sort on the vector to happen on the basis of first element of the tuple but only when there is a tie then the smallest of the second element of the tuple must be chosen to order the elements with the same first value.
Also specify what should be done, if at the time of a tie the order should be maintain on the basis of greater element of the second element of the tuple(instead of the first).
A custom function as a third parameter sorted my way out perfectly.
bool cmp( tuple <ll,ll,ll> const &s, tuple <ll,ll,ll> const &r)
{
if(get<0>(s)==get<0>(r))
{
return (get<1>(s))>(get<1>(r));
}
else
return (get<0>(s))<(get<0>(r));
}
sort(a.begin(),a.end(),cmp);// Call to sort will change like this.
I want to use GSL's uniform random number generator. On their website, they include this sample code:
#include <stdio.h>
#include <gsl/gsl_rng.h>
int
main (void)
{
const gsl_rng_type * T;
gsl_rng * r;
int i, n = 10;
gsl_rng_env_setup();
T = gsl_rng_default;
r = gsl_rng_alloc (T);
for (i = 0; i < n; i++)
{
double u = gsl_rng_uniform (r);
printf ("%.5f\n", u);
}
gsl_rng_free (r);
return 0;
}
However, this does not rely on any seed and so, the same random numbers will be produced each time.
They also specify the following:
The generator itself can be changed using the environment variable GSL_RNG_TYPE. Here is the output of the program using a seed value of 123 and the multiple-recursive generator mrg,
$ GSL_RNG_SEED=123 GSL_RNG_TYPE=mrg ./a.out
But I don't understand how to implement this. Any ideas as to what modifications I can make to the above code to incorporate the seed?
The problem is that a new seed is not being generated. If you just want a function that returns a darn random number, and care nothing about the sticky details of how it's generated, try this. Assumes that you have the GSL installed.
#include <iostream>
#include <gsl/gsl_math.h>
#include <gsl/gsl_rng.h>
#include <sys/time.h>
float keithRandom() {
// Random number function based on the GNU Scientific Library
// Returns a random float between 0 and 1, exclusive; e.g., (0,1)
const gsl_rng_type * T;
gsl_rng * r;
gsl_rng_env_setup();
struct timeval tv; // Seed generation based on time
gettimeofday(&tv,0);
unsigned long mySeed = tv.tv_sec + tv.tv_usec;
T = gsl_rng_default; // Generator setup
r = gsl_rng_alloc (T);
gsl_rng_set(r, mySeed);
double u = gsl_rng_uniform(r); // Generate it!
gsl_rng_free (r);
return (float)u;
}
Read 18.6 Random number environment variables to see what that gsl_rng_env_setup() function is doing. It is getting a generator type and seed from environment variables.
Then see 18.3 Random number generator initialization - if you don't want to get the seed from an environment variable, you can use gsl_rng_set() to set the seed.
A complete answer to this question with a sample code can be seen in in this link.
Just for completeness I am putting a copy of the code for a function to create a seed here. It is written by Robert G. Brown: http://www.phy.duke.edu/~rgb/ .
#include <stdio.h>
#include <sys/time.h>
unsigned long int random_seed()
{
unsigned int seed;
struct timeval tv;
FILE *devrandom;
if ((devrandom = fopen("/dev/random","r")) == NULL) {
gettimeofday(&tv,0);
seed = tv.tv_sec + tv.tv_usec;
} else {
fread(&seed,sizeof(seed),1,devrandom);
fclose(devrandom);
}
return(seed);
}
But from my own experience with this function, I would say that the dev/random solution is very time consuming compared to the gettimeofday(), you can check it out. So, the gettimeofday() solution, might be better for you if its level of accuracy is enough:
#include <stdio.h>
#include <sys/time.h>
unsigned long int random_seed()
{
struct timeval tv;
gettimeofday(&tv,0);
return (tv.tv_sec + tv.tv_usec);
}