I know how to do this the long way so to speak:
#include <vector>
int main() {
// Simple vector of ints = resized to 1k elements
std::vector<int> ints;
ints.resize( 1000 ); // Easy enough
// nested vector of ints 1k vectors each with 1k elements
std::vector<std::vector<int>> vecInts;
vecInts.resize( 1000 );
for ( auto& a : vecInts ) {
a.resize( 1000 );
// Again easy enough.
Now instead of typing it out like that I would like to use typedefs
#include <vector>
typedef std::vector<int> Ints;
typedef std::vector<Ints> vecInts;
int main() {
vecInts a;
a.resize( 1000 ); // seems okay
// Assuming this would work...
for ( auto& n : a ) {
n.resize( 1000 );
My question is does the 2nd code snippet do what is expected and is it equivalent to the 1st code snippet or am I missing something?
Second quick question does 1k * 1k exceed the size limits of std::vector?
Yes, the two snippets do the same thing. But you can write it as a one-liner too. vector has a constructor (constructor (2) on that page) that takes a count and a value from which each element will be copy constructed.
vecInts a(1000, Ints(1000));
I have a large array that I need to sort on the GPU. The array itself is a concatenation of multiple smaller subarrays that satisfy the condition that given i < j, the elements of the subarray i are smaller than the elements of the subarray j. An example of such array would be {5 3 4 2 1 6 9 8 7 10 11},
where the elements of the first subarray of 5 elements are smaller than the elements of the second subarray of 6 elements. The array I need is {1, 2, 3, 4, 5, 6, 7, 10, 11}. I know the position where each subarray starts in the large array.
I know I can simply use thrust::sort on the whole array, but I was wondering if it's possible to launch multiple concurrent sorts, one for each subarray. I'm hoping to get a performance improvement by doing that. My assumption is that it would be faster to sort multiple smaller arrays than one large array with all the elements.
I'd appreciate if someone could give me a way to do that or correct my assumption in case it's wrong.
A way to do multiple concurrent sorts (a "vectorized" sort) in thrust is via the marking of the sub arrays, and providing a custom functor that is an ordinary thrust sort functor that also orders the sub arrays by their key.
Another possible method is to use back-to-back thrust::stable_sort_by_key as described here.
As you have pointed out, another method in your case is just to do an ordinary sort, since that is ultimately your objective.
However I think its unlikely that any of the thrust sort methods will give a signficant speed-up over a pure sort, although you can try it. Thrust has a fast-path radix sort which it will use in certain situations, which the pure sort method could probably use in your case. (In other cases, e.g. when you provide a custom functor, thrust will often use a slower merge-sort method.)
If the sizes of the sub arrays are within certain ranges, I think you're likely to get much better results (performance-wise) with block radix sort in cub, one block per sub-array.
Here is an example that uses specific sizes (since you've given no indication of size ranges and other details), comparing a thrust "pure sort" to a thrust segmented sort with functor, to the cub block sort method. For this particular case, the cub sort is fastest:
$ cat t1.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/scan.h>
#include <thrust/equal.h>
#include <cstdlib>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
const int num_blocks = 2048;
const int items_per = 4;
const int nTPB = 512;
const int block_size = items_per*nTPB; // must be a whole-number multiple of nTPB;
typedef float mt;
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
struct my_sort_functor
template <typename T, typename T2>
__host__ __device__
bool operator()(T t1, T2 t2){
if (thrust::get<1>(t1) < thrust::get<1>(t2)) return true;
if (thrust::get<1>(t1) > thrust::get<1>(t2)) return false;
if (thrust::get<0>(t1) > thrust::get<0>(t2)) return false;
return true;}
// from: https://nvlabs.github.io/cub/example_block_radix_sort_8cu-example.html#_a0
#define CUB_STDERR
#include <stdio.h>
#include <iostream>
#include <algorithm>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_sort.cuh>
using namespace cub;
// Globals, constants and typedefs
bool g_verbose = false;
bool g_uniform_keys;
// Kernels
template <
typename Key,
__launch_bounds__ (BLOCK_THREADS)
__global__ void BlockSortKernel(
Key *d_in, // Tile of input
Key *d_out) // Tile of output
// Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
// Specialize BlockRadixSort type for our thread block
typedef BlockRadixSort<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// Shared memory
__shared__ union TempStorage
typename BlockLoadT::TempStorage load;
typename BlockRadixSortT::TempStorage sort;
} temp_storage;
// Per-thread tile items
// Our current block's offset
int block_offset = blockIdx.x * TILE_SIZE;
// Load items into a blocked arrangement
BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
// Barrier for smem reuse
// Sort keys
// Store output in striped fashion
StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
int main(){
const int ds = num_blocks*block_size;
thrust::host_vector<mt> data(ds);
thrust::host_vector<int> keys(ds);
for (int i = block_size; i < ds; i+=block_size) keys[i] = 1; // mark beginning of blocks
thrust::device_vector<int> d_keys = keys;
for (int i = 0; i < ds; i++) data[i] = (rand()%block_size) + (i/block_size)*block_size; // populate data
thrust::device_vector<mt> d_data = data;
thrust::inclusive_scan(d_keys.begin(), d_keys.end(), d_keys.begin()); // fill out keys array 000111222...
thrust::device_vector<mt> d1 = d_data; // make a copy of unsorted data
unsigned long long os = dtime_usec(0);
thrust::sort(d1.begin(), d1.end()); // ordinary sort
os = dtime_usec(os);
thrust::device_vector<mt> d2 = d_data; // make a copy of unsorted data
unsigned long long ss = dtime_usec(0);
thrust::sort(thrust::make_zip_iterator(thrust::make_tuple(d2.begin(), d_keys.begin())), thrust::make_zip_iterator(thrust::make_tuple(d2.end(), d_keys.end())), my_sort_functor());
ss = dtime_usec(ss);
if (!thrust::equal(d1.begin(), d1.end(), d2.begin())) {std::cout << "oops1" << std::endl; return 0;}
std::cout << "ordinary thrust sort: " << os/(float)USECPSEC << "s " << "segmented sort: " << ss/(float)USECPSEC << "s" << std::endl;
thrust::device_vector<mt> d3(ds);
unsigned long long cs = dtime_usec(0);
BlockSortKernel<mt, nTPB, items_per><<<num_blocks, nTPB>>>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d3.data()));
cs = dtime_usec(cs);
if (!thrust::equal(d1.begin(), d1.end(), d3.begin())) {std::cout << "oops2" << std::endl; return 0;}
std::cout << "cub sort: " << cs/(float)USECPSEC << "s" << std::endl;
$ nvcc -o t1 t1.cu
$ ./t1
ordinary thrust sort: 0.001652s segmented sort: 0.00263s
cub sort: 0.000265s
(CUDA 10.2.89, Tesla V100, Ubuntu 18.04)
I have no doubt that your sizes and array dimensions don't correspond to mine. The purpose here is to illustrate some possible methods, not a black-box solution that works for your particular case. You probably should do benchmark comparisons of your own. I also acknowledge that the block radix sort method for cub expects equal-sized sub-arrays, which you may not have. It may not be a suitable method for you, or you may wish to explore some kind of padding arrangement. There's no need to ask this question of me; I won't be able to answer it based on the information in your question.
I don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the questions in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.
I was wondering what's a good solution to make it so that a custom data structure took the least amount of space possible, and I've been searching around without finding anything.
The general idea is I may have a some kind of data structure with a lot of different variables, integers, booleans, etc. With booleans, it's fairly easy to use bitmasks/flags. For integers, perhaps I only need to use 10 of the numbers for one of the integers, and 50 for another. I would like to have some function encode the structure, without wasting any bits. Ideally I would be able to pack them side-by-side in an array, without any padding.
I have a vague idea that I would have to have way of enumerating all the possible permutations of values of all the variables, but I'm unsure where to start with this.
Additionally, though this may be a bit more complicated, what if I have a bunch of restrictions such as not caring about certain variables if other variables meet certain criteria. This reduces the amount of permutations, so there should be a way of saving some bits here as well?
Example: Say I have a server for an online game, containing many players. Each player. The player struct stores a lot of different variables, level, stats, and a bunch of flags for which quests the player has cleared.
struct Player {
int level; //max is 100
int strength //max is
int int // max is 500
/* ... */
bool questFlag30;
bool questFlag31;
bool questFlag32;
/* ... */
and I want to have a function that takes an vector of Players called encodedData encode(std::vector<Player> players) and a function decodeData which returns a vector from the encoded data.
This is what I came up with; it's not perfect, but it's something:
#include <vector>
#include <iostream>
#include <bitset>
#include <assert.h>
/* Data structure for packing multiple variables, without padding */
struct compact_collection {
std::vector<bool> data;
/* Returns a uint32_t since we don't want to store the length of each variable */
uint32_t query_bits(int index, int length) {
std::bitset<32> temp;
for (int i = index; i < index + length; i++) temp[i - index] = data[i];
return temp.to_ulong();
/* */
void add_bits(int32_t value, int32_t bits) {
assert(std::pow(2, bits) >= value);
auto a = std::bitset<32>(value).to_string();
for (int i = 32 - bits; i < 32; i++) data.insert(data.begin(), (a[i] == '1'));
int main() {
compact_collection myCollection;
std::cout << myCollection.query_bits(0,6);
return 0;
Is it possible to put the iterator of list in to set:
I wrote codes as follows :
It failed on VS2015 but run smoothly on g++
And I also tried to use std::hash to calculate a hash value of std::list::iterator
but failed again, it has no hash func for iterator.
And one can help ? Or it's impossible .....
#include <set>
#include <list>
#include <cstring>
#include <cassert>
// like std::less
struct myless
typedef std::list<int>::iterator first_argument_type;
typedef std::list<int>::iterator second_argument_type;
typedef bool result_type;
bool operator()(const std::list<int>::iterator& x,const std::list<int>::iterator& y) const
return memcmp(&x, &y, sizeof(std::list<int>::iterator)) < 0; // using memcmp
int main()
std::list<int> lst = {1,2,3,4,5};
std::set<std::list<int>::iterator,myless> test;
auto it = lst.begin();
assert(test.find(lst.begin()) != test.end()); // fail on vs 2015
auto it1 = lst.end();
auto it2 = lst.end();
assert(memcmp(&it1,&it2,sizeof(it1)) == 0); // fail on vs 2015
return 0;
Yes, you can put std::list<T>::iterator in a std::set, if you tell std::set what order they should be in. A reasonable order could be std::less<T>, i.e. you sort the iterators by the values they point to (obviously you then can't insert an std::list::end iterator). Any other order is also OK.
However, you tried to use memcmp, and that is wrong. The predicate used by set requires that equal values compare equal, and there is no guarantee that equal iterators (as defined by list::iterator::operator==) also compare equal using memcmp.
I find a way to do this as like but not for the end iterator
bool operator<(const T& x, const T& y)
return &*x < &*y;
I have some old C code that still runs very fast. One of the things it does is store the part of an array for which a condition holds (a 'masked' copy)
So the C code is:
int *msk;
int msk_size;
double *ori;
double out[msk_size];
for ( int i=0; i<msk_size; i++ )
out[i] = ori[msk[i]];
When I was 'modernising' this code, I figured that there would be a way to do this in C++11 with iterators that don't need to use index counters. But there does not seem to be a shorter way to do this with std::for_each or even std::copy.
Is there a way to write this up more concisely in C++11? Or should I stop looking and leave the old code in?
I think you are looking for std::transfrom.
std::array<int, msk_size> msk;
std::array<double, msk_size> out;
double *ori;
std::transform(std::begin(msk), std::end(msk),
[&](int i) { return ori[i]; });
In case you only want to modernize the loop, and keep the ori and msk data around, use #YuxiuLi's solution. If you also want to modernize the generation of the msk data, you can use std::copy_if with a predicate (here: a lambda that keeps only the negative numbers) to filter the elements directly.
#include <algorithm>
#include <vector>
#include <iostream>
#include <iterator>
int main()
auto ori = std::vector<double> { 0.1, -1.2, 2.4, 3.4, -7.1 };
std::vector<double> out;
std::copy_if(begin(ori), end(ori), std::back_inserter(out), [&](double d) { return d < 0.0; });
std::copy(begin(out), end(out), std::ostream_iterator<double>(std::cout, ","));
Live Example. This saves an intermediate storage of msk.
I'm trying to use a set of dynamic_bitset objects, but I'm getting an assertion failure at runtime:
a.out: boost/dynamic_bitset/dynamic_bitset.hpp:1291:
bool boost::operator<(const boost::dynamic_bitset<Block, Allocator>&,
const boost::dynamic_bitset<Block, Allocator>&)
[with Block = long unsigned int,
Allocator = std::allocator<long unsigned int>]:
Assertion `a.size() == b.size()' failed.
Here is the code:
#include <iostream>
#include <set>
#include <boost/dynamic_bitset.hpp>
int main() {
typedef boost::dynamic_bitset<> bitset;
std::set<bitset> myset;
bitset x(2, 0);
bitset y(3, 1);
return 0;
I'm wondering why the same size for the inserted dynamic_bitset objects is required. For the operator< to work, couldn't it assume that the most significant bits in the shorter bitset are implicitly filled with zeros?
Is there any way to do get that set of dynamic_bitsets to work?
I've also tried an unordered_set because it doesn't need the operator< but it can't compile because dynamic_bitset doesn't have a hash_value and I'm not sure how to write that without using its to_ulong member function, which would work only for short bitsets.
The reason for the assertion is the way the operator< is implemented:
for (size_type ii = a.num_blocks(); ii > 0; --ii)
Only the block count of the first operand is used to iterate through the bitsets.
If the size of the first bitset is larger, it would access the second bitset out of bounds.
You can define and use your own comperator with std::set and handle the comparison of different sized bitsets as you see fit:
struct my_less {
bool operator()(const boost::dynamic_bitset<>& lhs,
const boost::dynamic_bitset<>& rhs) const
//TODO: implement custom comparison for lhs < rhs
return false;
typedef boost::dynamic_bitset<> bitset;
std::set<bitset,my_less> myset;
myset.insert( bitset(2, 0) );
myset.insert( bitset(3, 1) );