The output of the below code is 16. Why so? even without initializing with the length of array of the class the size is 16 and with initializing the length with the 2nd constructor, it is the same size i.e. 16. Any explanation?
#include <iostream>
#include <string>
using namespace std;
template <class T>
class array1{
T * arr;
int l;
public:
array1(){
arr = 0; l=0;
}
array1(int x){
l = x;
arr = new T[l];
}
~array1(){
delete[] arr;
}
void display(){
cout << "array1 is :"<<endl;
for (int i=0; i<l; i++)
cout << arr[i] << " ";
cout << endl;
}
};
int main()
{
array1<int> a1;
cout << "size of int arr is " << sizeof(a1);
return 0;
}
It is because of Data Structure alignment. In your system it is being aligned to 8 bytes word. Print the sizeof(T*) and sizeof(int) it will output 8 and 4 respectively in the constructor of array class. But when output together it takes 16 bytes.
It appears your integer and pointer types are both 8 bytes/64 bits. Also, just a heads up, sizeof is a compile-time operator, meaning that even if an object of your type allocates memory on the heap with the operator new[], sizeof will still return 16 bytes for any object of type array1.
Additionally, no matter what type you have for T, sizeof(array1<T>) will always be 16 bytes. (I am making the assumption that you are not compiling on an atypical target.)
Your class has two member variables. A pointer and an int. I suspect that both are eight bytes in size for your platform.
If so, then: 8 + 8 = 16.
(Your class has no virtual methods -- so no vtable overhead).
Related
I have a class named Foo that is privately nothing more than 4-byte int. If I return its value as an 8-byte size_t, am I going to be screwing up unordered_map<> or anything else? I could fill all bits with something like return foo + foo << 32;. Would that be better, or would it be worse as all hashes are now multiples of 0x100000001? Or how about return ~foo + foo << 32; which would use all 64 bits and also not have a common factor?
namespace std {
template<> struct hash<MyNamespace::Foo> {
typedef size_t result_type;
typedef MyNamespace::Foo argument_tupe;
size_t operator() (const MyNamespace::Foo& f ) const { return (size_t) f.u32InternalValue; }
};
}
An incremental uint32_t key converted to uint64_t works well
unordered_map will reserve space for the hash-table incrementally.
The less significant bits of the key is used to determine the bucket position, in an example for 4 entries/buckets, the less significant 2 bits are used.
Elements with a key giving the same bucket (multiple of the number of buckets) are chained in a linked list. This carry the concept of load-factor.
// 4 Buckets example
******** ******** ******** ******** ******** ******** ******** ******XX
bucket 00 would contains keys like {0, 256, 200000 ...}
bucket 01 would contains keys like {1, 513, 4008001 ...}
bucket 10 would contains keys like {2, 130, 10002 ...}
bucket 11 would contains keys like {3, 259, 1027, 20003, ...}
If you try to save an additional values in a bucket, and it load factor goes over the limit, the table is resized (e.g. you try to save a 5th element in a 4-bucket table with load_factor=1.0).
Consequently:
Having a uint32_t or a uint64_t key will have little impact until you reach 2^32-elements hash-table.
Would that be better, or would it be worse as all hashes are now multiples of 0x100000001?
This will have no impact until you reach 32-bits overflow (2^32) hash-table.
Good key conversion between incremental uint32_t and uint64_t:
key64 = static_cast<uint64>(key32);
Bad key conversion between incremental uint32_t and uint64_t:
key64 = static_cast<uint64>(key32)<<32;
The best is to keep the keys as even as possible, avoiding hashes with the same factor again and again. E.g. in the code below, keys with all factor 7 would have collision until resized to 16 buckets.
https://onlinegdb.com/r1N7TNySv
#include <iostream>
#include <unordered_map>
using namespace std;
// Print to std output the internal structure of an unordered_map.
template <typename K, typename T>
void printMapStruct(unordered_map<K, T>& map)
{
cout << "The map has " << map.bucket_count()<<
" buckets and max load factor: " << map.max_load_factor() << endl;
for (size_t i=0; i< map.bucket_count(); ++i)
{
cout << " Bucket " << i << ": ";
for (auto it=map.begin(i); it!=map.end(i); ++it)
{
cout << it->first << " ";
}
cout << endl;
}
cout << endl;
}
// Print the list of bucket sizes by this implementation
void printMapResizes()
{
cout << "Map bucket counts:"<< endl;
unordered_map<size_t, size_t> map;
size_t lastBucketSize=0;
for (size_t i=0; i<1024*1024; ++i)
{
if (lastBucketSize!=map.bucket_count())
{
cout << map.bucket_count() << " ";
lastBucketSize = map.bucket_count();
}
map.emplace(i,i);
}
cout << endl;
}
int main()
{
unordered_map<size_t,size_t> map;
printMapStruct(map);
map.emplace(0,0);
map.emplace(1,1);
printMapStruct(map);
map.emplace(72,72);
map.emplace(17,17);
printMapStruct(map);
map.emplace(7,7);
map.emplace(14,14);
printMapStruct(map);
printMapResizes();
return 0;
}
Note over the bucket count:
In the above example, the bucket count is as follow:
1 3 7 17 37 79 167 337 709 1493 3209 6427 12983 26267 53201 107897 218971 444487 902483 1832561
This seems to purposely follow a series of prime numbers (minimizing collisions). I am not aware of the function behind.
std::unordered_map<> bucket_count() after default rehash
I' did this program what suppose save pairs of string ,int on one vector and print the strings of the maximum number on vector
but when i try to find this strings don't appears nothing so I try print all values of int's on vector and although was finding the maximum of 10 all values in the vector was printing as 0. Someone can explain was it occurred and how I can access the values , please.
#include <iostream>
#include <utility>
#include <vector>
#include <string>
#include <algorithm>
using namespace std;
typedef vector<pair<string,int>> vsi;
bool paircmp(const pair<string,int>& firste,const pair<string,int>& seconde );
int main(int argc, char const *argv[]) {
vsi v(10);
string s;
int n,t;
cin>>t;
for (size_t i = 0;i < t;i++) {
for (size_t j = 0; j < 10; j++) {
cin>>s>>n;
v.push_back(make_pair(s,n));
}
sort(v.begin(),v.end(),paircmp);
int ma=v[v.size()-1].second;
cout<<ma<<endl;
for (size_t j = 0; j < 10; j++) {
cout << v.at(j).second <<endl;
if(v[j].second == ma)
cout<<v[j].first<<endl;
}
}
return 0;
}
bool paircmp(const pair<string,int>& firste,const pair<string,int>& seconde ){
return firste.second < seconde.second;
}
This line
vsi v(10);
creates you a std::vector filled with 10 default-constructed std::pair<std::string, int>s. That is, an empty string and zero.
You then push_back other values to your vector but they happen to be sorted after those ten initial elements, probably because they all have positive ints in them.
Therefore, printing the first member of the first ten elements prints ten empty strings.
This is all I can guess from what you have provided. I don't know what you are trying to accomplish with this code.
Try something like
for (const auto& item : v)
{
std::cout << "{ first: '" << item.first << "', "
<< "second: " << item.second << " }\n";
}
to print all elements of the vector v.
I am a lazy programmer. I want to use C++ vector to create a multidimensional array. For example, this code create a 3x2 2D array:
int nR = 3;
int nC = 2;
vector<vector<double> > array2D(nR);
for(int c = 0; c < nC; c++)
array2D.resize(nC, 0);
However, I am too lazy to
declare array2D's data type: vector<vector<double> >
C++ auto could solve this problem.
However, I am too lazy to
write loop(s) to allocate the space(s) for each object like array2D.
Writing a function could solve this problem.
However, I am too lazy to
write each function for each N-dimensional array.
write nested N-1 loops for allocating spaces.
wirte each function for each data type.
The C++11 variadic template with function recursion could solve this problem.
Is it possible ...?
This is what you want. (Tested on Microsoft Visual C++ 2013 Update 1)
#include <iostream>
#include <vector>
using namespace std;
template<class elemType> inline vector<elemType> getArrayND(int dim) {
// Allocate space and initialize all elements to 0s.
return vector<elemType>(dim, 0);
}
template<class elemType, class... Dims> inline auto getArrayND(
int dim, Dims... resDims
) -> vector<decltype(getArrayND<elemType>(resDims...))> {
// Allocate space for this dimension.
auto parent = vector<decltype(getArrayND<elemType>(resDims...))>(dim);
// Recursive to next dimension.
for (int i = 0; i < dim; i++) {
parent[i] = getArrayND<elemType>(resDims...);
}
return parent;
}
int main() {
auto test3D = getArrayND<double>(2, 3, 4);
auto test4D = getArrayND<double>(2, 3, 4, 2);
test3D[0][0][1] = 3;
test4D[1][2][3][1] = 5;
cout << test3D[0][0][1] << endl;
cout << test4D[1][2][3][1] << endl;
return 0;
}
Suppose we have;
struct collapsed {
char **seq;
int num;
};
...
__device__ *collapsed xdev;
...
collapsed *x_dev
cudaGetSymbolAddress((void **)&x_dev, xdev);
cudaMemcpyToSymbol(x_dev, x, sizeof(collapsed)*size); //x already defined collapsed * , this line gives ERROR
Whay do you think I am getting error at the last line : invalid device symbol ??
The first problem here is that x_dev isn't a device symbol. It might contain an address in a device memory, but that address cannot be passed to cudaMemcpyToSymbol. The call should just be:
cudaMemcpyToSymbol(xdev, ......);
Which brings up the second problem. Doing this:
cudaMemcpyToSymbol(xdev, x, sizeof(collapsed)*size);
would be illegal. xdev is a pointer, so the only valid value you can copy to xdev is a device address. If x is the address of a struct collapsed in device memory, then the only valid version of this memory transfer operation is
cudaMemcpyToSymbol(xdev, &x, sizeof(collapsed *));
ie. x must have previously have been set to the address of memory allocated in the device, something like
collapsed *x;
cudaMalloc((void **)&x, sizeof(collapsed)*size);
cudaMemcpy(x, host_src, sizeof(collapsed)*size, cudaMemcpyHostToDevice);
As promised, here is a complete working example. First the code:
#include <cstdlib>
#include <iostream>
#include <cuda_runtime.h>
struct collapsed {
char **seq;
int num;
};
__device__ collapsed xdev;
__global__
void kernel(const size_t item_sz)
{
if (threadIdx.x < xdev.num) {
char *p = xdev.seq[threadIdx.x];
char val = 0x30 + threadIdx.x;
for(size_t i=0; i<item_sz; i++) {
p[i] = val;
}
}
}
#define gpuQ(ans) { gpu_assert((ans), __FILE__, __LINE__); }
void gpu_assert(cudaError_t code, const char *file, const int line)
{
if (code != cudaSuccess)
{
std::cerr << "gpu_assert: " << cudaGetErrorString(code) << " "
<< file << " " << line << std::endl;
exit(code);
}
}
int main(void)
{
const int nitems = 32;
const size_t item_sz = 16;
const size_t buf_sz = size_t(nitems) * item_sz;
// Gpu memory for sequences
char *_buf;
gpuQ( cudaMalloc((void **)&_buf, buf_sz) );
gpuQ( cudaMemset(_buf, 0x7a, buf_sz) );
// Host array for holding sequence device pointers
char **seq = new char*[nitems];
size_t offset = 0;
for(int i=0; i<nitems; i++, offset += item_sz) {
seq[i] = _buf + offset;
}
// Device array holding sequence pointers
char **_seq;
size_t seq_sz = sizeof(char*) * size_t(nitems);
gpuQ( cudaMalloc((void **)&_seq, seq_sz) );
gpuQ( cudaMemcpy(_seq, seq, seq_sz, cudaMemcpyHostToDevice) );
// Host copy of the xdev structure to copy to the device
collapsed xdev_host;
xdev_host.num = nitems;
xdev_host.seq = _seq;
// Copy to device symbol
gpuQ( cudaMemcpyToSymbol(xdev, &xdev_host, sizeof(collapsed)) );
// Run Kernel
kernel<<<1,nitems>>>(item_sz);
// Copy back buffer
char *buf = new char[buf_sz];
gpuQ( cudaMemcpy(buf, _buf, buf_sz, cudaMemcpyDeviceToHost) );
// Print out seq values
// Each string should be ASCII starting from ´0´ (0x30)
char *seq_vals = buf;
for(int i=0; i<nitems; i++, seq_vals += item_sz) {
std::string s;
s.append(seq_vals, item_sz);
std::cout << s << std::endl;
}
return 0;
}
and here it is compiled and run:
$ /usr/local/cuda/bin/nvcc -arch=sm_12 -Xptxas=-v -g -G -o erogol erogol.cu
./erogol.cu(19): Warning: Cannot tell what pointer points to, assuming global memory space
ptxas info : 8 bytes gmem, 4 bytes cmem[14]
ptxas info : Compiling entry function '_Z6kernelm' for 'sm_12'
ptxas info : Used 5 registers, 20 bytes smem, 4 bytes cmem[1]
$ /usr/local/cuda/bin/cuda-memcheck ./erogol
========= CUDA-MEMCHECK
0000000000000000
1111111111111111
2222222222222222
3333333333333333
4444444444444444
5555555555555555
6666666666666666
7777777777777777
8888888888888888
9999999999999999
::::::::::::::::
;;;;;;;;;;;;;;;;
<<<<<<<<<<<<<<<<
================
>>>>>>>>>>>>>>>>
????????????????
################
AAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEE
FFFFFFFFFFFFFFFF
GGGGGGGGGGGGGGGG
HHHHHHHHHHHHHHHH
IIIIIIIIIIIIIIII
JJJJJJJJJJJJJJJJ
KKKKKKKKKKKKKKKK
LLLLLLLLLLLLLLLL
MMMMMMMMMMMMMMMM
NNNNNNNNNNNNNNNN
OOOOOOOOOOOOOOOO
========= ERROR SUMMARY: 0 errors
Some notes:
To simplify things a bit, I have only used a single memory allocation _buf to hold all of the string data. Each value of seq is set to a different address within _buf. This is functionally equivalent to running a separate cudaMalloc call for each pointer, but much faster.
The key concept is to assemble a copy of the structure you wish to access on the device in host memory, then copy that to the device. All of the pointers in my xdev_host are device pointers. The CUDA API doesn't have any sort of deep copy or automatic pointer translation facility, so it is the programmer's responsibility to make sure this is correct.
Each thread in the kernel just fills its sequence with a difference ASCII character. Note that I have declared my xdev as a structure, rather than pointer to structure and copy values rather than a reference to the __device__ symbol (again to simplify things slightly). But otherwise the sequence of operations is what you would need to make your design pattern work.
Because I only have access to a compute 1.x device, the compiler issues a warning. One compute 2.x and 3.x this won't happen because of the improved memory model in those devices. The warning is normal and can be safely ignored.
Because each sequence is just written into a different part of _buf, I can transfer all the sequences back to the host with a single cudaMemcpy call.
I’m looking for a sorting algorithm on CUDA that can sort an array A of elements (double) and returns an array of keys B for that array A.
I know the sort_by_key function in the Thrust library but I want my array of elements A to remain unchanged.
What can I do?
My code is:
void sortCUDA(double V[], int P[], int N) {
real_t *Vcpy = (double*) malloc(N*sizeof(double));
memcpy(Vcpy,V,N*sizeof(double));
thrust::sort_by_key(V, V + N, P);
free(Vcpy);
}
i'm comparing the thrust algorithm against others that i have on sequencial cpu
N mergesort sortCUDA
113 0.000008 0.000010
226 0.000018 0.000016
452 0.000036 0.000020
905 0.000061 0.000034
1810 0.000135 0.000071
3621 0.000297 0.000156
7242 0.000917 0.000338
14484 0.001421 0.000853
28968 0.003069 0.001931
57937 0.006666 0.003939
115874 0.014435 0.008025
231749 0.031059 0.016718
463499 0.067407 0.039848
926999 0.148170 0.118003
1853998 0.329005 0.260837
3707996 0.731768 0.544357
7415992 1.638445 1.073755
14831984 3.668039 2.150179
115035495 39.276560 19.812200
230070990 87.750377 39.762915
460141980 200.940501 74.605219
Thrust performance is not bad, but I think if I use OMP can probably get easily a better CPU time
I think this is because to memcpy
SOLUTION:
void thrustSort(double V[], int P[], int N)
{
thrust::device_vector<int> d_P(N);
thrust::device_vector<double> d_V(V, V + N);
thrust::sequence(d_P.begin(), d_P.end());
thrust::sort_by_key(d_V.begin(), d_V.end(), d_P.begin());
thrust::copy(d_P.begin(),d_P.end(),P);
}
where V is a my double values to sort
You can modify comparison operator to sort keys instead of values. #Robert Crovella correctly pointed that a raw device pointer cannot be assigned from the host. The modified algorithm is below:
struct cmp : public binary_function<int,int,bool>
{
cmp(const double *ptr) : rawA(ptr) { }
__host__ __device__ bool operator()(const int i, const int j) const
{return rawA[i] > rawA[j];}
const double *rawA; // an array in global mem
};
void sortkeys(double *A, int n) {
// move data to the gpu
thrust::device_vector<double> devA(A, A + n);
double *rawA = thrust::raw_pointer_cast(devA.data());
thrust::device_vector<int> B(n);
// initialize keys
thrust::sequence(B.begin(), B.end());
thrust::sort(B.begin(), B.end(), cmp(rawA));
// B now contains the sorted keys
}
And here is alternative with arrayfire. Though I am not sure which one is more efficient since arrayfire solution uses two additional arrays:
void sortkeys(double *A, int n) {
af::array devA(n, A, af::afHost);
af::array vals, indices;
// sort and populate vals/indices arrays
af::sort(vals, indices, devA);
std::cout << devA << "\n" << indices << "\n";
}
How large is this array? The most efficient way, in terms of speed, will likely be to just duplicate the original array before sorting, if the memory is available.
Building on the answer provided by #asm (I wasn't able to get it working), this code seemed to work for me, and does sort only the keys. However, I believe it is limited to the case where the keys are in sequence 0, 1, 2, 3, 4 ... corresponding to the (double) values. Since this is a "index-value" sort, it could be extended to the case of an arbitrary sequence of keys, perhaps by doing an indexed copy. However I'm not sure the process of generating the index sequence and then rearranging the original keys will be any faster than just copying the original value data to a new vector (for the case of arbitrary keys).
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
using namespace std;
__device__ double *rawA; // an array in global mem
struct cmp : public binary_function<int, int, bool>
{
__host__ __device__ bool operator()(const int i, const int j) const
{return ( rawA[i] < rawA[j]);}
};
void sortkeys(double *A, int n) {
// move data to the gpu
thrust::device_vector<double> devA(A, A + n);
// rawA = thrust::raw_pointer_cast(&(devA[0]));
double *test = raw_pointer_cast(devA.data());
cudaMemcpyToSymbol(rawA, &test, sizeof(double *));
thrust::device_vector<int> B(n);
// initialize keys
thrust::sequence(B.begin(), B.end());
thrust::sort(B.begin(), B.end(), cmp());
// B now contains the sorted keys
thrust::host_vector<int> hostB = B;
for (int i=0; i<hostB.size(); i++)
std::cout << hostB[i] << " ";
std::cout<<std::endl;
for (int i=0; i<hostB.size(); i++)
std::cout << A[hostB[i]] << " ";
std::cout<<std::endl;
}
int main(){
double C[] = {0.7, 0.3, 0.4, 0.2, 0.6, 1.2, -0.5, 0.5, 0.0, 10.0};
sortkeys(C, 9);
std::cout << std::endl;
return 0;
}