Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
If we ignore the aspects:
Bad practice
Unjustified need
and probably others ...
What are the risks (run-time : crash, undefined behavior, segmentation fault. implementation-defined behavior : wrong address generation) of this program as long as the address remains in the interval INT_MIN and INT_MAX:
#include <iostream>
using namespace std;
#include <sstream>
#include <string>
#define TAB_SIZE 2
void UseIntAsAdress (unsigned int i)
{
int *pTab = (int*) i;
for (int i=0; i< TAB_SIZE; i++)
cout << "tab ["<<i<<"] = "<< pTab[i] <<endl;
}
int main()
{
int *pTab = new int [TAB_SIZE];
for ( int i=0; i<TAB_SIZE; i++)
pTab [i] = i;
std::stringstream streamAdr;
streamAdr << pTab;
std::string name = streamAdr.str();
unsigned int i = stoi(name.c_str(), 0, 16);
UseIntAsAdress (i);
delete [] pTab;
return 0;
}
Your program has implementation-defined behavior. Both the result of streamAdr << pTab; and the result of (int*) i are implementation-defined.
So you need to look at the documentation of your particular compiler to figure out whether this program behaves in the way you expect it to or not.
There is no general guarantee that this will behave correctly.
The cast from pointer to integer can be done much simpler as well:
reinterpret_cast<std::intptr_t>(pTab)
This is assuming your implementation supports std::intptr_t. Otherwise (in particular pre-C++11) you can try one one of the standard integer types. Compilation should fail if the type used is too small to hold the pointer values and otherwise it will work the same as std::intptr_t.
If then the value resulting from this cast isn't narrowed by conversion to int, the result of casting back to int* will behave as expected (i.e. you get a pointer to the first element of the array back), otherwise it will still have implementation-defined behavior.
Related
I have a large array that I need to sort on the GPU. The array itself is a concatenation of multiple smaller subarrays that satisfy the condition that given i < j, the elements of the subarray i are smaller than the elements of the subarray j. An example of such array would be {5 3 4 2 1 6 9 8 7 10 11},
where the elements of the first subarray of 5 elements are smaller than the elements of the second subarray of 6 elements. The array I need is {1, 2, 3, 4, 5, 6, 7, 10, 11}. I know the position where each subarray starts in the large array.
I know I can simply use thrust::sort on the whole array, but I was wondering if it's possible to launch multiple concurrent sorts, one for each subarray. I'm hoping to get a performance improvement by doing that. My assumption is that it would be faster to sort multiple smaller arrays than one large array with all the elements.
I'd appreciate if someone could give me a way to do that or correct my assumption in case it's wrong.
A way to do multiple concurrent sorts (a "vectorized" sort) in thrust is via the marking of the sub arrays, and providing a custom functor that is an ordinary thrust sort functor that also orders the sub arrays by their key.
Another possible method is to use back-to-back thrust::stable_sort_by_key as described here.
As you have pointed out, another method in your case is just to do an ordinary sort, since that is ultimately your objective.
However I think its unlikely that any of the thrust sort methods will give a signficant speed-up over a pure sort, although you can try it. Thrust has a fast-path radix sort which it will use in certain situations, which the pure sort method could probably use in your case. (In other cases, e.g. when you provide a custom functor, thrust will often use a slower merge-sort method.)
If the sizes of the sub arrays are within certain ranges, I think you're likely to get much better results (performance-wise) with block radix sort in cub, one block per sub-array.
Here is an example that uses specific sizes (since you've given no indication of size ranges and other details), comparing a thrust "pure sort" to a thrust segmented sort with functor, to the cub block sort method. For this particular case, the cub sort is fastest:
$ cat t1.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/scan.h>
#include <thrust/equal.h>
#include <cstdlib>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
const int num_blocks = 2048;
const int items_per = 4;
const int nTPB = 512;
const int block_size = items_per*nTPB; // must be a whole-number multiple of nTPB;
typedef float mt;
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
struct my_sort_functor
{
template <typename T, typename T2>
__host__ __device__
bool operator()(T t1, T2 t2){
if (thrust::get<1>(t1) < thrust::get<1>(t2)) return true;
if (thrust::get<1>(t1) > thrust::get<1>(t2)) return false;
if (thrust::get<0>(t1) > thrust::get<0>(t2)) return false;
return true;}
};
// from: https://nvlabs.github.io/cub/example_block_radix_sort_8cu-example.html#_a0
#define CUB_STDERR
#include <stdio.h>
#include <iostream>
#include <algorithm>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_sort.cuh>
using namespace cub;
//---------------------------------------------------------------------
// Globals, constants and typedefs
//---------------------------------------------------------------------
bool g_verbose = false;
bool g_uniform_keys;
//---------------------------------------------------------------------
// Kernels
//---------------------------------------------------------------------
template <
typename Key,
int BLOCK_THREADS,
int ITEMS_PER_THREAD>
__launch_bounds__ (BLOCK_THREADS)
__global__ void BlockSortKernel(
Key *d_in, // Tile of input
Key *d_out) // Tile of output
{
enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };
// Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
typedef BlockLoad<Key, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoadT;
// Specialize BlockRadixSort type for our thread block
typedef BlockRadixSort<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// Shared memory
__shared__ union TempStorage
{
typename BlockLoadT::TempStorage load;
typename BlockRadixSortT::TempStorage sort;
} temp_storage;
// Per-thread tile items
Key items[ITEMS_PER_THREAD];
// Our current block's offset
int block_offset = blockIdx.x * TILE_SIZE;
// Load items into a blocked arrangement
BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
// Barrier for smem reuse
__syncthreads();
// Sort keys
BlockRadixSortT(temp_storage.sort).SortBlockedToStriped(items);
// Store output in striped fashion
StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
}
int main(){
const int ds = num_blocks*block_size;
thrust::host_vector<mt> data(ds);
thrust::host_vector<int> keys(ds);
for (int i = block_size; i < ds; i+=block_size) keys[i] = 1; // mark beginning of blocks
thrust::device_vector<int> d_keys = keys;
for (int i = 0; i < ds; i++) data[i] = (rand()%block_size) + (i/block_size)*block_size; // populate data
thrust::device_vector<mt> d_data = data;
thrust::inclusive_scan(d_keys.begin(), d_keys.end(), d_keys.begin()); // fill out keys array 000111222...
thrust::device_vector<mt> d1 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long os = dtime_usec(0);
thrust::sort(d1.begin(), d1.end()); // ordinary sort
cudaDeviceSynchronize();
os = dtime_usec(os);
thrust::device_vector<mt> d2 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long ss = dtime_usec(0);
thrust::sort(thrust::make_zip_iterator(thrust::make_tuple(d2.begin(), d_keys.begin())), thrust::make_zip_iterator(thrust::make_tuple(d2.end(), d_keys.end())), my_sort_functor());
cudaDeviceSynchronize();
ss = dtime_usec(ss);
if (!thrust::equal(d1.begin(), d1.end(), d2.begin())) {std::cout << "oops1" << std::endl; return 0;}
std::cout << "ordinary thrust sort: " << os/(float)USECPSEC << "s " << "segmented sort: " << ss/(float)USECPSEC << "s" << std::endl;
thrust::device_vector<mt> d3(ds);
cudaDeviceSynchronize();
unsigned long long cs = dtime_usec(0);
BlockSortKernel<mt, nTPB, items_per><<<num_blocks, nTPB>>>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d3.data()));
cudaDeviceSynchronize();
cs = dtime_usec(cs);
if (!thrust::equal(d1.begin(), d1.end(), d3.begin())) {std::cout << "oops2" << std::endl; return 0;}
std::cout << "cub sort: " << cs/(float)USECPSEC << "s" << std::endl;
}
$ nvcc -o t1 t1.cu
$ ./t1
ordinary thrust sort: 0.001652s segmented sort: 0.00263s
cub sort: 0.000265s
$
(CUDA 10.2.89, Tesla V100, Ubuntu 18.04)
I have no doubt that your sizes and array dimensions don't correspond to mine. The purpose here is to illustrate some possible methods, not a black-box solution that works for your particular case. You probably should do benchmark comparisons of your own. I also acknowledge that the block radix sort method for cub expects equal-sized sub-arrays, which you may not have. It may not be a suitable method for you, or you may wish to explore some kind of padding arrangement. There's no need to ask this question of me; I won't be able to answer it based on the information in your question.
I don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the questions in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I just ended my first exams session, passed (thanks to you).
I have one more question for you: I have to find the max of an array of struct and then printf the element of the array that has the max value in it, using a recursive algorithm. I've been smashing my head on the keyboard for about 1 week just to solve this, but I cannot seem to be able to do it. Can you help me?
Here's the code:
PLEASE, DON'T CARE ABOUT THOSE STRCPY, ty.
#include <stdio.h>
#include <stdlib.h>
typedef struct
{
char autori[100];
char titolo[100];
int anno;
int codice;
float prezzo;
char stato[50];
} libro;
int massimo(int m, int n );
libro ricorsivo(libro a[], int len);
void main()
{
libro max;
int len=30;
libro elenco[100];
strcpy(elenco[0].autori,"Angelo Ciaramella e Giulio Giunta");
strcpy(elenco[0].titolo,"Manuale di Programmazione in C");
elenco[0].anno=2009;
elenco[0].codice=0;
elenco[0].prezzo=0.0;
strcpy(elenco[0].stato,"Disponibile");
max=ricorsivo(elenco, len);
printf ("il massimo vale %d", max);
}
libro ricorsivo(libro a[], int len)
{
if (len==1)
return a[0];
else
return massimo(a.prezzo[len-1],ricorsivo(a,len-1));
}
int massimo(int m, int n)
{
if (n>m)
return n;
else if (m>n)
return m;
}
The algorithm is incomplete, I know, but the most problematic parts are those functions. I hope you can help me, thank you.
Here are some hints that should help you fix the code:
Firstly, your massimo (max) function is defined incorrectly. In the case where m == n this function returns nothing, this is not allowed. What you want is if n > m you return n otherwise simply return m, i.e.
int max(int m, int n) { return n > m ? n : m; }
Next, in your recursive function you have a few type errors in this line:
return massimo(a.prezzo[len-1], recorsivo(a, len-1))
a is not of type libro, it is of type libro[] so it is not going to have a prezzo field.
If a were of type libro, and you accessed its prezzo field, then that has type float, and so it would be incorrect to perform an array index on it.
If a.prezzo[len-1] did produce the prezzo value of the len-1th element of the array, then that would have type float, but your massimo function accepts only ints.
ricorsivo(a, len-1) returns a libro and it is being passed into massimo which takes an int.
To fix these issues try the following:
Remove your massimo function, you don't need it.
In the recursive case of your recursive function
Call it recursively to get the max from the rest of the array.
Compare the prezzo value from the max of the rest of the array, with the prezzo value from the element at len-1.
Return the libro with the larger prezzo.
You should be able to translate the above into some pretty straightforward C code.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
[ The Challenge is Over ]
Problem:
An Array of positive elements. Deepu Wants to reduce the elements of the array. He calls a function Hit(X) which reduces the all the elements in the array which are greater than X by 1.
he will call this array many times . Print the array after many calls to Hit(X).
Input:
n----- no of elements in array 10^5.
n elements ----- 1<=element<=10^9.
x----- no of calls to Hit(X) x elements----- 1<=element<=10^9.
output:
Print The array after call to Hit(X) x times.
Time limit--5 secs.
My solution gave Time Limit Exceeded.
My approach:
keep an Original Array
Create a vector of pairs of array elements and their index in the array Sort the vector elements [ ascending ].
Do LowerBound() of C++ STL to get the position of element in the
vector where elements are equal to give element x.
From this element
decrease the elements which are greater than x by 1 till end in the
original array from the index in the pair.
Repeat step 3 & 4 for
every x.
Print the Original array.
I think my solution has complexity n^2.
Can someone Give me an Optimized solution
Thanks
My Code
#define _CRT_DISABLE_PERFCRIT_LOCKS
// lower_bound/upper_bound example
#include <iostream> // std::cout
#include <algorithm> // std::lower_bound, std::upper_bound, std::sort
#include <vector> // std::vector
#include <utility>
using namespace std;
bool pairCompare(const std::pair<long long int, unsigned int>& firstElem, const std::pair<long long int, unsigned int>& secondElem) {
return firstElem.first < secondElem.first;
}
int main() {
ios_base::sync_with_stdio(false);
cin.tie(NULL);
unsigned int n, m;
long long int arr[100000], x,temp;
vector<pair<long long int, unsigned int> > vect(100000);
cin >> n;
for (unsigned int i = 0; i < n; i++)
{
cin >> temp;
arr[i] = temp;
vect[i].first = temp;
vect[i].second = i;
}
sort(vect.begin(), vect.begin() + n, pairCompare);
cin >> m;
vector<pair<long long int, unsigned int> >::iterator low;
while (m--)
{
cin >> x;
low = lower_bound(vect.begin(), vect.begin() + n, make_pair(x,2), pairCompare);
if (low != vect.begin() + n)
{
for (unsigned int i = low - vect.begin(); i < n; i++)
{
if (vect[i].first != x)
{
vect[i].first -= 1;
arr[vect[i].second] -= 1;
}
}
}
}
for (unsigned int i = 0; i < n; i++)
{
cout << arr[i]<<" ";
}
return 0;
}
First sort the input array in non-decreasing order. The input array will remain sorted after each of the update operations is run because we are looking for elements greater than x and decrementing them so the worst that could happen is that some elements become equal to x after the operation: array is still sorted.
You can update a range quickly by using a lazy segment tree update. You have to remember the original positions so that you can print the array at the end.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
This is my code in c++. I have used c++11. is used to measure time in microseconds. My merge sort takes about 24 seconds to sort a randomly generated number array of size of 10 million. But when i refer to my friend's results they have got like 3 seconds. My code seems correct and the difference of mine and them is that they have used the clock in to measure time instead of chrono. Will this affect the deviation of my result? Please answer!
This is my code:
#include <iostream>
#include <climits>
#include<cstdlib>
#include<ctime>
#include<chrono>
using namespace std;
void merge_sort(long* inputArray,long low,long high);
void merge(long* inputArray,long low,long high,long mid);
int main(){
srand(time(NULL));
int n=1000;
long* inputArray = new long[n];
for (long i=0;i<n;i++){ //initialize the arrays of size n with random n numbers
inputArray[i]=rand(); //Generate a random number
}
auto Start = std::chrono::high_resolution_clock::now(); //Set the start time for insertion_sort
merge_sort(inputArray,0,n); //calling the insertion_sort to sort the array of size n
auto End = std::chrono::high_resolution_clock::now(); //Set the end time for insertion_sort
cout<<endl<<endl<<"Time taken for Merge Sort = "<<std::chrono::duration_cast<std::chrono::microseconds>(End-Start).count()<<" microseconds"; //Display the time taken for insertion sort
delete []inputArray;
return 0;
}
void merge_sort(long* inputArray,long low,long high){
if (low<high){
int mid =(low+high)/2;
merge_sort(inputArray,low,mid);
merge_sort(inputArray,mid+1,high);
merge(inputArray,low,mid,high);
}
return;
}
void merge(long* inputArray,long low,long mid,long high){
long n1 = mid-low+1;
long n2 = high - mid;
long *L= new long [n1+1];
long *R=new long [n2+1];
for (int i=0;i<=n1;i++){
L[i] = inputArray[low+i];
}
for (int j=0;j<=n2;j++){
R[j] = inputArray[mid+j+1];
}
L[n1]=INT_MAX ;
R[n2]=INT_MAX;
long i=0;
long j=0;
for (long k=low;k<=high;k++){
if (L[i] <= R[j] ){
inputArray[k]=L[i];
i=i+1;
}
else{
inputArray[k]=R[j];
j=j+1;
}
}
delete[] L;
delete[] R;
}
No way two time measurements took 20 seconds.
As others pointed out, results would be really dependent on platform, compiler optimization (debug mode can be really slower than release mode) and so on.
If you have the same setting as your friends and still have performance issue, you may want to use a profiler to see where your code is spending time. You can use that tool if you are in linux otherwise visual studio in windows is a good candidate.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What is the best algorithm to sort out unique words from a list of more than 10 million words? we need the best technique in the terms of execution time.
There are two simple approaches I remember using:
Add all the items to a data structure that folds duplicates (generally a hash, but you can also try a balanced tree or a trie).
Sort the list, then run over it copying out all elements that are non-equal to the previous element.
Roughly speaking, and subject to the usual fudges, the hash table and the trie give you expected O(n), the balanced tree and the sort give you expected O(n log n). It is not necessarily true that the O(n) solutions are faster than the O(n log n) solutions for your particular data.
All the options in (1) may have the disadvantage of doing a lot of small memory allocations for nodes in a data structure, which can be slow unless you use a special-purpose allocator. So in my experience it's worth testing the sort on the size of data you actually care about, before embarking on anything that requires you to write significant code.
Depending what language you're using, some of these approaches might be easier to test than others. For example in Python if you have a list of strings then the hashtable approach is just set(my_strings). In C, there is no standard hashtable, so you're either writing one or looking for a library.
Of course ease of writing has no direct effect on execution time, so if (as you claim) your programmer time is immaterial and all that matters is execution speed, then you should have no problems spending a few weeks getting familiar with the best available literature on sorting and hash tables. You'd be far better able to answer the question than I am.
Just add them to a hash. Constant time insert. I don't believe you can do better than order n. Red black trees can be faster on small data sets (faster to traverse the tree than to compute the hash), but your data set is large.
Spoiler:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
struct somehash {
struct somehash *next;
unsigned hash;
char *mem;
};
#define THE_SIZE (10*1000*1000)
struct somehash *table[THE_SIZE] = { NULL,};
struct somehash **some_find(char *str, unsigned len);
static unsigned some_hash(char *str, unsigned len);
int main (void)
{
char buffer[100];
struct somehash **pp;
size_t len;
while (fgets(buffer, sizeof buffer, stdin)) {
len = strlen(buffer);
pp = some_find(buffer, len);
if (*pp) { /* found */
fprintf(stderr, "Duplicate:%s", buffer);
}
else { /* not found: create one */
fprintf(stdout, "%s", buffer);
*pp = malloc(sizeof **pp);
(*pp)->next = NULL;
(*pp)->hash = some_hash(buffer,len);
(*pp)->mem = malloc(1+len);
memcpy((*pp)->mem , buffer, 1+len);
}
}
return 0;
}struct somehash **some_find(char *str, unsigned len)
{
unsigned hash;
unsigned short slot;
struct somehash **hnd;
hash = some_hash(str,len);
slot = hash % THE_SIZE;
for (hnd = &table[slot]; *hnd ; hnd = &(*hnd)->next ) {
if ( (*hnd)->hash != hash) continue;
if ( strcmp((*hnd)->mem , str) ) continue;
break;
}
return hnd;
}
static unsigned some_hash(char *str, unsigned len)
{
unsigned val;
unsigned idx;
if (!len) len = strlen(str);
val = 0;
for(idx=0; idx < len; idx++ ) {
val ^= (val >> 2) ^ (val << 5) ^ (val << 13) ^ str[idx] ^ 0x80001801;
}
return val;
}