Best sorting algorithm [closed] - algorithm

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What is the best algorithm to sort out unique words from a list of more than 10 million words? we need the best technique in the terms of execution time.

There are two simple approaches I remember using:
Add all the items to a data structure that folds duplicates (generally a hash, but you can also try a balanced tree or a trie).
Sort the list, then run over it copying out all elements that are non-equal to the previous element.
Roughly speaking, and subject to the usual fudges, the hash table and the trie give you expected O(n), the balanced tree and the sort give you expected O(n log n). It is not necessarily true that the O(n) solutions are faster than the O(n log n) solutions for your particular data.
All the options in (1) may have the disadvantage of doing a lot of small memory allocations for nodes in a data structure, which can be slow unless you use a special-purpose allocator. So in my experience it's worth testing the sort on the size of data you actually care about, before embarking on anything that requires you to write significant code.
Depending what language you're using, some of these approaches might be easier to test than others. For example in Python if you have a list of strings then the hashtable approach is just set(my_strings). In C, there is no standard hashtable, so you're either writing one or looking for a library.
Of course ease of writing has no direct effect on execution time, so if (as you claim) your programmer time is immaterial and all that matters is execution speed, then you should have no problems spending a few weeks getting familiar with the best available literature on sorting and hash tables. You'd be far better able to answer the question than I am.

Just add them to a hash. Constant time insert. I don't believe you can do better than order n. Red black trees can be faster on small data sets (faster to traverse the tree than to compute the hash), but your data set is large.

Spoiler:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
struct somehash {
struct somehash *next;
unsigned hash;
char *mem;
};
#define THE_SIZE (10*1000*1000)
struct somehash *table[THE_SIZE] = { NULL,};
struct somehash **some_find(char *str, unsigned len);
static unsigned some_hash(char *str, unsigned len);
int main (void)
{
char buffer[100];
struct somehash **pp;
size_t len;
while (fgets(buffer, sizeof buffer, stdin)) {
len = strlen(buffer);
pp = some_find(buffer, len);
if (*pp) { /* found */
fprintf(stderr, "Duplicate:%s", buffer);
}
else { /* not found: create one */
fprintf(stdout, "%s", buffer);
*pp = malloc(sizeof **pp);
(*pp)->next = NULL;
(*pp)->hash = some_hash(buffer,len);
(*pp)->mem = malloc(1+len);
memcpy((*pp)->mem , buffer, 1+len);
}
}
return 0;
}struct somehash **some_find(char *str, unsigned len)
{
unsigned hash;
unsigned short slot;
struct somehash **hnd;
hash = some_hash(str,len);
slot = hash % THE_SIZE;
for (hnd = &table[slot]; *hnd ; hnd = &(*hnd)->next ) {
if ( (*hnd)->hash != hash) continue;
if ( strcmp((*hnd)->mem , str) ) continue;
break;
}
return hnd;
}
static unsigned some_hash(char *str, unsigned len)
{
unsigned val;
unsigned idx;
if (!len) len = strlen(str);
val = 0;
for(idx=0; idx < len; idx++ ) {
val ^= (val >> 2) ^ (val << 5) ^ (val << 13) ^ str[idx] ^ 0x80001801;
}
return val;
}

Related

Why redis command ‘LLEN’ has constant time complexity instead of O(n)?

I know that redis list is implemented by linked list under the hood. However when calculating time complexity of length of the list, shouldn’t it be O(n)?
You can find the declaration of the list type at https://github.com/redis/redis/blob/unstable/src/adlist.h. If you look at the section around line 50 you find:
typedef struct list {
listNode *head;
listNode *tail;
void *(*dup)(void *ptr);
void (*free)(void *ptr);
int (*match)(void *ptr, void *key);
unsigned long len;
} list;
Note the unsigned long len that stores the length of the list. That is why it is O(1).

Sorting multiple arrays using CUDA/Thrust

I have a large array that I need to sort on the GPU. The array itself is a concatenation of multiple smaller subarrays that satisfy the condition that given i < j, the elements of the subarray i are smaller than the elements of the subarray j. An example of such array would be {5 3 4 2 1 6 9 8 7 10 11},
where the elements of the first subarray of 5 elements are smaller than the elements of the second subarray of 6 elements. The array I need is {1, 2, 3, 4, 5, 6, 7, 10, 11}. I know the position where each subarray starts in the large array.
I know I can simply use thrust::sort on the whole array, but I was wondering if it's possible to launch multiple concurrent sorts, one for each subarray. I'm hoping to get a performance improvement by doing that. My assumption is that it would be faster to sort multiple smaller arrays than one large array with all the elements.
I'd appreciate if someone could give me a way to do that or correct my assumption in case it's wrong.
A way to do multiple concurrent sorts (a "vectorized" sort) in thrust is via the marking of the sub arrays, and providing a custom functor that is an ordinary thrust sort functor that also orders the sub arrays by their key.
Another possible method is to use back-to-back thrust::stable_sort_by_key as described here.
As you have pointed out, another method in your case is just to do an ordinary sort, since that is ultimately your objective.
However I think its unlikely that any of the thrust sort methods will give a signficant speed-up over a pure sort, although you can try it. Thrust has a fast-path radix sort which it will use in certain situations, which the pure sort method could probably use in your case. (In other cases, e.g. when you provide a custom functor, thrust will often use a slower merge-sort method.)
If the sizes of the sub arrays are within certain ranges, I think you're likely to get much better results (performance-wise) with block radix sort in cub, one block per sub-array.
Here is an example that uses specific sizes (since you've given no indication of size ranges and other details), comparing a thrust "pure sort" to a thrust segmented sort with functor, to the cub block sort method. For this particular case, the cub sort is fastest:
$ cat t1.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/scan.h>
#include <thrust/equal.h>
#include <cstdlib>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
const int num_blocks = 2048;
const int items_per = 4;
const int nTPB = 512;
const int block_size = items_per*nTPB; // must be a whole-number multiple of nTPB;
typedef float mt;
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
struct my_sort_functor
{
template <typename T, typename T2>
__host__ __device__
bool operator()(T t1, T2 t2){
if (thrust::get<1>(t1) < thrust::get<1>(t2)) return true;
if (thrust::get<1>(t1) > thrust::get<1>(t2)) return false;
if (thrust::get<0>(t1) > thrust::get<0>(t2)) return false;
return true;}
};
// from: https://nvlabs.github.io/cub/example_block_radix_sort_8cu-example.html#_a0
#define CUB_STDERR
#include <stdio.h>
#include <iostream>
#include <algorithm>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_sort.cuh>
using namespace cub;
//---------------------------------------------------------------------
// Globals, constants and typedefs
//---------------------------------------------------------------------
bool g_verbose = false;
bool g_uniform_keys;
//---------------------------------------------------------------------
// Kernels
//---------------------------------------------------------------------
template <
typename Key,
int BLOCK_THREADS,
int ITEMS_PER_THREAD>
__launch_bounds__ (BLOCK_THREADS)
__global__ void BlockSortKernel(
Key *d_in, // Tile of input
Key *d_out) // Tile of output
{
enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };
// Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
typedef BlockLoad<Key, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoadT;
// Specialize BlockRadixSort type for our thread block
typedef BlockRadixSort<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// Shared memory
__shared__ union TempStorage
{
typename BlockLoadT::TempStorage load;
typename BlockRadixSortT::TempStorage sort;
} temp_storage;
// Per-thread tile items
Key items[ITEMS_PER_THREAD];
// Our current block's offset
int block_offset = blockIdx.x * TILE_SIZE;
// Load items into a blocked arrangement
BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
// Barrier for smem reuse
__syncthreads();
// Sort keys
BlockRadixSortT(temp_storage.sort).SortBlockedToStriped(items);
// Store output in striped fashion
StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
}
int main(){
const int ds = num_blocks*block_size;
thrust::host_vector<mt> data(ds);
thrust::host_vector<int> keys(ds);
for (int i = block_size; i < ds; i+=block_size) keys[i] = 1; // mark beginning of blocks
thrust::device_vector<int> d_keys = keys;
for (int i = 0; i < ds; i++) data[i] = (rand()%block_size) + (i/block_size)*block_size; // populate data
thrust::device_vector<mt> d_data = data;
thrust::inclusive_scan(d_keys.begin(), d_keys.end(), d_keys.begin()); // fill out keys array 000111222...
thrust::device_vector<mt> d1 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long os = dtime_usec(0);
thrust::sort(d1.begin(), d1.end()); // ordinary sort
cudaDeviceSynchronize();
os = dtime_usec(os);
thrust::device_vector<mt> d2 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long ss = dtime_usec(0);
thrust::sort(thrust::make_zip_iterator(thrust::make_tuple(d2.begin(), d_keys.begin())), thrust::make_zip_iterator(thrust::make_tuple(d2.end(), d_keys.end())), my_sort_functor());
cudaDeviceSynchronize();
ss = dtime_usec(ss);
if (!thrust::equal(d1.begin(), d1.end(), d2.begin())) {std::cout << "oops1" << std::endl; return 0;}
std::cout << "ordinary thrust sort: " << os/(float)USECPSEC << "s " << "segmented sort: " << ss/(float)USECPSEC << "s" << std::endl;
thrust::device_vector<mt> d3(ds);
cudaDeviceSynchronize();
unsigned long long cs = dtime_usec(0);
BlockSortKernel<mt, nTPB, items_per><<<num_blocks, nTPB>>>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d3.data()));
cudaDeviceSynchronize();
cs = dtime_usec(cs);
if (!thrust::equal(d1.begin(), d1.end(), d3.begin())) {std::cout << "oops2" << std::endl; return 0;}
std::cout << "cub sort: " << cs/(float)USECPSEC << "s" << std::endl;
}
$ nvcc -o t1 t1.cu
$ ./t1
ordinary thrust sort: 0.001652s segmented sort: 0.00263s
cub sort: 0.000265s
$
(CUDA 10.2.89, Tesla V100, Ubuntu 18.04)
I have no doubt that your sizes and array dimensions don't correspond to mine. The purpose here is to illustrate some possible methods, not a black-box solution that works for your particular case. You probably should do benchmark comparisons of your own. I also acknowledge that the block radix sort method for cub expects equal-sized sub-arrays, which you may not have. It may not be a suitable method for you, or you may wish to explore some kind of padding arrangement. There's no need to ask this question of me; I won't be able to answer it based on the information in your question.
I don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the questions in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.

Use an integer (int) as a pointer argument [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
If we ignore the aspects:
Bad practice
Unjustified need
and probably others ...
What are the risks (run-time : crash, undefined behavior, segmentation fault. implementation-defined behavior : wrong address generation) of this program as long as the address remains in the interval INT_MIN and INT_MAX:
#include <iostream>
using namespace std;
#include <sstream>
#include <string>
#define TAB_SIZE 2
void UseIntAsAdress (unsigned int i)
{
int *pTab = (int*) i;
for (int i=0; i< TAB_SIZE; i++)
cout << "tab ["<<i<<"] = "<< pTab[i] <<endl;
}
int main()
{
int *pTab = new int [TAB_SIZE];
for ( int i=0; i<TAB_SIZE; i++)
pTab [i] = i;
std::stringstream streamAdr;
streamAdr << pTab;
std::string name = streamAdr.str();
unsigned int i = stoi(name.c_str(), 0, 16);
UseIntAsAdress (i);
delete [] pTab;
return 0;
}
Your program has implementation-defined behavior. Both the result of streamAdr << pTab; and the result of (int*) i are implementation-defined.
So you need to look at the documentation of your particular compiler to figure out whether this program behaves in the way you expect it to or not.
There is no general guarantee that this will behave correctly.
The cast from pointer to integer can be done much simpler as well:
reinterpret_cast<std::intptr_t>(pTab)
This is assuming your implementation supports std::intptr_t. Otherwise (in particular pre-C++11) you can try one one of the standard integer types. Compilation should fail if the type used is too small to hold the pointer values and otherwise it will work the same as std::intptr_t.
If then the value resulting from this cast isn't narrowed by conversion to int, the result of casting back to int* will behave as expected (i.e. you get a pointer to the first element of the array back), otherwise it will still have implementation-defined behavior.

Optimizing bit-waste for custom data encoding

I was wondering what's a good solution to make it so that a custom data structure took the least amount of space possible, and I've been searching around without finding anything.
The general idea is I may have a some kind of data structure with a lot of different variables, integers, booleans, etc. With booleans, it's fairly easy to use bitmasks/flags. For integers, perhaps I only need to use 10 of the numbers for one of the integers, and 50 for another. I would like to have some function encode the structure, without wasting any bits. Ideally I would be able to pack them side-by-side in an array, without any padding.
I have a vague idea that I would have to have way of enumerating all the possible permutations of values of all the variables, but I'm unsure where to start with this.
Additionally, though this may be a bit more complicated, what if I have a bunch of restrictions such as not caring about certain variables if other variables meet certain criteria. This reduces the amount of permutations, so there should be a way of saving some bits here as well?
Example: Say I have a server for an online game, containing many players. Each player. The player struct stores a lot of different variables, level, stats, and a bunch of flags for which quests the player has cleared.
struct Player {
int level; //max is 100
int strength //max is
int int // max is 500
/* ... */
bool questFlag30;
bool questFlag31;
bool questFlag32;
/* ... */
};
and I want to have a function that takes an vector of Players called encodedData encode(std::vector<Player> players) and a function decodeData which returns a vector from the encoded data.
This is what I came up with; it's not perfect, but it's something:
#include <vector>
#include <iostream>
#include <bitset>
#include <assert.h>
/* Data structure for packing multiple variables, without padding */
struct compact_collection {
std::vector<bool> data;
/* Returns a uint32_t since we don't want to store the length of each variable */
uint32_t query_bits(int index, int length) {
std::bitset<32> temp;
for (int i = index; i < index + length; i++) temp[i - index] = data[i];
return temp.to_ulong();
};
/* */
void add_bits(int32_t value, int32_t bits) {
assert(std::pow(2, bits) >= value);
auto a = std::bitset<32>(value).to_string();
for (int i = 32 - bits; i < 32; i++) data.insert(data.begin(), (a[i] == '1'));
};
};
int main() {
compact_collection myCollection;
myCollection.add_bits(45,6);
std::cout << myCollection.query_bits(0,6);
std::cin.get();
return 0;
}

Space efficiency of algorithms

It seems like none of the algorithm textbooks mentions about space efficiency as much, so I don't really understand when I encounter questions asking for an algorithm that requires only constant memory.
What would be an example of a few examples of algorithms that uses constant memory and algorithms that doesn't use constant memory?
If an algorithm:
a) recurses a number of levels deep which depends on N, or
b) allocates an amount of memory which depends on N
then it is not constant memory. Otherwise it probably is: formally it is constant-memory if there is a constant upper bound on the amount of memory which the algorithm uses, no matter what the size/value of the input. The memory occupied by the input is not included, so sometimes to be clear you talk about constant "extra" memory.
So, here's a constant-memory algorithm to find the maximum of an array of integers in C:
int max(int *start, int *end) {
int result = INT_MIN;
while (start != end) {
if (*start > result) result = *start;
++start;
}
return result;
}
Here's a non-constant memory algorithm, because it uses stack space proportional to the number of elements in the input array. However, it could become constant-memory if the compiler is somehow capable of optimising it to a non-recursive equivalent (which C compilers don't usually bother with except sometimes with a tail-call optimisation, which wouldn't do the job here):
int max(int *start, int *end) {
if (start == end) return INT_MIN;
int tail = max(start+1, end);
return (*start > tail) ? *start : tail;
}
Here is a constant-space sort algorithm (in C++ this time), which is O(N!) time or thereabouts (maybe O(N*N!)):
void sort(int *start, int *end) {
while (std::next_permutation(start,end));
}
Here is an O(N) space sort algorithm, which is O(N^2) time:
void sort(int *start, int *end) {
std::vector<int> work;
for (int *current = start; current != end; ++current) {
work.insert(
std::upper_bound(work.begin(), work.end(), *current),
*current
);
}
std::copy(work.begin(), work.end(), start);
}
Very easy example: counting a number of characters in a string. It can be iterative:
int length( const char* str )
{
int count = 0;
while( *str != 0 ) {
str++;
count++
}
return count;
}
or recursive:
int length( const char* str )
{
if( *str == 0 ) {
return 0;
}
return 1 + length( str + 1 );
}
The first variant only uses a couple of local variables regardless of the string length - it's space complexity is O(1). The second if executed without recursion elimination requires a separate stack frame for storing the return address and local variables corresponding to each depth level - its space complexity is O(n) where n is string length.
Take a sorting algorithms on an array for example. You can either use an new array of the same length as the original array where you put the sorted elements into (Θ(n)). Or you sort the array in-place and just use one additional temporary variable for swapping two elements (Θ(1)).

Resources