Efficiently sorting many strings in parallel for presentation - algorithm

I am confronted with a problem where I have a massive list of information (287,843 items) that must be sorted for display. Which is more efficient, to use a self-organizing red-black binary tree to keep them sorted or to build an array and then sort? My keys are strings, if that helps. This algorithm should make use of multiple processor cores.
Thank you!

This really depends on the particulars of your setup. If you have a multicore machine, you can probably sort the strings extremely quickly by using a parallel version of quicksort, in which each recursive call is executed in parallel with each other call. With many cores, this can take the already fast quicksort and make it substantially faster. Other sorting algorithms like merge sort can also be parallelized, though parallel quicksort has the advantage of requiring less extra memory. Since you know that you're sorting strings, you may also want to look into parallel radix sort, which could potentially be extremely fast.
Most binary search trees cannot easily be multithreaded, because rebalance operations often require changing multiple parts of the tree at once, so a balanced red/black tree may not be the best approach here. However, you may want to look into a concurrent skiplist, which is a data structure that can be made to work efficiently in parallel. There are some newer binary search trees designed for parallelism that sometimes outperform the skiplist (here is one such data structure), though I expect that there will be fewer existing implementations and discussion of these newer structures.
If the elements are not changing frequently or you only need sorted order once, then just sorting once with parallel quicksort is probably the best bet. If the elements are changing frequently, then a concurrent data structure like the parallel skiplist will probably be a better bet.
Hope this helps!

Assuming that you're reading that list from a file or some other data source, it seems quite right to read all that into an array, and then sort it. If you have a GUI of some sort, it seems even more feasible to do both reading and sorting in a thread, while having the GUI in a "waiting to complete" state. Keeping a tree of the values sounds feasible only if you're going to do a lot of deletions/insertions, which would make an array less usable in this case.
When it comes to multi-core sorting, I believe the merge sort is the easiest to parallelize. But I'm no expert when it comes to this, so don't take my word for a definite answer.

Related

What is the utility of treap data structure?

I am currently studying advanced data structures and I came across a weird data structure called Treap. I understand what Treap is but I can't seem to find it's utility in a valid use case scenario.
Why should you use such a data structure and in what type of problems/conditions treaps are best used?
I find myself much more into using either hash maps, min/max heaps, binary search tree or balanced binary search trees, but I can't tell on why should you use a treap.
They are easier to implement and more importantly, that makes them easier to modify/maintain into the future if you want to make slight variations on them or change them some way. They also allow for efficient parallel versions of set operations Union/Intersect/Difference which is extremely valuable. Using them simultaneously as a heap and binary tree isn't really very handy unless the stuff you use for priorities are coincidentally really nicely randomly distributed/permuted. I suppose there might be a case where that would be handy, but it seems really unlikely. Stuff so randomly distributed is usually more like a hash key which typically aren't useful as ordered data. How often do you want to pull people out in order of their SSNs? I guess it's possible but unlikely.

CUDA parallel sorting algorithm vs single thread sorting algorithms

I have a large amount of data which i need to sort, several million array each with tens of thousand of values. What im wondering is the following:
Is it better to implement a parallel sorting algorithm, on the GPU, and run it across all the arrays
OR
implement a single thread algorithm, like quicksort, and assign each thread, of the GPU, a different array.
Obviously speed is the most important factor. For single thread sorting algorithm memory is a limiting factor. Ive already tried to implement a recursive quicksort but it doesnt seem to work for large amounts of data so im assuming there is a memory issue.
Data type to be sorted is long, so i dont believe a radix sort would be possible due to the fact that it a binary representation of the numbers would be too long.
Any pointers would be appreciated.
Sorting is an operation that has received a lot of attention. Writing your own sort isn't advisable if you are interested in high performance. I would consider something like thrust, back40computing, moderngpu, or CUB for sorting on the GPU.
Most of the above will be handling an array at a time, using the full GPU to sort an array. There are techniques within thrust to do a vectorized sort which can handle multiple arrays "at once", and CUB may also be an option for doing a "per-thread" sort (let's say, "per thread block").
Generally I would say the same thing about CPU sorting code. Don't write your own.
EDIT: I guess one more comment. I would lean heavily towards the first approach you mention (i.e. not doing a sort per thread.) There are two related reasons for this:
Most of the fast sorting work has been done along the lines of your first method, not the second.
The GPU is generally better at being fast when the work is well adapted for SIMD or SIMT. This means we generally want each thread to be doing the same thing and minimizing branching and warp divergence. This is harder to achieve (I think) in the second case, where each thread appears to be following the same sequence but in fact data dependencies are causing "algorithm divergence". On the surface of it, you might wonder if the same criticism might be levelled at the first approach, but since these libraries I mention arer written by experts, they are aware of how best to utilize the SIMT architecture. The thrust "vectorized sort" and CUB approaches will allow multiple sorts to be done per operation, while still taking advantage of SIMT architecture.

Is hash the best for application requesting high lookup speed?

I keep in mind that hash would be first thing I should resort to if I want to write an application which requests high lookup speed, and any other data structure wouldn't guarantee that.
But I got confused when saw some many post saying different, such as suffix tree, trie, to name a few.
So I wonder is hash always the best thing for high speed lookup? What if I want both high lookup speed and less space cost?
Is there any material (books or papers) lecturing about the data structures or algorithms **on high speed lookup and space efficiency? Any of this kind is highly appreciated.
So I wonder is hash always the best thing for high speed lookup?
No. As stated in comments:
There is never such a thing Best data structure for [some generic issue]. Everything is case dependent. Tries and radix trees might be great for strings, since you need to read the string anyway. arrays allows simplicity and great cache efficiency - and are usually the best for small scale static information
I once answered a related question of cases where a tree might be better then a hash table: Hash Table v/s Trees
What if I want both high lookup speed and less space cost?
The two might be self-contradicting. Even for the simple example of a hash table of size X vs a hash table of size 2*X. The bigger hash table is less likely to encounter collisions, and thus is expected to be faster then the smaller one.
Is there any material (books or papers) lecturing about the data
structures or algorithms on high speed lookup and space efficiency?
Introduction to Algorithms provide a good walk through on the main data structure used. Any algorithm developed is trying to provide a good space and time efficiency, but like said, there is a trade off, and some algorithms might be better for specific cases then others.
Choosing the right algorithm/data structure/design for the specific problem is what engineering is about, isn't it?
I assume you are talking about strings here, and the answer is "no", hashes are not the fastest or most space efficient way to look up strings, tries are. Of course, writing a hashing algorithm is much, much easier than writing a trie.
One thing you won't find in wikipedia or books about tries is that if you naively implement them with one node per letter, you end up with large numbers of inefficient, one-child nodes. To make a trie that really burns up the CPU you have to implement nodes so that they can have a variable number of characters. This, of course, is even harder than writing a plain trie.
I have written trie implementations that handle over a billion entries and I can tell you that if done properly it is insanely fast, nothing else compares.
One other issue with tries is that you have to write a custom heap, because if you just use some kind of generic memory management it will be slow. So in addition to implementing the trie, you have to implement the heap that the trie runs on. Pretty freakin complicated, but if you do it, you get batshit crazy speed.
Only a good implementation of hash will give you good performance. And you cannot compare hash with Trie for all situations. Situations where Trie is applicable, is fast, but it can be costly in terms of memory, (again dependent on implementation).
But have you measured performance? Or it is unnecessary optimization you are looking for. Did the map fail you?
That might also depend on the actual number of elements.
In complexity theory a hash is not bad, but complexity theory is only good if the actual number of elements is bigger than some threshold.
I.e. if you have only 2 elements, there is a faster method than a hash ;-)
Hash tables are a good general purpose structure but they can fail spectacularly if the hash function doesn't suit the input data. Worst case lookup is O(n). They also waste some space as you mentioned. Other general-purpose structures like balanced binary search trees have worse average case but better worst case performance than a hash table. This is important for real-time applications. A trie is a more special-purpose structure tailored to string lookup.

Optimal physical orderings of nodes

First, read this:
TPT paper
I was wondering what other options might exist for arranging nodes to boost performance. Anything from post-parent order in a byte array, like TPT's, to something more like a k-order b-tree; I'm wondering what good options are known at the moment?
A bit more on the problem:
I have an extremely fast way of finding elements within a sparse set, given some concept of adjacency to a given pointer. I was wondering how I could best take advantage of this in storing a patricia trie.
You can make assumptions about whether the trie will be random-access, read only, write-seldom, or add-only. Please note them if you do, but I've actually used a TPT and the gains were pretty significant so I'm willing to consider certain constraints.
Update
I guess in some senses this was a little unclear. What I'm looking for here is ways of arranging things in memory that optimize one performance metric or another. The TPTs, through some tricks, use node order to optimize disk reads and space-per-node. I'm curious about:
Total deletion, where the structure is removed from memory entirely.
Inserts, particularly in densely populated structures.
Deletes, again, particularly in densely populated structures.
A DAWG or a minimal DFA (see this question or the paper "How to squeeze a lexicon") may be even better than a TPT because the totel size is smaller.

Which sorting method is most suitable for parallel processing?

I am now looking at my old school assignment and want to find the solution of a question.
Which sorting method is most suitable for parallel processing?
Bubble sort
Quick sort
Merge sort
Selection sort
I guess quick sort (or merge sort?) is the answer.
Am I correct?
Like merge sort, quicksort can also be easily parallelized due to its divide-and-conquer nature. Individual in-place partition operations are difficult to parallelize, but once divided, different sections of the list can be sorted in parallel.
One advantage of parallel quicksort over other parallel sort algorithms is that no synchronization is required. A new thread is started as soon as a sublist is available for it to work on and it does not communicate with other threads. When all threads complete, the sort is done.
http://en.wikipedia.org/wiki/Quicksort
It depends completely on the method of parallelization. For multithreaded general computing, a merge sort provides pretty reliable load balancing and memory localization properties. For a large sorting network in hardware, a form of Batcher, Bitonic, or Shell sort is actually best if you want good O(logĀ² n) performance.
i think merge sort
you can divide the dataset and make parallel operations on them..
I think Merge Sort would be the best answer here. Because the basic idea behind merge sort is to divide the problem into individual solutions.Solve them and Merge them.
Thats what we actually do in parallel processing too. Divide the whole problem into small unit statements to compute parallely and then join the results.
Thanks
Just a couple of random remarks:
Many discussions of how easy it is to parallelize quicksort ignore the pivot selection. If you traverse the array to find it, you've introduced a linear time sequential component.
Quicksort is not easy to implement at all in distributed memory. There is a discussion in the Kumar book
Yeah, I know, one should not use bubble sort. But "odd-even transposition sort", which is more or less equivalent, is actually a pretty good parallel programming exercise. In particular for distributed memory parallelism. It is the easiest example of a sorting network, which is very doable in MPI and such.
It is merge sort since the sorting is done on two sub arrays and they are compared and sorted at the end. these can be done in parallel

Resources