Given a cloud storage folder with say 1PB of data in it, what would be the quickest way to sort all of that data? It's easy to sort small chunks of it, but then merging them into a larger sorted output will take longer since at some point a single process will have to merge the whole thing. I would like to avoid this, and have a fully distributed solution, is there a way? If so, is there any implementation that would be suitable for using to sort data in S3?
Since the amount of data you need to sort exceeds RAM (by a lot), the only reasonable way (to my knowledge) is to sort chunks first and then merge them together.
Merge Sort is the best way to accomplish this task. You can sort separate chunks of data at the same time with parallel processes, which should speed up your sort.
The thing is, after you done sorting chunks, you don't have to have a single process doing all of merging, you can have several processes merging different chunks at the same time:
This algorithm uses a parallel merge algorithm to not only parallelize the recursive division of the array, but also the merge operation. It performs well in practice when combined with a fast stable sequential sort, such as insertion sort, and a fast sequential merge as a base case for merging small arrays.
Here is a link that gives a bit more info about Merge Algorithm (just in case).
Bad news- you cannot avoid k-merge of multiple sorted files.
Good thing is that you can do some operations in parallel.
Related
Actually sorting techniques are two types according to memory usage. One is that internal Another one is that external.
Insertion selection exchange sorts are internal sorts. That means they are processed in internal memory.
But I don't know about merge sort?
You can certainly write a completely internal merge sort. See https://www.geeksforgeeks.org/merge-sort/ for an example.
People often talk about an "external merge sort", but that often works out to a two-pass sorting technique where you successively load small portions of a large file into memory, sort them, and write them to disk. In the second pass, you merge those multiple portions into a single sorted file. See https://en.wikipedia.org/wiki/External_sorting for details.
I have two large csv files presorted by one of the columns. Is there a way to use the fact that they are already sorted to get a new sorted RDD faster, without full sorting again?
The short answer: No, there is no way to leverage the fact that two input RDDs are already sorted when using the sort facilities offered by Apache Spark.
The long answer: Under certain conditions, there might be a better way than using sortBy or sortByKey.
The most obvious case is when the input RDDs are already sorted and represent distinct ranges. In this case, simply using rdd1.union(rdd2) is the fastest (virtually zero cost) way for combining the input RDDs, assuming that all elements in rdd1 come before all elements in rdd2 (according to the chosen ordering).
When the ranges of the input RDDs overlap, things get more tricky. Assuming that the target RDD shall only have a single partition, it might be efficient to use toLocalIterator on both RDDs and then do a merge manually. If the result has to be an RDD, one could do this inside the compute method of a custom RDD type, processing the input RDDs and generating the outputs.
When the inputs are large and thus consist of many partitions, things get even trickier. In this case, you probably want multiple partitions in the output RDD as well. You could use the custom RDD mentioned earlier, but create multiple partitions (using a RangePartitioner). Each partition would cover a distinct range of elements (in the optimal case, these ranges would cover roughly equally sized parts of the output).
The tricky part with this is avoiding to process the complete input RDDs multiple times inside compute. This can be avoided efficiently using filterByRange from OrderedRDDFunctions when the input RDDs are using a RangePartitioner. When they are not using a RangePartitioner, but you know that partitions are ordered internally and also have a global order, you would first need to find out the effective ranges covered by these partitions by actually probing into the data.
As the multiple partition case is rather complex, I would check whether the custom-made sort is really faster than simply using sortBy or sortByKey. The logic for sortBy and sortByKey is highly optimized regarding the shuffling process (transferring data between nodes). For this reason, it might well be that for many cases these methods are faster than the custom-made logic, even though the custom-made logic could be O(n) while sortBy / sortByKey can be O(n log(n)) at best.
If you are interested in learning more about the shuffling logic used by Apache Spark, there is an article explaining the basic concept.
I am confronted with a problem where I have a massive list of information (287,843 items) that must be sorted for display. Which is more efficient, to use a self-organizing red-black binary tree to keep them sorted or to build an array and then sort? My keys are strings, if that helps. This algorithm should make use of multiple processor cores.
Thank you!
This really depends on the particulars of your setup. If you have a multicore machine, you can probably sort the strings extremely quickly by using a parallel version of quicksort, in which each recursive call is executed in parallel with each other call. With many cores, this can take the already fast quicksort and make it substantially faster. Other sorting algorithms like merge sort can also be parallelized, though parallel quicksort has the advantage of requiring less extra memory. Since you know that you're sorting strings, you may also want to look into parallel radix sort, which could potentially be extremely fast.
Most binary search trees cannot easily be multithreaded, because rebalance operations often require changing multiple parts of the tree at once, so a balanced red/black tree may not be the best approach here. However, you may want to look into a concurrent skiplist, which is a data structure that can be made to work efficiently in parallel. There are some newer binary search trees designed for parallelism that sometimes outperform the skiplist (here is one such data structure), though I expect that there will be fewer existing implementations and discussion of these newer structures.
If the elements are not changing frequently or you only need sorted order once, then just sorting once with parallel quicksort is probably the best bet. If the elements are changing frequently, then a concurrent data structure like the parallel skiplist will probably be a better bet.
Hope this helps!
Assuming that you're reading that list from a file or some other data source, it seems quite right to read all that into an array, and then sort it. If you have a GUI of some sort, it seems even more feasible to do both reading and sorting in a thread, while having the GUI in a "waiting to complete" state. Keeping a tree of the values sounds feasible only if you're going to do a lot of deletions/insertions, which would make an array less usable in this case.
When it comes to multi-core sorting, I believe the merge sort is the easiest to parallelize. But I'm no expert when it comes to this, so don't take my word for a definite answer.
Say I have 50 million features, each feature comes from disk.
At the beggining of my program, I handle each feature and depending on some conditions, I apply some modifications to some.
A this point in my program, I am reading a feature from disk, processing it, and writing it back, because well I don't have enough ram to open all 50 million features at once.
Now say I want to sort these 50 million features, is there any optimal algorithm to do this as I can't load everyone at the same time?
Like a partial sorting algorithm or something like that?
In general, the class of algorithms you're looking for is called external sorting. Perhaps the most widely known example of such sorting algorithm is called Merge sort.
The idea of this algorithm (the external version) is that you split the data into pieces that you can sort in-place in memory (say 100 thousands) and sort each block independently (using some standard algorithm such as Quick sort). Then you take the blocks and merge them (so you merge two 100k blocks into one 200k block) which can be done by reading elements from both of the block into buffers (since the blocks are already sorted). At the end, you merge two smaller blocks into one block which will contain all the elements in the right order.
If you are on Unix, use sort ;)
It may seem stupid but the command-line tool has been programmed to handle this case and you won't have to reprogram it.
I am now looking at my old school assignment and want to find the solution of a question.
Which sorting method is most suitable for parallel processing?
Bubble sort
Quick sort
Merge sort
Selection sort
I guess quick sort (or merge sort?) is the answer.
Am I correct?
Like merge sort, quicksort can also be easily parallelized due to its divide-and-conquer nature. Individual in-place partition operations are difficult to parallelize, but once divided, different sections of the list can be sorted in parallel.
One advantage of parallel quicksort over other parallel sort algorithms is that no synchronization is required. A new thread is started as soon as a sublist is available for it to work on and it does not communicate with other threads. When all threads complete, the sort is done.
http://en.wikipedia.org/wiki/Quicksort
It depends completely on the method of parallelization. For multithreaded general computing, a merge sort provides pretty reliable load balancing and memory localization properties. For a large sorting network in hardware, a form of Batcher, Bitonic, or Shell sort is actually best if you want good O(logĀ² n) performance.
i think merge sort
you can divide the dataset and make parallel operations on them..
I think Merge Sort would be the best answer here. Because the basic idea behind merge sort is to divide the problem into individual solutions.Solve them and Merge them.
Thats what we actually do in parallel processing too. Divide the whole problem into small unit statements to compute parallely and then join the results.
Thanks
Just a couple of random remarks:
Many discussions of how easy it is to parallelize quicksort ignore the pivot selection. If you traverse the array to find it, you've introduced a linear time sequential component.
Quicksort is not easy to implement at all in distributed memory. There is a discussion in the Kumar book
Yeah, I know, one should not use bubble sort. But "odd-even transposition sort", which is more or less equivalent, is actually a pretty good parallel programming exercise. In particular for distributed memory parallelism. It is the easiest example of a sorting network, which is very doable in MPI and such.
It is merge sort since the sorting is done on two sub arrays and they are compared and sorted at the end. these can be done in parallel