Say I have 50 million features, each feature comes from disk.
At the beggining of my program, I handle each feature and depending on some conditions, I apply some modifications to some.
A this point in my program, I am reading a feature from disk, processing it, and writing it back, because well I don't have enough ram to open all 50 million features at once.
Now say I want to sort these 50 million features, is there any optimal algorithm to do this as I can't load everyone at the same time?
Like a partial sorting algorithm or something like that?
In general, the class of algorithms you're looking for is called external sorting. Perhaps the most widely known example of such sorting algorithm is called Merge sort.
The idea of this algorithm (the external version) is that you split the data into pieces that you can sort in-place in memory (say 100 thousands) and sort each block independently (using some standard algorithm such as Quick sort). Then you take the blocks and merge them (so you merge two 100k blocks into one 200k block) which can be done by reading elements from both of the block into buffers (since the blocks are already sorted). At the end, you merge two smaller blocks into one block which will contain all the elements in the right order.
If you are on Unix, use sort ;)
It may seem stupid but the command-line tool has been programmed to handle this case and you won't have to reprogram it.
Related
Actually sorting techniques are two types according to memory usage. One is that internal Another one is that external.
Insertion selection exchange sorts are internal sorts. That means they are processed in internal memory.
But I don't know about merge sort?
You can certainly write a completely internal merge sort. See https://www.geeksforgeeks.org/merge-sort/ for an example.
People often talk about an "external merge sort", but that often works out to a two-pass sorting technique where you successively load small portions of a large file into memory, sort them, and write them to disk. In the second pass, you merge those multiple portions into a single sorted file. See https://en.wikipedia.org/wiki/External_sorting for details.
Given a cloud storage folder with say 1PB of data in it, what would be the quickest way to sort all of that data? It's easy to sort small chunks of it, but then merging them into a larger sorted output will take longer since at some point a single process will have to merge the whole thing. I would like to avoid this, and have a fully distributed solution, is there a way? If so, is there any implementation that would be suitable for using to sort data in S3?
Since the amount of data you need to sort exceeds RAM (by a lot), the only reasonable way (to my knowledge) is to sort chunks first and then merge them together.
Merge Sort is the best way to accomplish this task. You can sort separate chunks of data at the same time with parallel processes, which should speed up your sort.
The thing is, after you done sorting chunks, you don't have to have a single process doing all of merging, you can have several processes merging different chunks at the same time:
This algorithm uses a parallel merge algorithm to not only parallelize the recursive division of the array, but also the merge operation. It performs well in practice when combined with a fast stable sequential sort, such as insertion sort, and a fast sequential merge as a base case for merging small arrays.
Here is a link that gives a bit more info about Merge Algorithm (just in case).
Bad news- you cannot avoid k-merge of multiple sorted files.
Good thing is that you can do some operations in parallel.
I have a large amount of data which i need to sort, several million array each with tens of thousand of values. What im wondering is the following:
Is it better to implement a parallel sorting algorithm, on the GPU, and run it across all the arrays
OR
implement a single thread algorithm, like quicksort, and assign each thread, of the GPU, a different array.
Obviously speed is the most important factor. For single thread sorting algorithm memory is a limiting factor. Ive already tried to implement a recursive quicksort but it doesnt seem to work for large amounts of data so im assuming there is a memory issue.
Data type to be sorted is long, so i dont believe a radix sort would be possible due to the fact that it a binary representation of the numbers would be too long.
Any pointers would be appreciated.
Sorting is an operation that has received a lot of attention. Writing your own sort isn't advisable if you are interested in high performance. I would consider something like thrust, back40computing, moderngpu, or CUB for sorting on the GPU.
Most of the above will be handling an array at a time, using the full GPU to sort an array. There are techniques within thrust to do a vectorized sort which can handle multiple arrays "at once", and CUB may also be an option for doing a "per-thread" sort (let's say, "per thread block").
Generally I would say the same thing about CPU sorting code. Don't write your own.
EDIT: I guess one more comment. I would lean heavily towards the first approach you mention (i.e. not doing a sort per thread.) There are two related reasons for this:
Most of the fast sorting work has been done along the lines of your first method, not the second.
The GPU is generally better at being fast when the work is well adapted for SIMD or SIMT. This means we generally want each thread to be doing the same thing and minimizing branching and warp divergence. This is harder to achieve (I think) in the second case, where each thread appears to be following the same sequence but in fact data dependencies are causing "algorithm divergence". On the surface of it, you might wonder if the same criticism might be levelled at the first approach, but since these libraries I mention arer written by experts, they are aware of how best to utilize the SIMT architecture. The thrust "vectorized sort" and CUB approaches will allow multiple sorts to be done per operation, while still taking advantage of SIMT architecture.
I'm trying to sort an array of 8bit numbers using vhdl.
I'm trying to find out a method which optimise delay and another which would use less hardware.
The size of the array is fixed. But I'm also interested to extend the functionality to variable lengths.
I've come across 3 algorithms so far:
Bathcher Parallel
Method Green Sort
Van Vorris Sort
Which of these will do the best job? Are there any other methods I should be looking at?
Thanks.
There is a lot of research articles in the matter. You could try to search the web for it. I did a search for "Sorting Networks" and came up with a lot of comparisons of different algorithms and how well they fitted into an FPGA.
The algorithm you choose will greatly depend on which parameter is most important to optimize for, i.e. latency, area, etc. Another important factor is where the values are stored at the beginning and end of the sort. If they are stored in registers, all might be accessed at once, but if you have to read them from a memory with a limited width, you should consider that in your implementation as well, because then you will have to sort values in a stream, and rearrange that stream before saving it back to memory.
Personally, I'd consider something time-constant like merge-sort, which has a constant time to sort, so you could easily schedule the sort for a fixed size array. I'm however not sure how well this scales or works with arbitrary sized arrays. You'd probably have to set an upper limit on array size, and also this approach works best if all data is stored in registers.
I read about this in a book by Knuth and according to that book, the Batcher's parallel merge sort is the fastest algorithm and also the most hardware efficient.
Hi I saw that as an interview question and thought it was an interesting question that I am not sure about the answer.
What would be the best way ?
Assuming *nix:
system("sort <input_file >output_file")
"sort" can use temporary files to work with input files larger than memory. It has switches to tune the amount of main memory and the number of temporary files it will use, if needed.
If not *nix, or the interviewer frowns because of the sideways answer, then I'll code an external merge sort. See #psyho's answer for a good summary of an external sorting algorithm.
Put them in a database and let the database worry about it.
One way to do this is to use an external sorting algorithm:
Read a chunk of file into memory
Sort that chunk using any regular sorting algorithm (like quicksort)
Output the sorted strings into a
temporary file
Repeat steps 1-3 until you process
the whole file
Apply the merge-sort algorithm by
reading the temporary files line by line
Profit!
Well, this is an interesting interview question... almost all such kind of questions are meant to test your skills and don't, fortunately, directly apply to real-life examples. This looks like one, so let's get into the puzzle
When your interviewer asks for "best", I believe he/she talks about performance only.
Answer 1
30GB of strings is lot of data. All compare-swap algorithms are Omega(n logn), so it will take a long time. While there are O(n) algorithms, such as counting sort, they are not in place, so you will be multiplying the 30GB and you have only 4GB of RAM (consider the swapping amount...), so I would go with quicksort
Answer 2 (partial)
Start thinking about counting sort. You may want to first split the strings in groups (using radix sort approach), one for each letter. You may want to scan the file and, for each initial letter, move the string (so copy and delete, no space waste) into a temporary file. You may want to repeat the process for the first 2, 3 or 4 chars of each string. Then, in order to reduce the complexity of sorting lots of files, you can separately sort the string within each one (using quicksort now) and finally merge all files in order. This way you'll still have a O(n logn) but on fair lower n
Database systems are already handling this particular problem well.
A good answer is to use the merge-sort algorithm, adapting it to spool data to and from disk as needed for the merge steps. This can be done with minimal demands on memory.