Radix sort using a CSV file - sorting

I've got an assignment today and I have no idea how can I sort a csv file using radix sort.
The Question is:
Performance of Radix Sort
Memory Space Requirements
Time in Seconds
Time Complexity O(??)
How many data movements/swapping was there?
How many comparisons were performed?

Related

Sorting technique - most efficient

What sorting technique would you use to sort 10,000 items using just 1000 available slots in your RAM?
Heap Sort
Quick Sort
Bubble Sort
Merge Sort
I am confused between quick and merge sort. Both have average time complexity of nlogn but again heap sort also has the same complexity. Any inputs would be appreciated!
Time complexity won't help you here - what the question is looking for is space complexity. Just as a hint, n = 10000 and you have only 1000 available spaces, so you need to pick an algorithm that is better than O(n) space complexity even in the worst case.
This seems like an HW question, so I'd prefer not to answer directly. In general, though, since your RAM is small and your list is big, you'll do best with something like a cache oblivious algorithm.

Sorting an array consisting random numbers [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
In this sorting animation, I saw that heap sort and merge sort works best for an array containing random numbers. But, what about if we compare these sorting algorithms with Radix and introsort?
In short, which type of sorting algorithm is best to sort an array consisting of random numbers?
Thanks
For an array of random numbers, a least significant digit first counting variation of radix sort is normally fastest for smaller arrays that fit within cache. While for larger arrays, using one most significant digit first to split up the array into smaller sub-arrays that fit in cache will be faster. Since the data is random, the main time overhead for a radix sort is the randomly distributed writes, which is not cache friendly if the array is significantly larger than cache. If the original and working arrays fit within cache, for most systems, the random access writes don't incur a significant time penalty.
There is also a choice for the base used in a radix sort. For example 32 bit numbers can be sorted in 4 passes if using base 256 (8 bit "digits"). Using base 65536 (16 bit "digits") usually exceeds the size of the L1 and/or L2 caches, so it's not faster in most cases, even though it only takes two passes. For 64 bit numbers, four 11 bit "digits" and two 10 bit "digits" could be used to sort in 6 passes, instead of using eight 8 bit "digits" to sort in 8 passes. However, the 11/10 bit digit variation isn't going to be faster unless the array is large enough and the distribution of the random numbers is uniform enough to use up most of the storage used to hold the counts / indexes.
Link to a prior thread about radix sort variations:
Radix Sort Optimization
merge sort "best case"
For a standard merge sort, the number of moves is always the same, but if the data is already sorted, only half the number of compares is done.
quick sort / intro sort "best case"
Best case for quick sort is random data. Using the middle value for partition doesn't make much difference for random data, but if the data is already sorted, it ends up as best case. Intro sort generally involves extra code to check if the recursion is getting too deep where it switches to heap sort. For random data, this shouldn't happen, so the extra code to check for switch to heap sort is just extra overhead.
Here you can see time complexities of various sorting algorithms in best, average and worst cases: http://bigocheatsheet.com/
As you want to compare the time complexities of sorting algorithms with random numbers, we can simply compare their average-case time complexities.
You can further look into their algorithms and analyze their time complexities.
https://www.geeksforgeeks.org/merge-sort/
https://www.geeksforgeeks.org/radix-sort/
https://www.geeksforgeeks.org/heap-sort/
https://www.geeksforgeeks.org/know-your-sorting-algorithm-set-2-introsort-cs-sorting-weapon/
Merge Sort is a widely used sorting algorithm in various libraries which has sort function implementations.
Merge sort sorts in O(nlogn) time and O(n) space.
Heap Sort sorts in O(nlogn) time and O(1) space.
Radix sort sorts in O(nk) time and O(n+k) space.
Intro sort sorts in O(nlogn) time and O(logn) space.
Intro Sort is mix of quick, insertion and heap sort.
Intro Sort is probably the best one.
There is no perfect algorithm, different algorithms have a different set of advantages and disadvantages.
Note: All time complexities are average case.

Why is it termed heap sort is best for external sorting?

While studying sorting algorithms, it is referred to as heap sort is used for external sorting.
I am not able to figure out how does it differ in terms of sorting techniques when we deal with the external storage? Or What is that something which heap sort does uniquely to be considered useful for external sort?
Could someone explain this?
The external part of the sort is k-way merge sort. Blocks or files of data on external media, such as hard drive(s) are repeated merged "k" at a time until a single sorted file is produced.
A min-heap is a common way to implement the internal portion of a k-way merge.
The initial pass to create the blocks or files of data could be just about any internal sort, one that is stable if stability is needed. In the case of sorting records, merge sort can be used to sort an array of pointers to the records, which reduces the space requirement since only the array of pointers requires a second array, as opposed to a second array for the records. It should be noted that sorting the pointers can be slower than sorting records, since sorting via pointers ends up random accessing records for compares, which isn't cache friendly.
Gnu sort for large text files is an example of an external sort. It reads a "chunk" of lines at a time, creating pointers to the lines, and uses merge sort on the pointers, then creates a temporary file for each chunk sorted. It then does a 16-way (16 is the default) merge on the temporary files until it reaches the final merge step where the final merge goes to the specified output file.
Link to source. It's a big program, partially because it has so many options.
http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/sort.c
Let's take an example from the Linux kernel code:
This function does a heapsort on the given array. Sorting time is O(n log n) both on average and worst-case. While qsort is about 20% faster on average, it suffers from exploitable O(n*n) worst-case behavior and extra memory requirements that make it less suitable for kernel use.
From Wikipedia:
Heapsort also competes with merge sort, which has the same time bounds. Merge sort requires Ω(n) auxiliary space, but heapsort requires only a constant amount. Heapsort typically runs faster in practice on machines with small or slow data caches, and does not require as much external memory.
Not
heap sort is best for creation of initial runs in external sorting,
but using a heap to create the initial runs leads to an expected initial run length twice the size of the heap (for a uniform distribution of keys), consequently, to half as many initial runs as any method sorting each batch of records and writing it as a run (using the same amount of RAM).
With two-way merging, half as many initial runs saves an entire pass. With advanced merging schemes, a high degree (number of runs merged into one), or even number of passes (ratio data/RAM size), this loses impact.

Is it faster to read then sort or to sort while reading an array?

I'm programming in Pascal.
Is it faster to read an array, and then sort it (using, say, quick sort), or to sort it while reading it? I won't need the unsorted array anymore, so I can change the order of array as I read it.
Sorting the entire array at the end should be the default choice.
You gain absolutely nothing by incrementally sorting the array as you're reading it. What you're losing, however, is the flexibility of choosing the best sorting algorithm for the job.
See this: Average time complexity of quicksort vs insertion sort
As if you are sorting while reading it is going to be insertion sort the question is which sort is faster.
Reading only takes o(n) time and that is shorter then the sorting complexity (O(nlogn)thus making the reading insignificant
Read the whole array, then sort it. Otherwise you will need to use heap to keep it in sorted structure during reading or any other structure which will take more space & time.

What sorting algorithm fit this 'stream-like' condition?

I have a buffer receiving data, which means the data are like 'stream' and have latency in 'IO'. The way I am doing now is when the buffer is full, using qsort to sort the buffer and write the result to disk. but there is obvious latency when doing qsort, so I am looking for some other sorting algorithms that may start sorting while the data is being added to the buffer, in order to reduce time consumed overall.
don't know if I have made myself clear and leave any comments if needed, thanks
Heap sort keeps the data permanently in a partially sorted condition and so is comparable to Insertion sort. But it is substantially quicker and has a worst case of O(n log n) compared with O(n2) for Insertion Sort.
How is this going to work? Presumably at some point you have to stop reading from the stream, store what you have sorted, and start reading a new set of data?
I think merge-sort or tree sort can be of great help . Look why on wikipedia.
When you can cut the huge input in reasonable large blocks, merge-sort is more appropriate.
When you insert small pieces at a time, tree-sort is more appropriate.
You want to implement an online sorting algorithm, ie an algorithm which runs while receiving the data in a streamlined fashion. Search for online algorithms over the web and you may find other nice algorithms.
In your case I would use tree sort. It doesn't have a better complexity than quicksort (both are O(nlog n) most of the time and O(n²) in few bad cases). But it amortizes the cost over each input. Which means the delay you have to wait after the last data is added is not of order O(nlog n), but O(log n)
You can try to use my Link Array structure. It should be ok for sequential adding of random data while keeping it sorted (look at the numbers in the table). This is a variation of Skip list approach but with easier implementation and logic (although the performance of Skip list should be better)

Resources