Is parallel merge sort always worth it?

Is parallel merge sort always worth it? - sorting

So, I know that if I have n number of data and p number of processors, an algorithm like mergesort will split this data in n/p data per processor, and run the sequential sorting algorithm for each sub section of the data. It will then put them back together ( sort them ).
What I am wondering is if this is always faster than simply using the sequential mergesort on n?
I see the time complexity decreasing since we operate on n/p data but also increasing since we have the additional time used for the parallel operations...
I am thinking that if I set up the two equations, they should look something like this:
N/p log2 N/p + 2N-2 = N* log2 N
where the left side is the parallel algorithm and the right side the non parallel one. Is this correct and how would I go about solving it?

Take the simple case of 2 processors. For simplicity (and since you don't say otherwise), I'll assume shared memory.
The sequential case will, in sequence:
Sort one half
Sort the other half
Merge the two halves
The parallel case will perform the same steps, but do steps 1 & 2 in parallel; the only added work is any overhead in managing the processes/threads.
As the number of processors goes up, the time it will take to do the parallel part will diminish, speeding thing up until it is overtaken by the overhead of managing the parallel bits.

Related

Merging k sorted arrays - Priority Queue vs Traditional Merge-sort merge, when to use which?

Assuming we are given k sorted arrays (each of size n), in which case is using a priority heap better than a traditional merge (similar to the one used in merge-sort) and vice-versa?
Priority Queue Approach: In this approach, we have a min heap of size k (initially, the first element from each of the arrays is added to the heap). We now remove the min element (from one of the input arrays), put this in the final array and insert a new element from that same input array. This approach takes O(kn log k) time and O(kn) space. Note: It takes O(kn) space because that's the size of the final array and this dominates the size of the heap while calculating the asymptotic space complexity.
Traditional Merge: In this approach, we merge the first 2 arrays to get a sorted array of size 2n. We repeat this for all the input arrays and after the first pass, we obtain k/2 sorted arrays each of size 2n. We repeat this process until we get the final array. Each pass has a time complexity of O(kn) since one element will be added to the corresponding output array after each comparison. And we have log k passes. So, the total time complexity is O(kn log k). And since we can delete the input arrays after each pass, the space used at any point is O(kn).
As we can see, the asymptotic time and space complexities are exactly the same in both the approaches. So, when exactly do we prefer one over the other? I understand that for an external sort the Priority Queue approach is better because you only need O(k) in-memory space and you can read and write each element from and back to disk. But how do these approaches stack up against each other when we have enough memory?

The total number of operations, compares + moves, is about the same either way. A k-way merge does more compares but fewer moves. My system has an 8 way cache (Intel 3770K 3.5 ghz), which in the case of a 4 way merge sort, allows for 4 lines of cache for the 4 input runs and 1 line of cache for the merged output run. In 64 bit mode, there are 16 registers that can be used for working variables, 8 of them used for pointers to the current and end position of each "run" (compiler optimization).
On my system, I compared a 4 way merge (no heap, ~3 compares per element moved) versus a 2 way merge (~1 compare per move, but twice as many passes), the 4 way has 1.5 times the number of compares, but 0.5 times the number of moves, so essentially the same number of operations, but the 4 way is about 15% faster due to cache issues.
I don't know if 16 registers is enough for a 6 way merge to be a tiny bit faster, and 16 register is not enough for an 8 way merge (some of the working variable would be memory / cache based). Trying to use a heap probably wouldn't help as the heap would be memory / cache based (not register based).
A k-way merge is mostly useful for external sorts, where compare time is ignored due to the much larger overhead of moves.

What is '6n' ( 6n (logn+1)= 6nlogn+6n ) in analysis of merge sort algorithm?

Number of operations required to implimented merge is :
:::::: 6n (logn+1)= 6nlogn+6n.
logn+1 is the number of levels in merge sort. What is 6n here?

In the case of a crude merge sort: two reads to compare two elements, one read and one write to copy the smaller element to a working array, then later another read and another write to copy elements back to the original array, for a total of 6 memory accesses per element (except for boundary cases like reaching the end of a run, in which case the remainder of the other run is just copied without compares). A more optimized merge sort avoids the copy back step by alternating the direction of merge depending on the merge pass if bottom up, or the recursion level if top down, reducing the 6 to a 4. If an element fits in a register, then after a compare, the element will be in a register and will not have to be re-read, reducing the 6 to a 3.

I'm not sure what mean by "what is 6n"? If you are asking about the complexity of your algorithm (merge sort), it can be reduced to nlog(n). You can ignore the the coefficients in your problem as they are negligible when accounting for big O complexity. When calculating nlog(n) + n, you can also ignore n as it will increase at a much slower rate than nlog(n). This leaves you with a complexity of nlog(n).

Is Big(O) machine dependent?

I am really confused with Big(O) notation. Is Big(O) machine dependent or machine independent ? (Machine in the sense the computer in which we run the Algorithm)
Will Sorting of 1000 numbers using quick sort in i3 processor and i7 processor be the same ? Why don't we consider the machine and it's processor speed when calculating the Time Complexity ? I am a neophyte in this stuff.

Big-O is a measure of scalability, not of speed. It shows you what effect on time and memory it has when you e.g. double the amount of data - does it double the execution, or quadruple it?
Whether you use i7 or i3, double is double. Whether a linear algorithm is fast or slow, double is double.
This also has another implication many people ignore. A complex algorithm such as O(n^3) can be faster than a simple algorithm such as O(n) for a given n that is below a certain limit. Example:
loop n times:
loop n times:
loop n times:
sleep 1 second
is O(n^3), as it has 3 nested loops.
loop n times:
sleep 10 seconds
is O(n), as it only has one loop. For n = 10 the first program executes for 1000 seconds, and the second one executes for only 100. So, O(n) is good! one would be tempted to say. But if you have n = 2, the first, complex program executes in only 8 seconds, while the second, simpler one executes for 20! Even for n = 3, the first executes in 27 seconds, the second one in 30. So while the n is low, a complex program might be able to outperform the simpler one. It's just that as n rises, the complex program gets slower much faster (if that makes sense) than a simple one. For n = 1000, the simple code has risen to only 10000 seconds, but the complex one is now 1000000000 seconds!
Also, this clearly shows you that complexity is not processor-dependent. A second is a second.
EDIT: Also, you might want to read this question, where Big-O is explained in a number of very high-quality answers.

Big(O) Notation is the method of calculating the complexity of an algorithm, and hence the relative time it will take to run. The same algorithm, for the same data, will run faster on a faster processor, but will still take the same number of operations. It's used as a way of evaluating the relative efficiency of different algorithms to achieve the same result.

Big O notation is not architecture-dependent in any way, it is a Mathematical construct.It is a very limited measure of algorithmic complexity, it only gives you a rough upper bound for how performance changes with data size.

Big(O) is alogorithm dependent. It's job is to help compare the relative costs of various algorithms, without the need to consider the machine dependencies.
Linear search though an array, on average will look at about 1/2 of the elements if it is found. for all practical purposes that is O(N/2) which is the same as O(1/2 * N). for compairson, you toss away the coefficient. hence it is O(N) for use.
A binary tree can hold N elements for searching as well. on agerage it will look though log base 2 (N) to find something, hence you will see it described as cost O(LN2(N)).
pop in small values for N, and there isn't a whole lot of difference between the algorithms. Pop in a large value of N, and it will be clear that the binary tree lookup is much faster.

Big(O) is not machine dependent. It is mathematical notation to denote complexity of an algorithm. Usually we use these notations in theory to compare algorithms performance.

Good choice of a parallelized sorting algorithm to implement as homework?

I wanna implement a fast algorithm for a homework, but using parallel processing for this task. I heard that the parallel version of Quicksort is the best choice, but I'm not sure of this... maybe Heapsort is a good idea. Which algorithm do you think is the best one for a parallelized environment, and why?

Quick sort can split the unsorted list into two halves, but unfortunately, the halves aren't guaranteed to be anywhere near even. So one machine (or half of a cluster of machines) could get 20 entries, and the other half could get 20 billion.
I can't think of a good way to make heapsort work in parallel. It can be done, but man, that feels really counterintuitive.
Merge sort is the one I think you want.
Each split is exactly 50% of the list, so it's easy to split between processors.
You can implement merge sort on two sets of tape drives, which means that it doesn't require the whole list be in memory at one time. For large lists, especially those that are larger than the memory you have available, that's a must have.
Merge sort is also stable in parallel implementations, if it matters.

Merge sort is a great first parallel sorting technique. The best sort is always machine dependent and generally involves a combination of sorting techniques for different size inputs.

As Dean J mentions, merge sort is a good candidate. But it has the disadvantage of requiring a synchronization when both the threads are done (the merging process).
Though quicksort has the disadvantage of being unpredictable while partitioning, what can be done is to make the first partition (that decides the processor load) consciously to divide the load more or less evenly, and then letting the algorithm take its course.
The advantage is that you don't need to do any sync of any kind after the processors are done with their work. After they are done, you have the sorted array ready, without the need for an extra merging step, whic might be costly.

How about thinking about this in two steps.
Step 1. Break my data down into N chunks, where N is my number of processors/nodes/cores. Sort each chunk.
Step 2. Combine my N chunks together.
For sorting the N chunks, you can use whatever you want, based on your data. Quicksort, heapsort, I don't care. For step 2, Merge sort handles combining two sorted lists really well, so that is probably your best bet.

You should consider Bitonic Sort:
This algorithm is somewhat similar to merge sort, but it has an interesting twist: Instead of sorting both halves of the array from lower to upper, then merging, you sort one half of the array in the opposite direction, to obtain a bitonic array: Comprising two monotonic parts in opposite directions.
Bitonic arrays can be merged into sorted arrays in a very-well-parallelizing way: While its overall time complexity is O(n log(n)), all of its comparisons and swaps are independent, i.e. the choice of elements to compare does not depend on previous comparison results, unlike the usual merge. Consequently, it admits full parallelization.
This Youtube video demonstrates a bitonic sort.
PS - I'm guessing the asker's homework is already due... 3 years ago.

quick sort is recursive, a simple way to make any recursive algorithm parallel (only if it involves two or more recursive calls, as quicksort does), is to spawn two new threads for the recursive calls, and wait until they are done, then finish your function. this is by no means optimal, but it is a fairly quick and dirty way of parallelizing recursive calls.

I actually worked on a parallel sorting algorithm for a parallelization library a while back and came to the conclusion that it's not worth doing. For small datasets the cost of even a few synchronization primitives makes the parallel sort slower than a regular sort. For large datasets, you're mostly bound by shared memory bandwidth and you get minimal speedups. For the case of sorting a large number (I think 10 million) integers, I was only able to get <1.5x speedup on a dual core using a parallel quick sort IIRC.
Edit:
Most of the programming I do is number crunching, so I tend to think in terms of sorting simple primitives. I still think a parallel sort is a bad idea for these cases. If you're sorting things that are expensive to compare, though, this answer doesn't apply.

Analysis of algorithms (complexity)

How are algorithms analyzed? What makes quicksort have an O(n^2) worst-case performance while merge sort has an O(n log(n)) worst-case performance?

That's a topic for an entire semester. Ultimately we are talking about the upper bound on the number of operations that must be completed before the algorithm finishes as a function of the size of the input. We do not include the coeffecients (ie 10N vs 4N^2) because for N large enough, it doesn't matter anymore.
How to prove what the big-oh of an algorithm is can be quite difficult. It requires a formal proof and there are many techniques. Often a good adhoc way is to just count how many passes on the data the algorithm makes. For instance, if your algorithm has nested for loops, then for each of N items you must operate N times. That would generally be O(N^2).
As to merge sort, you split the data in half over and over. That takes log2(n). And for each split you make a pass on the data, which gives N log(n).
quick sort is a bit trickier because in the average case it is also n log (n). You have to imagine what happens if your partition splits the data such that every time you get only one element on one side of the partition. Then you will need to split the data n times instead of log(n) times which makes it N^2. The advantage of quicksort is that it can be done in place, and that we usually get closer to N log(n) performance.

This is introductory analysis of algorithms course material.
An operation is defined (ie, multiplication) and the analysis is performed in terms of either space or time.
This operation is counted in terms of space or time. Typically analyses are performed as Time being the dependent variable upon Input Size.
Example pseudocode:
foreach $elem in #list
op();
endfor
There will be n operations performed, where n is the size of #list. Count it yourself if you don't believe me.
To analyze quicksort and mergesort requires a decent level of what is known as mathematical sophistication. Loosely, you solve a discrete differential equation derived from the recursive relation.

Both quicksort and merge sort split the array into two, sort each part recursively, then combine the result. Quicksort splits by choosing a "pivot" element and partitioning the array into smaller or greater then the pivot. Merge sort splits arbitrarily and then merges the results in linear time. In both cases a single step is O(n), and if the array size halves each time this would give a logarithmic number of steps. So we would expect O(n log(n)).
However quicksort has a worst case where the split is always uneven so you don't get a number of steps proportional to the logarithmic of n, but a number of steps proportional to n. Merge sort splits exactly into two halves (or as close as possible) so it doesn't have this problem.

Quick sort has many variants depending on pivot selection
Let's assume we always select 1st item in the array as a pivot
If the input array is sorted then Quick sort will be only a kind of selection sort!
Because you are not really dividing the array.. you are only picking first item in each cycle
On the other hand merge sort will always divide the input array in the same manner, regardless of its content!
Also note: the best performance in divide and conquer when divisions length are -nearly- equal !

Analysing algorithms is a painstaking effort, and it is error-prone. I would compare it with a question like, how much chance do I have to get dealt two aces in a bridge game. One has to carefully consider all possibilities and must not overlook that the aces can arrive in any order.
So what one does for analysing those algorithms is going through an actual pseudo code of the algorithm and add what result a worst case situation would have. In the following I will paint with a large brush.
For quicksort one has to choose a pivot to split the set. In a case of dramatic bad luck the set splits in a set of n-1 and a set of 1 each time, for n steps, where each steps means inspecting n elements. This arrive at N^2
For merge sort one starts by splitting the sequence into in order sequences. Even in the worst case that means at most n sequences. Those can be combined two by two, then the larger sets are combined two by two etc. However those (at most) n/2 first combinations deal with extremely small subsets, and the last step deals with subsets that have about size n, but there is just one such step. This arrives at N.log(N)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio