merge sort algorithm for stream data - algorithm

I am reading about merge sort at below link
http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_sorting.aspx
Merge sort's claim to fame is that it can easily be modified to handle
sequential data such as from a stream or a generator. Another huge
benefit is that when written carefully, merge sort does not require
that all items be present. It can sort an unknown number of items
coming in from a stream or generator, which is a very useful property.
My questions are
1.My understanding is that merge sort requires complete array because we have to divide array in between and sort independently followed by merge.How merge sort algorithm works if not all items are present?
Give a algorithm in simple terms how merge sort algorithm used for items coming in from a stream?

The answers to 1 and 2 are somewhat related.
You can still perform a mergesort with an incomplete array, which would leave you with a sorted, partially complete array. The running time would still be O(n lg n). As for inserting the remaining items, you could either merge the partial array with the new items, or insert the new items one at a time, q.v. part 2 below. Inserting the remaining items one at a time would work best if the original array is nearly complete.
Assuming you are starting with a sorted array of numbers, as new numbers came in from the stream one-by-one, you would not have to run another mergesort. Instead, you could simply walk through the sorted array in O(n) time and insert each new item coming from the stream.

Merge sort works by dividing the list into subsets, sorting the subsets then putting it all back together. If the subset is one element long (the item being streamed in) then it is already sorted and just needs to be put in with the existing set of elements. Technically, I would call this insertion sort. The hard part here is determining where to put the new element. You could use an array which makes finding the right place quite easy but adding new items requires moving data around to make space (sometimes reallocating the array). Alternatively, you could store the data as a linked list so adding items is trivial but determining where to put new items is trickier. Swings and roundabouts.

Related

What is the best algorithm to sort an almost sorted books in a library

If there is the library where all books are already sorted except only one book that is not stored in the place where it should be. Which algorithm that I should use to sort this library with shortest time spend?. Currently I think it would be merge sort with O(nlogn). Any algorithm faster than this?
Find the location of the book (an index i) - linear scan
Sift all the books with index j>i one place to the right
Place the book in the newly free space
This is done in O(n) with only 2 passes on the data, and can be optimized to be done in one pass only, by combining step (1) and (2) to be done together.
Also note:
This is basically doing mergesort with two arrays, (but with O(1) additional space for this variant):
Original (sorted array)
New book (which is also sorted array, of size 1)
Since array (1) - Original is already sorted, you can skip the recursive calls in it, and just go right to the merge step of mergesort.
If you're sure that it's almost sorted, then Insertion algorithm or even Bubble would be the best... actually merge wouldn't be the best since it will make too many order checks.
Look at this gif, it may help you :)

Efficiently Filtering out Sorted Data With A Second Predicate

Lets say that I have a list (or a hashmap etc., whatever makes this the fastest) of objects that contain the following fields: name, time added, and time removed. The list given to me is already sorted by time removed. Now, given a time T, I want to filter (remove from the list) out all objects of the list where:
the time T is greater than an object's time removed OR T is less than an object's time added.
So after processing, the list should only contain objects where T falls in the range specified by time added and time removed.
I know I can do this easily in O(n) time by going through each individual object, but I was wondering if there was a more efficient way considering the list was already sorted by the first predicate (time removed).
*Also, I know I can easily remove all objects with time removed less than T because the list is presorted (possibly in O(log n) time since I do a binary search to find the first element that is less than and then remove the first part of the list up to that object).
(Irrelevant additional info: I will be using C++ for any code that I write)
Unfortunately you are stuck with a O(n) being your fastest option. That is unless their are hidden requirements about the difference between time added and time removed (such as a max time span) that can be exploited.
As you said you can start the search where the time removed equals (or is the first greater than) the time removed. Unfortunately you'll need to go through the rest of the list to see if time added is less than your time.
Because a comparative sort is at best O(n*log(n)) you cannot sort the objects again to improve your performance.
One thing, based on the heuristics of the application it may be beneficial to receive the data in order of date added but that is between you and wherever you get the data from.
Let's examine the data structures you offered:
A list (usually implemented as a linked list, or a dynamic array), or a hash map.
Linked List: Cannot do binary search, finding first occurance of an
element (even if list is sorted) is done in O(n), so no benefit
from the fact the data is sorted.
Dynamic Array: Removing a single element (or more) from arbitrary location requires shifting all the following elements to the left, and thus is O(n). You cannot remove elements from the list better than O(n), so no gain here from the fact the DS is sorted.
HashMap: is unsorted by definition. Also, removing k elements is O(k), no way to go around this.
So, you cannot even improve performance from O(n) to O(logn) for the same field the list was sorted by.
Some data structures such as B+ trees do allow efficient range queries, and you can pretty efficiently [O(logn)] remove a range of elements from the tree.
However, it does not help you to filter the data of the 2nd field, which the tree is unsorted by, and to filter according to it (unless there is some correlation you can exploit) - will still need O(n) time.
If all you are going to do is to later on iterate the new list, you can push the evaluation to the iteration step, but there won't be any real benefit from it - only delaying the processing to when it's needed, and avoiding it, if it is not needed.

Should you sort a list when getting or setting it?

A decision I often run into is when to sort a list of items. When an item is added, keeping the list sorted at all times, or when the list is accessed.
Is there a best practice for better performance, or is it just the matter of saying: if the list is mostly accessed, sort it when it is changed or vice versa.
Sorting the list at every acccess is a bad idea. You have to have a flag which you set when the collection is modified. Only if this flag is set, you need to sort and then reset the flag.
But the best is if you have a data structure which is per definition always sorted. That means, if you insert a new element, the element is automatically inserted at the right index, thus keeping the collection sorted.
I don't know which platform / framework you are using. I know .NET provides a SortedList class which manages that kind of insertion-sort algorithm for you.
The answer is a big depends. You should profile and apply a strategy that is best for your case.
If you want performance on access/finding elements a good decision will be to maintain the list sorted using InsertionSort (http://en.wikipedia.org/wiki/Insertion_sort).
Sorting list on access may be an option only on some very particular scenarios, when are many insertions, low access and performance is not very important.
But, there are many other options: like maintain a var that say "list is sorted" and sort at every n-th insertion, on idle or on access (if you need).
I'm used to think in this way:
If the list is filled all at once and only after this is read, then add elements in non-sorted order and sort it just at the end of filling (in complexity terms it requires O(n log n) plus the complexity of filling, and that's usually faster than sorting while adding elements)
Conversely, if the list needs to be read before it is completely filled, then you have to add elements in sorted order (maybe using some special data structure doing the work for you, like sortedlist, red-black tree etc.)

Bucket sort for integers

Could anybody help me with the bucket sort algorithm for integers? Often people mistakenly say they have this algorithm, but actually have a counting sort! Maybe it works similarly, but it is something different.
I hope you will help me find the right way, because now I have no idea (Cormen's book and Wikipedia are not so helpful).
Thanks in advance for all your respones.
Bucket sort can be seen as a
generalization of counting sort; in
fact, if each bucket has size 1 then
bucket sort degenerates to counting
sort.
Counting sort will only work on integers, while bucket sort can work on anything with a value, also, the final loop is a bit different.
Counting Sort maintains an additional array of ints, which basically counts how many of a certain number there is, and then creates the number again when it goes through the additional array in the final loop, what I mean by this is - in a OOP way of looking at it, it's not the same object, but a new object with identical value.
Then, we have bucket sort. Bucket sort goes through the array, but instead of just going ++ in the relevant place in the array, it inserts the item into a list of some kind (I like to use a queue, that way it's a stable sort). Then, in the final loop, the algorithm goes through the entire additional array, and dequeues the elements in each bucket into the array. That way it's the same object.
If you're sorting anything and you know that the range of numbers is smaller then nlogn, it's simple - use counting sort if it's integers, and bucket sort if the object has some additional data. You can use bucket sort for integers, sure, but counting sort will take much less space.

Inserting items in a list that is frequently insertion sorted

I have a list that is frequently insertion sorted. Is there a good position (other than the end) for adding to this list to minimize the work that the insertion sort has to do?
The best place to insert would be where the element belongs in the sorted list. This would be similar to preemptively insertion sorting.
Your question doesn't make sense. Either the list is insertion sorted (which means you can't append to the end by definition; the element will still end up in the place where it belongs. Otherwise, the list wouldn't be sorted).
If you have to add lots of elements, then the best solution is to clone the list, add all elements, sort the new list once and then replace the first list with the clone.
[EDIT] In reply to your comments: After doing a couple of appends, you must sort the list before you can do the next sorted insertion. So the question isn't how you can make the sorted insertion cheaper but the sort between appends and sorted insertions.
The answer is that most sorting algorithms do pretty good with partially sorted lists. The questions you need to ask are: What sorting algorithm is used, what properties does it have and, most importantly, why should you care.
The last question means that you should measure performance before you do any kind of optimization because you have a 90% chance that it will hurt more than it helps unless it's based on actual numbers.
Back to the sorting. Java uses a version of quicksort to sort collections. Quicksort will select a pivot element to partition the collection. This selection is crucial for the performance of the algorithm. For best performance, the pivot element should be as close to the element in the middle of the result as possible. Usually, quicksort uses an element from the middle of the current partition as a pivot element. Also, quicksort will start processing the list with the small indexes.
So adding the new elements at the end might not give you good performance. It won't affect the pivot element selection but quicksort will look at the new elements after it has checked all the sorted elements already. Adding the new elements in the middle will affect the pivot selection and we can't really tell whether that will have an influence on the performance or not. My instinctive guess is that the pivot element will be better if quicksort finds sorted elements in the middle of the partitions.
That leaves adding new elements at the beginning. This way, quicksort will usually find a perfect pivot element (since the middle of the list will be sorted) and it will pick up the new elements first. The drawback is that you must copy the whole array for every insert. There are two ways to avoid that: a) As I said elsewhere, todays PCs copy huge amounts of RAM in almost no time at all, so you can just ignore this small performance hit. b) You can use a second ArrayList, put all the new elements in it and then use addAll(). Java will do some optimizations internally for this case and just move the existing elements once.
[EDIT2] I completely misunderstood your question. For the algorithm insertion sort, the best place is probably somewhere in the middle. This should halve the chances that you have to move an element through the whole list. But since I'm not 100% sure, I suggest to create a couple of small tests to verify this.

Resources