Find the median of N^2 elements( large scale )

Find the median of N^2 elements( large scale ) - algorithm

The question is like this:
Assume we have N machines, and each machine store and can manipulate its N elements, then, how can we find the median of all the N^2 elements in the lowest cost?
It really bothers me much, hope to get answer from you guys, thanks!
Sorry I just write it down too simple. The elements stored in each machine is random, and have no order. And the cost contains I/O cost, as well as communication between machines, RAM, time everything should be considered too. I just want to find the most efficient way to get the median.
These are some solutions I have come up with:
use external sort like merge sort or something else, and find the median.
use bucket sort, divide all the elements into X consecutive buckets according to its value, and so we can decide which bucket the median is in. Scan the bucket and we will get the median.
I think the finding kth number in O(N) algorithm in "Introduction to Algorithms" should work here?
But still, all these solutions need an extra machine to do the job. I'm wondering whether there is a way that we can only use these N machines to get the median?
Thanks!

You'll need to have a process that counts all the values (total across all the stores). Pick the middle index. Adjust the index to be an offset from the start of items on the appropriate machine. Ask that machine to sort the items and return the value for that index.

Step 1: Sort the numbers at each machine individually
Step 2: Send the median at each machine to a central place
Step 3: Sort the medians and send it to each machine
Step 4: For each element in the sorted medians calculate the rank at machine level
Step 5: Calculate the rank of each element over all machines (just sum the rank)
Step 6: Find two elements in the sorted medians between which the global median exists
Step 7: For the next iteration consider only elements between those two medians
and repeat the whole thing again
In the worst case all the remaining elements in the second iteration will be on a single machine.
Complexity: Pretty sure it is O(nlogn) (i.e. including palatalization it can be O(n^2logn)

Can you estimate it rather than get it exactly?
If so, pick a constant K and fit a K-coefficient polynomial to the data on each machine, send the coefficients to a central machine that adds them and then finds the median by
Integrating the curve over the range to find the area under the curve
Doing a root-finding algorithm to find the point that splits the area in half.
The bigger K is, the less error there will be. The smaller K is, the more efficient it will be.

Related

write an Algorithm for Scenario using Selection sort?

This Scenario was in my final exam but can't write up an algorithm for it.
Everfresh Cattle farm has its annual Big Cattle Contest. Because the Bilal Haleem son Ali, is majoring in computer science, the county hires him to computerize the Big Cattle judging. Each Cattle’s name (string) and weight (integer) are to be read in from the keyboard. The county expects 500 entries this year. The output needed is a listing of the ten heaviest Cattles, sorted from biggest to smallest. Because Ali has just learned some sorting methods in school, he feels up to the task of writing this “pork-gram”. He writes a program to read in all the entries into an array of records, then uses a selection sort to put the entire array in order based on the Cattle Weight member. He then prints the ten largest values from the array. Can you think of a more efficient way to write this program? If so, write the algorithm.

The solution is to not sort the 500 entries, you only have to keep track of the top 10 encountered so far, sort it every time a new value is larger than best10[9].

First of all, I would not say that this is a C++ question. This actually sounds quite similar to some interview questions. I will assume you care about worst-case running times so will use Big-Oh and we will worry about the general case of n is the number of cattle and k is the number of entries we want.
The first observation you can make is looking at the running time of the algorithm they have used which is selection sort. It says that the entire array is sorted so the running time will be O(n^2 + k). Since k < n in this case, we get O(n^2).
The first observation to make is that we don't need to sort the entire array only get the k largest. In fact, if we're running selection sort to choose maximums, we have this after only k iterations of the algorithm giving a running time of O(kn). This may have been enough for the exam question, but we can simplify further.
The next observation extends from the observation that we only care about the k largest elements we have encountered so far. Let us say we have this collection of elements when we encounter a new element, we need to determine if it should be within that collection. It actually suffices to show that if it is greater than the smallest element of the collection, then we know that it must be in the collection.
Now, the question now becomes which data structure should we use to store this? Since we care about fast insertions, deletions of the smallest and getting of the smallest, the natural choice is a Min-Heap. Performing at most n operations on this heap each costing log k, the maximum size of the heap, gives us O(n log k) time.
There are some further minor improvements we can make which actually reduce this to O((n - k) log k), though I will leave it as an exercise to you to work out why this is. You should also note that the question asks for the output in sorted order, so you would have to deconstruct the heap at the end taking an additional O(k log k) steps.
In terms of the code, this will also be left to you, so best of luck.
p.s. homework / exam questions aren't the most welcome here and there are plenty of sources.

Confusion in calculating number of steps for various algorithms?

I've been learning data structures and algorithms from a book, in which it compares time efficiency in terms of number of steps taken by various sorting algorithms. I'm confused as to what we define as one step while doing this.
So while counting no. of steps we consider the worst case scenarios. I understood how we come up with the no. of steps for bubble sort. But for selection sort, I am confused about the part where we compare every element with the current lowest value.
For example, in the worst case array, lets say 5,4,3,2,1, and lets say we are in the first pass through. When we start, 5 is the current lowest value. When we move to 4, and compare it to 5, we change the current lowest value to 4.
Why isnt this action of changing the current lowest value to 4 counted as a swap or an additional step? I mean, it is a step separate from the comparison step. The book I am referring to states that in the first passthrough, the number of comparisons are n-1 but the no. of swaps is only 1, even in worst case, for an n size array. Here they are assuming that the step of changing the current lowest value is a part of the comparison step, which I think should not be a valid assumption, since there can be an array in which you compare but don't need to change the lowest current value and hence your no. of steps eventually reduce. Point being, we cant assume that the no. of steps in the first pass through for selection sort in the worst case is n-1 (comparisons) + 1 (swap). It should be more than (n-1) + (1).
I understand that both selection sort and bubble sort lie in the same classification of time complexity as per big O methodology, but the book goes on to claim that selection sort has lesser steps than bubble sort in worst case scenarios, and I'm doubting that. This is what the book says: https://ibb.co/dxFP0

Generally in these kinds of exercises you’re interested in whether the algorithm is O(1), O(n), O(n^2) or something higher. You’re generally not interested in O(1) vs O(2) or in O(3n) vs O(5n) because for sufficiently large n only the power of n matters.
To put it another way, small differences in the complexity of each step, maybe favors of 2 or 3 or even 10, don’t matter against choosing an algorithm with a factor of n = 300 or more additional work

In quicksort If an array is randomized, does using the median of 3 for pivot selection matter?

I've been comparing the run times of various pivot selection algorithms. Surprisingly the simplest one where the first element is always chosen is the fastest. This may be because I'm filling the array with random data.
If the array has been randomized (shuffled) does it matter? For example picking the medium of 3 as the pivot is always(?) better than picking the first element as the pivot. But this isn't what I've noticed. Is it because if the array is already randomized there would be no reason to assume sortedness, and using the medium is assuming there is some degree of sortedness?

The worst case runtime of quicksort is O(n²). Quicksort is only in average case a fast sorting algorithm.
To reach a average runtime of O(n log n) you have to choose a random pivot element.
But instead of choosing a random pivot element, you can shuffle the list and choose the first element.
To see that this holds you can look at this that way: lets say all elements are in a specific order. Shuffling means you use a random permutation on the list of elements, so a random element will be at the first position and also on all other positions. You can also see it by shuffling the list by randomly choose one of all elements for the first element, then choosing randomly one element of the other (not yet coosen elements) for the second element, and so on.
If your list is already a random generated list, you can directly choose the first element as pivot, without shuffling again.
So, choosing the first element is the fastest one because of the random generated input, but choosing the thrid or the last will also as fast as choosing the first.
All other ways to choose a pivot element have to compute something (a median or a random number or something like this), but they have no advantage over a random choice.

A substantially late response, but I believe it will add some additional info.
Surprisingly the simplest one where the first element is always chosen
is the fastest.
This is actually not surprisingly at all, since you mentioned that you test the algorithm with the random data. In the reality, a percentage of almost-sorted and sorted data is much greater than it would statistically be expected. Take for example the chronological data, when you collect it into the log file some elements can be out of order, but most of them are already sorted. Unfortunately, the Quicksort implementation that takes first (or last) element as a pivot is vulnerable to such input and it degenerates into O(n^2) complexity because in the partition step you divide your array into two halves of size 1 and n-1 and therefore you get n partitions instead of log n, on average.
That's why people decided to add some sort of randomization that would make a probability of getting the problematic input as minimum as possible. There are three well-known approaches:
shuffle the input - to quote Robert Sedgewick, "the probability of getting O(n^2) performance with such approach is lower than the probability that you will be hit by a thunderstrike" :)
choose the pivot element randomly - Wikipedia says that in average, expected number of comparisons in this case is 1.386 n log n
choose the pivot element as a median of three - Wikipedia says that in average, expected number of comparisons in this case is 1.188 n log n
However, randomization costs. If you shuffle the input array, that is O(n) which is dominated by O(nlogn), but you need to take in the account the cost of invoking random(..) method n times. With your simple approach, that is avoided and it is thus faster.
See also:
Worst case for Quicksort - when can it occur?

Balancing KD-Tree: Which approach is more efficient?

I'm trying to balance a set of (Million +) 3D points using a KD-tree and I have two ways of doing it.
Way 1:
Use an O(n) algorithm to find the arraysize/2-th largest element along a given axis and store it at the current node
Iterate over all the elements in the vector and for each, compare them to the element I just found and put those smaller in newArray1, and those larger in newArray2
Recurse
Way 2:
Use quicksort O(nlogn) to sort all the elements in the array along a given axis, take the element at position arraysize/2 and store it in the current node.
Then put all the elements from index 0 to arraysize/2-1 in newArray1, and those from arraysize/2 to arraysize-1 in newArray2
Recurse
Way 2 seems more "elegant" but way 1 seems faster since the median search and the iterating are both O(n) so I get O(2n) which just reduces to O(n). But then at the same time, even though way 2 is O(nlogn) time to sort, splitting up the array into 2 can be done in constant time, but does it make up for the O(nlogn) time for sorting?
What should I do? Or is there an even better way to do this that I'm not even seeing?

How about Way 3:
Use an O(n) algorithm such as QuickSelect to ensure that the element at position length/2 is the correct element, all elements before are less, and all afterwards are larger than it (without sorting them completely!) - this is probably the algorithm you used in your Way 1 step 1 anyway...
Recurse into each half (except middle element) and repeat with next axis.
Note that you actually do not need to make "node" objects. You can actually keep the tree in a large array. When searching, start at length/2 with the first axis.
I've seen this trick being used by ELKI. It uses very little memory and code, which makes the tree quite fast.

Another way:
Sort for each of the dimensions: O(K N log N). This will be performed only once, we will utilize the sorted list on the dimensions.
For the current dimension, find the median in O(1) time, split for the median in O(N) time, split also the sorted arrays for each of the dimensions in O(KN) time, and recurse for the next dimension.
In that way, you will perform sorts at the beginning. And perform (K+1) splits/filterings for each subtree, for a known value. For small K, this approach should be faster than the other approaches.
Note: The additional space needed for the algorithm can be decreased by the tricks pointed out by Anony-Mousse.

Notice that if the query hyper-rectangle contains many points (all of them for example) it does not matter if the tree is balanced or not. A balanced tree is useful if the query hyper-rects are small.

Find the 10,000 largest out of 1,000,000 total values

I have a file that has 1,000,000 float values in it. I need to find the 10,000 largest values.
I was thinking of:
Reading the file
Converting the strings to floats
Placing the floats into a max-heap (a heap where the largest value is the root)
After all values are in the heap, removing the root 10,000 times and adding those values to a list/arraylist.
I know I will have
1,000,000 inserts into the heap
10,000 removals from the heap
10,000 inserts into the return list
Would this be a good solution? This is for a homework assignment.

Your solution is mostly good. It's basically a heapsort that stops after getting K elements, which improves the running time from O(NlogN) (for a full sort) to O(N + KlogN). Here N = 1000000 and K = 10000.
However, you should not do N inserts to the heap initially, as this would take O(NlogN) - instead, use a heapify operation which turns an array to a heap in linear time.
If the K numbers don't need to be sorted, you can find the Kth largest number in linear time using a selection algorithm, and then output all numbers larger than it. This gives an O(n) solution.

How about using mergesort(log n operations in worst case scenario) to sort the 1,000,000 integers into an array then get the last 10000 directly?

Sorting is expensive, and your input set is not small. Fortunately, you don't care about order. All you need is to know that you have the top X numbers. So, don't sort.
How would you do this problem if, instead of looking for the top 10,000 out of 1,000,000, you were looking for the top 1 (i.e. the single largest value) out of 100? You'd only need to keep track of the largest value you'd seen so far, and compare it to the next number and the next one until you found a larger one or you ran out of input. Could you expand that idea back out to the input size you're looking at? What would be the big-O (hint: you'd only be looking at each input number one time)?
Final note since you said this was homework: if you've just been learning about heaps in class, and you think your teacher/professor is looking for a heap solution, then yes, your idea is good.

Could you merge sort the values in the array after you have read them all in? This is a fast way to sort the values. Then you could request your_array[10000] and you would know that it is the 10000th largest. Merge sort sounds like what you want. Also if you really need speed, you could look into format your values for radix sort, that would take a bit of formatting but it sounds like that would be the absolute fastest way to solve this problem.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio