sorting with minimum comparison

sorting with minimum comparison - sorting

Give a strategy to sort four distinct integers
a,b,c,d in increasing order that minimizes the number of pairwise comparisons needed to sort any permutation of a,b,c,d
I think there are 4 integers so 4!=24 , 2^5>24 so we need at least 5 comparisons .
but what strategy I should use here?
can I use Counting sort or BBST or anything else to minimize it.

A sorting network will sort 4 numbers with 5 compare / swaps, where two of the compare / swaps can be performed in parallel in a hardware type implementation. It's possible that if the two potential parallel compare / swaps are done using 4 separate registers, some processors might optimize this by at least overlapping the compare swaps.

Related

Partitioning N arrays into K groups with constraints

I have been stuck in this problem and can't find the efficient solution for this problem .
I have N (Upto 10 Million ) arrays of say maximum 100 elements. These arrays contain numbers from 1-10000 .
Now my problem is to partition these arrays into K groups such that i minimize the duplicates across all the arrays i.e for an array containing 1, 4, 10 ,100 and another containing 1, 100. I would like them to go into same group because that minimizes duplicity. Two constraints my problem has are as follows -
i don't want to increase size of unique elements more than 110 for a group of arrays. So i have an array of size 100 and there is another array of size 100 which is a 60% match i would rather create new group because this increases no. of unique elements to 140 and this will go on increasing.
The number of vectors in the groups should be uniformly distributed.
Grouping these arrays based on size in decreasing order. Then finding unique vectors unique hashing and applying a greedy algo of maximum match with the constraints but the greedy doesn't seem to be working well because that will entirely depend on the partitions i picked first. I couldn't figure out how DP can be applied because number of combinations given total number of vectors is just huge. I am not sure what methodology should i take.
some of the fail cases of my algo are , say there are two vectors which are mutually exclusive of each other but if i form a group with them i could match 100% with a third vector which otherwise matched just 30% in a group and made that group full following the addition to that group this will increase my duplicity because the third vector should have formed a group with first two vectors.

Simple yet intensive on computing and memory is iterate 10 million times for each array to match maximum numbers match. Now store match numbers in an array and find match of such arrays similarly by iterating with criteria that match should be at least 60%

Find medians in multiple sub ranges of a unordered list

E.g. given a unordered list of N elements, find the medians for sub ranges 0..100, 25..200, 400..1000, 10..500, ...
I don't see any better way than going through each sub range and run the standard median finding algorithms.
A simple example: [5 3 6 2 4]
The median for 0..3 is 5 . (Not 4, since we are asking the median of the first three elements of the original list)

INTEGER ELEMENTS:
If the type of your elements are integers, then the best way is to have a bucket for each number lies in any of your sub-ranges, where each bucket is used for counting the number its associated integer found in your input elements (for example, bucket[100] stores how many 100s are there in your input sequence). Basically you can achieve it in the following steps:
create buckets for each number lies in any of your sub-ranges.
iterate through all elements, for each number n, if we have bucket[n], then bucket[n]++.
compute the medians based on the aggregated values stored in your buckets.
Put it in another way, suppose you have a sub-range [0, 10], and you would like to compute the median. The bucket approach basically computes how many 0s are there in your inputs, and how many 1s are there in your inputs and so on. Suppose there are n numbers lies in range [0, 10], then the median is the n/2th largest element, which can be identified by finding the i such that bucket[0] + bucket[1] ... + bucket[i] greater than or equal to n/2 but bucket[0] + ... + bucket[i - 1] is less than n/2.
The nice thing about this is that even your input elements are stored in multiple machines (i.e., the distributed case), each machine can maintain its own buckets and only the aggregated values are required to pass through the intranet.
You can also use hierarchical-buckets, which involves multiple passes. In each pass, bucket[i] counts the number of elements in your input lies in a specific range (for example, [i * 2^K, (i+1) * 2^K]), and then narrow down the problem space by identifying which bucket will the medium lies after each step, then decrease K by 1 in the next step, and repeat until you can correctly identify the medium.
FLOATING-POINT ELEMENTS
The entire elements can fit into memory:
If your entire elements can fit into memory, first sorting the N element and then finding the medians for each sub ranges is the best option. The linear time heap solution also works well in this case if the number of your sub-ranges is less than logN.
The entire elements cannot fit into memory but stored in a single machine:
Generally, an external sort typically requires three disk-scans. Therefore, if the number of your sub-ranges is greater than or equal to 3, then first sorting the N elements and then finding the medians for each sub ranges by only loading necessary elements from the disk is the best choice. Otherwise, simply performing a scan for each sub-ranges and pick up those elements in the sub-range is better.
The entire elements are stored in multiple machines:
Since finding median is a holistic operator, meaning you cannot derive the final median of the entire input based on the medians of several parts of input, it is a hard problem that one cannot describe its solution in few sentences, but there are researches (see this as an example) have been focused on this problem.

I think that as the number of sub ranges increases you will very quickly find that it is quicker to sort and then retrieve the element numbers you want.
In practice, because there will be highly optimized sort routines you can call.
In theory, and perhaps in practice too, because since you are dealing with integers you need not pay n log n for a sort - see http://en.wikipedia.org/wiki/Integer_sorting.
If your data are in fact floating point and not NaNs then a little bit twiddling will in fact allow you to use integer sort on them - from - http://en.wikipedia.org/wiki/IEEE_754-1985#Comparing_floating-point_numbers - The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers; endianness issues apply.
So you could check for NaNs and other funnies, pretend the floating point numbers are sign + magnitude integers, subtract when negative to correct the ordering for negative numbers, and then treat as normal 2s complement signed integers, sort, and then reverse the process.

My idea:
Sort the list into an array (using any appropriate sorting algorithm)
For each range, find the indices of the start and end of the range using binary search
Find the median by simply adding their indices and dividing by 2 (i.e. median of range [x,y] is arr[(x+y)/2])
Preprocessing time: O(n log n) for a generic sorting algorithm (like quick-sort) or the running time of the chosen sorting routine
Time per query: O(log n)
Dynamic list:
The above assumes that the list is static. If elements can freely be added or removed between queries, a modified Binary Search Tree could work, with each node keeping a count of the number of descendants it has. This will allow the same running time as above with a dynamic list.

The answer is ultimately going to be "in depends". There are a variety of approaches, any one of which will probably be suitable under most of the cases you may encounter. The problem is that each is going to perform differently for different inputs. Where one may perform better for one class of inputs, another will perform better for a different class of inputs.
As an example, the approach of sorting and then performing a binary search on the extremes of your ranges and then directly computing the median will be useful when the number of ranges you have to test is greater than log(N). On the other hand, if the number of ranges is smaller than log(N) it may be better to move elements of a given range to the beginning of the array and use a linear time selection algorithm to find the median.
All of this boils down to profiling to avoid premature optimization. If the approach you implement turns out to not be a bottleneck for your system's performance, figuring out how to improve it isn't going to be a useful exercise relative to streamlining those portions of your program which are bottlenecks.

Sort an array of tuples by all elements

I would like to sort an array of tuples by all elements (like if they were in a trie). If the input is (1,2,5), (1,2,3), (1,1,4), (2,8,9), the corresponding output would be (1,1,4), (1,2,3), (1,2,5),(2,8,9). The corresponding trie would be:
root
/ \
1 2
/ \ |
1 2 8
| /\ |
4 3 5 9
I was thinking about using a search tree for each position in the tuples. There is also the obvious naive way (sort by first position, then sort by second position, etc.). Does anybody see a better way?

The trie-based approach that you have outlined above is extremely similar to doing a most-significant digit radix sort on the tuples. You essentially are distributing them into buckets based on their first digit, then recursively subdividing the buckets into smaller groups based on the remaining digits. You might want to consider explicitly performing the MSD radix sort rather than building the trie, since tries can be memory-inefficient when the data are sparse, while MSD radix sort has reasonably good memory usage (especially if you implement everything implicitly).
In the example you gave above, all of the numbers in the tuples were single digits. If this is the case, you can have at most 10 × 10 × 10 = 1000 possible distinct tuples, which isn't very large. In that case, you might want to consider just using a standard sorting algorithm with a custom comparator, since the benefits of a more optimized sort probably won't be all that apparent at that scale. On the other hand, if your tuples have many more entries in them, then it might be worth investing in a more clever sort, like MSD radix sort.
Hope this helps!

Radix Sort to Trie, is like Merge Sort to Binary-Tree

how about keeping things simple and consider the tuple value as sum of all elements of the tuple to some base
say 10
then we have (1,2,5) as 125
and so now and then just sorting them with any simple comparative sorting like heap sort

minimum number of comparisons needed

what is the minimum number of comparisons needed to find the largest element from 4 distinct elements? I know for 5 distinct numbers it is 6, floor(5/2) * 3; this is from clrs book. but I know there is no one general formula for finding this, or is there?
edit clarification
these 4 elements could be in any different order(for all permutations of these 4 elements) im not interested in a counting technique to keep track of the largest element as you traverse the elements, but comparisons like > or <.

for 4 elements the min. number of comparisons is 3.
In general, to find largest of N elements you need N-1 comparisons. This gives you 4 for 5 numbers, not 6.
Proof:
there is always a solution with N-1 comparisons: just compare first two and then select the larger and compare with next one, select the larger and compare with next one etc....
there cannot be shorter solution because this solution would not compare all the elements.
QED.

I know it does not answer the original question, but I enjoyed reading this not-so-intuitive post on the minimum number of comparisons needed to find the smallest AND the largest number from an unsorted array (with proof).

Think of it as a competition. By comparing two elements you have a looser and a winner.
So if you have n elements and need 1 final winner you need n-1 comparisons to rule out the other ones.

for elements a,b,c,d
if a>b+c+d, then it only required one comparison to know that a is the biggest.
You do have to get lucky though.

Sort numbers by sum algorithm

I have a language-agnostic question about an algorithm.
This comes from a (probably simple) programming challenge I read. The problem is, I'm too stupid to figure it out, and curious enough that it is bugging me.
The goal is to sort a list of integers to ascending order by swapping the positions of numbers in the list. Each time you swap two numbers, you have to add their sum to a running total. The challenge is to produce the sorted list with the smallest possible running total.
Examples:
3 2 1 - 4
1 8 9 7 6 - 41
8 4 5 3 2 7 - 34
Though you are free to just give the answer if you want, if you'd rather offer a "hint" in the right direction (if such a thing is possible), I would prefer that.

Only read the first two paragraph is you just want a hint. There is a an efficient solution to this (unless I made a mistake of course). First sort the list. Now we can write the original list as a list of products of disjoint cycles.
For example 5,3,4,2,1 has two cycles, (5,1) and (3,4,2). The cycle can be thought of as starting at 3, 4 is in 3's spot, 2 is in 4's spot, and 4 is in 3's. spot. The end goal is 1,2,3,4,5 or (1)(2)(3)(4)(5), five disjoint cycles.
If we switch two elements from different cycles, say 1 and 3 then we get: 5,1,4,2,3 and in cycle notation (1,5,3,4,2). The two cycles are joined into one cycle, this is the opposite of what we want to do.
If we switch two elements from the same cycle, say 3 and 4 then we get: 5,4,3,2,1 in cycle notation (5,1)(2,4)(3). The one cycle is split into two smaller cycles. This gets us closer to the goal of all cycles of length 1. Notice that any switch of two elements in the same cycle splits the cycle into two cycles.
If we can figure out the optimal algorithm for switching one cycle we can apply that for all cycles and get an optimal algorithm for the entire sort. One algorithm is to take the minimum element in the cycle and switch it with the the whose position it is in. So for (3,4,2) we would switch 2 with 4. This leaves us with a cycle of length 1 (the element just switched into the correct position) and a cycle of size one smaller than before. We can then apply the rule again. This algorithm switches the smallest element cycle length -1 times and every other element once.
To transform a cycle of length n into cycles of length 1 takes n - 1 operations. Each element must be operated on at least once (think about each element to be sorted, it has to be moved to its correct position). The algorithm I proposed operates on each element once, which all algorithms must do, then every other operation was done on the minimal element. No algorithm can do better.
This algorithm takes O(n log n) to sort then O(n) to mess with cycles. Solving one cycle takes O(cycle length), the total length of all cycles is n so cost of the cycle operations is O(n). The final run time is O(n log n).

I'm assuming memory is free and you can simulate the sort before performing it on the real objects.
One approach (that is likely not the fastest) is to maintain a priority queue. Each node in the queue is keyed by the swap cost to get there and it contains the current item ordering and the sequence of steps to achieve that ordering. For example, initially it would contain a 0-cost node with the original data ordering and no steps.
Run a loop that dequeues the lowest-cost queue item, and enqueues all possible single-swap steps starting at that point. Keep running the loop until the head of the queue has a sorted list.

I did a few attempts at solving one of the examples by hand:
1 8 9 7 6
6 8 9 7 1 (+6+1=7)
6 8 1 7 9 (7+1+9=17)
6 8 7 1 9 (17+1+7=25)
6 1 7 8 9 (25+1+8=34)
1 6 7 8 9 (34+1+6=41)
Since you needed to displace the 1, it seems that you may have to do an exhaustive search to complete the problem - the details of which were already posted by another user. Note that you will encounter problems if the dataset is large when doing this method.
If the problem allows for "close" answers, you can simply make a greedy algorithm that puts the largest item into position - either doing so directly, or by swapping the smallest element into that slot first.

Comparisons and traversals apparently come for free, you can pre-calculate the "distance" a number must travel (and effectively the final sort order). The puzzle is the swap algorithm.
Minimizing overall swaps is obviously important.
Minimizing swaps of larger numbers is also important.
I'm pretty sure an optimal swap process cannot be guaranteed by evaluating each ordering in a stateless fashion, although you might frequently come close (not the challenge).

I think there is no trivial solution to this problem, and my approach is likely no better than the priority queue approach.
Find the smallest number, N.
Any pairs of numbers that occupy each others' desired locations should be swapped, except for N.
Assemble (by brute force) a collection of every set of numbers that can be mutually swapped into their desired locations, such that the cost of sorting the set amongst itself is less than the cost of swapping every element of the set with N.
These sets will comprise a number of cycles. Swap within those cycles in such a way that the smallest number is swapped twice.
Swap all remaining numbers, which comprise a cycle including N, using N as a placeholder.

As a hint, this reeks of dynamic programming; that might not be precise enough a hint to help, but I'd rather start with too little!

You are charged by the number of swaps, not by the number of comparisons. Nor did you mention being charged for keeping other records.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio