Fastest gap sequence for shell sort? - algorithm

According to Marcin Ciura's Optimal (best known) sequence of increments for shell sort algorithm,
the best sequence for shellsort is 1, 4, 10, 23, 57, 132, 301, 701...,
but how can I generate such a sequence?
In Marcin Ciura's paper, he said:
Both Knuth’s and Hibbard’s sequences
are relatively bad, because they are
defined by simple linear recurrences.
but most algorithm books I found tend to use Knuth’s sequence: k = 3k + 1, because it's easy to generate. What's your way of generating a shellsort sequence?

Ciura's paper generates the sequence empirically -- that is, he tried a bunch of combinations and this was the one that worked the best. Generating an optimal shellsort sequence has proven to be tricky, and the problem has so far been resistant to analysis.
The best known increment is Sedgewick's, which you can read about here (see p. 7).

If your data set has a definite upper bound in size, then you can hardcode the step sequence. You should probably only worry about generality if your data set is likely to grow without an upper bound.
The sequence shown seems to grow roughly as an exponential series, albeit with quirks. There seems to be a majority of prime numbers, but with non-primes in the mix as well. I don't see an obvious generation formula.
A valid question, assuming you must deal with arbitrarily large sets, is whether you need to emphasise worst-case performance, average-case performance, or almost-sorted performance. If the latter, you may find that a plain insertion sort using a binary search for the insertion step might be better than a shellsort. If you need good worst-case performance, then Sedgewick's sequence appears to be favoured. The sequence you mention is optimised for average-case performance, where the number of comparisons outweighs the number of moves.

I would not be ashamed to take the advice given in Wikipedia's Shellsort article,
With respect to the average number of comparisons, the best known gap
sequences are 1, 4, 10, 23, 57, 132, 301, 701 and similar, with gaps
found experimentally. Optimal gaps beyond 701 remain unknown, but good
results can be obtained by extending the above sequence according to
the recursive formula h_k = \lfloor 2.25 h_{k-1} \rfloor.
Tokuda's sequence [1, 4, 9, 20, 46, 103, ...], defined by the simple formula h_k = \lceil h'_k
\rceil, where h'k = 2.25h'k − 1 + 1, h'1 = 1, can be recommended for
practical applications.
guessing from the pseudonym, it seems Marcin Ciura edited the WP article himself.

The sequence is 1, 4, 10, 23, 57, 132, 301, 701, 1750. For every next number after 1750 multiply previous number by 2.25 and round down.

Sedgewick observes that coprimality is good. This rings true: if there are separate ‘streams’ not much cross-compared until the gap is small, and one stream contains mostly smalls and one mostly larges, then the small gap might need to move elements far. Coprimality maximises cross-stream comparison.
Gonnet and Baeza-Yates advise growth by a factor of about 2.2; Tokuda by 2.25. It is well known that if there is a mathematical constant between 2⅕ and 2¼ then it must† be precisely √5 ≈ 2.236.
So start {1, 3}, and then each subsequent is the integer closest to previous·√5 that is coprime to all previous except 1. This sequence can be pre-calculated and embedded in code. There follow the values up to 2⁶⁴ ≈ eighteen quintillion.
{1, 3, 7, 16, 37, 83, 187, 419, 937, 2099, 4693, 10499, 23479, 52501, 117391, 262495, 586961, 1312481, 2934793, 6562397, 14673961, 32811973, 73369801, 164059859, 366848983, 820299269, 1834244921, 4101496331, 9171224603, 20507481647, 45856123009, 102537408229, 229280615033, 512687041133, 1146403075157, 2563435205663, 5732015375783, 12817176028331, 28660076878933, 64085880141667, 143300384394667, 320429400708323, 716501921973329, 1602147003541613, 3582509609866643, 8010735017708063, 17912548049333207, 40053675088540303, 89562740246666023, 200268375442701509, 447813701233330109, 1001341877213507537, 2239068506166650537, 5006709386067537661, 11195342530833252689}
(Obviously, omit those that would overflow the relevant array index type. So if that is a signed long long, omit the last.)
On average these have ≈1.96 distinct prime factors and ≈2.07 non-distinct prime factors; 19/55 ≈ 35% are prime; and all but three are square-free (2⁴, 13·19² = 4693, 3291992692409·23³ ≈ 4.0·10¹⁶).
I would welcome formal reasoning about this sequence.
† There’s a little mischief in this “well known … must”. Choosing ∉ℚ guarantees that the closest number that is coprime cannot be a tie, but rational with odd denominator would achieve same. And I like the simplicity of √5, though other possibilities include e^⅘, 11^⅓, π/√2, and √π divided by the Chow-Robbins constant. Simplicity favours √5.

I've found this sequence similar to Marcin Ciura's sequence:
1, 4, 9, 23, 57, 138, 326, 749, 1695, 3785, 8359, 18298, 39744, etc.
For example, Ciura's sequence is:
1, 4, 10, 23, 57, 132, 301, 701, 1750
This is a mean of prime numbers. Python code to find mean of prime numbers is here:
import numpy as np
def isprime(n):
''' Check if integer n is a prime '''
n = abs(int(n)) # n is a positive integer
if n < 2: # 0 and 1 are not primes
return False
if n == 2: # 2 is the only even prime number
return True
if not n & 1: # all other even numbers are not primes
return False
# Range starts with 3 and only needs to go up the square root
# of n for all odd numbers
for x in range(3, int(n**0.5)+1, 2):
if n % x == 0:
return False
return True
# To apply a function to a numpy array, one have to vectorize the function
vectorized_isprime = np.vectorize(isprime)
a = np.arange(10000000)
primes = a[vectorized_isprime(a)]
#print(primes)
for i in range(2,20):
print(primes[0:2**i].mean())
The output is:
4.25
9.625
23.8125
57.84375
138.953125
326.1015625
749.04296875
1695.60742188
3785.09082031
8359.52587891
18298.4733887
39744.887085
85764.6216431
184011.130096
392925.738174
835387.635033
1769455.40302
3735498.24225
The gap in the sequence is slowly decreasing from 2.5 to 2.
Maybe this association could improve the Shellsort in the future.

I discussed this question here yesterday including the gap sequences I have found work best given a specific (low) n.
In the middle I write
A nasty side-effect of shellsort is that when using a set of random
combinations of n entries (to save processing/evaluation time) to test
gaps you may end up with either the best gaps for n entries or the
best gaps for your set of combinations - most likely the latter.
The problem lies in testing the proposed gaps such that valid conclusions can be drawn. Obviously, testing the gaps against all n! orderings that a set of n unique values can be expressed as is unfeasible. Testing in this manner for n=16, for example, means that 20,922,789,888,000 different combinations of n values must be sorted to determine the exact average, worst and reverse-sorted cases - just to test one set of gaps and that set might not be the best. 2^(16-2) sets of gaps are possible for n=16, the first being {1} and the last {15,14,13,12,11,10,9,8,7,6,5,4,3,2,1}.
To illustrate how using random combinations might give incorrect results assume n=3 that can assume six different orderings 012, 021, 102, 120, 201 and 210. You produce a set of two random sequences to test the two possible gap sets, {1} and {2,1}. Assume that these sequences turn out to be 021 and 201. for {1} 021 can be sorted with three comparisons (02, 21 and 01) and 201 with (20, 21, 01) giving a total of six comparisons, divide by two and voilà, an average of 3 and a worst case of 3. Using {2,1} gives (01, 02, 21 and 01) for 021 and (21, 10 and 12) for 201. Seven comparisons with a worst case of 4 and an average of 3.5. The actual average and worst case for {1] is 8/3 and 3, respectively. For {2,1} the values are 10/3 and 4. The averages were too high in both cases and the worst cases were correct. Had 012 been one of the cases {1} would have given a 2.5 average - too low.
Now extend this to finding a set of random sequences for n=16 such that no set of gaps tested will be favored in comparison with the others and the result close (or equal) to the true values, all the while keeping processing to a minimum. Can it be done? Possibly. After all, everything is possible - but is it probable? I think that for this problem random is the wrong approach. Selecting the sequences according to some system may be less bad and might even be good.

More information regarding jdaw1's post:
Gonnet and Baeza-Yates advise growth by a factor of about 2.2; Tokuda by 2.25. It is well known that if there is a mathematical constant between 2⅕ and 2¼ then it must† be precisely √5 ≈ 2.236.
It is known that √5 * √5 is 5 so I think every other index should increase by a factor of five. So first index being 1 insertion sort, second being 3 then each other subsequent is of the factor 5. There follow the values up to 2⁶⁴ ≈ eighteen quintillion.
{1, 3,, 15,, 75,, 375,, 1 875,, 9 375,, 46 875,, 234 375,, 1 171 875,, 5 859 375,, 29 296 875,, 146 484 375,, 732 421 875,, 3 662 109 375,, 18 310 546 875,, 91 552 734 375,, 457 763 671 875,, 2 288 818 359 375,, 11 444 091 796 875,, 57 220 458 984 375,, 286 102 294 921 875,, 1 430 511 474 609 375,, 7 152 557 373 046 875,, 35 762 786 865 234 375,, 178 813 934 326 171 875,, 894 069 671 630 859 375,, 4 470 348 358 154 296 875,}
The values in the gaps can simply be calculated by taking the value before and multiply by √5 rounding to whole numbers giving the resulting array (using 2.2360679775 * 5 ^ n * 3):
{1, 3, 7, 15, 34, 75, 168, 375, 839, 1 875, 4 193, 9 375, 20 963, 46 875, 104 816, 234 375, 524 078, 1 171 875, 2 620 392, 5 859 375, 13 101 961, 29 296 875, 65 509 804, 146 484 375, 327 549 020, 732 421 875, 1 637 745 101, 3 662 109 375, 8 188 725 504, 18 310 546 875, 40 943 627 518, 91 552 734 375, 204 718 137 589, 457 763 671 875, 1 023 590 687 943, 2 288 818 359 375, 5 117 953 439 713, 11 444 091 796 875, 25 589 767 198 563, 57 220 458 984 375, 127 948 835 992 813, 286 102 294 921 875, 639 744 179 964 066, 1 430 511 474 609 375, 3 198 720 899 820 328, 7 152 557 373 046 875, 15 993 604 499 101 639, 35 762 786 865 234 375, 79 968 022 495 508 194, 178 813 934 326 171 875, 399 840 112 477 540 970, 894 069 671 630 859 375, 1 999 200 562 387 704 849, 4 470 348 358 154 296 875, 9 996 002 811 938 524 246}
(Obviously, omit those that would overflow the relevant array index type. So if that is a signed long long, omit the last.)

Related

Something I dont understand about median of medians algorithm

There is something I don't understand about the algorithm of median of medians.
One key step about this algorithm is to find an approximate median, and according to Wikipedia, we have the guarantee that this approximate median is greater than 30% of elements of the initial set.
To find this approximate median, we compute the median of each group of 5 elements, we gather these medians in a new set, and we recompute the medians until the obtained set have least than 5 elements. In this case, we get the median of the set. (see the wikipedia page if my explanations are not clear)
But, consider the following set of 125 elements :
1 2 3 1001 1002
4 5 6 1003 1004
7 8 9 1005 1006
1020 1021 1022 1023 1034
1025 1026 1027 1028 1035
10 11 12 1007 1008
13 14 15 1009 1010
16 17 18 1011 1013
1029 1030 1031 1032 1033
1036 1037 1038 1039 1040
19 20 21 1014 1015
22 23 24 1016 1017
25 26 27 1018 1019
1041 1042 1043 1044 1045
1046 1047 1048 1049 1050
1051 1052 1053 1054 1055
1056 1057 1058 1059 1060
1061 1062 1063 1064 1065
1066 1067 1068 1069 1070
1071 1072 1073 1074 1075
1076 1077 1078 1079 1080
1081 1082 1083 1084 1085
1086 1087 1088 1089 1090
1091 1092 1093 1094 1095
1096 1097 1098 1099 1100
So we divide the set in group of 5 elements, we compute and gather the medians, and so, we obtain the following set :
3 6 9 1022 1207
12 15 18 1031 1038
21 24 27 1043 1048
1053 1058 1063 1068 1073
1078 1083 1088 1093 1098
We redo the same algorithm, and we obtain the following set :
9 18 27 1063 1068
So we obtain that the approximate median is 27. But this number is greater or equals than only 27 elements. And 27/125 = 21.6% < 30%!!
So my question is : where am I wrong?? Why is the approximate median is in my case not greater than 30% of elements????
Thank you for your replies!!
The cause of your confusion about the median-of-medians algorithm is that, while median-of-medians returns an approximate result within 20% of the actual median, at some stages in the algorithm we also need to calculate exact medians. If you mix up the two, you will not get the expected result, as demonstrated in your example.
Median-of-medians uses three functions as its building blocks:
medianOfFive(array, first, last) {
// ...
return median;
}
This function returns the exact median of five (or fewer) elements from (part of) an array. There are several ways to code this, based on e.g. a sorting network or insertion sort. The details are not important for this question, but it is important to note that this function returns the exact median, not an approximation.
medianOfMedians(array, first, last) {
// ...
return median;
}
This function returns an approximation of the median from (part of) an array, which is guaranteed to be larger than the 30% smallest elements, and smaller than the 30% largest elements. We'll go into more detail below.
select(array, first, last, n) {
// ...
return element;
}
This function returns the n-th smallest element from (part of) an array. This function too returns an exact result, not an approximation.
At its most basic, the overall algorithm works like this:
medianOfMedians(array, first, last) {
call medianOfFive() for every group of five elements
fill an array with these medians
call select() for this array to find the middle element
return this middle element (i.e. the median of medians)
}
So this is where your calculation went wrong. After creating an array with the median-of-fives, you then used the median-of-medians function again on this array, which gives you an approximation of the median (27), but here you need the actual median (1038).
This all sounds fairly straightforward, but where it becomes complicated is that the function select() calls medianOfMedians() to get a first estimate of the median, which it then uses to calculate the exact median, so you get a two-way recursion where two functions call each other. This recursion stops when medianOfMedians() is called for 25 elements or fewer, because then there are only 5 medians, and instead of using select() to find their median, it can use medianOfFive().
The reason why select() calls medianOfMedians() is that it uses partitioning to split (part of) the array into two parts of close to equal size, and it needs a good pivot value to do that. After it has partitioned the array into two parts with the elements which are smaller and larger than the pivot, it then checks which part the n-th smallest element is in, and recurses with this part. If the size of the part with the smaller values is n-1, the pivot is the n-th value, and no further recursion is needed.
select(array, first, last, n) {
call medianOfMedians() to get approximate median as pivot
partition (the range of) the array into smaller and larger than pivot
if part with smaller elements is size n-1, return pivot
call select() on the part which contains the n-th element
}
As you see, the select() function recurses (unless the pivot happens to be the n-th element), but on ever smaller ranges of the array, so at some point (e.g. two elements) finding the n-th element will become trivial, and recursing further is no longer needed.
So finally we get, in some more detail:
medianOfFive(array, first, last) {
// some algorithmic magic ...
return median;
}
medianOfMedians(array, first, last) {
if 5 elements or fewer, call medianOfFive() and return result
call medianOfFive() for every group of five elements
store the results in an array medians[]
if 5 elements or fewer, call medianOfFive() and return result
call select(medians[]) to find the middle element
return the result (i.e. the median of medians)
}
select(array, first, last, n) {
if 2 elements, compare and return n-th element
if 5 elements or fewer, call medianOfFive() to get median as pivot
else call medianOfMedians() to get approximate median as pivot
partition (the range of) the array into smaller and larger than pivot
if part with smaller elements is size n-1, return pivot
if n-th value is in part with larger values, recalculate value of n
call select() on the part which contains the n-th element
}
EXAMPLE
Input array (125 values, 25 groups of five):
#1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25
1 4 7 1020 1025 10 13 16 1029 1036 19 22 25 1041 1046 1051 1056 1061 1066 1071 1076 1081 1086 1091 1096
2 5 8 1021 1026 11 14 17 1030 1037 20 23 26 1042 1047 1052 1057 1062 1067 1072 1077 1082 1087 1092 1097
3 6 9 1022 1027 12 15 18 1031 1038 21 24 27 1043 1048 1053 1058 1063 1068 1073 1078 1083 1088 1093 1098
1001 1003 1005 1023 1028 1007 1009 1011 1032 1039 1014 1016 1018 1044 1049 1054 1059 1064 1069 1074 1079 1084 1089 1094 1099
1002 1004 1006 1034 1035 1008 1010 1013 1033 1040 1015 1017 1019 1045 1050 1055 1060 1065 1070 1075 1080 1085 1090 1095 1100
Medians of groups of five (25 values):
3, 6, 9, 1022, 1027, 12, 15, 18, 1031, 1038, 21, 24, 27, 1043,
1048, 1053, 1058, 1063, 1068, 1073, 1078, 1083, 1088, 1093, 1098
Groups of five for approximate median:
#1 #2 #3 #4 #5
3 12 21 1053 1078
6 15 24 1058 1083
9 18 27 1063 1088
1022 1031 1043 1068 1096
1027 1038 1048 1073 1098
Medians of five for approximate median:
9, 18, 27, 1063, 1088
Approximate median as pivot:
27
Medians of five partitioned with pivot 27 (depends on method):
small: 3, 6, 9, 24, 21, 12, 15, 18
pivot: 27
large: 1031, 1038, 1027, 1022, 1043, 1048, 1053, 1058,
1063, 1068, 1073, 1078, 1083, 1088, 1093, 1098
The smaller group has 8 elements, the larger group 16 elements. We were looking for the middle 13th element out of 25, so now we look for the 13 - 8 - 1 = 4th element out of 16:
Groups of five:
#1 #2 #3 #4
1031 1048 1073 1098
1038 1053 1078
1027 1058 1083
1022 1063 1088
1043 1068 1093
Medians of groups of five:
1031, 1058, 1083, 1098
Approximate median as pivot:
1058
Range of medians of five partitioned with pivot 1058 (depends on method):
small: 1031, 1038, 1027, 1022, 1043, 1048, 1053
pivot: 1058
large: 1063, 1068, 1073, 1078, 1083, 1088, 1093, 1098
The smaller group has 7 elements. We were looking for the 4th element of 16, so now we look for the 4th element out of 7:
Groups of five:
#1 #2
1031 1048
1038 1053
1027
1022
1043
Medians of groups of five:
1031, 1048
Approximate median as pivot:
1031
Range of medians of five partitioned with pivot 1031 (depends on method):
small: 1022, 1027
pivot: 1031
large: 1038, 1043, 1048, 1053
The smaller part has 2 elements, and the larger has 4, so now we look for the 4 - 2 - 1 = 1st element out of 4:
Median of five as pivot:
1043
Range of medians of five partitioned with pivot 1043 (depends on method):
small: 1038
pivot: 1043
large: 1048, 1053
The smaller part has only one element, and we were looking for the first element, so we can return the small element 1038.
As you will see, 1038 is the exact median of the original 25 median-of-fives, and there are 62 smaller values in the original array of 125:
1 ~ 27, 1001 ~ 1011, 1013 ~ 1023, 1025 ~ 1037
which not only puts it in the 30~70% range, but means it is actually the exact median (note that this is a coincidence of this particular example).
I'm completely with your analysis up through the point where you get the medians of each of the blocks of five elements, when you're left with this collection of elements:
3 6 9 1022 1207 12 15 18 1031 1038 21 24 27 1043 1048 1053 1058 1063 1068 1073 1078 1083 1088 1093 1098
You are correct that, at this point, we need to get the median of this collection of elements. However, the way that the median-of-medians algorithm accomplishes this is different than what you've proposed.
When you were working through your analysis, you attempted to get the median of this set of values by, once again, splitting the input into blocks of size five and taking the median of each. However, that approach won't actually give you the median of the medians. (You can see this by noting that you got back 27, which isn't the true median of that collection of values).
The way that the median-of-medians algorithm actually gets back the median of the medians is by recursively invoking the overall algorithm to obtain the median of those elements. This is subtly different from just repeatedly breaking things apart into blocks and computing the medians of each block. In particular, each recursive call will
get an estimate of the pivot by using the groups-of-five heuristic,
recursively invoke the function on itself to find the median of those medians, then
apply a partitioning step on that median and use that to determine how to proceed from there.
This algorithm is, in my opinion, something that's way too complicated to actually trace through by hand. You really need to trust that, since each recursive call you're making works on a smaller array than what you started with, each recursive call will indeed do what it says to do. So when you're left with the medians of each group, as you were before, you should just trust that when you need to get the median by a recursive call, you end up with the true median.
If you look at the true median of the medians that you've generated in the first step, you'll find that it indeed will be between the 30th and 70th percentiles of the original data set.
If this seems confusing, don't worry - you're in really good company. This algorithm is famously tricky to understand. For me, the easiest way to understand it is to just trust that recursion works and to trace through it only one layer deep, working under the assumption that all the recursive calls work, rather than trying to walk all the way down to the bottom of the recursion tree.

The 1000th element which is product of 2, 3, 5

There is a sequence S.
All the elements in S is product of 2, 3, 5.
S = {2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24 ...}
How to get the 1000th element in this sequence efficiently?
I check each number from 1, but this method is too slow.
A geometric approach:
Let s = 2^i . 3^j . 5^k, where the triple (i, j, k) belongs to the first octant of a 3D state space.
Taking the logarithm,
ln(s) = i.ln(2) + j.ln(3) + k.ln(5)
so that in the state space the iso-s surfaces are planes, which intersect the first octant along a triangle. On the other hand, the feasible solutions are the nodes of a square grid.
If one wants to produce the s-values in increasing order, one can keep a list of the grid nodes closest to the current s-plane*, on its "greater than" side.
If I am right, to move from one s-value to the next, it suffices to discard the current (i, j, k) and replace it by the three triples (i+1, j, k), (i, j+1, k) and (i, j, k+1), unless they are already there, and pick the next smallest s.
An efficient implementation will be by storing the list as a binary tree with the log(s)-value as the key.
If you are asking for the first N values, you will explore a pyramidal volume of state-space of height O(³√N), and base area O(³√N²), which is the number of tree nodes, hence the spatial complexity. Every query in the tree will take O(log(N)) comparisons (and O(1) operations to fetch the minimum), for a total of O(N.log(N)).
*More precisely, the list will contain all triples on the "greater than" side and such that no index can be decreased without getting on the other side of the plane.
Here is Python code that implements these ideas.
You will notice that the logarithms are converted to fixed point (7 decimals) to avoid floating-point inaccuracies that could result in the log(s)-values not being found equal. This causes the s values being inexact in the last digits, but this does not matter as long as the ordering of the values is preserved. Recomputing the s-values from the indexes yields exact values.
import math
import bintrees
# Constants
ln2= round(10000000 * math.log(2))
ln3= round(10000000 * math.log(3))
ln5= round(10000000 * math.log(5))
# Initial list
t= bintrees.FastAVLTree()
t.insert(0, (0, 0, 0))
# Find the N first products
N= 100
for i in range(N):
# Current s
s= t.pop_min()
print math.pow(2, s[1][0]) * math.pow(3, s[1][1]) * math.pow(5, s[1][2])
# Update the list
if not s[0] + ln2 in t:
t.insert(s[0] + ln2, (s[1][0]+1, s[1][1], s[1][2]))
if not s[0] + ln3 in t:
t.insert(s[0] + ln3, (s[1][0], s[1][1]+1, s[1][2]))
if not s[0] + ln5 in t:
t.insert(s[0] + ln5, (s[1][0], s[1][1], s[1][2]+1))
The 100 first values are
1 2 3 4 5 6 8 9 10 12
15 16 18 20 24 25 27 30 32 36
40 45 48 50 54 60 64 72 75 80
81 90 96 100 108 120 125 128 135 144
150 160 162 180 192 200 216 225 240 243
250 256 270 288 300 320 324 360 375 384
400 405 432 450 480 486 500 512 540 576
600 625 640 648 675 720 729 750 768 800
810 864 900 960 972 1000 1024 1080 1125 1152
1200 1215 1250 1280 1296 1350 1440 1458 1500 1536
The plot of the number of tree nodes confirms the O(³√N²) spatial behavior.
Update:
When there is no risk of overflow, a much simpler version (not using logarithms) is possible:
import math
import bintrees
# Initial list
t= bintrees.FastAVLTree()
t[1]= None
# Find the N first products
N= 100
for i in range(N):
# Current s
(s, r)= t.pop_min()
print s
# Update the list
t[2 * s]= None
t[3 * s]= None
t[5 * s]= None
Simply put, you just have to generate each ith number consecutively. Let's call the set {2, 3, 5} to be Z. At ith iteration, assume you have all (i-1) of the values generated in the previous iteration. While generating the next one, what you basically have to do is trying all the elements in Z and for each of them generating **the least element they can form that is larger than the element generated at (i-1)th iteration. Then, you simply consider the smallest one among them as the ith value. A simple and not so efficient implementation is given below.
def generate_simple(N, Z):
generated = [1]
for i in range(1, N+1):
minFound = -1
minElem = -1
for j in range(0, len(Z)):
for k in range(0, len(generated)):
candidateVal = Z[j] * generated[k]
if candidateVal > generated[-1]:
if minFound == -1 or minFound > candidateVal:
minFound = candidateVal
minElem = j
break
generated.append(minFound)
return generated[-1]
As you may observe, this approach has a time complexity of O(N2 * |Z|). An improvement in terms of efficiency would be to store where we left off scanning in the array of generated values for each element in a second array, indicesToStart. Then, for each element we would only scan all N values of the array generated for once(i.e. all through the algorithm), which means the time complexity after such an improvement would be O(N * |Z|).
A simple implementation of the improvement based on the simple version provided above, is given below.
def generate_improved(N, Z):
generated = [1]
indicesToStart = [0] * len(Z)
for i in range(1, N+1):
minFound = -1
minElem = -1
for j in range(0, len(Z)):
for k in range(indicesToStart[j], len(generated)):
candidateVal = Z[j] * generated[k]
if candidateVal > generated[-1]:
if minFound == -1 or minFound > candidateVal:
minFound = candidateVal
minElem = j
break
indicesToStart[j] += 1
generated.append(minFound)
indicesToStart[minElem] += 1
return generated[-1]
If you have a hard time understanding how complexity decreases with this algorithm, try looking into the difference in time complexity of any graph traversal algorithm when an adjacency list is used, and when an adjacency matrix is used. The improvement adjacency lists help achieve is almost exactly the same kind of improvement we get here. In a nutshell, you have an index for each element and instead of starting to scan from the beginning you continue from wherever you left the last time you scanned the generated array for that element. Consequently, even though there are N iterations in the algorithm(i.e. the outermost loop) the overall number of operations you make is O(N * |Z|).
Important Note: All the code above is a simple implementation for demonstration purposes, and you should consider it just as a pseudocode you can test. While implementing this in real life, based on the programming language you choose to use, you will have to consider issues like integer overflow when computing candidateVal.

LMC program to find the difference between double the median and the smallest of 3 inputs?

I want to write an LMC program to find the difference between twice the median and the smallest of 3 distinct inputs efficiently. I would like some help in figuring out an algorithm for this.
Here is what I have so far:
INPUT 901 - Input first
STO 399 - Store in 99 (a)
INPUT 901 - Input second
STO 398 - Store in 98 (b)
INPUT 901 - Input third
STO 397 - Store in 97 (c)
LOAD 597 - Load 97 (a)
SUB 298 - Subtract 97 - 98 (a - b)
BRP 8xx - If value positive go to xx (if value is positive a > b else b > a)
LOAD 598 - Load 98 (b)
SUB 299 - Subtract 98 - 99 (b - c)
BRP 8xx - If value positive go to xx (if value is positive b > c else c > b)
LOAD 598 - Load 98 (b) which is the median
ADD 198 - Double to get "twice the median"
I realized at the end of the snippet I didn't know which input was the smallest and was assuming the inputs were already sorted (which they aren't).
I think I will need to somehow sort the inputs from smallest to largest to do this efficiently and determine the smallest input and the median within the same branch.
I don't know little-man-computer language, but it doesn't matter, it's an algorithm question.
First of all, you made a little confusion naming the three parameters (first you said that 99 was a, then you said 97 was a).
You must load the three parameters in 99, 98, 97 (say a, b, c).
Then, you load 99 (a) and subtract 98 (b) from 99 (a).
If the result is positive (99 is greater than 98), you have to swap 98 and 99, so the smallest between the two is in location 99.
Now load 98 (c) and subtract 97 from it. If the result is positive, swap 97 and 98, so the smallest between the two is in location 98.
Finally, you have the two smallest numbers in 98 and 99 locations, that is the smallest and the median.
Load 99 and subtract 98 from it. If the result is positive, 99 contains the median and 98 the smallest, otherwise the contrary.
Now you can double the median one, and calculate the difference between this number and the smallest.

Need to find lowest differences between first line of an array and the rest ones

Well, I've been given a number of pairs of elements (s,h), where s sends an h element on the s-th row of a 2d array.It is not necessary that each line has the same amount of elements, only known that there cannot be more than N elements on a line.
What I want to do is to find the lowest biggest difference(!) between a certain element of the first line and the rest ones.
Thus, if I have 3 lines with (101,92) (100,25,95,52,101) (93,108,0,65,200) what I want to find is 3, because I have to choose 92 and I have 95-92=3 from first to second and 93-92=1 form first to third.
I have reached a point where it is certain that if I have s lines with n(i) elements each and i=0..s, then n0<=n1<=...<=ns so as to have a good average performance scenario when picking the best-fit from 1st line towards the others.
However, I cannot think of a way lower than O(n2) or even maybe O(n3) in some cases. Does anyone have a suggestion about a fairly improved way to do this?
Combine all lines into a single list, also keeping track of which element comes from where.
Sort this list.
Have a last-value variable for each line.
For each item in the sorted list, update the last-value variable of the applicable list. If not all lines have a last-value set yet, do nothing. If it's an element from the first list:
Recalculate the biggest difference for all of the last-value variables. Store this difference.
If it's an element from any other list:
If all values have previous not been set, calculate the biggest difference. Otherwise, if the difference between the first list's last-value and this element is bigger than the biggest difference, update the biggest difference with this difference. Store this difference.
The smallest difference is the desired value.
Example:
Lists: (101,92) (100,25,95,52,101) (93,108,0,65,200)
Sorted 0 25 52 65 92 93 95 100 101 101 108 200
Source 2 1 1 2 0 2 1 1 0 1 2 2
Last[0] - - - - 92 92 92 92 101 101 101 101
Last[1] - 25 52 52 52 52 95 100 100 101 101 101
Last[2] 0 0 0 65 65 93 93 93 93 93 108 200
Diff - - - - 40 41 3 8 8 8 7 9
Best - - - - 40 40 3 3 3 3 3 3
Best = 3 as required. Storing the actual items or finding them afterwards should be easy enough.
Complexity:
Let n be the total number of items and k be the number of lists.
O(n log n) for the combine + sort.
O(nk) (worst case) for the scan through, since we're checking n items and, at each item, we do maximum O(k) work.
So O(n log n + nk).

Image Segmentation using Mean Shift explained

Could anyone please help me understand how Mean Shift segmentation actually works?
Here is a 8x8 matrix that I just made up
103 103 103 103 103 103 106 104
103 147 147 153 147 156 153 104
107 153 153 153 153 153 153 107
103 153 147 96 98 153 153 104
107 156 153 97 96 147 153 107
103 153 153 147 156 153 153 101
103 156 153 147 147 153 153 104
103 103 107 104 103 106 103 107
Using the matrix above is it possible to explain how Mean Shift segmentation would separate the 3 different levels of numbers?
The basics first:
The Mean Shift segmentation is a local homogenization technique that is very useful for damping shading or tonality differences in localized objects.
An example is better than many words:
Action:replaces each pixel with the mean of the pixels in a range-r neighborhood and whose value is within a distance d.
The Mean Shift takes usually 3 inputs:
A distance function for measuring distances between pixels. Usually the Euclidean distance, but any other well defined distance function could be used. The Manhattan
Distance is another useful choice sometimes.
A radius. All pixels within this radius (measured according the above distance) will be accounted for the calculation.
A value difference. From all pixels inside radius r, we will take only those whose values are within this difference for calculating the mean
Please note that the algorithm is not well defined at the borders, so different implementations will give you different results there.
I'll NOT discuss the gory mathematical details here, as they are impossible to show without proper mathematical notation, not available in StackOverflow, and also because they can be found from good sources elsewhere.
Let's look at the center of your matrix:
153 153 153 153
147 96 98 153
153 97 96 147
153 153 147 156
With reasonable choices for radius and distance, the four center pixels will get the value of 97 (their mean) and will be different form the adjacent pixels.
Let's calculate it in Mathematica. Instead of showing the actual numbers, we will display a color coding, so it's easier to understand what is happening:
The color coding for your matrix is:
Then we take a reasonable Mean Shift:
MeanShiftFilter[a, 3, 3]
And we get:
Where all center elements are equal (to 97, BTW).
You may iterate several times with Mean Shift, trying to get a more homogeneous coloring. After a few iterations, you arrive at a stable non-isotropic configuration:
At this time, it should be clear that you can't select how many "colors" you get after applying Mean Shift. So, let's show how to do it, because that is the second part of your question.
What you need to be able to set the number of output clusters in advance is something like Kmeans clustering.
It runs this way for your matrix:
b = ClusteringComponents[a, 3]
{{1, 1, 1, 1, 1, 1, 1, 1},
{1, 2, 2, 3, 2, 3, 3, 1},
{1, 3, 3, 3, 3, 3, 3, 1},
{1, 3, 2, 1, 1, 3, 3, 1},
{1, 3, 3, 1, 1, 2, 3, 1},
{1, 3, 3, 2, 3, 3, 3, 1},
{1, 3, 3, 2, 2, 3, 3, 1},
{1, 1, 1, 1, 1, 1, 1, 1}}
Or:
Which is very similar to our previous result, but as you can see, now we have only three output levels.
HTH!
A Mean-Shift segmentation works something like this:
The image data is converted into feature space
In your case, all you have are intensity values, so feature space will only be one-dimensional. (You might compute some texture features, for instance, and then your feature space would be two dimensional – and you’d be segmenting based on intensity and texture)
Search windows are distributed over the feature space
The number of windows, window size, and initial locations are arbitrary for this example – something that can be fine-tuned depending on specific applications
Mean-Shift iterations:
1.) The MEANs of the data samples within each window are computed
2.) The windows are SHIFTed to the locations equal to their previously computed means
Steps 1.) and 2.) are repeated until convergence, i.e. all windows have settled on final locations
The windows that end up on the same locations are merged
The data is clustered according to the window traversals
... e.g. all data that was traversed by windows that ended up at, say, location “2”, will form a cluster associated with that location.
So, this segmentation will (coincidentally) produce three groups. Viewing those groups in the original image format might look something like the last picture in belisarius' answer. Choosing different window sizes and initial locations might produce different results.

Resources