I have the following pandas dataframe
import numpy as np
import pandas as pd
df = pd.DataFrame({"first_element":[20, 125, 156, 211, 227, 220, 230, 472, 4765], "second_element":[35, 145, 178, 233, 321, 234, 231, 498, 8971], "next":[0.32, 0.04, 0.59, 0.103, 0.37, 0.92, 0.81, 0.24, 0.77]})
df = df[["first_element", "second_element", "next"]]
print(df)
### print(df) outputs:
first_element second_element next
0 20 35 0.320
1 125 145 0.040
2 156 178 0.590
3 211 233 0.103
4 227 321 0.370
5 220 234 0.920
6 230 231 0.810
7 472 498 0.240
8 4765 8971 0.770
In this DataFrame, each row is considered an "interval" along a real line, [first_element, second_element], e.g. 20 to 35, 125 to 145.
If I wished to sort df based on both columns, I would use .sort_values(), i.e.
sorted_df = df.sort_values(["first_element", "second_element"], ascending=[True, False])
which outputs
print(sorted_df)
first_element second_element next
0 20 35 0.320
1 125 145 0.040
2 156 178 0.590
3 211 233 0.103
5 220 234 0.920
4 227 321 0.370
6 230 231 0.810
7 472 498 0.240
8 4765 8971 0.770
There are several intervals which intersect/overlap, namely [211, 233], [220, 234], [227, 321], [230, 231]. Because [230, 231] is a subset of [211, 233], there are several ways to order these two.
My goal is (1) write a function that finds all overlapping "intervals" (the values in the two columns first_element and second_element) and (2) randomly shuffle these intervals.
The goal (2) sounds very tricky, because one would need to separately shuffle/re-order multiple "groups" of overlapping intervals. For example, let's say our dataframe was larger, and had the following overlapping intervals:
[211, 233], [220, 234], [227, 321], [230, 231], [5550, 5879], [5400, 5454]
I would want to separately re-shuffle [211, 233], [220, 234], [227, 321], [230, 231] and [5550, 5879], [5400, 5454], not mix up the subsets of overlapping intervals.
There are several ways to shuffle rows with pandas, e.g. shuffle by the index
def shuffle_by_index(df):
index = list(df.index)
random.shuffle(index)
df = df.ix[index]
df.reset_index()
return df
or use sklearn
import sklearn.utils
shuffled = sklearn.utils.shuffle(df)
df = df.reset_index(drop=True)
but (1) how does one search for all overlapping intervals in a pythonic/pandas way and (2) how do I select these subsets of overlapping intervals and only shuffle those individually?
This is not the best way to solve it, but it gives your desired results. I have left second part for you.
import numpy as np
import pandas as pd
df = pd.DataFrame({"first_element":[20, 125, 156, 211, 227, 220, 230, 472, 4765], "second_element":[35, 145, 178, 233, 321, 234, 231, 498, 8971], "next":[0.32, 0.04, 0.59, 0.103, 0.37, 0.92, 0.81, 0.24, 0.77]})
df = df[["first_element", "second_element", "next"]]
sorted_df = df.sort_values(["first_element", "second_element"], ascending=[True, False])
sorted_df.reset_index(0, inplace = True)
prev_min = sorted_df.first_element.iloc[0]
prev_max = sorted_df.second_element.iloc[0]
labels = []
label_counter = 1
labels.append(label_counter)
for rowIndex in xrange(1, sorted_df.shape[0]):
row = sorted_df.iloc[rowIndex]
if row.first_element > prev_max:
# totally different interval, may be overlapping interval
prev_min = row.first_element
prev_max = row.second_element
label_counter += 1
labels.append(label_counter)
elif row.first_element >= prev_min:
prev_max = max(prev_max, row.second_element)
labels.append(label_counter)
sorted_df['overlapping_index'] = labels
# group sorted_df by overlapping index, and randomly select the save interval group
Related
I'm trying to have the "range" of compass headings over the last X seconds. Example: Over the last minute, my heading has been between 120deg and 140deg on the compass. Easy enough right? I have an array with the compass headings over the time period, say 1 reading every second.
[ 125, 122, 120, 125, 130, 139, 140, 138 ]
I can take the minimum and maximum values and there you go. My range is from 120 to 140.
Except it's not that simple. Take for example if my heading has shifted from 10 degrees, to 350 degrees (ie it "passed" through North, changing -20deg.
Now my array might look something like this:
[ 9, 10, 6, 3, 358, 355, 350, 353 ]
Now the Min is 3 and max 358, which is not what I need :( I'm looking for the most "right hand" (clockwise) value, and most "left hand" (counter-clockwise) value.
Only way I can think of, is finding the largest arc along the circle that includes none of the values in my array, but I don't even know how I would do that.
Would really appreciate any help!
Problem Analysis
To summarize the problem, it sounds like you want to find both of the following:
The two readings that are closest together (for simplicity: in a clockwise direction) AND
Contain all of the other readings between them.
So in your second example, 9 and 10 are only 1° apart, but they do not contain all the other readings. Conversely, traveling clockwise from 10 to 9 would contain all of the other readings, but they are 359° apart in that direction, so they are not closest.
In this case, I'm not sure if using the minimum and maximum readings will help. Instead, I'd recommend sorting all of the readings. Then you can more easily check the two criteria specified above.
Here's the second example you provided, sorted in ascending order:
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
If we start from the beginning, we know that traveling from reading 3 to reading 358 will encompass all of the other readings, but they are 358 - 3 = 355° apart. We can continue scanning the results progressively. Note that once we circle around, we have to add 360 to properly calculate the degrees of separation.
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
*--------------------------> 358 - 3 = 355° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
-> *----------------------------- (360 + 3) - 6 = 357° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
----> *-------------------------- (360 + 6) - 9 = 357° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
-------> *----------------------- (360 + 9) - 10 = 359° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
----------> *------------------- (360 + 10) - 350 = 20° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
--------------> *-------------- (360 + 350) - 353 = 357° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
-------------------> *--------- (360 + 353) - 355 = 358° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
------------------------> *---- (360 + 355) - 358 = 357° separation
Pseudocode Solution
Here's a pseudocode algorithm for determining the minimum degree range of reading values. There are definitely ways it could be optimized if performance is a concern.
// Somehow, we need to get our reading data into the program, sorted
// in ascending order.
// If readings are always whole numbers, you can use an int[] array
// instead of a double[] array. If we use an int[] array here, change
// the "minimumInclusiveReadingRange" variable below to be an int too.
double[] readings = populateAndSortReadingsArray();
if (readings.length == 0)
{
// Handle case where no readings are provided. Show a warning,
// throw an error, or whatever the requirement is.
}
else
{
// We want to track the endpoints of the smallest inclusive range.
// These values will be overwritten each time a better range is found.
int minimumInclusiveEndpointIndex1;
int minimumInclusiveEndpointIndex2;
double minimumInclusiveReadingRange; // This is convenient, but not necessary.
// We could determine it using the
// endpoint indices instead.
// Check the range of the greatest and least readings first. Since
// the readings are sorted, the greatest reading is the last element.
// The least reading is the first element.
minimumInclusiveReadingRange = readings[array.length - 1] - readings[0];
minimumInclusiveEndpointIndex1 = 0;
minimumInclusiveEndpointIndex2 = array.length - 1;
// Potential to skip some processing. If the ends are 180 or less
// degrees apart, they represent the minimum inclusive reading range.
// The for loop below could be skipped.
for (int i = 1; i < array.length; i++)
{
if ((360.0 + readings[i-1]) - readings[i] < minimumInclusiveReadingRange)
{
minimumInclusiveReadingRange = (360.0 + readings[i-1]) - readings[i];
minimumInclusiveEndpointIndex1 = i;
minimumInclusiveEndpointIndex2 = i - 1;
}
}
// Most likely, there will be some different readings, but there is an
// edge case of all readings being the same:
if (minimumInclusiveReadingRange == 0.0)
{
print("All readings were the same: " + readings[0]);
}
else
{
print("The range of compass readings was: " + minimumInclusiveReadingRange +
" spanning from " + readings[minimumInclusiveEndpointIndex1] +
" to " + readings[minimumInclusiveEndpointIndex2]);
}
}
There is one additional edge case that this pseudocode algorithm does not cover, and that is the case where there are multiple minimum inclusive ranges...
Example 1: [0, 90, 180, 270] which has a range of 270 (90 to 0/360, 180 to 90, 270 to 180, and 0 to 270).
Example 2: [85, 95, 265, 275] which has a range of 190 (85 to 275 and 265 to 95)
If it's necessary to report each possible pair of endpoints that create the minimum inclusive range, this edge case would increase the complexity of the logic a bit. If all that matters is determining the value of the minimum inclusive range or it is sufficient to report just one pair that represents the minimum inclusive range, the provided algorithm should suffice.
I have a set of numbers eg. [100,90,80,70,60,50] and want to find all combinations of size r=3 but in order of decreasing sum.
Arranging the numbers in decreasing order does not work eg.
(100, 90, 80) 270
(100, 90, 70) 260
(100, 90, 60) 250
(100, 90, 50) **240**
(100, 80, 70) **250**
(100, 80, 60) 240
How can i go about finding such a combination set with decreasing sum value.
Here' the Code
import itertools
array = [100,90,80,70,60,50]
size = 3
answer = [] # to store all combination
order = [] # to store order according to sum
number = 0 # index of combination
for comb in itertools.combinations(array,size):
answer.append(comb)
order.append([sum(comb),number]) # Storing sum and index
number += 1
order.sort(reverse=True) # sorting in decreasing order
for key in order:
print key[0],answer[key[1]] # key[0] is sum of combination
Output for the above code is
270 (100, 90, 80)
260 (100, 90, 70)
250 (100, 80, 70)
250 (100, 90, 60)
240 (90, 80, 70)
240 (100, 80, 60)
240 (100, 90, 50)
230 (90, 80, 60)
230 (100, 70, 60)
230 (100, 80, 50)
220 (90, 70, 60)
220 (90, 80, 50)
220 (100, 70, 50)
210 (80, 70, 60)
210 (90, 70, 50)
210 (100, 60, 50)
200 (80, 70, 50)
200 (90, 60, 50)
190 (80, 60, 50)
180 (70, 60, 50)
The first (and naive) solution is to iterate over all posible permutations, and save those sets in a min-heap. At the end just remove all sets one by one.
run time: suppose x = n choose r, so O(xlogx)
The second one is a little more complicated:
* You need to save the minimum number you found out untill now
* now you are iterating exactly like your example with one change, to know what is the next set you are moving to, you have to replace every number in the current set with the next number in the array, and replace the max option that less than the minimum you are saving. and of course set the minimum to the new minimum.
run time: O((n choose r)*r)
I need to render a horizontal calendar and render events on it. So I get two dates and the width in pixels. I want to distribute the days between the two provided dates over those pixels and maintain a minimum distance between the visual points.
for instance, I have 365 days (each day should consume at least 10 pixels) and I need to distribute then over 300 pixels. So I need to "pack" them in groups so each pixel would represent multiple dates. How can I calculate this mathematically speaking?
i.e.
(days)
1/1 8/1 16/1 24/1 2/2 10/2 18/2 ......
in the above example for instance, how can I calculate that I need to "pack/skip" the 7 days?
What I need in the end is to produce an array with the dates (days) and the x offset where it should be positioned in the horizontal axis.
i.e.
1/1/2013 = 0
2/1/2013 = 0
3/1/2013 = 0
4/1/2013 = 0
5/1/2013 = 0
6/1/2013 = 0
7/1/2013 = 0
8/1/2013 = 10
9/1/2013 = 10
10/1/2013 = 10
....
You have 300 pixels to use. Each 'package' should be at least 10 pixels. This means you should have 300/10=30 packages. You have 365 which should be distributed over 30 packages so that's 365/30=12.17 days per package. Or simply 12.
The same logic can be used to calculate the amount of days needed in a package if you have a different amount of pixels to use.
I hope that this was what you were asking for.
Jannes
Edit: I have just read your edit so I will alter my reply a bit here.
If you have converted your date to a number between 1 and 365 you can simply calculate each element of your array days like this.
days[i]=floor(i/12)*10
Where the 12 came from above calculations.
date_width = 10
display_width = 300
date_range = 365
num_of_dates = display_len // date_len
date_offsets = [x * date_range // num_of_dates for x in range(num_of_dates)]
gives dates for every 10 "pixels"
[0, 12, 24, 36, 48, 60, 73, 85, 97, 109, 121, 133, 146, 158, 170, 182, 194, 206, 219, 231, 243, 255, 267, 279, 292, 304, 316, 328, 340, 352]
if seeing that you have about 12 days between data points you want to shift it up to 2 weeks
date_offset = 14
date_offsets = [x * date_offset for x in range(date_range//date_offset)]
date_positions = [display_width * o // date_range for o in date_offsets]
Given the four digit number 1234, there are six possible two digit subsequences (12, 13, 14, 23, 24, 34). Given some of the subsequences, is it possible to recover the original number?
Here's some example data. Each line lists some 3 digit subsequences of a different 6 digit number (to be found)
528, 508, 028, 502, 058, 528, 028, 528, 552, 050
163, 635, 635, 130, 163, 633, 130, 330, 635, 135
445, 444, 444, 444, 454, 444, 445,
011, 350, 601, 651, 601, 511, 511, 360, 601, 351
102, 021, 102, 221, 102, 100, 002, 021, 021, 121
332, 111, 313, 311, 132, 113, 132, 111, 112
362, 650, 230, 172, 120, 165, 372, 202, 702
103, 038, 138, 150, 110, 518, 510, 538, 108
343, 231, 431, 341, 203, 203, 401, 303, 031, 233
Edit: Sometimes the solution might not be unique (more than one number could have given the subsequences). In that case, it would be good to return one of them, or maybe even a list.
What you want to do is to find the Shortest common supersequence of all your subsequences. Clearly if you have all subsequences, including the original number, the SCS will be what you are looking for. Otherwise it can't be guaranteed, but there's a good chance.
Unfortunately there isn't a nice polynomial algorithm for this problem, but if you Google it you'll find there are a lot of approximation algorithms available. E.g. An ACO Algorithm for the Shortest Common Supersequene Problem which mentions there are three overall approaches:
Dynamic Programming or Branch'n'Bound. These are usually to slow except for very few strings or small alphabets.
Finding the SCS of the strings pairwise using Dynamic Programming, using heuristics to choose which strings to 'merge'.
The Majority Merge heuristic which might be the nicest one for your case.
The approach described in the paper.
Here is another nice article about the problem: http://www.update.uu.se/~shikaree/Westling/
Build a directed graph with each digit connected to the digit following it in each sequence.
Dealing with cycles:
A cycle implies an impossible scenario - a same character cannot have 2 locations (there can be characters with the same values at multiple positions, but not the exact same character - as a metaphor, you can have many people named Bob, but any given Bob can only be at one place). Some node must be split into multiple nodes. The chosen node should be split such that all incoming edges is in one of the new nodes and all outgoing edges is in the other, and there's a connection between the two.
There should be multiple nodes which can be picked to be split, with possibly only one being correct, you may need to explore all possibilities until you find one that works. If one doesn't work, you'll get a longer string than is allowed somewhere down the line.
It might be a better idea to leave getting rid of cycles completely until right before the topological sort (resolving them in the same way).
Dealing with nodes with the same value (as a result of cycle resolution):
If there are multiple nodes with the same value that can be chosen, let the outgoing edges go from the first one (the one that has a directed path to all the others) and the incoming edges to the last one (the one which all other ones has a directed path to). This obviously needs to be slightly modified if there are multiple digits with the same value in the same sequence.
Finding the actual string:
To determine the string, do a topological sort on the graph.
Example:
Assume we're looking for a 5-digit number and the input is:
528, 508, 028, 502, 058, 058
I know the duplication of 058 is somewhat trivial, but it's just for illustration.
For 528, create nodes for 5, 2 and 8, and connect 5 and 2 and 2 and 8.
5 -> 2 -> 8
For 508, create 0, connect 5 and 0 and 0 and 8.
5 -> 2 -> 8
\ /
> 0
For 028, connect 0 and 2.
5 ------> 2 -> 8
\ / /
> 0 -----
For 502, all connections are already there.
For 058, we get a cycle (5->0->5), so we have 2 choices:
Split 0 into 2 nodes:
/-----------\----\
/ v v
0 -> 5 ------> 2 -> 8
\
> 0
Split 5 into 2 nodes:
/-----------\
/ v
5 ------> 2 -> 5 -> 8
\ ^ ^
\ / /
> 0 --------
Let's assume we go with the latter.
For 058, we need an outgoing edge from the last 5 (the right 5 in this case) and an incoming edge from the first 5 (the left 5 in this case). These edges (5->0 and 5->8) already exists, so there's nothing to do.
A topological sort will give us 50258, which is our number.
Let logic programming do the work for you.
This is via core.logic in clojure.
Define what it means to be a subsequence
(defne subseqo [s1 s2]
([(h . t1) (h . t2)] (subseqo t1 t2))
([(h1 . t1) (h2 . t2)] (!= h1 h2) (subseqo s1 t2))
([() _]))
Run the constraints through the solver.
(defn recover6 [input-string]
(run* [q]
(fresh [a b c d e f]
(== q [a b c d e f])
(everyg (fn [s] (subseqo (seq s) q))
(re-seq #"\d+" input-string)))))
Examples (results are perceptually instantaneous at the REPL):
(recover6 "528, 508, 028, 502, 058, 528, 028, 528, 552, 050")
;=> ([\5 \0 \5 \2 \8 \0]
[\5 \0 \5 \2 \0 \8]
[\5 \0 \5 \0 \2 \8]
[\0 \5 \0 \5 \2 \8]
[\0 \5 \5 \0 \2 \8])
(recover6 "163, 635, 635, 130, 163, 633, 130, 330, 635, 135")
;=> ([\1 \6 \3 \5 \3 \0]
[\1 \6 \3 \3 \5 \0]
[\1 \6 \3 \3 \0 \5])
(recover6 "445, 444, 444, 444, 454, 444, 445")
;=> ([\4 \4 \5 \4 _0 _1]
... and many more
In the last example, the underscores indicate that _0 and _1 are free variables. They have not been constrained. It is easy enough to constrain any free variables to the set of digits.
According to Marcin Ciura's Optimal (best known) sequence of increments for shell sort algorithm,
the best sequence for shellsort is 1, 4, 10, 23, 57, 132, 301, 701...,
but how can I generate such a sequence?
In Marcin Ciura's paper, he said:
Both Knuth’s and Hibbard’s sequences
are relatively bad, because they are
defined by simple linear recurrences.
but most algorithm books I found tend to use Knuth’s sequence: k = 3k + 1, because it's easy to generate. What's your way of generating a shellsort sequence?
Ciura's paper generates the sequence empirically -- that is, he tried a bunch of combinations and this was the one that worked the best. Generating an optimal shellsort sequence has proven to be tricky, and the problem has so far been resistant to analysis.
The best known increment is Sedgewick's, which you can read about here (see p. 7).
If your data set has a definite upper bound in size, then you can hardcode the step sequence. You should probably only worry about generality if your data set is likely to grow without an upper bound.
The sequence shown seems to grow roughly as an exponential series, albeit with quirks. There seems to be a majority of prime numbers, but with non-primes in the mix as well. I don't see an obvious generation formula.
A valid question, assuming you must deal with arbitrarily large sets, is whether you need to emphasise worst-case performance, average-case performance, or almost-sorted performance. If the latter, you may find that a plain insertion sort using a binary search for the insertion step might be better than a shellsort. If you need good worst-case performance, then Sedgewick's sequence appears to be favoured. The sequence you mention is optimised for average-case performance, where the number of comparisons outweighs the number of moves.
I would not be ashamed to take the advice given in Wikipedia's Shellsort article,
With respect to the average number of comparisons, the best known gap
sequences are 1, 4, 10, 23, 57, 132, 301, 701 and similar, with gaps
found experimentally. Optimal gaps beyond 701 remain unknown, but good
results can be obtained by extending the above sequence according to
the recursive formula h_k = \lfloor 2.25 h_{k-1} \rfloor.
Tokuda's sequence [1, 4, 9, 20, 46, 103, ...], defined by the simple formula h_k = \lceil h'_k
\rceil, where h'k = 2.25h'k − 1 + 1, h'1 = 1, can be recommended for
practical applications.
guessing from the pseudonym, it seems Marcin Ciura edited the WP article himself.
The sequence is 1, 4, 10, 23, 57, 132, 301, 701, 1750. For every next number after 1750 multiply previous number by 2.25 and round down.
Sedgewick observes that coprimality is good. This rings true: if there are separate ‘streams’ not much cross-compared until the gap is small, and one stream contains mostly smalls and one mostly larges, then the small gap might need to move elements far. Coprimality maximises cross-stream comparison.
Gonnet and Baeza-Yates advise growth by a factor of about 2.2; Tokuda by 2.25. It is well known that if there is a mathematical constant between 2⅕ and 2¼ then it must† be precisely √5 ≈ 2.236.
So start {1, 3}, and then each subsequent is the integer closest to previous·√5 that is coprime to all previous except 1. This sequence can be pre-calculated and embedded in code. There follow the values up to 2⁶⁴ ≈ eighteen quintillion.
{1, 3, 7, 16, 37, 83, 187, 419, 937, 2099, 4693, 10499, 23479, 52501, 117391, 262495, 586961, 1312481, 2934793, 6562397, 14673961, 32811973, 73369801, 164059859, 366848983, 820299269, 1834244921, 4101496331, 9171224603, 20507481647, 45856123009, 102537408229, 229280615033, 512687041133, 1146403075157, 2563435205663, 5732015375783, 12817176028331, 28660076878933, 64085880141667, 143300384394667, 320429400708323, 716501921973329, 1602147003541613, 3582509609866643, 8010735017708063, 17912548049333207, 40053675088540303, 89562740246666023, 200268375442701509, 447813701233330109, 1001341877213507537, 2239068506166650537, 5006709386067537661, 11195342530833252689}
(Obviously, omit those that would overflow the relevant array index type. So if that is a signed long long, omit the last.)
On average these have ≈1.96 distinct prime factors and ≈2.07 non-distinct prime factors; 19/55 ≈ 35% are prime; and all but three are square-free (2⁴, 13·19² = 4693, 3291992692409·23³ ≈ 4.0·10¹⁶).
I would welcome formal reasoning about this sequence.
† There’s a little mischief in this “well known … must”. Choosing ∉ℚ guarantees that the closest number that is coprime cannot be a tie, but rational with odd denominator would achieve same. And I like the simplicity of √5, though other possibilities include e^⅘, 11^⅓, π/√2, and √π divided by the Chow-Robbins constant. Simplicity favours √5.
I've found this sequence similar to Marcin Ciura's sequence:
1, 4, 9, 23, 57, 138, 326, 749, 1695, 3785, 8359, 18298, 39744, etc.
For example, Ciura's sequence is:
1, 4, 10, 23, 57, 132, 301, 701, 1750
This is a mean of prime numbers. Python code to find mean of prime numbers is here:
import numpy as np
def isprime(n):
''' Check if integer n is a prime '''
n = abs(int(n)) # n is a positive integer
if n < 2: # 0 and 1 are not primes
return False
if n == 2: # 2 is the only even prime number
return True
if not n & 1: # all other even numbers are not primes
return False
# Range starts with 3 and only needs to go up the square root
# of n for all odd numbers
for x in range(3, int(n**0.5)+1, 2):
if n % x == 0:
return False
return True
# To apply a function to a numpy array, one have to vectorize the function
vectorized_isprime = np.vectorize(isprime)
a = np.arange(10000000)
primes = a[vectorized_isprime(a)]
#print(primes)
for i in range(2,20):
print(primes[0:2**i].mean())
The output is:
4.25
9.625
23.8125
57.84375
138.953125
326.1015625
749.04296875
1695.60742188
3785.09082031
8359.52587891
18298.4733887
39744.887085
85764.6216431
184011.130096
392925.738174
835387.635033
1769455.40302
3735498.24225
The gap in the sequence is slowly decreasing from 2.5 to 2.
Maybe this association could improve the Shellsort in the future.
I discussed this question here yesterday including the gap sequences I have found work best given a specific (low) n.
In the middle I write
A nasty side-effect of shellsort is that when using a set of random
combinations of n entries (to save processing/evaluation time) to test
gaps you may end up with either the best gaps for n entries or the
best gaps for your set of combinations - most likely the latter.
The problem lies in testing the proposed gaps such that valid conclusions can be drawn. Obviously, testing the gaps against all n! orderings that a set of n unique values can be expressed as is unfeasible. Testing in this manner for n=16, for example, means that 20,922,789,888,000 different combinations of n values must be sorted to determine the exact average, worst and reverse-sorted cases - just to test one set of gaps and that set might not be the best. 2^(16-2) sets of gaps are possible for n=16, the first being {1} and the last {15,14,13,12,11,10,9,8,7,6,5,4,3,2,1}.
To illustrate how using random combinations might give incorrect results assume n=3 that can assume six different orderings 012, 021, 102, 120, 201 and 210. You produce a set of two random sequences to test the two possible gap sets, {1} and {2,1}. Assume that these sequences turn out to be 021 and 201. for {1} 021 can be sorted with three comparisons (02, 21 and 01) and 201 with (20, 21, 01) giving a total of six comparisons, divide by two and voilà, an average of 3 and a worst case of 3. Using {2,1} gives (01, 02, 21 and 01) for 021 and (21, 10 and 12) for 201. Seven comparisons with a worst case of 4 and an average of 3.5. The actual average and worst case for {1] is 8/3 and 3, respectively. For {2,1} the values are 10/3 and 4. The averages were too high in both cases and the worst cases were correct. Had 012 been one of the cases {1} would have given a 2.5 average - too low.
Now extend this to finding a set of random sequences for n=16 such that no set of gaps tested will be favored in comparison with the others and the result close (or equal) to the true values, all the while keeping processing to a minimum. Can it be done? Possibly. After all, everything is possible - but is it probable? I think that for this problem random is the wrong approach. Selecting the sequences according to some system may be less bad and might even be good.
More information regarding jdaw1's post:
Gonnet and Baeza-Yates advise growth by a factor of about 2.2; Tokuda by 2.25. It is well known that if there is a mathematical constant between 2⅕ and 2¼ then it must† be precisely √5 ≈ 2.236.
It is known that √5 * √5 is 5 so I think every other index should increase by a factor of five. So first index being 1 insertion sort, second being 3 then each other subsequent is of the factor 5. There follow the values up to 2⁶⁴ ≈ eighteen quintillion.
{1, 3,, 15,, 75,, 375,, 1 875,, 9 375,, 46 875,, 234 375,, 1 171 875,, 5 859 375,, 29 296 875,, 146 484 375,, 732 421 875,, 3 662 109 375,, 18 310 546 875,, 91 552 734 375,, 457 763 671 875,, 2 288 818 359 375,, 11 444 091 796 875,, 57 220 458 984 375,, 286 102 294 921 875,, 1 430 511 474 609 375,, 7 152 557 373 046 875,, 35 762 786 865 234 375,, 178 813 934 326 171 875,, 894 069 671 630 859 375,, 4 470 348 358 154 296 875,}
The values in the gaps can simply be calculated by taking the value before and multiply by √5 rounding to whole numbers giving the resulting array (using 2.2360679775 * 5 ^ n * 3):
{1, 3, 7, 15, 34, 75, 168, 375, 839, 1 875, 4 193, 9 375, 20 963, 46 875, 104 816, 234 375, 524 078, 1 171 875, 2 620 392, 5 859 375, 13 101 961, 29 296 875, 65 509 804, 146 484 375, 327 549 020, 732 421 875, 1 637 745 101, 3 662 109 375, 8 188 725 504, 18 310 546 875, 40 943 627 518, 91 552 734 375, 204 718 137 589, 457 763 671 875, 1 023 590 687 943, 2 288 818 359 375, 5 117 953 439 713, 11 444 091 796 875, 25 589 767 198 563, 57 220 458 984 375, 127 948 835 992 813, 286 102 294 921 875, 639 744 179 964 066, 1 430 511 474 609 375, 3 198 720 899 820 328, 7 152 557 373 046 875, 15 993 604 499 101 639, 35 762 786 865 234 375, 79 968 022 495 508 194, 178 813 934 326 171 875, 399 840 112 477 540 970, 894 069 671 630 859 375, 1 999 200 562 387 704 849, 4 470 348 358 154 296 875, 9 996 002 811 938 524 246}
(Obviously, omit those that would overflow the relevant array index type. So if that is a signed long long, omit the last.)