Creating combinations that have no more one intersecting element - algorithm
I am looking to create a special type of combination in which no two sets have more than one intersecting element. Let me explain with an example:
Let us say we have 9 letter set that contains A, B, C, D, E, F, G, H, and I
If you create the standard non-repeating combinations of three letters you will have 9C3 sets.
These will contain sets like ABC, ABD, BCD, etc. I am looking to create sets that have at the most only 1 common letter. So in this example, we will get following sets:
ABC, ADG, AEI, AFH, BEH, BFG, BDI, CFI, CDH, CEG, DEF, and GHI - note that if you take any two sets there are no more than 1 repeating letter.
What would be a good way to generate such sets? It should be scalable solution so that I can do it for a set of 1000 letters, with sub set size of 4.
Any help is highly appreciated.
Thanks
Say you have n letters (or students, or whatever), and every week want to partition them into subsets of size k (for a total of n/k subsets every week). This method will generate almost n/k subsets every week - I show below how to extend it to generate exactly n/k subsets.
Generating the Subsets (no partitioning)
First pick p, the largest prime <= n/k.
Let's consider every ordered pair (a,b) such that
0 <= a < k
0 <= b < p
We can map each pairing to one of our letters; thus, we can map p*k <= n letters this way (again, I show below how to map exactly n letters)
(0,0) => 'A'
(0,1) => 'B'
...
(0,p-1) => 'F'
(1,0) => 'G'
(1,1) => 'H'
...
(k-1,p-1) => 's'
Now, given
0 <= w < p
0 <= i < p
We can create a set Sw(i) of our ordered pairs. Each pairing in Sw(i) will represent one letter (according to our mapping above), and the set Sw(i) itself represents one "grouping of letters" aka. one subset of size k.
The formula for Sw(i) is
Sw(i) = {(0,i mod p), (1,(w+i) mod p), (2,(2w+i) mod p),..., ((k-1),((k-1)*w+i) mod p)}
= { (x,y) | 0 <= x < k and y = w*x + i (mod p)}
If we vary w and i over all possible values, we get p2 total sets. When we take any two of these sets, they will have at most one intersecting element.
How it works
Say we have two sets Sw1(i1) and Sw2(i2). If Sw1(i1) and Sw2(i2) have more than one element in common, then there exists at least two x such that
w1*x + i1 = w2*x + i2 (mod p)
(w1-w2)*x + (i1-i2) = 0 (mod p)
However, anyone who's taken modular arithmetic knows that if p is prime, either x has a unique solution or (w1 = w2 and i1 = i2); thus, there cannot be more than one x, and Sw1(i1) and Sw2(i2) can have at most one intersecting element.
Analysis
Since p < n/k, by Chebyshev's Theorem (which states there is a prime between x and 2x for x > 3)
n/2k < p <= n/k
Thus, this method generates at least (n/2k)2 subsets of letters, though in practice p will be nearer to n/k, so the number will be nearer to (n/k)2. Since a simple upper bound for the maximum possible such subsets is n(n-1)/(k(k-1)) (see BlueRaja's comment below), this means the algorithm is asymptotically optimal, and will generate near the optimal amount of sets (even in the worst case, it won't generate less than about 1/4th the optimal amount; see again the comment below)
Partitioning
You now want to group the letters into partitions each week: each week, all letters are included in exactly one group.
We do this by letting w be fixed to a certain value (representing the week) and letting i vary from 0 to p-1.
Proof
Consider the groups we created:
Sw(i) = { (x,y) | 0 <= x < k and y = w*x + i (mod p)}
Let's say w is fixed and i varies from 0 to p-1. Then we get p sets:
Sw(0), Sw(1), ..., Sw(p-1)
Now let's say Sw(i1) and Sw(i2) (with i1 =/= i2) intersect; then
w*x + i1 = w*x + i2 (mod p)
for some x, and hence i1 = i2. Thus, Sw(i1) and Sw(i2) don't intersect.
Since no two of our sets intersect, and there are exactly p of them (each with k elements), our sets form a partition of the k*p letters.
Generating n/k Subsets Each Week
The biggest disadvantage of this method is that it generates sets for p*k letters, rather than n letters. If the last letters can't be left out (as in your case, where the letters are students), there are two ways to generate exactly n/k subsets each week:
Find a set of prime numbers p1, p2, p3, ... which sums up to exactly n/k. Then we can treat each group of pik letters as an independent alphabet, so that rather than finding subsets of pk letters, we find one group of subsets for p1*k letters, another group of subsets for p2*k letters, another group...
This has the disadvantage that letters from one group will never be paired with letters from another group, reducing the total number of subsets generated. Luckily, if n is even, by Goldbach's conjecture† you will only need two groups at the most (if n is odd, you will only need three at most)
This method guarantees subsets of size k, but doesn't generate as many subsets.
† Though unproven, it is known to be true for every ridiculously large number you will likely encounter for this problem
The other option is to use the smallest prime p >= n/k. This will give you p*k >= n letters - after the subsets have been generated, simply throw out the extra letters. Thus, in the end this gives you some subsets with size < k. Assuming k divides n evenly (ie. n/k is an integer), you could take the smaller subsets and mix them up by hand to make subsets of size k, but you risk having some overlap with past/future subsets this way.
This method generates at least as many subsets as the original method, but some may have size < k
Example
Take n = 15, k = 3. i.e. there are 15 students and we are making groups of three.
To begin with, we pick largest prime p <= n/k. n/k is prime (lucky us!), so p = 5.
We map the 15 students into the ordered pairs (a,b) described above, giving us (each letter is a student):
(0,0) = A
(0,1) = B
(0,2) = C
(0,3) = D
(0,4) = E
(1,0) = F
(1,1) = G
(1,2) = H
(1,3) = I
(1,4) = J
(2,0) = K
(2,1) = L
(2,2) = M
(2,3) = N
(2,4) = O
The method generates 25 groups of three. Thus, since we need to schedule n/k = 5 groups each week, we can schedule 5 weeks of activities (5 groups a week * 5 weeks = 25 groups).
For week 0, we generate the partition as
S0(i), for i = 0 to 4.
S0(0) = { (0,0), (1,0), (2,0) } = AFK
S0(1) = { (0,1), (1,1), (2,1) } = BGL
S0(2) = { (0,2), (1,2), (2,2) } = CHM
S0(3) = { (0,3), (1,3), (2,3) } = DIN
S0(4) = { (0,4), (1,4), (2,4) } = EJO
For week 4 it will be
S4(i) for i = 0 to 4.
S4(0) = { (0,0), (1, (4*1 + 0) mod 5), (2, (2*4 + 0) mod 5) }
= { (0,0), (1,4), (2,3) }
= AJN
S4(1) = { (0,1), (1, (4*1 + 1) mod 5), (2, (4*2 + 1) mod 5) }
= { (0,1), (1,0), (2,4) }
= BFO
S4(2) = { (0,2), (1, (4*1 + 2) mod 5), (2, (4*2 + 2) mod 5) }
= { (0,2), (1,1), (2,0) }
= CGK
S4(3) = { (0,3), (1, (4*1 + 3) mod 5), (2, (4*2 + 3) mod 5) }
= { (0,3), (1,2), (2,1) }
= DHL
S4(4) = { (0,4), (1, (4*1 + 4) mod 5), (2, (4*2 + 4) mod 5) }
= { (0,4), (1,3), (2,2) }
= EIM
Here's the schedule for all 5 weeks:
Week: 0
S0(0) ={(0,0) (1,0) (2,0) } = AFK
S0(1) ={(0,1) (1,1) (2,1) } = BGL
S0(2) ={(0,2) (1,2) (2,2) } = CHM
S0(3) ={(0,3) (1,3) (2,3) } = DIN
S0(4) ={(0,4) (1,4) (2,4) } = EJO
Week: 1
S1(0) ={(0,0) (1,1) (2,2) } = AGM
S1(1) ={(0,1) (1,2) (2,3) } = BHN
S1(2) ={(0,2) (1,3) (2,4) } = CIO
S1(3) ={(0,3) (1,4) (2,0) } = DJK
S1(4) ={(0,4) (1,0) (2,1) } = EFL
Week: 2
S2(0) ={(0,0) (1,2) (2,4) } = AHO
S2(1) ={(0,1) (1,3) (2,0) } = BIK
S2(2) ={(0,2) (1,4) (2,1) } = CJL
S2(3) ={(0,3) (1,0) (2,2) } = DFM
S2(4) ={(0,4) (1,1) (2,3) } = EGN
Week: 3
S3(0) ={(0,0) (1,3) (2,1) } = AIL
S3(1) ={(0,1) (1,4) (2,2) } = BJM
S3(2) ={(0,2) (1,0) (2,3) } = CFN
S3(3) ={(0,3) (1,1) (2,4) } = DGO
S3(4) ={(0,4) (1,2) (2,0) } = EHK
Week: 4
S4(0) ={(0,0) (1,4) (2,3) } = AJN
S4(1) ={(0,1) (1,0) (2,4) } = BFO
S4(2) ={(0,2) (1,1) (2,0) } = CGK
S4(3) ={(0,3) (1,2) (2,1) } = DHL
S4(4) ={(0,4) (1,3) (2,2) } = EIM
More Practical Example
In your case, n = 1000 students and k = 4 in each group. Thus, we pick p as the largest prime <= (n/k = 1000/4 = 250), so p = 241. Without considering the alterations above under "Generating n/k Subsets Each Week", this method will generate a schedule for 961 students lasting 241 weeks.
(An upper-bound for the maximum number of subsets possible would be 1000*999/(4*3) = 83250, though the actual number is likely less than that. Even so, this method generates 58081 subsets, or about 70% of the theoretical maximum!)
If we use the first method above to generate a schedule for exactly 1000 students, we take p1 = 113, p2 = 137 (so that p1 + p2 = n/k). Thus, we can generate (113)^2 + (137)^2 = 31,538 subsets of students, enough to last 113 weeks.
If we use the second method above to generate a schedule for exactly 1000 students, we take p = 251. This will give us a schedule for 1004 students for 251 weeks; we remove the 4 phantom students from the schedule each week. Usually, this will result in four groups of 3 every week (though unlikely, it is also possible to have for example one group of 2 and two groups of 3). The groups with < 4 students will always have a multiple-of-4 total number of students, so you could manually place those students into groups of 4, at the risk of potentially having two of those students together again later in another group.
Final thoughts
One flaw of this algorithm is that it's not really flexible: if a student drops out, we are forever stuck with a phantom student. Also, there is no way to add new students to the schedule midway through the year (unless we allow for them by initially creating phantom students).
This problem falls under the category of Restricted Set Systems in combinatorics. See this paper for more information, especially Chapters 1 and 2. Since it is a postscript file, you will need gsview or something to view it.
I had to add another answer as the other one was too long already.
If you have the following constraints:
1) You need groups of 4 every week.
2) Each group in a certain week is disjoint and each student is in exactly one group.
3) If two students are in the same group, they need cannot be in the same group in future.
If you construct a graph G as follows:
1) Each student is a node.
2) Two students are joined by an edge iff they haven't been together in a group before.
With students dropping/joining arbitrarily, this becomes a hard problem! Even though you start out with a complete graph intially, after some weeks, the graph could become quite unpredictable.
Your problem: You need to find a spanning subgraph of G, such that it is a union of copies of K_4 or in other words a partition into K_4s.
Unfortunately, look like this problem is NP-Hard: Exact cover by 4-sets (which is NP-Complete) can be reduced to your problem (just like Exact cover by 3-sets can be reduced to partition into triangles).
Perhaps this will help give some aproximation algorithms: http://www.siam.org/proceedings/soda/2010/SODA10_122_chany.pdf
(Your problem can be reduced to Hypergraph matching and so algorithms for that can be used for your problem).
Exact Cover: http://en.wikipedia.org/wiki/Exact_cover
Partition into triangles: https://noppa.tkk.fi/noppa/kurssi/t-79.5103/viikkoharjoitukset/T-79_5103_solutions_5.pdf
Exact Cover by 4 sets = Given a set S of size 4m and a collection C of 4-element subsets of S, does there exist a subset C' of C, such that each element of S appears precisely once in C'.
Unfortunately, seems like you might have to change some constraints.
Here is some outline of the algorithm.
First find all the pairs:
AB BC CD DE EF FG GH HI
AC BD CE DF EG FH GI
AD BE CF DG EH FI
AE BF CG DH EI
AF BG CH DI
AG BH CI
AH BI
AI
These can be stored in an array of sixe n(n-1)
Now start attempting to combine consecutive pairs using the following rules:
a. Two pairs can be combined only when there is a common character.
b. The combination is possible when the pair formed by leaving the common character is also available. e.g. if we want to combine AB and AC then we need to check if BC is also available.
c. When the above rules are satisfied we combine the two pairs into a triple (e.g. AB and AC merged to form ABC) and mark all the three pairs, such as AB, AC and BC as unavailable.
Continue doing this looking for available pairs in the array and merging them to form triples and marking the pairs unavailable until there are no available pairs or no triples can be formed anymore.
Example:
1. combine AB + AC --> ABC; Mark AB, AC and BC unavailable.
2. combine AD + AE --> AED; Mark AD, AE and DE unavailable.
3. combine AF + AG --> AFG; Mark AF, AG and FG unavailable.
..
Here's an approach which will satisfy the requirements you have stated. Whether it does what you want I don't know.
On a large sheet of paper draw a regular grid with at least 250 squares.
Label the sides of the squares with the letters in your alphabet (250 squares x 4 sides == 1000).
Each square defines one of your subsets. Each shares one side (ie one letter) only with each of its (up to) 4 neighbours. No side (ie letter) is shared by more than 2 squares (subsets).
I'll leave it up to you to turn this into working code, but I don't think that it should be too difficult and it should scale well. Note that it should also work for any other size of subset, you can tile the plane with irregular n-gons for any n, though it might get difficult.
The other approach I thought of is:
On a large sheet of paper draw N dots, where N is the size of your alphabet. Label each dot with one of the letters of the alphabet.
Connect any n dots, where n is the size of the required subsets. That's your first subset.
Choose one of the already-connected dots, connect it to (n-1) more unconnected dots. That's your next subset.
Repeat step 3 until you are finished.
This requires a bit more book-keeping, and there are a number of corner cases to deal with, but again it shoudln't be too difficult to code up.
EDIT: It's easy to transform my first approach into a form which is more obviously an algorithm on a graph. Model each subset as a node, and each letter in the alphabet as an edge. Construct a graph where each node has degree n (number of elements in the subset) and each letter (edge) is used once.
#khuss
Same method can be generalized. But it is not a linear algorithm and may be exponential.
For example: When subset size is 4, we pick 2 or more pairs with 4 unique characters.
e.g. "AB and CD" or "AB, AC & AD" only if the following conditions are satisfied:
All the pairs formed by the characters of the 4-tuple are available. e.g. if we want to form ABCD using AB, AC & AD then all the pairs fromed out A, B, C & D i.e. AB, AC, AD, BC, BD, CD all must be available.
If condition 1 is satisfied then we form ABCD and mark the C(4,2) = 6 pairs as unavailable.
We continue as before until no more 4-tuples can be formed or no more pairs are available.
Related
probability maximize expectation
Given 3n people that the i-th person can pass a test with probability p_i, now you are required to divide them to n groups that each group has 3 people. The score of one group equals 1 if at least two people pass the test, 0 otherwise. In order to maximize the expectation of total score, how do you group them? I've thought about this problem for a bit, and I think intuitively it makes sense to group two large p_i with a small p_i. Also, i've thought about in the optimal arrangement, swapping any two p_i from different groups should lower the expectation. I can write out mathematically the difference in expectation when swapping two of the students, but it doesn't seem to give any obvious result. I've hit a wall.
Interesting problem. It feels hard to me, since three tends to be the magic number for NP-hardness, and I don't see any kind of convex structure. I can suggest the following large-neighborhood local search strategy. If we were just trying to match pairs with singles to form groups of three, then the optimal strategy would be to sort the pairs by how likely they are to have exactly one pass, sort the singles by how likely they are to pass, and match them accordingly. To do local search, form initial groups, then repeatedly split each group uniformly at random into a pair and a single, then rematch optimally as above. Some very rough Python: import random def quality(groups): return sum(a * b + a * c + b * c - 2 * a * b * c for [a, b, c] in groups) def main(): n = 10 groups = [[random.random() for j in range(3)] for i in range(n)] print(groups) print(quality(groups)) for k in range(1000): choices = [random.randrange(3) for i in range(n)] pairs = [[group[j - 1], group[j - 2]] for (group, j) in zip(groups, choices)] pairs.sort(key=lambda pair: pair[0] + pair[1] - 2 * pair[0] * pair[1]) singles = [group[j] for (group, j) in zip(groups, choices)] singles.sort() groups = [pair + [single] for (pair, single) in zip(pairs, singles)] print(quality(groups)) print(groups) main()
Find optimal points to cut a set of intervals
Given a set of intervals on the real line and some parameter d > 0. Find a sequence of points with gaps between neighbors less or equal to d, such that the number of intervals that contain any of the points is minimized. To prevent trivial solutions we ask that the first point from the sequence is before the first interval, and the last point is after the last interval. The intervals can be thought of right-open. Does this problem have a name? Maybe even an algorithm and a complexity bound? Some background: This is motivated by a question from topological data analysis, but it seems so general, that it could be interesting for other topics, e.g. task scheduling (given a factory that has to shut down at least once a year and wants to minimize the number of tasks inflicted by the maintenance...) We were thinking of integer programming and minimum cuts, but the d-parameter does not quite fit. We also implemented approximate greedy solutions in n^2 and n*logn time, but they can run into very bad local optima. Show me a picture I draw intervals by lines. The following diagram shows 7 intervals. d is such that you have to cut at least every fourth character. At the bottom of the diagram you see two solutions (marked with x and y) to the diagram. x cuts through the four intervals in the top, whereas y cuts through the three intervals at the bottom. y is optimal. ——— ——— ——— ——— ——— ——— ——— x x x x y y y Show me some code: How should we define fun in the following snippet? intervals = [(0, 1), (0.5, 1.5), (0.5, 1.5)] d = 1.1 fun(intervals, d) >>> [-0.55, 0.45, 1.55] # Or something close to it In this small example the optimal solution will cut the first interval, but not the second and third. Obviously, the algorithm should work with more complicated examples as well. A tougher test can be the following: Given a uniform distribution of interval start times on [0, 100] and lengths uniform on [0, d], one can compute the expected number of cuts by a regular grid [0, d, 2d, 3d,..] to be slightly below 0.5*n. And the optimal solution should be better: n = 10000 delta = 1 starts = np.random.uniform(low=0., high=99, size=n) lengths = np.random.uniform(low=0., high=1, size=n) rand_intervals = np.array([starts, starts + lengths]).T regular_grid = np.arange(0, 101, 1) optimal_grid = fun(rand_intervals) # This computes the number of intervals being cut by one of the points def cuts(intervals, grid): bins = np.digitize(intervals, grid) return sum(bins[:,0] != bins[:,1]) cuts(rand_intervals, regular_grid) >>> 4987 # Expected to be slightly below 0.5*n assert cuts(rand_intervals, optimal_grid) <= cuts(rand_intervals, regular_grid)
You can solve this optimally through dynamic programming by maintaining an array S[k] where S[k] is the best solution (covers the largest amount of space) while having k intervals with a point in it. Then you can repeatedly remove your lowest S[k], extend it in all possible ways (limiting yourself to the relevant endpoints of intervals plus the last point in S[k] + delta), and updating S with those new possible solutions. When the lowest possible S[k] in your table covers the entire range, you are done. A Python 3 solution using intervaltree from pip: from intervaltree import Interval, IntervalTree def optimal_points(intervals, d, epsilon=1e-9): intervals = [Interval(lr[0], lr[1]) for lr in intervals] tree = IntervalTree(intervals) start = min(iv.begin for iv in intervals) stop = max(iv.end for iv in intervals) # The best partial solution with k intervals containing a point. # We also store the intervals that these points are contained in as a set. sols = {0: ([start], set())} while True: lowest_k = min(sols.keys()) s, contained = sols.pop(lowest_k) # print(lowest_k, s[-1]) # For tracking progress in slow instances. if s[-1] >= stop: return s relevant_intervals = tree[s[-1]:s[-1] + d] relevant_points = [iv.begin - epsilon for iv in relevant_intervals] relevant_points += [iv.end + epsilon for iv in relevant_intervals] extensions = {s[-1] + d} | {p for p in relevant_points if s[-1] < p < s[-1] + d} for ext in sorted(extensions, reverse=True): new_s = s + [ext] new_contained = set(tree[ext]) | contained new_k = len(new_contained) if new_k not in sols or new_s[-1] > sols[new_k][0][-1]: sols[new_k] = (new_s, new_contained)
If the range and precision could be feasible for iterating over, we could first merge and count the intervals. For example, [(0, 1), (0.5, 1.5), (0.5, 1.5)] -> [(0, 0.5, 1), (0.5, 1, 3), (1, 1.5, 2)] Now let f(n, k) represent the optimal solution with k points up to n on the number line. Then: f(n, k) = min( num_intervals(n) + f(n - i, k - 1) ) num_intervals(n) is known in O(1) from a pointer in the merged interval list. n-i is not every precision point up to n. Rather, it's every point not more than d back that marks a change from one merged interval to the next as we move it back from our current pointer in the merged-interval list. One issue to note is that we need to store the distance between the rightmost and previous point for any optimal f(n, k). This is to avoid joining f(n - i, k - 1) where the second to rightmost point would be less than d away from our current n, making the new middle point, n - i, superfluous and invalidating this solution. (I'm not sure I've thought this issue through enough. Perhaps someone could point out something that's amiss.) How would we know k is high enough? Given that the optimal solution may be lower than the current k, we assume that the recurrence would prevent us from finding an instance based on the idea in the above paragraph: 0.......8 ——— ——— ——— ——— ——— ——— ——— x x x x y y y d = 4 merged list: [(1, 3, 2), (3, 4, 5), (4, 5, 3), (5, 6, 5), (6, 8, 2)] f(4, 2) = (3, 0) // (intersections, previous point) f(8, 3) = (3, 4) There are no valid solutions for f(8, 4) since the break point we may consider between interval change in the merged list is before the second-to-last point in f(8, 3).
Permutations unrank
I know of an algorithm (it can be found online) to rank a permutation, i.e. given a permutation return the integer index into the list of lexicographically-sorted permutations, but I don't know any unrank algorithm that does the opposite: given an index i, return the i-th permutation in that lexicographic order. Since I couldn't find any, can somebody please shed some light?
Let's say you are permutating the letters (a, b, c). There are 3×2×1=6 permutations. Out of these, a third starts with a and lexicographically precedes another third starting with b, preceding the last third starting with c. For each of these thirds there are two halves, one starting with the first letter left after choosing the first, and the other with the second. Each of these halves has only one element (the last letter). So, given a set of three elements and an index between zero and five (let's say 3), we can divide (with reminder) by the size of each "third" to get the first letter. Now: the set has size n=3 there are n!=6 permutations there are n=3 groups of permutations which start with each of the n elements each group has size n!/n = (n-1)! = 6/3 = 2 elements To determine the index of the first element, we divide by 2 with remainder: 3÷2 = 1 rem 1 Since our set is (a,b,c), this tells us that the first letter is b. Now, we can remove the letter b from the set, and use the reminder as the new index. We get the set (a, c) and index 1. Re-applying the algorithm, the set has size n=2 there are n!=2 permutations there are n=2 groups of permutations which start with each of the n elements each group has size n!/n = (n-1)! = 2/2 = 1 element To determine the index of the first element, we divide by 1 with remainder: 1÷1 = 1 rem 0 Since our set is (a,c), this tells us that the first second letter is c. The third set is reduced to the singleton a and that's our third letter. The permutation with index 3 is b,c,a. Let's check it: 0 abc 1 acb 2 bac 3 bca <-- correct! 4 cab 5 cba So, putting this in a real algorithm and generalizing: public string NthPerm(string set, int n) { var res = ""; while (set.Length > 0) { var setSize = Math.Factorial(set.Length-1); var index = n/setSize; res.Concat(set[index]); set = index > 0 ? set.Substring(0, index) : "" + index < set.Length-1 ? set.Substring(index+1) : ""; n = n % setSize; } return res; }
segment overlapping regions into disjoint regions
Given a set of closed regions [a,b] where a and b are integers I need to find another set of regions that cover the same numbers but are disjoint. I suppose it is possible to do naively by iterating through the set several times, but I am looking for a recommendation of a good algorithm for this. Please help. EDIT: to clarify, the resulting regions cannot be larger than the original ones, I have to come up with disjoint regions that are contained by the original ones. In other words, I need to split the original regions on the boundaries where they overlap. example: 3,8 1,4 7,9 11,14 result: 1,3 3,4 4,7 7,8 8,9 11,14
Just sort all endpoints left to right (remember their type: start or end). Swype left to right. Keep a counter starting at 0. Whenever you come across a start: increment the counter. When you come across an end: decrement (note that the counter is always at least 0). Keep track of the last two points. If the counter is greater than zero - and the last two points are different (prevent empty ranges) - add the interval between the last two points. Pseudocode: points = all interval endpoints sort(points) previous = points[0] counter = 1 for(int i = 1; i < #points; i++) { current = points[i] if (current was start point) counter++ else counter-- if (counter > 0 and previous != current) add (previous, current) to output previous = current }
(This is a modification of an answer that I posted earlier today which I deleted after I discovered it had a logic error. I later realized that I could modify Vincent van der Weele's elegant idea of using parenthesis depth to fix the bug) On Edit: Modified to be able to accept intervals of length 0 Call an interval [a,a] of length 0 essential if a doesn't also appear as an endpoint of any interval of length > 0. For example, in [1,3], [2,2], [3,3], [4,4] the 0-length intervals [2,2] and [4,4] are essential but [3,3] isn't. Inessential 0-length intervals are redundant thus need not appear in the final output. When the list of intervals is initially scanned (loading the basic data structures) points corresponding to 0-length intervals are recorded, as are endpoint of intervals of length > 0. When the scan is completed, two instances of each point corresponding to essential 0-length intervals are added into the list of endpoints, which is then sorted. The resulting data structure is a multiset where the only repetitions correspond to essential 0-length intervals. For every endpoint in the interval define the pdelta (parentheses delta) of the endpoint as the number of times that point appears as a left endpoint minus the number of times it appears as a right endpoint. Store these in a dictionary keyed by the endpoints. [a,b] where a,b are the first two elements of the list of endpoints is the first interval in the list of disjoint intervals. Define the parentheses depth of b to be the sum of pdelta[a] and pdelta[b]. We loop through the rest of the endpoints as follows: In each pass through the loop look at the parenthesis depth of b. If it is not 0 than b is still needed for one more interval. Let a = b and let the new p be the next value in the list. Adjust the parentheses depth be the pdelta of the new b and add [a,b] to the list of disjoint intervals. Otherwise (if the parenthesis depth of b is 0) let the next [a,b] be the next two points in the list and adjust the parenthesis depth accordingly. Here is a Python implementation: def disjointify(intervals): if len(intervals) == 0: return [] pdelta = {} ends = set() disjoints = [] onePoints = set() #onePoint intervals for (a,b) in intervals: if a == b: onePoints.add(a) if not a in pdelta: pdelta[a] = 0 else: ends.add(a) ends.add(b) pdelta[a] = pdelta.setdefault(a,0) + 1 pdelta[b] = pdelta.setdefault(b,0) - 1 onePoints.difference_update(ends) ends = list(ends) for a in onePoints: ends.extend([a,a]) ends.sort() a = ends[0] b = ends[1] pdepth = pdelta[a] + pdelta[b] i = 1 disjoints.append((a,b)) while i < len(ends) - 1: if pdepth != 0: a = b b = ends[i+1] pdepth += pdelta[b] i += 1 else: a = ends[i+1] b = ends[i+2] pdepth += (pdelta[a] + pdelta[b]) i += 2 disjoints.append((a,b)) return disjoints Sample output which illustrates various edge cases: >>> example = [(1,1), (1,4), (2,2), (4,4),(5,5), (6,8), (7,9), (10,10)] >>> disjointify(example) [(1, 2), (2, 2), (2, 4), (5, 5), (6, 7), (7, 8), (8, 9), (10, 10)] >>> disjointify([(1,1), (2,2)]) [(1, 1), (2, 2)] (I am using Python tuples to represent the closed intervals even though it has the minor drawback of looking like the standard mathematical notation for open intervals). A final remark: referring to the result as a collection of disjoint interval might not be accurate since some of these intervals have nonempty albeit 1-point intersections
generate sequence with all permutations
How can I generate the shortest sequence with contains all possible permutations? Example: For length 2 the answer is 121, because this list contains 12 and 21, which are all possible permutations. For length 3 the answer is 123121321, because this list contains all possible permutations: 123, 231, 312, 121 (invalid), 213, 132, 321. Each number (within a given permutation) may only occur once.
This greedy algorithm produces fairly short minimal sequences. UPDATE: Note that for n ≥ 6, this algorithm does not produce the shortest possible string! Make a collection of all permutations. Remove the first permutation from the collection. Let a = the first permutation. Find the sequence in the collection that has the greatest overlap with the end of a. If there is a tie, choose the sequence is first in lexicographic order. Remove the chosen sequence from the collection and add the non-overlapping part to the end of a. Repeat this step until the collection is empty. The curious tie-breaking step is necessary for correctness; breaking the tie at random instead seems to result in longer strings. I verified (by writing a much longer, slower program) that the answer this algorithm gives for length 4, 123412314231243121342132413214321, is indeed the shortest answer. However, for length 6 it produces an answer of length 873, which is longer than the shortest known solution. The algorithm is O(n!2). An implementation in Python: import itertools def costToAdd(a, b): for i in range(1, len(b)): if a.endswith(b[:-i]): return i return len(b) def stringContainingAllPermutationsOf(s): perms = set(''.join(tpl) for tpl in itertools.permutations(s)) perms.remove(s) a = s while perms: cost, next = min((costToAdd(a, x), x) for x in perms) perms.remove(next) a += next[-cost:] return a The length of the strings generated by this function are 1, 3, 9, 33, 153, 873, 5913, ... which appears to be this integer sequence. I have a hunch you can do better than O(n!2).
Create all permutations. Let each permutation represent a node in a graph. Now, for any two states add an edge with a value 1 if they share n-1 digits (for the source from the end, and for the target from the end), two if they share n-2 digits and so on. Now, you are left to find the shortest path containing n vertices.
Here is a fast algorithm that produces a short string containing all permutations. I am pretty sure it produces the shortest possible answer, but I don't have a complete proof in hand. Explanation. Below is a tree of All Permutations. The picture is incomplete; imagine that the tree goes on forever to the right. 1 --+-- 12 --+-- 123 ... | | | +-- 231 ... | | | +-- 312 ... | +-- 21 --+-- 213 ... | +-- 132 ... | +-- 321 ... The nodes at level k of this tree are all the permutations of length k. Furthermore, the permutations are in a particular order with a lot of overlap between each permutation and its neighbors above and below. To be precise, each node's first child is found by simply adding the next symbol to the end. For example, the first child of 213 would be 2134. The rest of the children are found by rotating to the first child to left one symbol at a time. Rotating 2134 would produce 1342, 3421, 4213. Taking all the nodes at a given level and stringing them together, overlapping as much as possible, produces the strings 1, 121, 123121321, etc. The length of the nth string in that sequence is the sum for x=1 to n of x!. (You can prove this by observing how much non-overlap there is between neighboring permutations. Siblings overlap in all but 1 symbol; first-cousins overlap in all but 2 symbols; and so on.) Sketch of proof. I haven't completely proved that this is the best solution, but here's a sketch of how the proof would proceed. First show that any string containing n distinct permutations has length ≥ 2n - 1. Then show that adding any string containing n+1 distinct permutations has length 2n + 1. That is, adding one more permutation will cost you two digits. Proceed by calculating the minimum length of strings containing nPr and nPr + 1 distinct permutations, up to n!. In short, this sequence is optimal because you can't make it worse somewhere in the hope of making it better someplace else. It's already locally optimal everywhere. All the moves are forced. Algorithm. Given all this background, the algorithm is very simple. Walk this tree to the desired depth and string together all the nodes at that depth. Fortunately we do not actually have to build the tree in memory. def build(node, s): """String together all descendants of the given node at the target depth.""" d = len(node) # depth of this node. depth of "213" is 3. n = len(s) # target depth if d == n - 1: return node + s[n - 1] + node # children of 213 join to make "2134213" else: c0 = node + s[d] # first child node children = [c0[i:] + c0[:i] for i in range(d + 1)] # all child nodes strings = [build(c, s) for c in children] # recurse to the desired depth for j in range(1, d + 1): strings[j] = strings[j][d:] # cut off overlap with previous sibling return ''.join(strings) # join what's left def stringContainingAllPermutationsOf(s): return build(s[:1], s) Performance. The above code is already much faster than my other solution, and it does a lot of cutting and pasting of large strings that you can optimize away. The algorithm can be made to run in time and memory proportional to the size of the output.
For n 3 length chain is 8 12312132 Seems to me we are working with cycled system - it's ring, saying in other words. But we are are working with ring as if it is chain. Chain is realy 123121321 = 9 But the ring is 12312132 = 8 We take last 1 for 321 from the beginning of the sequence 12312132[1].
These are called (minimal length) superpermutations (cf. Wikipedia). Interest on this has re-sparked when an anonymous user has posted a new lower bound on 4chan. (See Wikipedia and many other web pages for history.) AFAIK, as of today we just know: Their length is A180632(n) ≤ A007489(n) = Sum_{k=1..n} k! but this bound is only sharp for n ≤ 5, i.e., we have equality for n ≤ 5 but strictly less for n > 5. There's a very simple recursive algorithm, given below, producing a superpermutation of length A007489(n), which is always palindromic (but as said above this is not the minimal length for n > 5). For n ≥ 7 we have the better upper bound n! + (n−1)! + (n−2)! + (n−3)! + n − 3. For n ≤ 5 all minimal SP's are known; and for all n > 5 we don't know which is the minimal SP. For n = 1, 2, 3, 4 the minimal SP's are unique (up to changing the symbols), given by (1, 121, 123121321, 123412314231243121342132413214321) of length A007489(1..4) = (1, 3, 9, 33). For n = 5 there are 8 inequivalent ones of minimal length 153 = A007489(5); the palindromic one produced by the algorithm below is the 3rd in lexicographic order. For n = 6 Houston produced thousands of the smallest known length 872 = A007489(6) - 1, but AFAIK we still don't know whether this is minimal. For n = 7 Egan produced one of length 5906 (one less than the better upper bound given above) but again we don't know whether that's minimal. I've written a very short PARI/GP program (you can paste to run it on the PARI/GP web site) which implements the standard algorithm producing a palindromic superpermutation of length A007489(n): extend(S,n=vecmax(s))={ my(t); concat([ if(#Set(s)<n, [], /* discard if not a permutation */ s=concat([s, n+1, s]); /* Now merge with preceding segment: */ forstep(i=min(#s, #t)-1, 0, -1, if(s[1..1+i]==t[#t-i..#t], s=s[2+i..-1]; break)); t=s /* store as previous for next */ )/*endif*/ | s <- [ S[i+1..i+n] | i <- [0..#S-n] ]]) } SSP=vector(6, n, s=if(n>1, extend(s), [1])); // gives the first 6, the 6th being non-minimal I think that easily translates to any other language. (For non-PARI speaking persons: "| x <-" means "for x in".)