efficiently finding overlapping segments from a set of lists - algorithm

Suppose I have the following lists:
[1, 2, 3, 20, 23, 24, 25, 32, 31, 30, 29]
[1, 2, 3, 20, 23, 28, 29]
[1, 2, 3, 20, 21, 22]
[1, 2, 3, 14, 15, 16]
[16, 17, 18]
[16, 17, 18, 19, 20]
Order matters here. These are the nodes resulting from a depth-first search in a weighted graph. What I want to do is break down the lists into unique paths (where a path has at least 2 elements). So, the above lists would return the following:
[1, 2, 3]
[20, 23]
[24, 25, 32, 31, 30, 29]
[28, 29]
[20, 21, 22]
[14, 15, 16]
[16, 17, 18]
[19, 20]
The general idea I have right now is this:
Look through all pairs of lists to create a set of lists of overlapping segments at the beginning of the lists. For example, in the above example, this would be the output:
[1, 2, 3, 20, 23]
[1, 2, 3, 20]
[1, 2, 3]
[16, 17, 18]
The next output would be this:
[1, 2, 3]
[16, 17, 18]
Once I have the lists from step 2, I look through each input list and chop off the front if it matches one of the lists from step 2. The new lists look like this:
[20, 23, 24, 25, 32, 31, 30, 29]
[20, 23, 28, 29]
[20, 21, 22]
[14, 15, 16]
[19, 20]
I then go back and apply step 1 to the truncated lists from step 3. When step 1 doesn't output any overlapping lists, I'm done.
Step 2 is the tricky part here. What's silly is it's actually equivalent to solving the original problem, although on smaller lists.
What's the most efficient way to solve this problem? Looking at all pairs obviously requires O(N^2) time, and step 2 seems wasteful since I need to run the same procedure to solve these smaller lists. I'm trying to figure out if there's a smarter way to do this, and I'm stuck.

Seems like the solution is to modify a Trie to serve the purpose. Trie compression gives clues, but the kind of compression that is needed here won't yield any performance benefits.
The first list you add becomes it's own node (rather than k nodes). If there is any overlap, nodes split but never get smaller than holding two elements of the array.
A simple example of the graph structure looks like this:
insert (1,2,3,4,5)
graph: (1,2,3,4,5)->None
insert (1,2,3)
graph: (1,2,3)->(4,5), (4,5)->None
insert (3,2,3)
graph: (1,2,3)->(4,5), (4,5)->None, (3,32)->None
segments
output: (1,2,3), (4,5), (3,32)
The child nodes should also be added as an actual Trie, at least when there are enough of them to avoid a linear search when adding/removing from the data structure and potentially increasing the runtime by a factor of N. If that is implemented, then the data structure has the same big O performance as a Trie with a somewhat higher hidden constants. Meaning that it takes O(L*N), where L is the average size of the list and N is the number of lists. Obtaining the segments is linear in the number of segments.
The final data structure, basically a directed graph, for your example would looks like below, with the start node at the bottom.
Note that this data structure can be built as you run the DFS rather than afterwords.

I ended up solving this by thinking about the problem slightly differently. Instead of thinking about sequences of nodes (where an edge is implicit between each successive pair of nodes), I'm thinking about sequences of edges. I basically use the algorithm I posted originally. Step 2 is simply an iterative step where I repeatedly identify prefixes until there are no more prefixes left to identify. This is pretty quick, and dealing with edges instead of nodes really simplified everything.
Thanks for everyone's help!

Related

How to solve combinations in card game for n people in r rounds (just one encounter)

There is a famous card game in Germany called "Doppelkopf".
Usually, you play "Doppelkopf" with 4 players, but you can also play it with a table of 5 players, where one player is just watching.
(Where everyone "has the cards" once in a round, meaning everyone has the right to play the first card once every round.)
Every year, my family organizes a "Doppelkopf" tournament with 3 rounds (r).
Depending on the availabilty of my relatives, every year the number of participants varies.
Expecting a minimun of participant of 16 people, the number (n) in this experiment can rise up unlimited (as does the number of rounds r).
Naturally, my relatives do not want to be paired with someone twice, since they want to exchange gossip most efficiently!
There we have:
n - Participants
r- Rounds
t_total = n // 4 # Total Tables (round down of n)
t_5 = n % 4 # Tables of 5s
t_4 = t_total - t_5 # Tables of 4s
pos_pair = n * (n - 1) / 2 # possible pairs (n over 2)
nec_pair = (t_5 * 10 + t_4 * 6) * r # necessary pairs
I was instructed with the aim to minimize the encounters (if possible to set encounters == 1 for everyone)!
Since, I do not want to solve the problem for P{n={16, ..., 32}, r=3} (which I did for some cases), but to solve it with any given P{n∈N, r∈N} , there is a discrepancy between my abilities and the requirements for a solution!
Therefore, I would like to ask the community to help me with this problem, to solve it for any given P{n∈N, r∈N}!
And also to prove, if this problem is not solvable for any P{n∈N, r∈N}, which is given "if pos_pair < nec_pair".
Here are two solutions for P{n=20, r=3}:
which very much solves my "Doppelkopf" tournament problem:
('Best result was ', [[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16], [17, 18, 19, 20]], [[16, 12, 8, 18], [13, 1, 5, 9], [15, 4, 17, 6], [2, 19, 7, 10], [3, 11, 20, 14]], [[14, 9, 17, 7], [13, 20, 8, 2], [5, 4, 12, 19], [6, 16, 11, 1], [15, 18, 10, 3]]])
('Best result was ', [[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16], [17, 18, 19, 20]], [[19, 11, 13, 3], [2, 15, 9, 8], [1, 16, 18, 6], [14, 7, 17, 10], [4, 12, 20, 5]], [[17, 8, 3, 12], [20, 9, 16, 7], [15, 11, 6, 4], [2, 13, 10, 18], [1, 19, 14, 5]]])
But in order to solve this problem with an arbitrary n and r I have come to no conclusion.
In my opinion, there are three ways to go about this problem in a computational solution or approximation.
First, you can iterate about rounds, and assign every player to a
table without having collision, remembering pairs and appeareances
in total (not to exeed total rounds)
Secondly, you can iterate about tables, which seems to be helpful with participants, that are a multiple of 2 (see for P{n=16, r=5}
https://matheplanet.com/default3.html?call=viewtopic.php?topic=85206&ref=https%3A%2F%2Fwww.google.com%2F)
also remeber pairs and appearances, but mainly follow a certain
patters as described in the link, which I somehow can not scale to
other numbers!!
There is somehow a mathemathical way to descibe this procedure and conclude a solution
Even though, this is more of a mathematical question (and I don't know where to ask those questions), I am interested in the algorithmic solution!

How to calculate the optimal time interval in multiple time series forecasts?

First thing first, I am new to the world of statistics.
Problem statement:
I have three predicted time series. These time series represent three independent scores, the sum of which is desired to be minimized over timeslot while selecting it. Length of the timeslot is already provided. I have read that there is confidence based selection of predicted interval for such problems, but I have used LSTM to predict the time series which may restrict me to use that approach, perhaps I think calculating the predicted interval is related to single time series.
e.g: Consider below arrays represent the three predicted time series.
arr1 = [23, 34, 16, 5, 45, 10, 2, 34, 56, 11]
arr2 = [123, 100, 124, 245, 125, 120, 298, 124, 175, 200]
arr3 = [1, 3, 10, 7, 2, 2, 10, 7, 8, 12]
time slot length = 3
As you could see, optimal timeslot for arr1 is [5, 7], for arr2 is [0, 2], and arr3 for is [3, 5], but I need only one timeslot for all three time series.
Questions:
Which error paradigm I should employ to select the optimal time slot?
I also have given weights(positive real number in [0, 1]) which represents the importance of particular time series in deciding timeslot. How do I employ it in error paradigm?

How to Generate N random numbers from a SHA-256 Hash

I'm working on a "provably fair" site where let's say X participants enter into a drawing and we need to pick first 1 overall winner, but then ideally we also want to pick N sub-winners out of the X total.
(for the curious, the SHA-256 Hash will be the merkle tree root of a Bitcoin block at a pre-specified time)
So, given a SHA-256 hash, how do we generate N random numbers?
I think I know how to generate 1 random number (within ruby's Fixnum range). According to this article: http://patshaughnessy.net/2014/1/9/how-big-is-a-bignum
The maximum Fixnum integer is: 4611686018427387903
Let's pluck the first Y characters of the SHA-256 hash. We can generate one instead of relying on a Bitcoin merkle root with:
d = Digest::SHA256.hexdigest('hello')
> "2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824"
Let's take the first 6 characters, or: 2cf24d
Convert this to base 10:
'2cf24d'.to_i(16)
> 2945613
We now have a unique Fixnum based on our merkle root.
With X participants, let's say 17, we decide the winner with:
2945613 % 17
> 6
So assuming all entries know their order of entry, the sixth entrant can prove that they should be the winner.
Now -- what would be the best way to similarly pick N sub-winners? Let's say each of these entrants should get a smaller but still somewhat valuable prize.
Why not just use the hash for the seed?
[*1..17].shuffle(random: Random.new(0x2cf24d))
# => [15, 5, 9, 7, 14, 3, 16, 12, 2, 1, 17, 4, 6, 13, 11, 10, 8]
[*1..17].shuffle(random: Random.new(0x2cf24d))
# => [15, 5, 9, 7, 14, 3, 16, 12, 2, 1, 17, 4, 6, 13, 11, 10, 8]
EDIT: This is dependent on Ruby version though - I believe shuffle is different between JRuby and MRI, even though Random produces the same sequence. You could circumvent this by implementing shuffle yourself. See this question for more details. This workaround works consistently for me in both JRuby and MRI:
r = Random.new(0x2cf24d)
[*1..17].sort_by { r.rand }
# => [14, 11, 4, 10, 1, 3, 9, 13, 16, 17, 12, 5, 8, 2, 6, 7, 15]
r = Random.new(0x2cf24d)
[*1..17].sort_by { r.rand }
# => [14, 11, 4, 10, 1, 3, 9, 13, 16, 17, 12, 5, 8, 2, 6, 7, 15]

Order of Elements in comprehension

I have a python method that returns the squared values in a given set.
print(str({x*x for x in {1,2,3,4,5}}))
The values are correct but not in the same are as submitted, why is this?
{16, 1, 4, 25, 9}
You've created a set, rather than a list, and the order in which a set's elements are presented is arbitrary.
Compare:
>>> print([x*x for x in [1,2,3,4,5]])
[1, 4, 9, 16, 25]

How to solve this variation of kirkkmans schoolgirls

I am trying to implement an app which assigns s students to l labs in g lab groups. The constraints are:
1:students shall work with new students for every lab.
2:all students shall be lab leader once.
2 is not solvable if the students can't be divided evenly in the lab groups. Therfore it is acceptable if the "odd" students never get to be lab leader.
I have tried two approaches but I am not happy yet.:
Tabu search, which solves 1 but has problems solving 2 ( I actually first solve 1 and then try to solve 2, which might be the wrong approach, any suggestions)
A simple solution where I divide the students in the #labs in an array [0..6][7..14][15..21] and then rotate(with 0,1,2 inc) and transpose the matrix, repeat this for #labs times with incremented rotation (1,2,4) and (2,4,6). For 21 students in 3 labs with lab groups of 7 the result looks like this:
lab 1: [0, 7, 14], [1, 8, 15], [2, 9, 16], [3, 10, 17], [4, 11, 18], [5, 12, 19], [6, 13, 20]
lab 2: [6,12, 18], [0, 13, 19], [1, 7, 20], [2, 8, 14], [3, 9, 15], [4, 10, 16], [5, 11, 17]
lab 3: [5, 10, 15], [6, 11, 16], [0, 12, 17], [1, 13, 18], [2, 7, 19], [3, 8, 20], [4, 9, 14]
the lab leaders are the first column for lab 1, the second for lab 2 ...
This solution works decent but for instance fails for 12 students in 3 labs or 150 students in 6 labs. Any suggestions?
2 seems to handle the same number of cases or combinations, and is lightning fast compared to 1. Maybe I should get a noble price :-)
Constraint #1 alone is usually referred to as the social golfer problem. (Let parameters g be the number of groups and s be the size of each group and w be the number of weeks. A grouping is a partition of g * s golfers into g groups of size s. Determine whether w groupings can be found such that each pair of golfers are grouped together at most once.) The social golfer problem has been studied in the combinatorial optimization literature, and the approaches are of three types (you can use your favorite search engine to find the research articles):
Local search. This is effective when w is well below its maximum feasible value. Dotú and Van Hentenryck have a paper applying tabu search to the social golfer problem.
Complete search. This is necessary when w is above or just below its maximum feasible value but it does not scale very well.
Algebraic constructions. This is how the notorious g=8 s=4 w=10 instance was solved. Unfortunately, for many parameter sets there is no construction known.
To assign lab leaders, compute a maximum matching between students and lab groups, where there is an edge between a student and a lab group if that student belongs to that lab group.

Resources