Solutions to group elements in different pairs - data-structures

I am working on grouping of elements for large dataset in least amount of time.
Below is small dataset example of the same:
Input:
pairs = [[1,2],[8,10],[4,8],[3,4],[2,3],[5,6],[6,7],[9,4],[11,12],[13,10],[16,1],[78,79],[61,3],[93,94]]
How to derive output:
If any of the element is present in another pair then both pair can be grouped. This way traverse all pairs and create different groups.
Output: -
[{1, 2, 3, 4, 8, 9, 10, 13, 16, 61}, {5, 6, 7}, {11, 12}, {78, 79}, {93, 94}]
Question:
What is the best way to solve this problem?
e.g.: graph database(neo4j) or any specific readily available algorithm.
any other alternatives which is not listed in example solution(neo4j, algorithm) above?
Below output is expected and solution should consume less time with large dataset.
Expected Output: -
[{1, 2, 3, 4, 8, 9, 10, 13, 16, 61}, {5, 6, 7}, {11, 12}, {78, 79}, {93, 94}]

Related

How to solve combinations in card game for n people in r rounds (just one encounter)

There is a famous card game in Germany called "Doppelkopf".
Usually, you play "Doppelkopf" with 4 players, but you can also play it with a table of 5 players, where one player is just watching.
(Where everyone "has the cards" once in a round, meaning everyone has the right to play the first card once every round.)
Every year, my family organizes a "Doppelkopf" tournament with 3 rounds (r).
Depending on the availabilty of my relatives, every year the number of participants varies.
Expecting a minimun of participant of 16 people, the number (n) in this experiment can rise up unlimited (as does the number of rounds r).
Naturally, my relatives do not want to be paired with someone twice, since they want to exchange gossip most efficiently!
There we have:
n - Participants
r- Rounds
t_total = n // 4 # Total Tables (round down of n)
t_5 = n % 4 # Tables of 5s
t_4 = t_total - t_5 # Tables of 4s
pos_pair = n * (n - 1) / 2 # possible pairs (n over 2)
nec_pair = (t_5 * 10 + t_4 * 6) * r # necessary pairs
I was instructed with the aim to minimize the encounters (if possible to set encounters == 1 for everyone)!
Since, I do not want to solve the problem for P{n={16, ..., 32}, r=3} (which I did for some cases), but to solve it with any given P{n∈N, r∈N} , there is a discrepancy between my abilities and the requirements for a solution!
Therefore, I would like to ask the community to help me with this problem, to solve it for any given P{n∈N, r∈N}!
And also to prove, if this problem is not solvable for any P{n∈N, r∈N}, which is given "if pos_pair < nec_pair".
Here are two solutions for P{n=20, r=3}:
which very much solves my "Doppelkopf" tournament problem:
('Best result was ', [[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16], [17, 18, 19, 20]], [[16, 12, 8, 18], [13, 1, 5, 9], [15, 4, 17, 6], [2, 19, 7, 10], [3, 11, 20, 14]], [[14, 9, 17, 7], [13, 20, 8, 2], [5, 4, 12, 19], [6, 16, 11, 1], [15, 18, 10, 3]]])
('Best result was ', [[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16], [17, 18, 19, 20]], [[19, 11, 13, 3], [2, 15, 9, 8], [1, 16, 18, 6], [14, 7, 17, 10], [4, 12, 20, 5]], [[17, 8, 3, 12], [20, 9, 16, 7], [15, 11, 6, 4], [2, 13, 10, 18], [1, 19, 14, 5]]])
But in order to solve this problem with an arbitrary n and r I have come to no conclusion.
In my opinion, there are three ways to go about this problem in a computational solution or approximation.
First, you can iterate about rounds, and assign every player to a
table without having collision, remembering pairs and appeareances
in total (not to exeed total rounds)
Secondly, you can iterate about tables, which seems to be helpful with participants, that are a multiple of 2 (see for P{n=16, r=5}
https://matheplanet.com/default3.html?call=viewtopic.php?topic=85206&ref=https%3A%2F%2Fwww.google.com%2F)
also remeber pairs and appearances, but mainly follow a certain
patters as described in the link, which I somehow can not scale to
other numbers!!
There is somehow a mathemathical way to descibe this procedure and conclude a solution
Even though, this is more of a mathematical question (and I don't know where to ask those questions), I am interested in the algorithmic solution!

Optimal strategy for two player coin games

Two players take turns choosing one of the outer coins. At the end we calculate the difference
between the score two players get, given that they play optimally.
The greedy strategy of getting the max. value of coin often does not lead to the best results in my case.
Now I developed an algorithm:
Sample:{9,1,15,22,4,8}
We calculate the sum of coins in even index and that of coins in odd index.
Compare the two sum, (9+15+4)<(1+22+8) so sum of odd is greater. We then pick the coin with odd index, in our sample that would be 8.
the opponent, who plays optimally, will try to pick the greater coin, e.g. 9.
There is always a coin at odd index after the opponent finished, so we keep picking the coins
at odd index, that would be 1.
looping the above steps we will get a difference of (8+1+22) - (9+15+4) = 3.
6.vice versa if sum of even is greater in step 2.
I have compared the results generated by my algorithm with a 2nd algorithm similar to below one: https://www.geeksforgeeks.org/optimal-strategy-for-a-game-set-2/?ref=rp
And the results were congruent, until my test generated a random long array:
[6, 14, 6, 8, 6, 3, 14, 5, 18, 6, 19, 17, 10, 11, 14, 16, 15, 18, 7, 8, 6, 9, 0, 15, 7, 4, 19, 9, 5, 2, 0, 18, 2, 8, 19, 14, 4, 8, 11, 2, 6, 16, 16, 13, 10, 19, 6, 17, 13, 13, 15, 3, 18, 2, 14, 13, 3, 4, 2, 13, 17, 14, 3, 4, 14, 1, 15, 10, 2, 19, 2, 6, 16, 7, 16, 14, 7, 0, 9, 4, 9, 6, 15, 9, 3, 15, 11, 19, 7, 3, 18, 14, 11, 10, 2, 3, 7, 3, 18, 7, 7, 14, 6, 4, 6, 12, 4, 19, 15, 19, 17, 3, 3, 1, 9, 19, 12, 6, 7, 1, 6, 6, 19, 7, 15, 1, 1, 6]
My algorithm generated 26 as the result, while the 2nd algorithm generated 36.
Mine is nothing about dynamic programming and it requires less memory, whereas i also implemented the 2nd one with memoization.
This is confusing since mine is correct with most of the array cases until this one.
Any help would be appreciated!
If the array is of even length, your algorithm tries to produce a guaranteed win. You can prove that quite easily. But it doesn't necessarily produce the optimal win. In particular it won't find strategies where you want some coins that are on even indexes and others on odd indexes.
The following short example illustrates the point.
[10, 1, 1, 20, 1, 1]
Your algorithm will look at evens vs odds, realize that 10+1+1 < 1+20+1 and take the last element first. Guaranteeing a win by 10.
But you want both the 10 and the 20. Therefore the optimal strategy is to take the 10 leaving 1, 1, 20, 1, 1, whichever side the other person takes you take the other to get to 1, 20, 1, and then whichever side the other takes you take the middle. Resulting in you getting 10, 1, 20 and the other person getting 1, 1, 1. Guaranteeing a win by 28.

Find Top N Most Frequent Sequence of Numbers in List of a Billion Sequences

Let's say I have the following list of lists:
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
Essentially, for any list that contains the target number 5 (i.e., target=5) anywhere within the list, what are the top N=2 most frequently observed subsequences with length M=4?
So, the conditions are:
if target doesn't exist in the list then we ignore that list completely
if the list length is less than M then we ignore the list completely
if the list is exactly length M but target is not in the Mth position then we ignore it (but we count it if target is in the Mth position)
if the list length, L, is longer than M and target is in the i=M position(ori=M+1position, ori=M+2position, ...,i=Lposition) then we count the subsequence of lengthMwheretarget` is in the final position in the subsequence
So, using our list-of-lists example, we'd count the following subsequences:
subseqs = [[2, 3, 4, 5], # taken from sequence 1
[2, 3, 4, 5], # taken from sequence 3
[12, 12, 6, 5], # taken from sequence 4
[8, 8, 3, 5], # taken from sequence 7
[1, 4, 12, 5], # taken from sequence 7
[12, 12, 6, 5], # taken from sequence 9
]
Of course, what we want are the top N=2 subsequences by frequency. So, [2, 3, 4, 5] and [12, 12, 6, 5] are the top two most frequent sequences by count. If N=3 then all of the subsequences (subseqs) would be returned since there is a tie for third.
Important
This is super simplified but, in reality, my actual list-of-sequences
consists of a few billion lists of positive integers (between 1 and 10,000)
each list can be as short as 1 element or as long as 500 elements
N and M can be as small as 1 or as big as 100
My questions are:
Is there an efficient data structure that would allow for fast queries assuming that N and M will always be less than 100?
Are there known algorithms for performing this kind of analysis for various combinations of N and M? I've looked at suffix trees but I'd have to roll my own custom version to even get close to what I need.
For the same dataset, I need to repeatedly query the dataset for various values or different combinations of target, N, and M (where target <= 10,000, N <= 100 and `M <= 100). How can I do this efficiently?
Extending on my comment. Here is a sketch how you could approach this using an out-of-the-box suffix array:
1) reverse and concatenate your lists with a stop symbol (I used 0 here).
[7, 6, 5, 4, 3, 2, 1, 0, 11, 10, 5, 6, 0, 5, 4, 3, 2, 8, 9, 0, 5, 6, 12, 12, 0, 2, 4, 3, 8, 5, 0, 5, 1, 0, 6, 5, 12, 4, 1, 9, 5, 3, 8, 8, 2, 0, 2, 1, 4, 3, 7, 1, 7, 0, 1, 5, 6, 12, 12, 4, 9]
2) Build a suffix array
[53, 45, 24, 30, 12, 19, 33, 7, 32, 6, 47, 54, 51, 38, 44, 5, 46, 25, 16, 4, 15, 49, 27, 41, 37, 3, 14, 48, 26, 59, 29, 31, 40, 2, 13, 10, 20, 55, 35, 11, 1, 34, 21, 56, 52, 50, 0, 43, 28, 42, 17, 18, 39, 60, 9, 8, 23, 36, 58, 22, 57]
3) Build the LCP array. The LCP array will tell you how many numbers a suffix has in common with its neighbour in the suffix array. However, you need to stop counting when you encounter a stop symbol
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 0, 2, 1, 1, 2, 0, 1, 3, 2, 2, 1, 0, 1, 1, 1, 4, 1, 2, 4, 1, 0, 1, 2, 1, 3, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 2, 1, 2, 0]
4) When a query comes in (target = 5, M= 4) you search for the first occurence of your target in the suffix array and scan the corresponding LCP-array until the starting number of suffixes changes. Below is the part of the LCP array that corresponds to all suffixes starting with 5.
[..., 1, 1, 1, 4, 1, 2, 4, 1, 0, ...]
This tells you that there are two sequences of length 4 that occur two times. Brushing over some details using the indexes you can find the sequences and revert them back to get your final results.
Complexity
Building up the suffix array is O(n) where n is the total number of elements in all lists and O(n) space
Building the LCP array is also O(n) in both time and space
Searching a target number in the suffix is O(log n) in average
The cost of scanning through the relevant subsequences is linear in the number of times the target occurs. Which should be 1/10000 on average according to your given parameters.
The first two steps happen offline. Querying is technically O(n) (due to step 4) but with a small constant (0.0001).

Flaw in this maximum clique polynomial time approach?

I have been trying to solve Maximum clique problem with the algorithm mentioned below and so far not been able to find a case in which it fails.
Algorithm:
For a given graph, each node numbered from 1 to N.
1. Consider a node as permanent node and form a set of nodes such that each node is connected to this permanent node.(the set includes permanent node as well)
2. Now form a subgraph of the original graph such that it contains all the nodes in the set formed and only those edges which are between the nodes present in the set.
3. Find degree of each node.
4. If all the nodes have same degree then we have a clique.
5. Else delete the least degree node from this subgraph and repeat from step 3.
6. Repeat step 1-5 for all the nodes in the graph.
Can anyone point out flaw in this algorithm?
Here is my code http://pastebin.com/tN149P9m.
Here's a family of counterexamples. Start with a k-clique. For each node in this clique, connect it to each node of a fresh copy of K_{k-1,k-1}, i.e., the complete bipartite graph on k-1 plus k-1 nodes. For every permanent node in the clique, the residual graph is its copy of K_{k-1,k-1} and the clique. The nodes in K_{k-1,k-1} have degree k and the other clique nodes have degree k - 1, so the latter get deleted.
Here's a 16-node counterexample, obtained by setting k = 4 and identifying parts of the K_{3,3}s in a ring:
{0: {1, 2, 3, 4, 5, 6, 7, 8, 9},
1: {0, 2, 3, 7, 8, 9, 10, 11, 12},
2: {0, 1, 3, 10, 11, 12, 13, 14, 15},
3: {0, 1, 2, 4, 5, 6, 13, 14, 15},
4: {0, 3, 7, 8, 9, 13, 14, 15},
5: {0, 3, 7, 8, 9, 13, 14, 15},
6: {0, 3, 7, 8, 9, 13, 14, 15},
7: {0, 1, 4, 5, 6, 10, 11, 12},
8: {0, 1, 4, 5, 6, 10, 11, 12},
9: {0, 1, 4, 5, 6, 10, 11, 12},
10: {1, 2, 7, 8, 9, 13, 14, 15},
11: {1, 2, 7, 8, 9, 13, 14, 15},
12: {1, 2, 7, 8, 9, 13, 14, 15},
13: {2, 3, 4, 5, 6, 10, 11, 12},
14: {2, 3, 4, 5, 6, 10, 11, 12},
15: {2, 3, 4, 5, 6, 10, 11, 12}}
What you propose looks very much like the following sorting algorithm combined with a greedy clique search:
Consider a simple undirected graph G=(V,E)
Initial sorting
Pick the vertex with minimum degree and place it first in the new list L. From the remaining vertices pick the vertex with minimum degree and place it in the second position in L. Repeat the operations until all vertices in V are in L.
Find cliques greedily
Start from the last vertex in L and move in reverse order. For each vertex v in L compute cliques like this:
Add v to the new clique C
Compute the neighbor set of v in L: N(v)
Pick the last vertex in N(v)
v=w; L=L intersection with N(v);
Repeat steps 1 to 4
Actually the proposed initial sorting is called a degeneracy ordering and decomposes G in k-cores (see Batagelj et al. 2002 ) A k-core is a maximal subgraph such that all its vertices have at least degree k. The initial sorting leaves the highest cores (with largest k) at the end. When vertices are picked in reverse order you are picking vertices in the highest cores first(similar to your step 4) and trying to find cliques there. There are a number of other possibilities to find cliques greedily based on k-cores but you can never guarantee an optimum unless you do full enumeration.
The proposed initial sorting is used, for example, when searching for exact maximum clique and has been described in many research papers, such as [Carraghan and Pardalos 90]

How to create a Mathematica list of fewer columns from larger list of many columns

I have a list "data1":
{{1, 6, 4.5, 1, 141.793, 2.31634, 27.907}, {2, 7, 4.5, 1, 133.702,
2.28725, 26.7442}, {3, 5, 5, 1, 136.546, 2.33522, 25.5814}, {4, 8,
5, 1, 104.694, 2.27871, 24.4186}}
What I would like to do is to create a new table with only the first two columns of each element. So my new table would be:
{{1,6},{2,7},{3,5},{4,8}}
I tried
data1[[All, 1][All, 2]]
and other variations but I am not understanding how to capture the desired fields. Thank you for your help.
Just have a range or list of the indices you want as the second argument, like so:
In[71]:= data[[All, {1, 2}]]
Out[71]= {{1, 6}, {2, 7}, {3, 5}, {4, 8}}

Resources