Efficient data structure for a list of index sets - algorithm

I am trying to explain by example:
Imagine a list of numbered elements E = [elem0, elem1, elem2, ...].
One index set could now be {42, 66, 128} refering to elements in E. The ordering in this set is not important, so {42, 66, 128} == {66, 128, 42}, but each element is at most once in any given index set (so it is an actual set).
What I want now is a space efficient data structure that gives me another ordered list M that contains index sets that refer to elements in E. Each index set in M will only occur once (so M is a set in this regard) but M must be indexable itself (so M is a List in this sense, whereby the precise index is not important). If necessary, index sets can be forced to all contain the same number of elements.
For example, M could look like:
0: {42, 66, 128}
1: {42, 66, 9999}
2: {1, 66, 9999}
I could now do the following:
for(i in M[2]) { element = E[i]; /* do something with E[1],E[66],and E[9999] */ }
You probably see where this is going: You may now have another map M2 that is an ordered list of sets pointing into M which ultimately point to elements in E.
As you can see in this example, index sets can be relatively similar (M[0] and M[1] share the first two entries, M[1] and M[2] share the last two) which makes me think that there must be something more efficient than the naive way of using an array-of-sets. However, I may not be able to come up with a good global ordering of index entries that guarantee good "sharing".
I could think of anything ranging from representing M as a tree (where M's index comes from the depth-first search ordering or something) to hash maps of union-find structures (no idea how that would work though:)
Pointers to any textbook datastructure for something like this are highly welcome (is there anything in the world of databases?) but I also appreciate if you propose a "self-made" solution or only random ideas.
Space efficiency is important for me because E may contain thousands or even few million elements, (some) index sets are potentially large, similarities between at least some index sets should be substantial, and there may be multiple layers of mappings.
Thanks a ton!

You may combine all numbers from M and remove duplicates and name it as UniqueM.
All M[X] collections convert to bit masks. For example int value may store 32 numbers (To support of unlimited count you should store array of ints, if array size is 10 totally we can store 320 different elements). long type may store 64 bits.
E: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}
M[0]: {6, 8, 1}
M[1]: {2, 8, 1}
M[2]: {6, 8, 5}
Will be converted to:
UniqueM: {6, 8, 1, 2, 5}
M[0]: 11100 {this is 7}
M[1]: 01110 {this is 14}
M[2]: 11001 {this is 19}
Note:
Also you may combine my and ring0 approaches, instead of rearrange E make new UniqueM and use intervals inside it.

It will be pretty hard to beat an index. You could save some space by using the right data type (eg in gnu C, short if less than 64k elements in E, int if < 4G...).
Besides,
Since you say the order in E is not important, you could sort E a way it maximizes the consecutive elements to match as much as possible the Ms.
For instance,
E: { 1,2,3,4,5,6,7,8 }
0: {1,3,5,7}
1: {1,3,5,8}
2: {3,5,7,8}
By re-arranging E
E: { 1,3,5,7,8,2,4,6 }
and using E indexes, not values, you could define the Ms based on subsets of E, giving indexes
0: {0-3} // E[0]: 1, E[1]: 3, E[2]: 5, E[3]: 7 etc...
1: {0-2,4}
2: {1-3,4}
this way
you use indexes instead of the raw numbers (indexes are usually smaller, no negative..)
the Ms are made of sub-sets, 0-3 meaning 0,1,2,3,
The difficult part is to make the algorithm to re-arrange E so that you maximize the subsets sizes - minimize the Ms sizes.
E rearrangement algo suggestion
sort all Ms
process all Ms:
algo to build a map, which gives for an element 'x' its list of neighbors 'y', along with points, number of times 'y' is just after 'x'
Map map (x,y) -> z
for m in Ms
for e,f in m // e and f are consecutive elements
if ( ! map(e,f)) map(e,f) = 1
else map(e,f)++
rof
rof
Get E rearranged
ER = {} // E rearranged
Map mas = sort_map(map) // mas(x) -> list(y) where 'y' are sorted desc based on 'z'
e = get_min_elem(mas) // init with lowest element (regardless its 'z' scores)
while (mas has elements)
ER += e // add element e to ER
f = mas(e)[0] // get most likely neighbor of e (in f), ie first in the list
if (empty(mas(e))
e = get_min_elem(mas) // Get next lowest remaining value
else
delete mas(e)[0] // set next e neighbour in line
e = f
fi
elihw
The algo (map) should be O(n*m) space, with n elements in E, m elements in all Ms.

Bit arrays may be used. They're arrays of elements a[i] which are 1 if i is in set and 0 if i is not in set. So every set would occupy exactly size(E) bits even if it contain a few or no members. Not so space efficient, but if you compress this array with some compression algorithm it will be much less in size (possibly reaching ultimate entropy limit). So you can try dynamic Markov coder or RLE or group Huffman and choose one most efficient for you. Then, iteration process could include on-the-fly decompression followed by linear scanning for 1 bits. For looong 0 runs you could modify decompression algorithm to detect such cases (RLE is simplest case for it).
If you found sets having small defference, you may store sets A and A xor B anstead of A and B saving space for common parts. In this case to iterate over B you'll have to unpack both A and A xor B then xor them.

Another useful solution:
E: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}
M[0]: {1, 2, 3, 4, 5, 10, 14, 15}
M[1]: {1, 2, 3, 4, 5, 11, 14, 15}
M[2]: {1, 2, 3, 4, 5, 12, 13}
Cache frequently used items:
Cache[1] = {1, 2, 3, 4, 5}
Cache[2] = {14, 15}
Cache[3] = {-2, 7, 8, 9} //Not used just example.
M[0]: {-1, 10, -2}
M[1]: {-1, 11, -2}
M[2]: {-1, 12, 13}
Mark links to cached list as negative numbers.

Related

Algorithm for obtaining from a set all sequences of size N that include a given subset

To be more specific, the sequences that we want to output can have the elements of the input subset in a different order than the one in the input subset.
EXAMPLE: Let's say we have a set {1, 9, 1, 5, 6, 9, 0 , 9, 9, 1, 10} and an input subset {9, 1}.
We want to find all subsets of size 3 that include all elements of the input subset.
This would return the sets {1, 9, 1} , {9, 1, 5}, {9, 9, 1}, {9, 1, 10}.
Of course, the algorithm with the lowest complexity possible is preferred.
EDIT: Edited with better terminology. Also, here's what I considered, in pseudocode:
1. For each sequence s in the set of size n, do the following:
2. Create a list l that is a copy of the input subset i.
3. j = 0
4. For each element e in s, do the following:
5. Check if it's in i.
If it is, remove that element from l.
6. If l is empty, add s to the sequences to return,
skip to the next sequence (go to step 1)
7. Increment j.
8. if j == n, go to the next sequence (go to step 1)
This is what I came up with, but it takes a really awful amount of time
due to the fact that we consider EVERY sequence with no memory of
previously scanned ones whatsoever. I really don't have an idea on how to implement such memory, this is all very new to me.
You could simply find all occurrences of that subset and then produce all lists that contain that combination.
In python for example:
def get_Allsubsets_withSubset(original_list, N , subset):
subset_len = len(subset)
list_len = len(original_list)
index_of_occurrence = []
overflow = N - subset_len
for index in range(len(original_list)-len(subset)):
if original_list[index: index +len(subset)] == subset:
index_of_occurrence.append(index)
final_result = []
for index in index_of_occurrence:
for value in range(overflow+1):
i = index - value
final_result.append(original_list[i:i+N])
return final_result
Its not beautiful, but I would say its a start

Intersecting sets such that the result is a set of sets with collectively unique elements

Let's say I have the following sets:
X -> {1, 2, 3}
Y -> {1, 4, 7}
Z -> {1, 4, 5}
I'm looking to find the combination of intersections that produce a number of sets where each element is unique among them all. (Really a set of hashs where each element refers back to the sets it intersects):
A -> {2, 3}: {X}
B -> {7}: {Y}
C -> {5}: {Z}
D -> {4}: {Y, Z}
E -> {1}: {X, Y, Z}
Boiling the problem down, following conditions have to be met:
For each initial set, each element will be in a resulting set created by the intersection of the maximum number of initial sets
Meaning, each element in an initial set needs to be in exactly one resulting set
The sets are realistically infinite, meaning stepping through all valid elements isn't feasible, but set operations are fine
All resulting sets containing no elements can be disregarded
The brute force approach is to loop over the powerset of the initial set in reverse order, intersect each set, then find the difference of this resulting set and all other intersections tested:
resulting_sets = {}
for sets in powerset(S):
s = intersection(sets)
for rs in resulting_sets.keys():
s -= rs
if not s.empty():
resulting_sets[s] = sets # realistically some kind of reference to sets
Of course the above is pretty inefficient at O(n^2log(n)) O(2^n * 2^(n/2)) of set operations (and for my purposes it may run up to n^2 times already). Is there a better solution for this type of problem?
UPDATE: not iterating any set, only uses set operations
This algorithm is building the result sets constructively, i.e. we modify the existing unique element sets and/or add new ones everytime we see a new source set.
The idea is that, every new set can be split into two parts, one with values already seen, and one with new unique values. For the first part, it is further split into various subsets (up to # of powerset of seen source sets) by the current result sets. For each such subset, it also splits into two parts, one intersects with the new source set, and the other does not. The job is to update the result sets for each of these categories.
For complexity in terms of set operations, this should be O(n*2^n). For the solution posted by the OP, I think the complexity should be O(2^(2n)), because len(resulting_sets) has up to 2^n elements in the worst case.
def solution(sets):
result_sets = [] # list of (unique element set, membership) tuples
for sid, s in enumerate(sets):
new_sets = []
for unique_elements, membership in result_sets:
# The intersect part has wider membership, while the other part
# has less unique elements (maybe empty).
# Wider membership must have not been seen before, so add as new.
intersect = unique_elements & s
# Special case if all unique elements exist in s, then update
# in place
if len(intersect) == len(unique_elements):
membership.append(sid)
elif len(intersect) != 0:
unique_elements -= intersect
new_sets.append((intersect, membership + [sid]))
s -= intersect
if len(s) == 0:
break
# Special syntax for Python: there are remaining elements in s
# This is the part of unseen elements: add as a new result set
else:
new_sets.append((s, [sid]))
result_sets.extend(new_sets)
print(result_sets)
sets = [{1, 2, 3}, {1, 4, 7}, {1, 4, 5}]
solution(sets)
# output:
# [(set([2, 3]), [0]), (set([1]), [0, 1, 2]), (set([7]), [1]), (set([4]), [1, 2]), (set([5]), [2])]
--------------- original answer below ---------------
The idea is to find the "membership" of each unique element, i.e. what sets does it belong to. Then we create a dictionary to group all element by their membership, generating the requested sets. The complexity is O(n*len(sets)), or O(n^2) in the worst case.
def solution(sets):
union = set().union(*sets)
numSets = len(sets)
numElements = len(union)
memberships = {}
for e in union:
membership = tuple(i for i, s in enumerate(sets) if e in s)
if membership not in memberships:
memberships[membership] = []
memberships[membership].append(e)
print(memberships)
sets = [{1, 2, 3}, {1, 4, 7}, {1, 4, 5}]
solution(sets)
# output:
# {(0, 1, 2): [1], (1, 2): [4], (0,): [2, 3], (1,): [7], (2,): [5]}

How to find the dimensions of a (2,3 or if possible an n)dimensional slice in golang and verify if its a matrix?

For example : [][]float64{{11, 5, 14, 1}, {11, 5, 14, 1}} has dimensions [2,4].
If this is passed to a function then what is the most efficient way to find the dimension here?
Thanks
The outer dimension is just len(x) where x is the slice of slices you pass to the function (your example [][]float64{{11, 5, 14, 1}, {11, 5, 14, 1}}).
However, the inner dimensions are not guaranteed to be equal so you will have to go through each element and check what len they have.
If you have the guarantee than each element of x has the same number of elements, just find len(x[0]) if len(x) > 0.
Go only provides 1-dimensional arrays and slices. N-dimensional arrays can be emulated by using arrays of arrays, which is close to what what you're doing--you have a 1-dimensional slice, which contains 2 1-dimensional slices.
This is problematic, because slices are not of a defined length. So you could end up with slices of wildly different lengths:
[][]float64{
{1, 2, 3, 4},
{5, 6, 7, 8, 9, 10},
}
If you use actual arrays, this is simpler, because arrays have a fixed length:
[2][4]float64
You can extend this to as many dimensions as you want:
[2][4][8]float64 provides three dimensions, with respective depths of 2, 4, and 8.
Then you can tell the capacity of each dimension by using the built-in len() function on any of the elements:
foo := [2][4][7]float64{}
x := len(foo) // x == 2
y := len(foo[0]) // y == 4
z := len(foo[0][0]) // z == 8

Finding a group of subsets that does not overlap

I am reviewing for an upcoming programming contest and was working on the following problem:
Given a list of integers, an integer t, an integer r, and an integer p, determine if the list contains t sets of 3, r runs of 3, and p pairs of numbers. For each of these subsets, the numbers must be adjacent and any given number can only exist in one subset, if any at all.
Currently, I am solving the problem by simply finding all sets of 3, runs of 3, and pairs and then checking all permutations until finding one which has no overlapping subsets. This seems inefficient, however, and I was wondering if there was a better solution to the problem.
Here are two examples of the problem:
{1, 1, 1, 2, 3, 4, 4, 4, 5, 5, 1, 0}, t = 1, r = 1, p = 2.
This works because we have the triple {4 4 4}, the run {1 2 3}, and the pairs {1 1} and {5 5}
{1, 1, 1, 2, 3, 3}, t = 1, r = 1, p = 1
This does not work because the only triple is {1 1 1} and the only run is {1 2 3} and the two overlap (They share a 1).
I am looking for a more efficient approach to this problem.
There is probably a faster way, but you can solve this with dynamic programming. Compute a recursive function F(t,r,p,n) which decides whether it is possible to have t triples, r runs, and p pairs in the sequence starting at position 1 and ending at n, and storing the last subset of the solution ending at position n if it is possible. If you can have a triple, run, or pair ending at position n then you have a recursive case, either. F(t-1,r,p,n-3) or F(t,r-1,p,n-3) or F(t,r,p-1,n-2), and you have the last subset stored, or otherwise you have a recursive case F(t,r,p,n-1). This looks like fourth power complexity but it really isn't, because the value of n is always decreasing so the complexity is actually O(n + TRP), where T is the total desired number of triples, R is the total desired number of runs, and P is the total desired number of pairs. So O(n^3) in the worst case.

Split set of numbers into as fewest subsets as possible of defined "interval length" algorithm

consider that I have some set of numbers. For example {1, 2, 3, 7, 8, 9, 11, 15, 16}. What I need is to split this set into as fewest subsets as possible and also difference between the lowest and highest number is less than 9.
From my example set it would be for example:
{1, 2, 3, 7, 8}, {9, 11, 15, 16}
I need this to optimize the number of "read multiple registers" requests through the Modbus.
I already tried to split set into subsets with consecutive numbers and than merged them, but it's not ideal, because it returns me this:
{1, 2, 3}, {7, 8, 9, 11}, {15, 16} or {1, 2, 3}, {7, 8, 9}, {11, 15, 16}
As you can see this approach gives me three subsets instead of two.
So is there any usable algorithm?
Thanks
What about the greedy approach, i.e. insert elements from the left into a set, when you go over the desired difference, create a new set.
So, for 1, 2, 3, 7, 8, 9, 11, 15, 16:
You'd start off with 1.
2-1 = 1 < 9, so add 2.
3-1 = 2 < 9, so add 3.
7-1 = 6 < 9, so add 7.
8-1 = 7 < 9, so add 8.
9-1 = 8 < 9, so add 9.
11-1 = 10 > 9, so create a new set.
15-11 = 4 < 9, so add 15.
16-11 = 5 < 9, so add 16.
Output: {1, 2, 3, 7, 8, 9}, {11, 15, 16}.
Note:
If the elements aren't necessarily ordered, we can simply sort them first.
If these subsets don't have to be continuous, it doesn't make a difference, as selecting continuous sets from the ordered input is always better than non-continuous sets.
If the elements aren't necessarily ordered and the subsets must be continuous, that will change the problem a bit.
Proof of optimality:
Let S be the set assignment produced by this algorithm for some arbitrary input.
Take any set assignment T.
Let Si be the i-th set of S.
Let Ti be the i-th set of T.
Let Si-size be the size of the i-th set of S.
Let Ti-size be the size of the i-th set of T.
Assume S and T are different, then for some Si and Ti, Si-size != Ti-size. Specifically choose the first set i where the two differ.
So either Si-size > Ti-size or Si-size < Ti-size.
j > i is impossible, since this algorithm takes as many elements as possible from the start.
If j < i, the first element of Si+1 will be greater than the first element of Ti+1, and since the algorithm is greedy, Si+1 will include at least all the elements of Ti+1 not already included by Si.
Now, because of the above, the first element of Si+2 will similarly be greater than the first element of Ti+2, thus Si+2 will include at least all the elements of Ti+2 not already included by previous sets of S. Similarly for the rest of the sets of S and T. Thus there are at least as many sets in T as there are in S.
Thus the set assignment produced by this algorithm can't be any worse than any other assignment. Thus it is optimal.

Resources