Data structure for finding set containing element - data-structures

What is a good data structure for finding which set an element belongs to, with N items grouped into M different sets? For example, if the sets are {A,B} , {C,D,E}, {F,G} how can I find a set given "D"?. The sets are hash sets, so a contains query within a set is O(1).
If I just have the sets in a list of sets,
[{A,B}, {C,D,E}, {F,G}]
I can get lookup by just asking each set in the list if it contains the item. This is simple to implement, the run time is linear (in the number of sets).
A faster approach is storing all sets in a hash table, keyed on every item in each set. That is:
[A -> {A, B},
B -> {A, B},
C -> {C, D, E},
D -> {C, D, E},
E -> {C, D, E},
F -> {F, G},
G -> {F, G}]
That structure lets me retrieve the correct set in O(1) time, but it feels inefficient and ugly. Is there a better data structure that allows for an O(1) lookup of the correct set? Should I make a lookup key by combining hashes like a kind of Bloom filter? Other ideas?

You can implement in this way:
You will need an array of trees. index will be the set number.
Have a hash table to store (element, index) pair for each element in the entire lists.
For each set, you can use tree structure, unique identifier is the root and elements of the list is connected to the root.

Related

Algorithm for allowing concurrent walk of a graph

In a directed acyclic graph describing a set of tasks to process, i need to find all tasks that can be processed concurrently. The graph has no loops and is quite small (~1000 nodes, ~2000 edges), performance is not a primary concern.
Examples with desired result:
[] is a group. All tasks in a group must be processed before continuing
[x & y] means x and y can be processed concurrently (x and y in parallel)
x -> y means x and y must be processed sequentially (x before y)
1
a -> [b & c] -> c
2
[a & e] -> b -> c -> [d & f]
3
[ [a -> b] & [e -> f] ] -> [ [c -> d] & g ]
I do not want to actually execute the graph, but rather build a data structure that is as parallel as possible, while maintaining the order. The nomenclature and names of algorithms is not that familiar to me, so i'm having a hard time trying to find similar problems/solutions online.
Mathematically, I would frame this problem as finding a minimally defined series-parallel partial order extending the given partial order.
I would start by transitively reducing the graph and repeatedly applying two heuristics.
If x has one dependent y, and y has one dependency x, merge them into a new node z = [x → y].
If x and y have the same dependencies and dependents, merge them into a new node z = [x & y].
Now, if the input is already series-parallel, the result will be one node. In general, however, this will leave a graph that embeds an N-shaped structure like b → c, b → g, f → g from the last example in the question. This structure must be addressed by adding one or more of b → f, c → f, c → g, f → b, f → c, g → c. But in a different instance, this act would in turn create new N-shaped structures. There's no obvious notion of a closure, which is why this problem feels hard to me.
Some of these choices seem worse than others. For example, c → f forces the sequence b → c → f → g, whereas f → c is the only choice that doesn't increase the length of the critical path.
I guess what I'd try is,
If heuristics 1 and 2 have no targets, form a graph with edges x--y if and only if x and y have either a common dependent or a common dependency, compute the connected components of this graph, and &-merge the smallest component that isn't a singleton, followed by another transitive reduction.
Here's a solution i came up with (pseudocode):
sequence = []
for each (node, depth) in depthFirstSearch(graph)
sequence[depth].push(node)
return sequence
The sequence defines the order to process the graph. If an item in it contains more than one node, they can be processed concurrently.
While this allows for some concurrency, it does not advance as fast as it could. For example, f in the 3rd example in the question would require a to be completed first (as it will be at depth 1, when a and e are depth 0). Ideally work on f could start when e is done.

Calculation of combinations/cartesian product of sets (without duplicates and without order restrictions)

I have a combinatorial problem that can be solved inefficiently using the cartesian
product of multiple sets. Concretely, I have multiple items and multiple elements that
satisfy each item. The problem consists of finding all possible combinations of elements
that satisfy all items. For example:
items -> elements
------------------------
1 -> {a,b} // a and b cover the item 1
2 -> {a,b} // a and b cover the item 2
3 -> {a,b,c} // a, b and c cover the item 3
4 -> {a,b,e,f} // a, b, e, f cover the item 4
Alternative representation:
element -> items covered
------------------------
a -> {1,2,3,4}
b -> {1,2,3,4}
c -> {3}
e -> {4}
f -> {4}
The goal is to find all combinations that cover items 1,2,3,4.
Valid solutions are:
{a},{a,b},{a,c},{a,e},{a,f},{a,b,c},{a,b,e},{a,b,f},{a,b,c,e},{a,b,c,f}
{b},{b,c},{b,e},{b,f},{b,c,e},{b,c,f}
Note that the order is not important, so {a,b} = {b,a} ({a,b} x {c,d} = {c,d} x {a,b}).
Also, note that {a,a,a,a}, {a,a,a,b}... are redundant combinations.
As you can see, this problem is similar to the set cover problem, where the universe
of elements for this example are the items U={1,2,3,4} and the set of subsets from U is S={ab={1,2,3,4},c={3},ef{4}}, where set {1,2,3,4} is the set of items covered by the element a and b, {3} is the set of elements covered by c, and {4} is the set of elements covered by elements e and f. However, the goal here is not finding the
minimal combination of sets from S that covers all elements from U, but finding all combinations of elements {a,b,c,e,f} that cover all items {1,2,3,4}.
A näive implementation could be done by performing a cartesian product between
sets for 1,2,3 and 4, and then filtering the combinations that are redundant. However,
this approach is very inefficient. Suppose I have this situation:
1 -> {a,b,c,d,e,f,g,h}
2 -> {a,b,c,d,e,f,g,h}
3 -> {a,b,c,d,e,f,g,h}
4 -> {a,b,c,d,e,f,g,h}
5 -> {a,b,c,d,e,f,g,h}
6 -> {a,b,c,d,e,f,g,h,i}
A cartesian product between the six sets will result in a 8^5*9=294912 combinations,
when there are actually many fewer combinations, which are: {a,b,c,d,e,f,g} U {a,b,c,d,e,f,g} x {i}.
Another way to solve this problem is to enumerate all elements, skipping
the combinations that are equivalent to other previously generated, and also
skipping repeated elements. This is kinda easy to compute and can be implemented
as an iterator that returns a combination at a time, but I don't know if there is
a better way to solve this problem, or if this problem was studied before.
How would you solve this problem?
First, realize that if a set of elements does not satisfy all items, neither does any of its subsets.
Second, realize that if a set satisfies all items, so do all its supersets.
Now, all you have to do is:
Let S be the set of all elements.
Let R be the empty set.
Define a function find( s, r ) which does:
If r includes s, return r.
If s does not satisfy all items, return r.
Otherwise add s to r.
For every item I in s,
let s' be s-I
let s be f(s', r)
return s.
Just call find(S,R) and you have your answer.
This method performs some duplicate tests, but always kills a branch whenever it is identified as such. This leads to a lot of pruning on a large set of elements.
Both lookup of whether r includes a particular set of elements and the check if s satisfies all items can be made very fast at the expense of extra memory.
What if you did this:
1 -> {a,b}
2 -> {b,c}
3 -> {a,b,c}
4 -> {a,e,f}
=>
a -> [1,3,4]
b -> [1,2,3]
c -> [2,3]
e -> [4]
f -> [4]
Then enumerate the combinations of the left side that provide (at least) [1,2,3,4]
For each item in the set of all-satisfying sets, enumerate combinations
with other items.
All-Satisfying-Sets: {{a,b},{b,e},{b,f}}
Combinations within All-Satisfiying-Sets: {{a,b,e},{a,b,f},{b,e,f},{a,b,e,f}}
Others: {c}
Combinations with Others: {{a,b,c},{b,e,c},{b,f,c}
,{a,b,e,c},{a,b,f,c},{b,e,f,c},{a,b,e,f,c}}
Or you could do this in Haskell:
import Data.List (union, subsequences, sort)
example1 = [(["a"],[1,2,3,4])
,(["b"],[1,2,3,4])
,(["c"],[3])
,(["e"],[4])
,(["f"],[4])]
example2 = [(["a"],[1,2,3,4,5,6])
,(["b"],[1,2,3,4,5,6])
,(["c"],[1,2,3,4,5,6])
,(["e"],[1,2,3,4,5,6])
,(["f"],[1,2,3,4,5,6])
,(["g"],[1,2,3,4,5,6])
,(["h"],[1,2,3,4,5,6])
,(["i"],[6])]
combs items list =
let unify (a,b) (a',b') = (sort (a ++ a'), sort (union b b'))
in map fst
. filter ((==items) . snd)
. map (foldr unify ([],[]))
. subsequences
$ list
OUTPUT:
*Main> combs [1..4] example1
[["a"],["b"],["a","b"],["a","c"],["b","c"],["a","b","c"],["a","e"],["b","e"],
["a","b","e"],["a","c","e"],["b","c","e"],["a","b","c","e"],["a","f"],["b","f"],
["a","b","f"],["a","c","f"],["b","c","f"],["a","b","c","f"],["a","e","f"],
["b","e","f"],["a","b","e","f"],["a","c","e","f"],["b","c","e","f"],
["a","b","c","e","f"]]

How do I generate set partitions of a certain size?

I would like to generate partitions for a set in a specific way: I need to filter out all partitions which are not of size N in the process of generating these partitions. The general solution is "Generate all “unique” subsets of a set (not a powerset)".
For the set S with the following subsets:
[a,b,c]
[a,b]
[c]
[d,e,f]
[d,f]
[e]
and the following 'unique' elements:
a, b, c, d, e, f
the result of the function/method running with the argument N = 2 should be:
[[a,b,c], [d,e,f]]
While the following partitions should be filtered out by the function/method:
[[a,b,c], [d,f], [e]]
[[a,b], [c], [d,e,f]]
[[a,b], [c], [d,f], [e]]
The underlying data structure is not important and could be arrays, sets or whatever.
Reason: I need to filter some partitions out before I have the full set of all partitions, because the function/method which generates all partitions is rather computationally intensive.
According to "Generating the Partitions of a Set", the number of possible partitions can be huge: 44152005855084346 for 23 elements. My data is 50-300 elements in the starting set, so I definitely need to filter out partitions that have size not equal to N before I save them anywhere.
Once you have the partitions as given by Frederick Cheung that you linked, do:
partitions.select{|partition| partition.length == 2}

Using Mathematica to Automatically Reveal Matrix Structure

I spend a lot of time looking at larger matrices (10x10, 20x20, etc) which usually have some structure, but it is difficult to quickly determine the structure of them as they get larger. Ideally, I'd like to have Mathematica automatically generate some representation of a matrix that will highlight its structure. For instance,
(A = {{1, 2 + 3 I}, {2 - 3 I, 4}}) // StructureForm
would give
{{a, b}, {Conjugate[b], c}}
or even
{{a, b + c I}, {b - c I, d}}
is acceptable. A somewhat naive implementation
StructureForm[M_?MatrixQ] :=
MatrixForm # Module[
{pos, chars},
pos = Reap[
Map[Sow[Position[M, #1], #1] &, M, {2}], _,
Union[Flatten[#2, 1]] &
][[2]]; (* establishes equality relationship *)
chars = CharacterRange["a", "z"][[;; Length # pos ]];
SparseArray[Flatten[Thread /# Thread[pos -> chars] ], Dimensions[M]]
]
works only for real numeric matrices, e.g.
StructureForm # {{1, 2}, {2, 3}} == {{a, b}, {b, c}}
Obviously, I need to define what relationships I think may exist (equality, negation, conjugate, negative conjugate, etc.), but I'm not sure how to establish that these relationships exist, at least in a clean manner. And, once I have the relationships, the next question is how to determine which is the simplest, in some sense? Any thoughts?
One possibility that comes to mind is for each pair of elements generate a triple relating their positions, like {{1,2}, Conjugate, {2,1}} for A, above, then it becomes amenable to graph algorithms.
Edit: Incidentally, my inspiration is from the Matrix Algorithms series (1, 2) by Stewart.
We can start by defining the relationships that we want to recognize:
ClearAll#relationship
relationship[a_ -> sA_, b_ -> sB_] /; b == a := b -> sA
relationship[a_ -> sA_, b_ -> sB_] /; b == -a := b -> -sA
relationship[a_ -> sA_, b_ -> sB_] /; b == Conjugate[a] := b -> SuperStar[sA]
relationship[a_ -> sA_, b_ -> sB_] /; b == -Conjugate[a] := b -> -SuperStar[sA]
relationship[_, _] := Sequence[]
The form in which these relationships are expressed is convenient for the definition of structureForm:
ClearAll#structureForm
structureForm[matrix_?MatrixQ] :=
Module[{values, rules, pairs, inferences}
, values = matrix // Flatten // DeleteDuplicates
; rules = Thread[Rule[values, CharacterRange["a", "z"][[;; Length#values]]]]
; pairs = rules[[#]]& /# Select[Tuples[Range[Length#values], 2], #[[1]] < #[[2]]&]
; inferences = relationship ### pairs
; matrix /. inferences ~Join~ rules
]
In a nutshell, this function checks each possible pair of values in the matrix inferring a substitution rule whenever a pair matches a defined relationship. Note how the relationship definitions are expressed in terms of pairs of substitution rules in the form value -> name. Matrix values are assigned letter names, proceeding from left-to-right, top-to-bottom. Redundant inferred relationships are ignored assuming a precedence in that same order.
Beware that the function will run out of names after it finds 26 distinct values -- an alternate name-assignment strategy will be needed if that is an issue. Also, the names are being represented as strings instead of symbols. This conveniently dodges any unwanted bindings of the single-letter symbols names. If symbols are preferred, it would be trivial to apply the Symbol function to each name.
Here are some sample uses of the function:
In[31]:= structureForm # {{1, 2 + 3 I}, {2 - 3 I, 4}}
Out[31]= {{"a", "b"}, {SuperStar["b"], "d"}}
In[32]:= $m = a + b I /. a | b :> RandomInteger[{-2, 2}, {10, 10}];
$m // MatrixForm
$m // structureForm // MatrixForm
Have you tried looking at the eigenvalues? The eigenvalues reveal a great deal of information on the structure and symmetry of matrices and are standard in statistical analysis of datasets. For e.g.,
Hermitian/symmetric eigenvalues have
real eigenvalues.
Positive semi-definite matrices have
non-negative eigenvalues and vice versa.
Rotation matrices have complex eigenvalues.
Circulant matrices have eigenvalues that are simply the DFT of the first row. The beauty of circulant matrices is that every circulant matrix has the same set of eigenvectors. In some cases, these results (circulant) can be extended to Toeplitz matrices.
If you're dealing with matrices that are random (an experimental observation can be modeled as a random matrix), you could also read up on random matrix theory, which relates the distributions of eigenvalues to the underlying symmetries in the matrix and the statistical distributions of elements. Specifically,
The eigenvalue distribution of symmetric/hermitian Gaussian matrices is a [semicircle]
Eigenvalue distributions of Wishart matrices (if A is a random Gaussian matrix, W=AA' is a Wishart matrix) are given by the Marcenko-Pastur distribution
Also, the differences (spacings) between the eigenvalues also convey information about the matrix.
I'm not sure if the structure that you're looking for is like a connected graph within the matrix or something similar... I presume random matrix theory (which is more general and vast than those links will ever tell you) has some results in this regard.
Perhaps this is not really what you were looking for, but afaik, there is no one stop solution to getting the structure of a matrix. You'll have to use multiple tools to nail it down, and if I were to do it, eigenvalues would be my first pick.

Problems with a simple dependency algorithm

In my webapp, we have many fields that sum up other fields, and those fields sum up more fields. I know that this is a directed acyclic graph.
When the page loads, I calculate values for all of the fields. What I'm really trying to do is to convert my DAG into a one-dimensional list which would contain an efficient order to calculate the fields in.
For example:
A = B + D, D = B + C, B = C + E
Efficient calculation order: E -> C -> B -> D -> A
Right now my algorithm just does simple inserts into a List iteratively, but I've run into some situations where that starts to break. I'm thinking what would be needed instead would be to work out all the dependencies into a tree structure, and from there convert that into the one dimensional form? Is there a simple algorithm for converting such a tree into an efficient ordering?
Are you looking for topological sort? This imposes an ordering (a sequence or list) on a DAG. It's used by, for example, spreadsheets, to figure out dependencies between cells for calculations.
What you want is a depth-first search.
function ExamineField(Field F)
{
if (F.already_in_list)
return
foreach C child of F
{
call ExamineField(C)
}
AddToList(F)
}
Then just call ExamineField() on each field in turn, and the list will be populated in an optimal ordering according to your spec.
Note that if the fields are cyclic (that is, you have something like A = B + C, B = A + D) then the algorithm must be modified so that it doesn't go into an endless loop.
For your example, the calls would go:
ExamineField(A)
ExamineField(B)
ExamineField(C)
AddToList(C)
ExamineField(E)
AddToList(E)
AddToList(B)
ExamineField(D)
ExamineField(B)
(already in list, nothing happens)
ExamineField(C)
(already in list, nothing happens)
AddToList(D)
AddToList(A)
ExamineField(B)
(already in list, nothing happens)
ExamineField(C)
(already in list, nothing happens)
ExamineField(D)
(already in list, nothing happens)
ExamineField(E)
(already in list, nothing happens)
And the list would end up C, E, B, D, A.

Resources