How do I minimally represent the check state of a check tree? - algorithm

If i have a tree structure whose nodes can have zero to many children, with each node holding some data value along with a boolean switch, how do i minimally represent the state of this tree for nodes with a particular switch value?
For example, say that my tree looks something like:
A[0] -> B[1] -> C[1]
|-----> D[1]
|-----> E[1]
Here we have a state where 4 nodes are checked, is there a way to represent this state in a concise manner? The naive approach would be to list the four nodes as being checked, but what if node B had 100 children instead of just four?
My current line of thinking is to store each node's ancestor in the data component and describe the checked state in terms of the set of ancestors that minimize the data required to represent a state. In the tree below, an ancestor of node N is represented as n'. So the above tree would now look something like:
A[0, {a}] -> B[1, {a', b}] -> C[1, {a' b' c}]
|--------------> D[1, {a' b' d}]
|--------------> E[1, {a' b' e}]
Now you can analyze the tree and see that all of node A's children are checked, and describe the state simply as the nodes with data element a' are set to 1, or just [a']. If node D's state switched to 0, the you could describe the tree state as [a' not d].
Are there data structures or algorithms that can be used to solve a problem of this type? Any thoughts on a better approach? Any thoughts on the analysis algorithm?
Thanks

Use a preorder tree traversal starting from the root. If a node is checked don't traverse its children. For each traversed node store it's checked state (boolean 0/1) in a boolean bitmap (8bits/byte). Finally compress the result with zip/bzip or any other compression technique.
When you reconstruct the state, first decompress, then use preorder tree traversal, set each node based on the state, if state is checked set all children to checked and skip them.

In general there is no technique that will always be able to store the checked elements in fewer than n bits of space, where n is the number of elements in the tree. The rationale behind this is that there are 2^n different possible check states, so you need at least 2^n different encodings, so there must be at least one coding of length 2^n since there are only 2^n - 1 encodings that are shorter than this.
Given this, if you really want to minimize space usage, I would suggest going with an encoding like the one #yi_H suggests. It uses precisely n bits for each encoding. You might be able to compress most of the encodings by applying a standard compression algorithm to the bits, which for practical sets of checked nodes might do quite well, but which degrades gracefully in the worst case.
Hope this helps!

Related

when exactly should root split in a B Tree

I learned B trees recently and from what I understand a node can have minimum t-1 keys and maximum 2t-1 keys given minimum degree t. Exception being root can have even 1 key.
Here is the example from CLRS 3rd edition Fig 18.7 (Page 498) where t=3
min keys = 3-1 = 2
max keys = 2*3-1 = 5
In the d) example when L is inserted why is the root splitted when it doesn't violate the B tree properties at the moment (It has 5 keys which is maximum allowed).
Why isn't inserting L into [J K L] without splitting [G M P T X] considered.
Should I always split the root when it reaches the maximum?
There are several variants of the insertion algorithm for B-trees. In this case the insertion algorithm is the "single pass down the tree" variant.
The background for this variant is given on page 493:
Since we cannot insert a key into a leaf node that is full, we introduce an operation that splits a full node 𝑦 (having 2𝑡 − 1 keys) around its median key 𝑦:key𝑡 into two nodes having only 𝑡 − 1 keys each. The median key moves up into 𝑦’s parent to identify the dividing point between the two new trees. But if 𝑦’s parent is also full, we must split it before we can insert the new key, and thus we could end up splitting full nodes all the way up the tree.
As with a binary search tree, we can insert a key into a B-tree in a single pass down the tree from the root to a leaf. To do so, we do not wait to find out whether we will actually need to split a full node in order to do the insertion. Instead, as we travel down the tree searching for the position where the new key belongs, we split each full node we come to along the way (including the leaf itself). Thus whenever we want to split a full node 𝑦, we are assured that its parent is not full.
In other words, this insertion algorithm will split a node earlier than might be strictly needed, in order to avoid to have to split nodes while backtracking out of recursion.
This algorithm is further described on page 495 with pseudo code.
This explains why at the insertion of L the root node is split immediately before any recursive call is made.
Alternative algorithms would not do this, and would delay the split up to the point when it is inevitable.

Special Augmented Red-Black Tree

I'm looking for some help on a specific augmented Red Black Binary Tree. My goal is to make every single operation run in O(log(n)) in the worst case. The nodes of the tree will have an integer as there key. This integer can not be negative, and the tree should be sorted by a simple compare function off of this integer. Additionally, each node will also store another value: its power. (Note that this has nothing to do with mathematical exponents). Power is a floating point value. Both power and key are always non-negative. The tree must be able to provide these operations in O(log(n)) runtime.:
insert(key, power): Insert into the tree. The node in the tree should also store the power, and any other variables needed to augment the tree in such a way that all other operations are also O(log(n)). You can assume that there is no node in the tree which already has the same key.
get(key): Return the power of the node identified by the key.
delete(key): Delete the node with key (assume that the key does exist in the tree prior to the delete.
update(key,power): Update the power at the node given by key.
Here is where it gets interesting:
highestPower(key1, key2): Return the maximum power of all nodes with key k in the range key1 <= k <= key2. That is, all keys from key1 to key2, inclusive on both ends.
powerSum(key1, key2): Return the sum of the powers of all nodes with key k in the ragne key1 <= k <= key2. That is, all keys from key1 to key2, inclusive on both ends.
The main thing I would like to know is what extra variables should I store at each node. Then I need to work out how to use each one of these in each of the above functions so that the tree stays balanced and all operations can run in O(log(n)) My original thought was to store the following:
highestPowerLeft: The highest power of all child nodes to the right of this node.
highestPowerRight: The highest power of all child nodes to the right of this node.
powerSumLeft: The sum of the powers of all child nodes to the left of this node.
powerSumRight: The sum of the powers of all child nodes to the right of this node.
Would just this extra information work? If so, I'm not sure how to deal with it in the functions that are required. Frankly my knowledge of Red Black Tree's isn't great because I feel like every explanation of them gets convoluted really fast, and all the rotations and things confuse the hell out of me. Thanks to anyone willing to attempt helping here, I know what I'm asking is far from simple.
A very interesting problem! For the sum, your proposed method should work (it should be enough to only store the sum of the powers to the left of the current node, though; this technique is called prefix sum). For the max, it doesn't work, since if both max values are equal, that value is outside of your interval, so you have no idea what the max value in your interval is. My only idea is to use a segment tree (in which the leaves are the nodes of your red-black tree), which lets you answer the question "what is the maximal value within the given range?" in logarithmic time, and also lets you update individual values in logarithmic time. However, since you need to insert new values into it, you need to keep it balanced as well.

Efficient algorithm for eliminating nodes in "graph"?

Suppose I have a a graph with 2^N - 1 nodes, numbered 1 to 2^N - 1. Node i "depends on" node j if all the bits in the binary representation of j that are 1, are also 1 in the binary representation of i. So, for instance, if N=3, then node 7 depends on all other nodes. Node 6 depends on nodes 4 and 2.
The problem is eliminating nodes. I can eliminate a node if no other nodes depend on it. No nodes depend on 7; so I can eliminate 7. After eliminating 7, I can eliminate 6, 5, and 3, etc. What I'd like is to find an efficient algorithm for listing all the possible unique elimination paths. (that is, 7-6-5 is the same as 7-5-6, so we only need to list one of the two). I have a dumb algorithm already, but I think there must be a better way.
I have three related questions:
Does this problem have a general name?
What's the best way to solve it?
Is there a general formula for the number of unique elimination paths?
Edit: I should note that a node cannot depend on itself, by definition.
Edit2: Let S = {s_1, s_2, s_3,...,s_m} be the set of all m valid elimination paths. s_i and s_j are "equivalent" (for my purposes) iff the two eliminations s_i and s_j would lead to the same graph after elimination. I suppose to be clearer I could say that what I want is the set of all unique graphs resulting from valid elimination steps.
Edit3: Note that elimination paths may be different lengths. For N=2, the 5 valid elimination paths are (),(3),(3,2),(3,1),(3,2,1). For N=3, there are 19 unique paths.
Edit4: Re: my application - the application is in statistics. Given N factors, there are 2^N - 1 possible terms in statistical model (see http://en.wikipedia.org/wiki/Analysis_of_variance#ANOVA_for_multiple_factors) that can contain the main effects (the factors alone) and various (2,3,... way) interactions between the factors. But an interaction can only be present in a model if all sub-interactions (or main effects) are present. For three factors a, b, and c, for example, the 3 way interaction a:b:c can only be in present if all the constituent two-way interactions (a:b, a:c, b:c) are present (and likewise for the two-ways). Thus, the model a + b + c + a:b + a:b:c would not be allowed. I'm looking for a quick way to generate all valid models.
It seems easier to think about this in terms of sets: you are looking for families of subsets of {1, ..., N} such that for each set in the family also all its subsets are present. Each such family is determined by the inclusion-wise maximal sets, which must be overlapping. Families of pairwise overlapping sets are called Sperner families. So you are looking for Sperner families, plus the union of all the subsets in the family. Possibly known algorithms for enumerating Sperner families or antichains in general are useful; without knowing what you actually want to do with them, it's hard to tell.
Thanks to #FalkHüffner's answer, I saw that what I wanted to do was equivalent to finding monotonic Boolean functions for N arguments. If you look at the figure on the Wikipedia page for Dedekind numbers (http://en.wikipedia.org/wiki/Dedekind_number) the figure expresses the problem graphically. There is an algorithm for generating monotonic Boolean functions (http://www.mathpages.com/home/kmath094.htm) and it is quite simple to construct.
For my purposes, I use the algorithm, then eliminate the first column and last row of the resulting binary arrays. Starting from the top row down, each row has a 1 in the ith column if one can eliminate the ith node.
Thanks!
You can build a "heap", in which at depth X are all the nodes with X zeros in their binary representation.
Then, starting from the bottom layer, connect each item to a random parent at the layer above, until you get a single-component graph.
Note that this graph is a tree, i.e., each node except for the root has exactly one parent.
Then, traverse the tree (starting from the root) and count the total number of paths in it.
UPDATE:
The method above is bad, because you cannot just pick a random parent for a given item - you have a limited number of items from which you can pick a "legal" parent... But I'm leaving this method here for other people to give their opinion (perhaps it is not "that bad").
In any case, why don't you take your graph, extract a spanning-tree (you can use Prim algorithm or Kruskal algorithm for finding a minimal-spanning-tree), and then count the number of paths in it?

Checking if A is a part of binary tree B

Let's say I have binary trees A and B and I want to know if A is a "part" of B. I am not only talking about subtrees. What I want to know is if B has all the nodes and edges that A does.
My thoughts were that since tree is essentially a graph, and I could view this question as a subgraph isomorphism problem (i.e. checking to see if A is a subgraph of B). But according to wikipedia this is an NP-complete problem.
http://en.wikipedia.org/wiki/Subgraph_isomorphism_problem
I know that you can check if A is a subtree of B or not with O(n) algorithms (e.g. using preorder and inorder traversals to flatten the trees to strings and checking for substrings). I was trying to modify this a little to see if I can also test for just "parts" as well, but to no avail. This is where I'm stuck.
Are there any other ways to view this problem other than using subgraph isomorphism? I'm thinking there must be faster methods since binary trees are much more restricted and simpler versions of graphs.
Thanks in advance!
EDIT: I realized that the worst case for even a brute force method for my question would only take O(m * n), which is polynomial. So I guess this isn't a NP-complete problem after all. Then my next question is, is there an algorithm that is faster than O(m*n)?
I would approach this problem in two steps:
Find the root of A in B (either BFS of DFS)
Verify that A is contained in B (giving that starting node), using a recursive algorithm, as below (I concocted same crazy pseudo-language, because you didn't specify the language. I think this should be understandable, no matter your background). Note that a is a node from A (initially the root) and b is a node from B (initially the node found in step 1)
function checkTrees(node a, node b) returns boolean
if a does not exist or b does not exist then
// base of the recursion
return false
else if a is different from b then
// compare the current nodes
return false
else
// check the children of a
boolean leftFound = true
boolean rightFound = true
if a.left exists then
// try to match the left child of a with
// every possible neighbor of b
leftFound = checkTrees(a.left, b.left)
or checkTrees(a.left, b.right)
or checkTrees(a.left, b.parent)
if a.right exists then
// try to match the right child of a with
// every possible neighbor of b
leftFound = checkTrees(a.right, b.left)
or checkTrees(a.right, b.right)
or checkTrees(a.right, b.parent)
return leftFound and rightFound
About the running time: let m be the number of nodes in A and n be the number of nodes in B. The search in the first step takes O(n) time. The running time of the second step depends on one crucial assumption I made, but that might be wrong: I assumed that every node of A is equal to at most one node of B. If that is the case, the running time of the second step is O(m) (because you can never search too far in the wrong direction). So the total running time would be O(m + n).
While writing down my assumption, I start to wonder whether that's not oversimplifying your case...
you could compare the trees in bottom-up as follows:
for each leaf in tree A, identify the corresponding node in tree B.
start a parallel traversal towards the root in both trees from the nodes just matched.
specifically, move to the parent of a node in A and subsequently move towards the root in B until you either encounter the corresponding node in B (proceed) or a marked node in A (see below, if a match in B is found proceed, else fail) or the root of B (fail)
mark all nodes visited in A.
you succeed, if you haven't failed ;-).
the main part of the algorithm runs in O(e_B) - in the worst case, all edges in B are visited a constant number of times. the leaf node matching will run in O(n_A * log n_B) if there the B vertices are sorted, O(n_A * log n_A + n_B * log n_B + n) = O(n_B * log n_B) (sort each node set, lienarly scan the results thereafter) otherwise.
EDIT:
re-reading your question, abovementioned step 2 is even easier, as for matching nodes in A, B, their parents must match too (otheriwse there would be a mismatch between the edge sets). no effect on worst-case run time, of course.

Data structure supporting Add and Partial-Sum

Let A[1..n] be an array of real numbers. Design an algorithm to perform any sequence of the following operations:
Add(i,y) -- Add the value y to the ith number.
Partial-sum(i) -- Return the sum of the first i numbers, i.e.
There are no insertions or deletions; the only change is to the values of the numbers. Each operation should take O(logn) steps. You may use one additional array of size n as a work space.
How to design a data structure for above algorithm?
Construct a balanced binary tree with n leaves; stick the elements along the bottom of the tree in their original order.
Augment each node in the tree with "sum of leaves of subtree"; a tree has #leaves-1 nodes so this takes O(n) setup time (which we have).
Querying a partial-sum goes like this: Descend the tree towards the query (leaf) node, but whenever you descend right, add the subtree-sum on the left plus the element you just visited, since those elements are in the sum.
Modifying a value goes like this: Find the query (left) node. Calculate the difference you added. Travel to the root of the tree; as you travel to the root, update each node you visit by adding in the difference (you may need to visit adjacent nodes, depending if you're storing "sum of leaves of subtree" or "sum of left-subtree plus myself" or some variant); the main idea is that you appropriately update all the augmented branch data that needs updating, and that data will be on the root path or adjacent to it.
The two operations take O(log(n)) time (that's the height of a tree), and you do O(1) work at each node.
You can probably use any search tree (e.g. a self-balancing binary search tree might allow for insertions, others for quicker access) but I haven't thought that one through.
You may use Fenwick Tree
See this question

Resources