Related
Given an n-ary tree of integers, the task is to find the maximum sum of a subsequence with the constraint that no 2 numbers in the sequence should share a common edge in the tree.
Example:
1
/ \
2 5
/ \
3 4
Maximum non adjacent sum = 3 + 4 + 5 = 12
The following is the faulty extension of the algorithm outlined in http://www.geeksforgeeks.org/maximum-sum-such-that-no-two-elements-are-adjacent?
def max_sum(node, inc_sum, exc_sum):
for child in node.children:
exc_new = max(inc_sum, exc_sum)
inc_sum = exc_sum + child.val
exc_sum = exc_new
inc_sum, exc_sum = max(max_sum(child, inc_sum, exc_sum),
max_sum(child, inc_sum, inc_sum - node.val))
return exc_sum, inc_sum
But I wasn't sure if swapping exc_sum and inc_sum while returning is the right way to achieve the result and how do I keep track of the possible sums which can lead to a maximum sum, in this example, the maximum sum in the left subtree is (1+3+4) whereas the sum which leads to the final maximum is (3+4+5), so how should (3+4) be tracked? Should all the intermediary sums stored in a table?
Lets say dp[u][select] stores the answer: maximum sub sequence sum with no two nodes having edge such that we consider only the sub-tree rooted at node u ( such that u is selected or not ). Now you can write a recursive program where state of each recursion is (u,select) where u means root of the sub graph being considered and select means whether or not we select node u. So we get the following pseudo code
/* Initialize dp[][] to be -1 for all values (u,select) */
/* Select is 0 or 1 for false/true respectively */
int func(int node , int select )
{
if(dp[node][select] != -1)return dp[node][select];
int ans = 0,i;
// assuming value of node is same as node number
if(select)ans=node;
//edges[i] stores children of node i
for(i=0;i<edges[node].size();i++)
{
if(select)ans=ans+func(edges[node][i],1-select);
else ans=ans+max(func(edges[node][i],0),func(edges[node][i],1));
}
dp[node][select] = ans;
return ans;
}
// from main call, root is root of tree and answer is
// your final answer
answer = max(func(root,0),func(root,1));
We have used memoization in addition to recursion to reduce time complexity.Its O(V+E) in both space and time. You can see here a working version of
the code Code. Click on the fork on top right corner to run on test case
4 1
1 2
1 5
2 3
2 4
It gives output 12 as expected.
The input format is specified in comments in the code along with other clarifications. Its in C++ but there is not significant changes if you want it in python once you understand the code. Do post in comments if you have any doubts regarding the code.
Steps to build Huffman Tree
Input is array of unique characters along with their frequency of occurrences and output is Huffman Tree.
Create a leaf node for each unique character and build a min heap of all leaf nodes (Min Heap is used as a priority queue. The value of frequency field is used to compare two nodes in min heap. Initially, the least frequent character is at root)
Extract two nodes with the minimum frequency from the min heap.
Create a new internal node with frequency equal to the sum of the two nodes frequencies. Make the first extracted node as its left child and the other extracted node as its right child. Add this node to the min heap.
Repeat steps#2 and #3 until the heap contains only one node. The remaining node is the root node and the tree is complete.
At a heap, a node can have at most 2 children, right?
So if we would like to generalize the Huffman algorithm for coded words in ternary system (i.e. coded words using the symbols 0 , 1 and 2 ) what could we do? Do we have to create a tree all the nodes of which have 3 children?
EDIT:
I think that it would be as follows.
Steps to build Huffman Tree
Input is array of unique characters along with their frequency of occurrences and output is Huffman Tree.
Create a leaf node for each unique character and build a min heap of all leaf nodes
Extract three nodes with the minimum frequency from the min heap.
Create a new internal node with frequency equal to the sum of the three nodes frequencies. Make the first extracted node as its left child, the second extracted node as its middle child and the third extracted node as its right child. Add this node to the min heap.
Repeat steps#2 and #3 until the heap contains only one node. The remaining node is the root node and the tree is complete.
How can we prove that the algorithm yields optimal ternary codes?
EDIT 2: Suppose that we have the frequencies 5,9,12,13,16,45.
Their number is even, so we add a dummy node with frequency 0. Do we put this at the end of the array? So, will it be as follows?
Then will we have the following heap?
Then:
Then:
Or have I understood it wrong?
Yes! you have to create all nodes with 3 children. Why 3? you can also have n-ary huffman coding using nodes with n child. The tree will look something like this-(for n=3)
*
/ | \
* * *
/|\
* * *
Huffman Algorithm for Ternary Codewords
I am giving the algorithms for easy reference.
HUFFMAN_TERNARY(C)
{
IF |C|=EVEN
THEN ADD DUMMY CHARACTER Z WITH FREQUENCY 0.
N=|C|
Q=C; //WE ARE BASICALLY HEAPIFYING THE CHARACTERS
FOR I=1 TO floor(N/2)
{
ALLOCATE NEW_NODE;
LEFT[NEW_NODE]= U= EXTRACT_MIN(Q)
MID[NEW_NODE] = V= EXTRACT_MIN(Q)
RIGHT[NEW_NODE]=W= EXTRACT_MIN(Q)
F[NEW_NODE]=F[U]+F[V]+F[W];
INSERT(Q,NEW_NODE);
}
RETURN EXTRACT_MIN(Q);
} //END-OF-ALGO
Why are we adding extra nodes? To make the number of nodes odd.(Why?) Because we want to get out of the for loop with just one node in Q.
Why floor(N/2)?
At first we take 3 nodes. Then replace with it 1 node.There are N-2 nodes.
After that we always take 3 nodes (if not available 1 node,it is never possible to get 2 nodes because of the dummy node) and replace with 1. In each iteration we are reducing it by 2 nodes. So that's why we are using the term floor(N/2).
Check it yourself in paper using some sample character set. You will understand.
CORRECTNESS
I am taking here reference from "Introduction to Algorithms" by Cormen, Rivest.
Proof: The step by step mathematical proof is too long to post here but it is quite similar to the proof given in the book.
Idea
Any optimal tree has the lowest three frequencies at the lowest level.(We have to prove this).(using contradiction) Suppose it is not the case then we could switch a leaf with a higher frequency from the lowest level with one of the lowest three leaves and obtain a lower average length. Without any loss of generality, we can assume that all the three lowest frequencies are the children of the same node. if they are at the same level, the average length does not change irrespective of where the frequencies are). They only differ in the last digit of their codeword (one will be 0,1 or 2).
Again as the binary codewords we have to contract the three nodes and make a new character out of it having frequency=total of three node's(character's) frequencies. Like binary Huffman codes, we see that the cost of the optimal tree is the sum of the tree
with the three symbols contracted and the eliminated sub-tree which had the nodes before contraction. Since it has been proved that the sub-tree has
to be present in the final optimal tree, we can optimize on the tree with the contracted newly created node.
Example
Suppose the character set contains frequencies 5,9,12,13,16,45.
Now N=6-> even. So add dummy character with freq=0
N=7 now and freq in C are 0,5,9,12,13,16,45
Now using min priority queue get 3 values. 0 then 5 then 9.
Add them insert new char with freq=0+9+5 in priority queue. This way continue.
The tree will be like this
100
/ | \
/ | \
/ | \
39 16 45 step-3
/ | \
14 12 13 step-2
/ | \
0 5 9 step-1
Finally Prove it
I will now go to straight forward mimic of the proof of Cormen.
Lemma 1. Let C be an alphabet in which each character d belonging to C has frequency c.freq. Let
x ,y and z be three characters in C having the lowest frequencies. Then there exists
an optimal prefix code for C in which the codewords for x ,y and z have the same
length and differ only in the last bit.
Proof:
Idea
First consider any tree T generating arbitrary optimal prefix code.
Then we will modify it to make a tree representing another optimal prefix such that the characters x,y,z appears as sibling nodes at the maximum depth.
If we can construct such a tree then the codewords for x,y and z will have the same length and differ only in the last bit.
Proof--
Let a,b,c be three characters that are sibling leaves of maximum depth in T .
Without loss of generality, we assume that a.freq < b:freq < c.freq and x.freq < y.freq < z.freq.
Since x.freq and y.freq and z.freq are the 3 lowest leaf frequencies, in order (means there are no frequencies between them) and a.freq
, b.freq and c.freq are two arbitrary frequencies, in order, we have x.freq < a:freq and
y.freq < b.freq and z.freq< c.freq.
In the remainder of the proof we can have x.freq=a.freq or y.freq=b.freq or z.freq=c.freq.
But if x.freq=b.freq or x.freq=c.freq
or y.freq=c.freq
then all of them are same. WHY??
Let's see. Suppose x!=y,y!=z,x!=z but z=c and x<y<z in order and aa<b<c.
Also x!=a. --> x<a
y!=b. --> y<b
z!=c. --> z<c but z=c is given. This contradicts our assumption. (Thus proves).
The lemma would be trivially true. Thus we will assume
x!=b and x!=c.
T1
* |
/ | \ |
* * x +---d(x)
/ | \ |
y * z +---d(y) or d(z)
/|\ |
a b c +---d(a) or d(b) or d(c) actually d(a)=d(b)=d(c)
T2
*
/ | \
* * a
/ | \
y * z
/|\
x b c
T3
*
/ | \
* * x
/ | \
b * z
/|\
x y c
T4
*
/ | \
* * a
/ | \
b * c
/|\
x y z
In case of T1 costt1= x.freq*d(x)+ cost_of_other_nodes + y.freq*d(y) + z.freq*d(z) + d(a)*a.freq + b.freq*d(b) + c.freq*d(c)
In case of T2 costt2= x.freq*d(a)+ cost_of_other_nodes + y.freq*d(y) + z.freq*d(z) + d(x)*a.freq + b.freq*d(b) + c.freq*d(c)
costt1-costt2= x.freq*[d(x)-d(a)]+0 + 0 + 0 + a.freq[d(a)-d(x)]+0 + 0
= (a.freq-x.freq)*(d(a)-d(x))
>= 0
So costt1>=costt2. --->(1)
Similarly we can show costt2 >= costt3--->(2)
And costt3 >= costt4--->(3)
From (1),(2) and (3) we get
costt1>=costt4.-->(4)
But T1 is optimal.
So costt1<=costt4 -->(5)
From (4) and (5) we get costt1=costt2.
SO, T4 is an optimal tree in which x,y,and z appears as sibling leaves at maximum depth, from which the lemma follows.
Lemma-2
Let C be a given alphabet with frequency c.freq defined for each character c belonging to C.
Let x , y, z be three characters in C with minimum frequency. Let C1 be the
alphabet C with the characters x and y removed and a new character z1 added,
so that C1 = C - {x,y,z} union {z1}. Define f for C1 as for C, except that
z1.freq=x.freq+y.freq+z.freq. Let T1 be any tree representing an optimal prefix code
for the alphabet C1. Then the tree T , obtained from T1 by replacing the leaf node
for z with an internal node having x , y and z as children, represents an optimal prefix
code for the alphabet C.
Proof.:
Look we are making a transition from T1-> T.
So we must find a way to express the T i.e, costt in terms of costt1.
* *
/ | \ / | \
* * * * * *
/ | \ / | \
* * * ----> * z1 *
/|\
x y z
For c belonging to (C-{x,y,z}), dT(c)=dT1(c). [depth corresponding to T and T1 tree]
Hence c.freq*dT(c)=c.freq*dT1(c).
Since dT(x)=dT(y)=dT(z)=dT1(z1)+1
So we have `x.freq*dT(x)+y.freq*dT(y)+z.freq*dT(z)=(x.freq+y.freq+z.freq)(dT1(z)+1)`
= `z1.freq*dT1(z1)+x.freq+y.freq+z.freq`
Adding both side the cost of other nodes which is same in both T and T1.
x.freq*dT(x)+y.freq*dT(y)+z.freq*dT(z)+cost_of_other_nodes= z1.freq*dT1(z1)+x.freq+y.freq+z.freq+cost_of_other_nodes
So costt=costt1+x.freq+y.freq+z.freq
or equivalently
costt1=costt-x.freq-y.freq-z.freq ---->(1)
Now we prove the lemma by contradiction.
We now prove the lemma by contradiction. Suppose that T does not represent
an optimal prefix code for C. Then there exists an optimal tree T2 such that
costt2 < costt. Without loss of generality (by Lemma 1), T2 has x and y and z as
siblings.
Let T3 be the tree T2 with the common parent of x and y and z replaced by a
leaf z1 with frequency z1.freq=x.freq+y.freq+z.freq Then
costt3 = costt2-x.freq-y.freq-z.freq
< costt-x.freq-y.freq-z.freq
= costt1 (From 1)
yielding a contradiction to the assumption that T1 represents an optimal prefix code
for C1. Thus, T must represent an optimal prefix code for the alphabet C.
-Proved.
Procedure HUFFMAN produces an optimal prefix code.
Proof: Immediate from Lemmas 1 and 2.
NOTE.: Terminologies are from Introduction to Algorithms 3rd edition Cormen Rivest
Given a fixed number of keys or values(stored either in array or in some data structure) and order of b-tree, can we determine the sequence of inserting keys that would generate a space efficient b-tree.
To illustrate, consider b-tree of order 3. Let the keys be {1,2,3,4,5,6,7}. Inserting elements into tree in the following order
for(int i=1 ;i<8; ++i)
{
tree.push(i);
}
would create a tree like this
4
2 6
1 3 5 7
see http://en.wikipedia.org/wiki/B-tree
But inserting elements in this way
flag = true;
for(int i=1,j=7; i<8; ++i,--j)
{
if(flag)
{
tree.push(i);
flag = false;
}
else
{
tree.push(j);
flag = true;
}
}
creates a tree like this
3 5
1 2 4 6 7
where we can see there is decrease in level.
So is there a particular way to determine sequence of insertion which would reduce space consumption?
The following trick should work for most ordered search trees, assuming the data to insert are the integers 1..n.
Consider the binary representation of your integer keys - for 1..7 (with dots for zeros) that's...
Bit : 210
1 : ..1
2 : .1.
3 : .11
4 : 1..
5 : 1.1
6 : 11.
7 : 111
Bit 2 changes least often, Bit 0 changes most often. That's the opposite of what we want, so what if we reverse the order of those bits, then sort our keys in order of this bit-reversed value...
Bit : 210 Rev
4 : 1.. -> ..1 : 1
------------------
2 : .1. -> .1. : 2
6 : 11. -> .11 : 3
------------------
1 : ..1 -> 1.. : 4
5 : 1.1 -> 1.1 : 5
3 : .11 -> 11. : 6
7 : 111 -> 111 : 7
It's easiest to explain this in terms of an unbalanced binary search tree, growing by adding leaves. The first item is dead centre - it's exactly the item we want for the root. Then we add the keys for the next layer down. Finally, we add the leaf layer. At every step, the tree is as balanced as it can be, so even if you happen to be building an AVL or red-black balanced tree, the rebalancing logic should never be invoked.
[EDIT I just realised you don't need to sort the data based on those bit-reversed values in order to access the keys in that order. The trick to that is to notice that bit-reversing is its own inverse. As well as mapping keys to positions, it maps positions to keys. So if you loop through from 1..n, you can use this bit-reversed value to decide which item to insert next - for the first insert use the 4th item, for the second insert use the second item and so on. One complication - you have to round n upwards to one less than a power of two (7 is OK, but use 15 instead of 8) and you have to bounds-check the bit-reversed values. The reason is that bit-reversing can move some in-bounds positions out-of-bounds and visa versa.]
Actually, for a red-black tree some rebalancing logic will be invoked, but it should just be re-colouring nodes - not rearranging them. However, I haven't double checked, so don't rely on this claim.
For a B tree, the height of the tree grows by adding a new root. Proving this works is, therefore, a little awkward (and it may require a more careful node-splitting than a B tree normally requires) but the basic idea is the same. Although rebalancing occurs, it occurs in a balanced way because of the order of inserts.
This can be generalised for any set of known-in-advance keys because, once the keys are sorted, you can assign suitable indexes based on that sorted order.
WARNING - This isn't an efficient way to construct a perfectly balanced tree from known already-sorted data.
If you have your data already sorted, and know it's size, you can build a perfectly balanced tree in O(n) time. Here's some pseudocode...
if size is zero, return null
from the size, decide which index should be the (subtree) root
recurse for the left subtree, giving that index as the size (assuming 0 is a valid index)
take the next item to build the (subtree) root
recurse for the right subtree, giving (size - (index + 1)) as the size
add the left and right subtree results as the child pointers
return the new (subtree) root
Basically, this decides the structure of the tree based on the size and traverses that structure, building the actual nodes along the way. It shouldn't be too hard to adapt it for B Trees.
This is how I would add elements to b-tree.
Thanks to Steve314, for giving me the start with binary representation,
Given are n elements to add, in order. We have to add it to m-order b-tree. Take their indexes (1...n) and convert it to radix m. The main idea of this insertion is to insert number with highest m-radix bit currently and keep it above the lesser m-radix numbers added in the tree despite splitting of nodes.
1,2,3.. are indexes so you actually insert the numbers they point to.
For example, order-4 tree
4 8 12 highest radix bit numbers
1,2,3 5,6,7 9,10,11 13,14,15
Now depending on order median can be:
order is even -> number of keys are odd -> median is middle (mid median)
order is odd -> number of keys are even -> left median or right median
The choice of median (left/right) to be promoted will decide the order in which I should insert elements. This has to be fixed for the b-tree.
I add elements to trees in buckets. First I add bucket elements then on completion next bucket in order. Buckets can be easily created if median is known, bucket size is order m.
I take left median for promotion. Choosing bucket for insertion.
| 4 | 8 | 12 |
1,2,|3 5,6,|7 9,10,|11 13,14,|15
3 2 1 Order to insert buckets.
For left-median choice I insert buckets to the tree starting from right side, for right median choice I insert buckets from left side. Choosing left-median we insert median first, then elements to left of it first then rest of the numbers in the bucket.
Example
Bucket median first
12,
Add elements to left
11,12,
Then after all elements inserted it looks like,
| 12 |
|11 13,14,|
Then I choose the bucket left to it. And repeat the same process.
Median
12
8,11 13,14,
Add elements to left first
12
7,8,11 13,14,
Adding rest
8 | 12
7 9,10,|11 13,14,
Similarly keep adding all the numbers,
4 | 8 | 12
3 5,6,|7 9,10,|11 13,14,
At the end add numbers left out from buckets.
| 4 | 8 | 12 |
1,2,|3 5,6,|7 9,10,|11 13,14,|15
For mid-median (even order b-trees) you simply insert the median and then all the numbers in the bucket.
For right-median I add buckets from the left. For elements within the bucket I first insert median then right elements and then left elements.
Here we are adding the highest m-radix numbers, and in the process I added numbers with immediate lesser m-radix bit, making sure the highest m-radix numbers stay at top. Here I have only two levels, for more levels I repeat the same process in descending order of radix bits.
Last case is when remaining elements are of same radix-bit and there is no numbers with lesser radix-bit, then simply insert them and finish the procedure.
I would give an example for 3 levels, but it is too long to show. So please try with other parameters and tell if it works.
Unfortunately, all trees exhibit their worst case scenario running times, and require rigid balancing techniques when data is entered in increasing order like that. Binary trees quickly turn into linked lists, etc.
For typical B-Tree use cases (databases, filesystems, etc), you can typically count on your data naturally being more distributed, producing a tree more like your second example.
Though if it is really a concern, you could hash each key, guaranteeing a wider distribution of values.
for( i=1; i<8; ++i )
tree.push(hash(i));
To build a particular B-tree using Insert() as a black box, work backward. Given a nonempty B-tree, find a node with more than the minimum number of children that's as close to the leaves as possible. The root is considered to have minimum 0, so a node with the minimum number of children always exists. Delete a value from this node to be prepended to the list of Insert() calls. Work toward the leaves, merging subtrees.
For example, given the 2-3 tree
8
4 c
2 6 a e
1 3 5 7 9 b d f,
we choose 8 and do merges to obtain the predecessor
4 c
2 6 a e
1 3 5 79 b d f.
Then we choose 9.
4 c
2 6 a e
1 3 5 7 b d f
Then a.
4 c
2 6 e
1 3 5 7b d f
Then b.
4 c
2 6 e
1 3 5 7 d f
Then c.
4
2 6 e
1 3 5 7d f
Et cetera.
So is there a particular way to determine sequence of insertion which would reduce space consumption?
Edit note: since the question was quite interesting, I try to improve my answer with a bit of Haskell.
Let k be the Knuth order of the B-Tree and list a list of keys
The minimization of space consumption has a trivial solution:
-- won't use point free notation to ease haskell newbies
trivial k list = concat $ reverse $ chunksOf (k-1) $ sort list
Such algorithm will efficiently produce a time-inefficient B-Tree, unbalanced on the left but with minimal space consumption.
A lot of non trivial solutions exist that are less efficient to produce but show better lookup performance (lower height/depth). As you know, it's all about trade-offs!
A simple algorithm that minimizes both the B-Tree depth and the space consumption (but it doesn't minimize lookup performance!), is the following
-- Sort the list in increasing order and call sortByBTreeSpaceConsumption
-- with the result
smart k list = sortByBTreeSpaceConsumption k $ sort list
-- Sort list so that inserting in a B-Tree with Knuth order = k
-- will produce a B-Tree with minimal space consumption minimal depth
-- (but not best performance)
sortByBTreeSpaceConsumption :: Ord a => Int -> [a] -> [a]
sortByBTreeSpaceConsumption _ [] = []
sortByBTreeSpaceConsumption k list
| k - 1 >= numOfItems = list -- this will be a leaf
| otherwise = heads ++ tails ++ sortByBTreeSpaceConsumption k remainder
where requiredLayers = minNumberOfLayersToArrange k list
numOfItems = length list
capacityOfInnerLayers = capacityOfBTree k $ requiredLayers - 1
blockSize = capacityOfInnerLayers + 1
blocks = chunksOf blockSize balanced
heads = map last blocks
tails = concat $ map (sortByBTreeSpaceConsumption k . init) blocks
balanced = take (numOfItems - (mod numOfItems blockSize)) list
remainder = drop (numOfItems - (mod numOfItems blockSize)) list
-- Capacity of a layer n in a B-Tree with Knuth order = k
layerCapacity k 0 = k - 1
layerCapacity k n = k * layerCapacity k (n - 1)
-- Infinite list of capacities of layers in a B-Tree with Knuth order = k
capacitiesOfLayers k = map (layerCapacity k) [0..]
-- Capacity of a B-Tree with Knut order = k and l layers
capacityOfBTree k l = sum $ take l $ capacitiesOfLayers k
-- Infinite list of capacities of B-Trees with Knuth order = k
-- as the number of layers increases
capacitiesOfBTree k = map (capacityOfBTree k) [1..]
-- compute the minimum number of layers in a B-Tree of Knuth order k
-- required to store the items in list
minNumberOfLayersToArrange k list = 1 + f k
where numOfItems = length list
f = length . takeWhile (< numOfItems) . capacitiesOfBTree
With this smart function given a list = [21, 18, 16, 9, 12, 7, 6, 5, 1, 2] and a B-Tree with knuth order = 3 we should obtain [18, 5, 9, 1, 2, 6, 7, 12, 16, 21] with a resulting B-Tree like
[18, 21]
/
[5 , 9]
/ | \
[1,2] [6,7] [12, 16]
Obviously this is suboptimal from a performance point of view, but should be acceptable, since obtaining a better one (like the following) would be far more expensive (computationally and economically):
[7 , 16]
/ | \
[5,6] [9,12] [18, 21]
/
[1,2]
If you want to run it, compile the previous code in a Main.hs file and compile it with ghc after prepending
import Data.List (sort)
import Data.List.Split
import System.Environment (getArgs)
main = do
args <- getArgs
let knuthOrder = read $ head args
let keys = (map read $ tail args) :: [Int]
putStr "smart: "
putStrLn $ show $ smart knuthOrder keys
putStr "trivial: "
putStrLn $ show $ trivial knuthOrder keys
Input: string S = AAGATATGATAGGAT.
Output: Maximal repeats such as GATA (as in positions 3 and 8), GAT (as in position 3, 8 and 13) and so on...
A maximal repeat is a substring t occurs k>1 times in S, and if t is extended to left or right, it will occur less than k times.
An internal node’s leaf descendants are suffixes, each of which has a left character.
If the left characters of all leaf descendants are not all identical, it’s called a “left-diverse” node.
Maximal repeats is left-diverse internal nodes.
Overall idea:
Build a suffix tree and then do a DFS (depth first search) on the tree
For each leaf, label it with its left character
For each internal node:
If at least one child is labelled with *, then label it with *
Else if its children’s labels are diverse, label with *.
Else then all children have same label, copy it to current node
Is the above idea is correct? How does the pseudo-code to be? Then I can try to write programming myself.
Your idea is good, but with a suffix tree you can do something even easier.
Let T be the suffix tree of your sequence .
Let x be a node in T, T_x is the subtree of T with root x.
Let N_x be the number of leaf in T_x
Let word(x) be the word created by traversing T from root to node x
Now using the definition of a suffix tree we get :
Number of repeats of word(x) = N_x and the position of this words are the label of each leaf
The algorithm for this would be a basic tree traversal, for each node in the tree calculate N_x, if N_x > 2 add this to your result (if you want the position too you can add the label of each leaf)
Pseudo code :
input :
mySequence
output:
Result (list of word that repeat with count and position)
Algorithm :
T = suffixTree(mySequence)
For each internal node X in T:
T_X = subTree(T)
N_X = Number of lead (T_X)
if N_X >=2 :
Result .add ( [word(X), N_X , list(label of leafs)] )
return Result
Example :
let's take the wikipedia example for suffix trees : "BANANA" :
we get :
N_A = 3 so "A" repeats 3 times in position {1,3,5}
N_N=2 so "N" repeats 2 times in position {2,4}
N_NA=2 so "NA" repeats 2 times in position {2,4}
I found this paper that seems to treat your problem the same way you're doing, so yes I think your method is write :
Spelling approximate repeated or common motifs using a suffix tree
Extract
We present in this paper two algorithms. The first one extracts
repeated motifs from a sequence defined over an alphabet Sigma. For
instance, Sigma may be equal to {A, C, G, T} and the sequence
represent an encoding of a DNA macromolecule. The motifs searched
correspond to words over the same alphabet which occur a minimum
number q of times in the sequence with at most e mismatches each time
(q is called the quorum constraint).[...]
You can download it and have a look at it , the author gives pseudo code for your algorithm.
Hope this helps
I have multiple binary trees stored as an array. In each slot is either nil (or null; pick your language) or a fixed tuple storing two numbers: the indices of the two "children". No node will have only one child -- it's either none or two.
Think of each slot as a binary node that only stores pointers to its children, and no inherent value.
Take this system of binary trees:
0 1
/ \ / \
2 3 4 5
/ \ / \
6 7 8 9
/ \
10 11
The associated array would be:
0 1 2 3 4 5 6 7 8 9 10 11
[ [2,3] , [4,5] , [6,7] , nil , nil , [8,9] , nil , [10,11] , nil , nil , nil , nil ]
I've already written simple functions to find direct parents of nodes (simply by searching from the front until there is a node that contains the child)
Furthermore, let us say that at relevant times, both all trees are anywhere between a few to a few thousand levels deep.
I'd like to find a function
P(m,n)
to find the lowest common ancestor of m and n -- to put more formally, the LCA is defined as the "lowest", or deepest node in which have m and n as descendants (children, or children of children, etc.). If there is none, a nil would be a valid return.
Some examples, given our given tree:
P( 6,11) # => 2
P( 3,10) # => 0
P( 8, 6) # => nil
P( 2,11) # => 2
The main method I've been able to find is one that uses an Euler trace, which turns the given tree (Adding node A as the invisible parent of 0 and 1, with a "value" of -1), into:
A-0-2-6-2-7-10-7-11-7-2-0-3-0-A-1-4-1-5-8-5-9-5-1-A
And from that, simply find the node between your given m and n that has the lowest number; For example, to find P(6,11), look for a 6 and an 11 on the trace. The number between them that is the lowest is 2, and that's your answer. If A (-1) is in between them, return nil.
-- Calculating P(6,11) --
A-0-2-6-2-7-10-7-11-7-2-0-3-0-A-1-4-1-5-8-5-9-5-1-A
^ ^ ^
| | |
m lowest n
Unfortunately, I do believe that finding the Euler trace of a tree that can be several thousands of levels deep is a bit machine-taxing...and because my tree is constantly being changed throughout the course of the programming, every time I wanted to find the LCA, I'd have to re-calculate the Euler trace and hold it in memory every time.
Is there a more memory efficient way, given the framework I'm using? One that maybe iterates upwards? One way I could think of would be the "count" the generation/depth of both nodes, and climb the lowest node until it matched the depth of the highest, and increment both until they find someone similar.
But that'd involve climbing up from level, say, 3025, back to 0, twice, to count the generation, and using a terribly inefficient climbing-up algorithm in the first place, and then re-climbing back up.
Are there any other better ways?
Clarifications
In the way this system is built, every child will have a number greater than their parents.
This does not guarantee that if n is in generation X, there are no nodes in generation (X-1) that are greater than n. For example:
0
/ \
/ \
/ \
1 2 6
/ \ / \ / \
2 3 9 10 7 8
/ \ / \
4 5 11 12
is a valid tree system.
Also, an artifact of the way the trees are built are that the two immediate children of the same parent will always be consecutively numbered.
Are the nodes in order like in your example where the children have a larger id than the parent? If so, you might be able to do something similar to a merge sort to find them.. for your example, the parent tree of 6 and 11 are:
6 -> 2 -> 0
11 -> 7 -> 2 -> 0
So perhaps the algorithm would be:
left = left_start
right = right_start
while left > 0 and right > 0
if left = right
return left
else if left > right
left = parent(left)
else
right = parent(right)
Which would run as:
left right
---- -----
6 11 (right -> 7)
6 7 (right -> 2)
6 2 (left -> 2)
2 2 (return 2)
Is this correct?
Maybe this will help: Dynamic LCA Queries on Trees.
Abstract:
Richard Cole, Ramesh Hariharan
We show how to maintain a data
structure on trees which allows for
the following operations, all in
worst-case constant time. 1. Insertion
of leaves and internal nodes. 2.
Deletion of leaves. 3. Deletion of
internal nodes with only one child. 4.
Determining the Least Common Ancestor
of any two nodes.
Conference: Symposium on Discrete
Algorithms - SODA 1999
I've solved your problem in Haskell. Assuming you know the roots of the forest, the solution takes time linear in the size of the forest and constant additional memory. You can find the full code at http://pastebin.com/ha4gqU0n.
The solution is recursive, and the main idea is that you can call a function on a subtree which returns one of four results:
The subtree contains neither m nor n.
The subtree contains m but not n.
The subtree contains n but not m.
The subtree contains both m and n, and the index of their least common ancestor is k.
A node without children may contain m, n, or neither, and you simply return the appropriate result.
If a node with index k has two children, you combine the results as follows:
join :: Int -> Result -> Result -> Result
join _ (HasBoth k) _ = HasBoth k
join _ _ (HasBoth k) = HasBoth k
join _ HasNeither r = r
join _ r HasNeither = r
join k HasLeft HasRight = HasBoth k
join k HasRight HasLeft = HasBoth k
After computing this result you have to check the index k of the node itself; if k is equal to m or n, you will "extend" the result of the join operation.
My code uses algebraic data types, but I've been careful to assume you need only the following operations:
Get the index of a node
Find out if a node is empty, and if not, find its two children
Since your question is language-agnostic I hope you'll be able to adapt my solution.
There are various performance tweaks you could put in. For example, if you find a root that has exactly one of the two nodes m and n, you can quit right away, because you know there's no common ancestor. Also, if you look at one subtree and it has the common ancestor, you can ignore the other subtree (that one I get for free using lazy evaluation).
Your question was primarily about how to save memory. If a linear-time solution is too slow, you'll probably need an auxiliary data structure. Space-for-time tradeoffs are the bane of our existence.
I think that you can simply loop backwards through the array, always replacing the higher of the two indices by its parent, until they are either equal or no further parent is found:
(defun lowest-common-ancestor (array node-index-1 node-index-2)
(cond ((or (null node-index-1)
(null node-index-2))
nil)
((= node-index-1 node-index-2)
node-index-1)
((< node-index-1 node-index-2)
(lowest-common-ancestor array
node-index-1
(find-parent array node-index-2)))
(t
(lowest-common-ancestor array
(find-parent array node-index-1)
node-index-2))))