Chi-Square Test algorithm - algorithm

I'm randomly selecting nodes from a Binary Tree and I have to build a black-box test that proves all nodes have the nearly the same probability of being selected.
I'm adapting the Chi-Square test algorithm based on this article http://en.wikibooks.org/wiki/Algorithm_Implementation/Pseudorandom_Numbers/Chi-Square_Test
but I'm a bit confused about what 'r' should be.
Just as a side question, do you think this is an appropriate algorithm to prove randomness in a set of results?
Thank you,
Diogo

The answer is right there in the citation. Scroll down to the javadocs:
* #param r upper bound for the random range
It sounds to me like r ought to be the number of nodes in the tree. Do you agree?
I'm randomly selecting nodes from a
Binary Tree and I have to build a
black-box test that proves all nodes
have the nearly the same probability
of being selected.
I don't understand this. Do you? It seems less a test of the tree data structure and more about the random algorithm you use to choose a node to be selected. What does this outcome prove?
Walk me through what you're doing, please. Here's what I'm imagining:
You have a list of nodes in your tree.
You use a random algorithm somehow to choose a node to select from your tree.
You iterate through your tree to find that node.
If there are r nodes in your tree, each one should have 1/r probability of being selected. It's a multi-dimensional, r-sided coin or die. Right?
The tree could bring another element into the mix: the probabilities are changed if the chances of being selected depend on where you are in the tree and whether or not you're allowed to backtrack. If that's the case, the chances of being selected are different at each node. Starting at the root means you can get to all child nodes. Being on the first level and not being able to backtrack eliminates the root and all other first level nodes from consideration, and so on.
Which problem are you trying to solve?

Related

What is a zip tree, and how does it work?

I've heard of a new balanced BST data structure called a zip tree. What is the zip tree? How does it work?
At a high level, a zip tree is a
randomized balanced binary search tree,
that is a way of encoding a skiplist as a BST, and
that uses a pair of operations called zipping and unzipping rather than tree rotations.
The first bullet point - that zip trees are randomized, balanced BSTs - gives a feel for what a zip tree achieves at a high level. It's a type of balanced binary search tree that, like treaps and unlike red/black trees, uses randomization to balance the tree. In that sense, a zip tree isn't guaranteed to be a balanced tree, but rather has a very high probability of being balanced.
The second bullet point - that zip trees are encodings of skiplists - shows where zip trees come from and why, intuitively, they're balanced. You can think of a zip tree as a way of taking the randomized skiplist data structure, which supports all major operations in expected time O(log n), and representing it as a binary search tree. This provides the intuition for where zip trees come from and why we'd expect them to be so fast.
The third bullet point - zip trees use zipping and unzipping rather than tree rotations - accounts for the name of the zip tree and what it feels like to code one up. Zip trees differ from other types of balanced trees (say, red/black trees or AVL trees) in that nodes are moved around the tree not through rotations, but through a pair of operations that convert a larger chain of nodes into two smaller chains or vice-versa.
The rest of this answer dives deeper into where zip trees come from, how they work, and how they're structured.
Review: Skip Lists
To understand where zip trees come from, let's begin with a review of another data structure, the skiplist. A skiplist is a data structure that, like a binary search tree, stores a collection of elements in sorted order. Skiplists, however, aren't tree structures. Rather, a skiplist works by storing elements in sorted order through several layers of linked lists. A sample skiplist is shown here:
As you can see, the elements are represented in sorted order. Each element has an associated height, and is part of a number of linked lists equal to its height. All of the elements of the skiplist participate in the bottom layer. Ideally, roughly half of the nodes will be in the layer above that, roughly a quarter of the nodes will be in the layer above that, roughly an eighth of the nodes will be in the layer above that, etc. (More on how this works later on.)
To do a lookup in a skiplist, we begin in the topmost layer. We walk forward in the skiplist until either (1) we find the element we're looking for, (2) we find an element bigger than the one we're looking for, or (3) we hit the end of the list. In the first case, we uncork the champagne and celebrate because we discovered the item we were searching for and there's nothing more to do. In the second case or third cases, we've "overshot" the element that we're looking for. But that's nothing to worry about - in fact, that's helpful because it means that what we're looking for must be between the node we hit that "overshot" and the node that comes before it. So we'll go to the previous node, drop down one layer, and pick up our search from there.
For example, here's how we'd do a search for 47:
Here, the blue edges indicate the links followed where we moved forward, and the red edges indicate where we overshot and decided to descend down a layer.
A powerful intuition for how skiplists work - which we'll need later on as we transition to zip trees - is that the topmost layer of the skiplist partitions the remaining elements of the skiplists into different ranges. You can see this here:
Intuitively, a skiplist search will be "fast" if we're able to skip looking at most of the elements. Imagine, for example, that the second-to-last layer of the skiplist only stores every other element of the skiplist. In that case, traversing the second-to-last layer is twice as fast as traversing the bottom layer, so we'd expect a lookup starting in the second-to-last layer to take half as much time as a lookup starting in the bottom layer. Similarly, imagine that the layer above that one only stores every other element from the layer below it. Then searching in that layer will take roughly half as much time as searching the layer below it. More generally, if each layer only stores roughly half the elements of the layer below it, then we could skip past huge amounts of the elements in the skiplist during a search, giving us good performance.
The skiplist accomplishes this by using the following rule: whenever we insert an element into the skiplist, we flip a coin until we get heads. We then set the height of the newly-inserted node to be the number of coins that we ended up tossing. This means it has a 50% chance to stay in its current layer and a 50% chance to move to the layer above it, which means, in aggregate, that roughly half the nodes will only be in the bottom layer, roughly half of what's left will be one layer above that, roughly half of what's left will be one layer above that, etc.
(For those of you with a math background, you could also say that the height of each node in the skiplist is a Geom(1/2) random variable.)
Here's an example of inserting 42 into the skiplist shown above, using a height of 1:
Deletion from a skiplist is also a fairly simple operation: we simply splice it out of whatever linked lists it happens to be in. That means that if we were to delete the 42 we just inserted from the above list, we'd end up with the same skiplist that we started with.
It can be shown that the expected cost of an insertion, deletion, or lookup in a skiplist is O(log n), based on the fact that the number of items in each list is roughly half the number of items in the one below it. (That means we'd expect to see O(log n) layers, and only take a constant number of steps in each layer.)
From Skiplists to Zip Trees
Now that we've reviewed skiplists, let's talk about where the zip tree comes from.
Let's imagine that you're looking at the skiplist data structure. You really like the expected O(log n) performance of each operation, and you like how conceptually simple it is. There's just one problem - you really don't like linked lists, and the idea of building something with layers upon layers of linked lists doesn't excite you. On the other hand, you really love binary search trees. They've got a really simple structure - each node has just two pointers leaving it, and there's a simple rule about where everything gets placed. This question then naturally arises: could you get all the benefits of a skiplist, except in BST form?
It turns out that there's a really nice way to do this. Let's imagine that you have the skiplist shown here:
Now, imagine you perform a lookup in this skiplist. How would that search work? Well, you'd always begin by scanning across the top layer of the skiplist, moving forward until you found a key that was bigger than the one you were looking for, or until you hit the end of the list and found that there were no more nodes at the top level. From there, you'd then "descend" one level into a sub-skiplist containing only the keys between the last node you visited and the one that overshot.
It's possible to model this exact same search as a BST traversal. Specifically, here's how we might represent the top layer of that skiplist as a BST:
Notice that all these nodes chain to the right, with the idea being that "scanning forward in the skiplist" corresponds to "visiting larger and larger keys." In a BST, moving from one node to a larger node corresponds to moving right, hence the chain of nodes to the right.
Now, each node in a BST can have up to two children, and in the picture shown above each node has either zero children or one child. If we fill in the missing children by marking what ranges they correspond to, we get this.
And hey, wait a minute! It sure looks like the BST is partitioning the space of keys the same way that the skiplist is. That's promising, since it suggests that we're on to something here. Plus, it gives us a way to fill in the rest of the tree: we can recursively convert the subranges of the skiplist into their own BSTs and glue the whole thing together. If we do that, we get this tree encoding the skiplist:
We now have a way of representing a skiplist as a binary search tree. Very cool!
Now, could we go the other way around? That is, could we go from a BST to a skiplist? In general, there's no one unique way to do this. After all, when we converted the skiplist to a BST, we did lose some information. Specifically, each node in the skiplist has an associated height, and while each node in our BST has a height as well it's not closely connected to the skiplist node heights. To address this, let's tag each BST node with the height of the skiplist node that it came from. This is shown here:
Now, some nice patterns emerge. For starters, notice that each node's associated number is bigger than its left child's number. That makes sense, since each step to the left corresponds to descending into a subrange of the skiplist, where nodes will have lower heights. Similarly, each node's associated number is greater than or equal to the number of its right child. And that again makes sense - moving to the right either means
continuing forward at the same level that we were already on, in which case the height remains the same, or
hitting the end of a range and descending into a subrange, in which case the height decreases.
Can we say more about the shape of the tree? Sure we can! For example, in a skiplist, each node's height is picked by flipping coins until we get heads, then counting how many total coins we flipped. (Or, as before, it's geometrically distributed with probability 1/2). So if we were to imagine building a BST that corresponded to a skiplist, we'd want the numbers assigned to the nodes to work out the same way.
Putting these three rules together, we get the following, which defines the shape of our tree, the zip tree!
A zip tree is a binary search tree where
Each node has an associated number called its rank. Ranks are assigned randomly to each node by flipping coins until heads is flipped, then counting how many total coins were tossed.
Each node's rank is strictly greater than its left child's rank.
Each node's rank is greater than or equal to its right child's rank.
It's amazing how something like a skiplist can be represented as a BST by writing out such simple rules!
Inserting Elements: Unzipping
Let's suppose you have a zip tree. How would you insert a new element into it?
We could in principle answer this question by looking purely at the rules given above, but I think it's a lot easier to figure this out by remembering that zip trees are skiplists in disguise. For example, here's the above zip tree, with its associated skiplist:
Now, suppose we want to insert 18 into this zip tree. To see how this might play out, imagine that we decide to give 18 a rank of 2. Rather than looking at the zip tree, let's look at what would happen if we did the insertion into the skiplist. That would give rise to this skiplist:
If we were to take this skiplist and encode it as a zip tree, we'd get the following result:
What's interesting about this is that we can see what the tree needs to look like after the insertion, even if we don't know how to perform the insertion. We can then try to figure out what the insertion logic needs to look like by reverse-engineering it from these "before" and "after" pictures.
Let's think about what change this insertion made to our zip tree. To begin with, let's think back to our intuition for how we encode skiplists as zip trees. Specifically, chains of nodes at the same level in a skiplist with no intervening "higher" elements map to chains of nodes in the zip tree that lean to the right. Inserting an element into the skiplist corresponds to adding some new element into one of the levels, which has the effect of (1) adding in something new into some level of the skiplist, and (2) taking chains of elements in the skiplist that previously were adjacent at some level, then breaking those connections.
For example, when we inserted 18 into the skiplist shown here, we added something new into the blue chain highlighted here, and we broke all of the red chains shown here:
What is that going to translate into in our zip tree? Well, we can highlight the blue link where our item was inserted here, as well as the red links that were cut:
Let's see if we can work out what's going on here. The blue link here is, fortunately, pretty easy to find. Imagine we do a regular BST insertion to add 18 into our tree. As we're doing so, we'll pause when we reach this point:
Notice that we've hit a key with the same rank as us. That means that, if we were to keep moving to the right, we'd trace out this region of the skiplist:
To find the blue edge - the place where we go - we just need to walk down through this chain of nodes until we find one bigger than us. The blue edge - our insertion point - is then given by the edge between that node and the one above it.
We can identify this location in a different way: we've found the blue edge - our insertion point - when we've reached a point where the node to insert (1) has a bigger rank than the node to the left, (2) has a rank that's greater than or equal to the node on the right, and (3) if the node to the right has the same rank, our new item to insert is less than the item to the right. The first two rules ensure that we're inserting into the right level of the skiplist, and the last rule ensures that we insert into the right place in that level of the skiplist.
Now, where are our red edges? Intuitively, these are the edges that were "cut" because 18 has been added into the skiplist. Those would be items that previously were between the two nodes on opposite ends of the blue edge, but which node need to get partitioned into the new ranges defined by the split version of that blue edge.
Fortunately, those edges appear in really nice places. Here's where they map to:
(In this picture, I've placed the new node 18 in the middle of the blue edge that we identified in the skiplist. This causes the result not to remain a BST, but we'll fix that in a minute.)
Notice that these are the exact same edges that we'd encounter if we were to finish doing our regular BST insertion - it's the path traced out by looking for 18! And something really nice happens here. Notice that
each time we move to the right, the node, when cut, goes to the right of 18, and
each time we move to the left, the node, when cut, goes to the left of 18.
In other words, once we find the blue edge where we get inserted, we keep walking as though we were doing our insertion as usual, keeping track of the nodes where we went left and the nodes where we went right. We can then chain together all the nodes where we went left and chain together all the nodes where we went right, gluing the results together under our new node. That's shown here:
This operation is called unzipping, and it's where we get the name "zip tree" from. The name kinda make sense - we're taking two interleaved structures (the left and right chains) and splitting them apart into two simpler linear chains.
To summarize:
Inserting x into a zip tree works as follows:
Assign a random rank to x by flipping coins and counting how many flips were needed to get heads.
Do a search for x. Stop the search once you reach a node where
the node's left child has a lower rank than x,
the node's right child has a rank less than or equal to x, and
the node's right child, if it has the same rank as x, has a larger key than x.
Perform a unzip. Specifically:
Continue the search for x as before, recording when we move left and when we move right.
Chain all the nodes together where we went left by making each the left child of the previously-visited left-moving node.
Chain all the nodes together where we went right by making each the right child of the previously-visited right-moving node.
Make those two chains the children of the node x.
You might notice that this "unzipping" procedure is equivalent to what you'd get if you performed a different operation. You could achieve the same result by inserting x as usual, then using tree rotations to pull x higher and higher in the tree until it came to rest in the right place. This is a perfectly valid alternative strategy for doing insertions, though it's a bit slower because two passes over the tree are required (a top-down pass to insert at a leaf, then a bottom-up pass to do the rotations).
Removing Elements: Zipping
Now that we've seen how to insert elements, how do we remove them?
Let's begin with a helpful observation: if we insert an item into a zip tree and then remove it, we should end up with the exact same tree that we started with. To see why this is, we can point back to a skiplist. If you add and then remove something from a skiplist, then you end up with the same skiplist that you would have had before. So that means that the zip tree needs to end up looking identical to how it started after we add and then remove an element.
To see how to do this, we'd need to perform two steps:
Undo the unzip operation, converting the two chains of nodes formed back into a linear chain of nodes.
Undo the break of the blue edge, restoring the insertion point of x.
Let's begin with how to undo an unzip operation. This, fortunately, isn't too bad. We can identify the chains of nodes that we made with the unzip operation when we inserted x into the zip tree fairly easily - we simply look at the left and right children of x, then move, respectively, purely to the left and purely to the right.
Now, we know that these nodes used to be linked together in a chain. What order do we reassemble them into? As an example, take a look a this part of a zip tree, where we want to remove 53. The chains to the left and right of 53 are highlighted:
If we look at the nodes making up the left and right chains, we can see that there's only one way to reassemble them. The topmost node of the reassembled chain must be 67, since it has rank 3 and will outrank all other items. After that, the next node must be 41, because it's the smaller of the rank-2 elements and elements with the same rank have smaller items on top. By repeating this process, we can reconstruct the chain of nodes, as shown here, simply by using the rules for how zip trees have to be structured:
This operation, which interleaves two chains together into one, is called zipping.
To summarize, here's how a deletion works:
Deleting a node x from a zip tree works as follows:
Find the node x in the tree.
Perform a zip of its left and right subtrees. Specifically:
Maintain "lhs" and "rhs" pointers, initially to the left and right subtrees.
While both those pointers aren't null:
If lhs has a higher rank than rhs, make lhs's right child rhs, then advance lhs to what used to be lhs's right child.
Otherwise, make rhs's left child lhs, then advance rhs to point to what used to be rhs's left child.
Rewire x's parent to point to the result of the zip operation rather than x.
More to Explore
To recap our main points: we saw how to represent a skiplist as a BST by using the idea of ranks. That gave rise to the zip tree, which uses ranking rules to determine parent/child relationships. Those rules are maintained using the zip and unzip operations, hence the name.
Doing a full analysis of a zip list is basically done by reasoning by analogy to a skiplist. We can show, for example, that the expected runtime of an insertion or deletion is O(log n) by pointing at the equivalent skiplist and noting that the time complexity of the equivalent operations there are O(log n). And we can similary show that these aren't just expected time bounds, but expected time bounds with a high probability of occurring.
There's a question of how to actually store the information needed to maintain a zip tree. One option would be to simply write the rank of each item down in the nodes themselves. That works, though since ranks are very unlikely to exceed O(log n) due to the nature of geometric random variables, that would waste a lot of space. Another alternative would be to use a hash function on node addresses to generate a random, uniformly-distributed integer in some range, then find the position of the most least-significant 1 bit to simulate our coin tosses. That increases the costs of insertions and deletions due to the overhead of computing the hash codes, but also decreases the space usage.
Zip trees aren't the first data structure to map skiplists and BSTs together. Dean and Jones developed an alternative presentation of this idea in 2007. There's also another way to exploit this connection. Here, we started with a randomized skiplist, and used it to derive a randomized BST. But we can run this in reverse as well - we can start with a deterministic balanced BST and use that to derive a deterministic skiplist. Munro, Papadakis, and Sedgewick found a way to do this by connecting 2-3-4 trees and skiplists.
And zip trees aren't the only randomized balanced BST. The treap was the first structure to do this, and with a little math you can show that treaps tend to have slightly lower expected heights than zip trees. The tradeoff, though, is that you need more random bits per node than in a zip tree.
Hope this helps!

What is the number of nodes at a particular level in a balanced binary search tree?

I was asked this question in a phone screen interview and I was not able to answer it. For example, in a BST, I know that the maximum number of nodes is given by 2^h (assuming the root node at height = 0)
I wanted to ask, is there a similar mathematical outcome for a balanced binary search tree as well (For AVL, Red Black trees?), i.e. the number of nodes at a particular level k.
Thanks!
A balanced binary tree starts with one node, which has two descendants. Each of those then has two descendants again. So there will be 1, 2, 4, 8 and so on nodes per level.
As a formula you can use 2^(level-1). The last row might not be completely full, so it can have less elements.
As the balancing step is costly, implementations usually do not rebalance after every mutation of the tree. They will rather apply a heuristic to find out when a rebalancing will make the most sense. So in practice, levels might have less nodes than if the tree were perfectly balanced and there might be additional levels from nodes being inserted in the wrong places.

Non-recursive 0/1 Knapsack algorithm using Breadth-first Search

I came across an interesting problem called Knapsack. You have a list of items, which all have a value and a weight. Then you have to find the combination of items that maximize the value of the objects summed, and stay within a certain limit. I saw somewhere that this is a search problem which you could use different search algorithms. Now I am trying to implement it with breadth-first.
The pseudo algorithm for BFS found on wikipedia is as follows:
Breadth-First-Search(Graph, root):
create empty set S
create empty queue Q
root.parent = NIL
Q.enqueue(root)
while Q is not empty:
current = Q.dequeue()
if current is the goal
return current
for each node n that is adjacent to current:
if n is not in S:
add n to S
n.parent = current
Q.enqueue(n)
I have really tried to understand how to apply this to the knapsack problem.
As much as I understand, it's about building a tree. I need to expand and explore each node of one level at a time. For BFS I need a FIFO queue. For each item selected, I have two choices: Either I take the item or not.
Anyway, to be specific: What I do not understand, in my context, with the pseudo code above are:
When I select an item, do I push it twice to the queue and mark one of them as used and one as not used?
How do I know if the current is the goal? I assume its something like when there are no more nodes to explore which means we are at a leaf node.. But there will be many leaf nodes, so which one do I choose and how?
What is meant with adjacent to current? If I only have a list, or an array of items (items have an ID, a weight, and a value), how do I know which is adjacent?
Say you have 4 different items. Then the graph you are searching is a hypercube like this (image by Yury Chebiryak):
The binary numbers at the nodes are all of the possible knapsacks, with a 0 in the nth place meaning item n is not in the knapsack, and a 1 meaning it is, so for example 0000 means an empty knapsack, 1001 means the knapsack containing the first item and the 4th item, and so on.
At each step you remove the current node from the queue, and if it isn't the goal, construct the adjacent nodes by finding all of the knapsacks differing from the current one by 1 item that you haven't visited already. So for example, if the current node is 1001, you would construct nodes 0001, 1101, 1011, and 1000. You then add these nodes to the queue.
The goal only has a meaning if you are looking for a "good enough" solution, rather than the best solution. To establish whether the current node is the goal you simply work out the value of the current knapsack and compare it to the goal value.
If you want the best solution, then breadth first search is not helping you because you need to explore every node in the graph. Dynamic programming or backtracking (which is a kind of Depth First Search) would allow you to reduce the search space.
If you want a "good enough" solution, then FIFO branch-and-bound or hill climbing (starting from a random knapsack) are effective ways of using breadth-first search.

counting leaf nodes in a tree

Assume we have a tree where every node has pre-decided set of outgoing nodes. Is it possible to come up with a fast way/optimizations to count the number of leaf nodes given a level value? Would be great if someone could suggest any ideas/links/resources to do the same.
No. you'd still have to traverse the entire tree. There's no way of predicting the precise structure - or approximating it - from only the number of childnodes of each node of the tree.
Apart from that: just keep a counter and update it on each insertion. Far simpler and wouldn't change time-complexity of any operation, except for counting leaves, which would be reduced to O(1).
This can actually get pretty tough thing. As it varies of what is the programming language, what is the input data structure, is the tree binary or general tree (arbitrary number of children), size of the tree.
The most general idea is to run a DFS or BFS, starting from the root, to get every node level and then make a list of sets where each set contains the nodes of a single level. The set can be any structure, standard list is fine.
Let's say you are working in C++ which is good, if not the best practical choice if you need performance (even better than C).
Let's say we have a general tree and the input structure is adjacency list as you mentioned.
//nodes are numbered from zero to N-1
vector<vector<int>> adjList;
Then you run either a BFS or DFS, either will do for a tree, keeping a level for each node. The level for a next node is the level of it's parent plus one.
Once you discover the level, you put the node in like this.
vector<vector<int>> nodesPartitionedByLevels(nodeCount);
//run bfs here
//inside it you call
nodesPartitionedByLevels[level].push_back(node)
That's about it.
Then when you have the levels, you iterate all the nodes on that level and you check the adjaceny list if it contans any nodes.
basically you call adjList[node].empty(). If true than that is a leaf node.

Split 2-3 tree into less-than and greater-than given value X

I need to write function, which receives some key x and split 2-3 tree into 2 2-3 trees. In first tree there are all nodes which are bigger than x, and in second which are less. I need to make it with complexity O(logn). thanks in advance for any idea.
edited
I thought about finding key x in the tree. And after split its two sub-trees(bigger or lesser if they exist) into 2 trees, and after begin to go up and every time to check sub-trees which I've not checked yet and to join to one of the trees. My problem is that all leaves must be at the same level.
If you move from the root to your key and split each node so one points at the nodes larger than the key and the other at the rest and then make the larger node be a part of your larger tree, say by having the leftmost node at one level higher point at it, (don't fix the tree yet, do it at the end) until you reach the key you will get your trees. Then you just need to fix both trees on the path you used (note that the same path exists on both trees).
Assuming you have covered 2-3-4 trees in the lecture already, here is a hint: see whether you can apply the same insertion algorithm for 2-3 trees also. In particular, make insertions always start in the leaf, and then restructure the tree appropriately. When done, determine the complexity of the algorithm you got.

Resources