Concatenation of iterators - algorithm

I saw this example in "Programming in Scala" chapter 24 "Collections in depth". This example shows two alternative ways to implement a tree:
by extending Traversable[Int] - here the complexity of def foreach[U](f: Int => U): Unit would be O(N).
by extending Iterable[Int] - here the complexity of def iterator: Iterator[Int] would be O(N log(N)).
This is to demonstrate why it would be helpful to have two separate traits, Traversable and Iterable.
sealed abstract class Tree
case class Branch(left: Tree, right: Tree) extends Tree
case class Node(elem: Int) extends Tree
sealed abstract class Tree extends Traversable[Int] {
def foreach[U](f: Int => U) = this match {
case Node(elem) => f(elem)
case Branch(l, r) => l foreach f; r foreach f
}
}
sealed abstract class Tree extends Iterable[Int] {
def iterator: Iterator[Int] = this match {
case Node(elem) => Iterator.single(elem)
case Branch(l, r) => l.iterator ++ r.iterator
}
}
Regarding the implementation of foreach they say:
traversing a balanced tree takes time proportional to the number of
elements in the tree. To see this, consider that for a balanced tree
with N leaves you will have N - 1 interior nodes of class Branch. So
the total number of steps to traverse the tree is N + N - 1.
That makes sense. :)
However, they mention that the concatenation of the two iterators in the iterator method has time complexity of log(N), so the total complexity of the method would be N log(N):
Every time an element is produced by a concatenated iterator such as
l.iterator ++ r.iterator, the computation needs to follow one
indirection to get at the right iterator (either l.iterator, or
r.iterator). Overall, that makes log(N) indirections to get at a leaf
of a balanced tree with N leaves. So the cost of visiting all elements of a tree went up from about 2N for the foreach traversal method to N log(N) for the traversal with iterator.
????
Why does the computation of the concatenated iterator need to get at a leaf of the left or right iterator?

The pun on "collections in depth" is apt. The depth of the data structure matters.
When you invoke top.iterator.next(), each interior Branch delegates to the iterator of the Branch or Node below it, a call chain which is log(N).
You incur that call chain on every next().
Using foreach, you visit each Branch or Node just once.
Edit: Not sure if this helps, but here is an example of eagerly locating the leaves but lazily producing the values. It would stackoverflow or be slower in older versions of Scala, but the implementation of chained ++ was improved. Now it's a flat chain that gets shorter as it's consumed.
sealed abstract class Tree extends Iterable[Int] {
def iterator: Iterator[Int] = {
def leafIterator(t: Tree): List[Iterator[Int]] = t match {
case Node(_) => t.iterator :: Nil
case Branch(left, right) => leafIterator(left) ::: leafIterator(right)
}
this match {
case n # Node(_) => Iterator.fill(1)(n.value)
case Branch(left # Node(_), right # Node(_)) => left.iterator ++ right.iterator
case b # Branch(_, _) =>
leafIterator(b).foldLeft(Iterator[Int]())((all, it) => all ++ it)
}
}
}
case class Branch(left: Tree, right: Tree) extends Tree {
override def toString = s"Branch($left, $right)"
}
case class Node(elem: Int) extends Tree {
def value = {
Console println "An expensive leaf calculation"
elem
}
override def toString = s"Node($elem)"
}
object Test extends App {
// many leaves
val n = 1024 * 1024
val ns: List[Tree] = (1 to n).map(Node(_)).toList
var b = ns
while (b.size > 1) {
b = b.grouped(2).map { case left :: right :: Nil => Branch(left, right) }.toList
}
Console println s"Head: ${b.head.iterator.take(3).toList}"
}

In this implementation, the topmost branch does NOT know how many elements there are in its left and right sub-branches.
Therefore, the iterator is built recursively with the divide and conquer approach which is clearly represented in the iterator method - you get to each node (case Branch), you produce the iterator of the single node case Node => ... and then you join them.
Without getting into each and every node, it would not know what elements there are and how the tree is structured (odd branches allowed vs not allowed etc.).
EDIT:
Let's have a look inside the ++ method on Iterator.
def ++[B >: A](that: => GenTraversableOnce[B]): Iterator[B] = new Iterator.JoinIterator(self, that)
and then at Iterator.JoinIterator
private[scala] final class JoinIterator[+A](lhs: Iterator[A], that: => GenTraversableOnce[A]) extends Iterator[A] {
private[this] var state = 0 // 0: lhs not checked, 1: lhs has next, 2: switched to rhs
private[this] lazy val rhs: Iterator[A] = that.toIterator
def hasNext = state match {
case 0 =>
if (lhs.hasNext) {
state = 1
true
} else {
state = 2
rhs.hasNext
}
case 1 => true
case _ => rhs.hasNext
}
def next() = state match {
case 0 =>
if (lhs.hasNext) lhs.next()
else {
state = 2
rhs.next()
}
case 1 =>
state = 0
lhs.next()
case _ =>
rhs.next()
}
override def ++[B >: A](that: => GenTraversableOnce[B]) =
new ConcatIterator(this, Vector(() => that.toIterator))
}
From that we can see that joining iterators just creates a recursive structure in the rhs field. Furthermore, let's focus on it a bit more.
Consider an even tree with structure level1 [A]; level2 [B][C]; level 3[D][E][F][F]
When you call JoinIterator on the iterator you preserve the existing lhs iterator. However, you always .toIterator on rhs. Which means that for each subsequent level, the rhs part will be reconstructed. So for B ++ C you get that looks like A.lhs (stands for B) and A.rhs (stands for C.toIterator) where C.toIterator stands for C.lhs and C.rhs etc. Thus, the added complexity.
I hope this answers your question.

Related

How would I write a method to turn a binary search tree (BST) into a sorted list for the values in the BST?

So I have a binary search tree and need to produce a list with the BSTtoList method, but I'm not sure what the general steps are or what I have to do.
class BinarySearchTree[A](comparator: (A, A) => Boolean) {
var root: BinaryTreeNode[A] = null
def search(a: A): BinaryTreeNode[A] = {
searchHelper(a, this.root)
}
def searchHelper(a: A, node: BinaryTreeNode[A]): BinaryTreeNode[A] = {
if(node == null){
null
}else if(comparator(a, node.value)){
searchHelper(a, node.left)
}else if(comparator(node.value, a)){
searchHelper(a, node.right)
}else{
node
}
}
def BSTtoList: List[A] = {
var sortedList = List()
if (root.left != null) {
sortedList :+ searchHelper(root.value, root.left).value
}
else if (root.right != null){
sortedList :+ searchHelper(root.value, root.right).value
}
sortedList
}
}
Let's first think about how a BST works. At any given node, say with value x, all the nodes in the left subtree will have values < x and all nodes in the right subtree will have values > x. Thus, to return the sorted list of the subtree rooted at node x, you just have to return [sorted list of left subtree] + [x] + [sorted list of right subtree], so you just have to call BSTtoList recursively on the left and right subtrees, and then return the list described above. From there you just have to handle the base case of returning an empty list at a NULL node.
The above algorithm is O(N^2) time, and there's a better solution using tail recursion that runs in O(N) time, pseudocode for which:
def BSTtoList(root, accumulator):
if root == NULL:
return accumulator
else:
return BSTtoList(root.left_child, [root.value] + BSTtoList(root.right_child, accumulator)
Where BSTtoList is initially called with an empty list as the accumulator. This second solution works similarly to the first but is optimized by minimizing array merges (this version works best if the language used has O(1) insertion into the front of a list; implementation is a bit different if the language allows O(1) insertion into the back).

Efficiently randomly sampling List while maintaining order

I would like to take random samples from very large lists while maintaining the order. I wrote the script below, but it requires .map(idx => ls(idx)) which is very wasteful. I can see a way of making this more efficient with a helper function and tail recursion, but I feel that there must be a simpler solution that I'm missing.
Is there a clean and more efficient way of doing this?
import scala.util.Random
def sampledList[T](ls: List[T], sampleSize: Int) = {
Random
.shuffle(ls.indices.toList)
.take(sampleSize)
.sorted
.map(idx => ls(idx))
}
val sampleList = List("t","h","e"," ","q","u","i","c","k"," ","b","r","o","w","n")
// imagine the list is much longer though
sampledList(sampleList, 5) // List(e, u, i, r, n)
EDIT:
It appears I was unclear: I am referring to maintaining the order of the values, not the original List collection.
If by
maintaining the order of the values
you understand to keeping the elements in the sample in the same order as in the ls list, then with a small modification to your original solution the performances can be greatly improved:
import scala.util.Random
def sampledList[T](ls: List[T], sampleSize: Int) = {
Random.shuffle(ls.zipWithIndex).take(sampleSize).sortBy(_._2).map(_._1)
}
This solution has a complexity of O(n + k*log(k)), where n is the list's size, and k is the sample size, while your solution is O(n + k * log(k) + n*k).
Here is an (more complex) alternative that has O(n) complexity. You can't get any better in terms of complexity (though you could get better performance by using another collection, in particular a collection that has a constant time size implementation). I did a quick benchmark which indicated that the speedup is very substantial.
import scala.util.Random
import scala.annotation.tailrec
def sampledList[T](ls: List[T], sampleSize: Int) = {
#tailrec
def rec(list: List[T], listSize: Int, sample: List[T], sampleSize: Int): List[T] = {
require(listSize >= sampleSize,
s"listSize must be >= sampleSize, but got listSize=$listSize and sampleSize=$sampleSize"
)
list match {
case hd :: tl =>
if (Random.nextInt(listSize) < sampleSize)
rec(tl, listSize-1, hd :: sample, sampleSize-1)
else rec(tl, listSize-1, sample, sampleSize)
case Nil =>
require(sampleSize == 0, // Should never happen
s"sampleSize must be zero at the end of processing, but got $sampleSize"
)
sample
}
}
rec(ls, ls.size, Nil, sampleSize).reverse
}
The above implementation simply iterates over the list and keeps (or not) the current element according to a probability which is designed to give the same chance to each element. My logic may have a flow, but at first blush it seems sound to me.
Here's another O(n) implementation that should have a uniform probability for each element:
implicit class SampleSeqOps[T](s: Seq[T]) {
def sample(n: Int, r: Random = Random): Seq[T] = {
assert(n >= 0 && n <= s.length)
val res = ListBuffer[T]()
val length = s.length
var samplesNeeded = n
for { (e, i) <- s.zipWithIndex } {
val p = samplesNeeded.toDouble / (length - i)
if (p >= r.nextDouble()) {
res += e
samplesNeeded -= 1
}
}
res.toSeq
}
}
I'm using it frequently with collections > 100'000 elements and the performance seems reasonable.
It's probably the same idea as in Régis Jean-Gilles's answer but I think the imperative solution is slightly more readable in this case.
Perhaps I don't quite understand, but since Lists are immutable you don't really need to worry about 'maintaining the order' since the original List is never touched. Wouldn't the following suffice?
def sampledList[T](ls: List[T], sampleSize: Int) =
Random.shuffle(ls).take(sampleSize)
While my previous answer has linear complexity, it does have the drawback of requiring two passes, the first one corresponding to the need to compute the length before doing anything else. Besides affecting the running time, we might want to sample a very large collection for which it is not practical nor efficient to load the whole collection in memory at once, in which case we'd like to be able to work with a simple iterator.
As it happens, we don't need to invent anything to fix this. There is simple and clever algorithm called reservoir sampling which does exactly this (building a sample as we iterate over a collection, all in one pass). With a minor modification we can also preserve the order, as required:
import scala.util.Random
def sampledList[T](ls: TraversableOnce[T], sampleSize: Int, preserveOrder: Boolean = false, rng: Random = new Random): Iterable[T] = {
val result = collection.mutable.Buffer.empty[(T, Int)]
for ((item, n) <- ls.toIterator.zipWithIndex) {
if (n < sampleSize) result += (item -> n)
else {
val s = rng.nextInt(n)
if (s < sampleSize) {
result(s) = (item -> n)
}
}
}
if (preserveOrder) {
result.sortBy(_._2).map(_._1)
}
else result.map(_._1)
}

Understanding a recursive function involving generators

I've come across the following recursive algorithm, written here in Swift, that given an array, produces a generator that generates sub-arrays that are one element shorter than the original array. The sub arrays are created by removing one element at every index.
ie input [1,2,3] would return a generator that generated [1,2] [2,3] [1,3].
The algorithm works, but I'm having real trouble understanding how. Could someone explain what's happening, or offer advice on how to analyze or understand it? Thanks in advance
// Main algorithm
func smaller1<T>(xs:[T]) -> GeneratorOf<[T]> {
if let (head, tail) = xs.decompose {
var gen1:GeneratorOf<[T]> = one(tail)
var gen2:GeneratorOf<[T]> = map(smaller1(tail)) {
smallerTail in
return [head] + smallerTail
}
return gen1 + gen2
}
return one(nil)
}
// Auxillary functions used
func map<A, B>(var generator:GeneratorOf<A>, f:A -> B) -> GeneratorOf<B> {
return GeneratorOf {
return generator.next().map(f)
}
}
func one<X>(x:X?) -> GeneratorOf<X> {
return GeneratorOf(GeneratorOfOne(x))
}
The code is taken from the book 'Functional Programming in Swift' by Chris Eidhof, Florian Kugler, and Wouter Swierstra
Given an array [a_1,…,a_n], the code:
Generates the sub-array [a_2,…,a_n];
For each sub-array B of [a_2,…,a_n] (generated recursively), generates [a_1] + B.
For example, given the array [1,2,3], we:
Generate [2,3];
For each sub-array B of [2,3] (namely, [3] and [2]), generate [1] + B (this generates [1,3] and [1,2]).

what does <=> mean in grails sort

I know this seems to be a stupid question but I cannot figure out what
def Nodes = Node.findAllByParent(theNode).sort{ a, b -> a.label <=> b.label }
does? The Node class contains label and other attributes. I want to know what the sort thing in the above line does. theNode is like a parent node which has children. and how is it different from
def Nodes = Node.findAllByParent(theNode,sort['label'])
a <=> b
is shorthand for
a.compareTo(b)
which itself is equivalent to:
if (a > b) {
return 1
} else if (a < b) {
return -1
} else {
// a and b are equal
return 0
}
The difference between
def Nodes = Node.findAllByParent(theNode).sort{ a, b -> a.label <=> b.label }
and
def Nodes = Node.findAllByParent(theNode,sort['label'])
is that the first one does the sorting in-memory, whereas in the second case the nodes are returned in sorted order by the query. In general you should let the database do the sorting where possible.
By the way, I think the second parameter above should be [sort: "label"] rather than sort['label'].
The first sort is done as a Groovy sort on the collection where as the second is using the sorting capabilities of the data source (e.g. database ORDER BY).
The <=> is known as the spaceship operator. The operator is another way of referring to the compareTo method of the Comparable interface. This means we can implement the compareTo method in our own classes and this will allow us to use the <=> operator in our code. And of course all classes which already have implemented the compareTo method can be used with the spaceship operator. The operator makes for good readable sort methods.
For example:
class Person implements Comparable {
String username
String email
int compareTo(other) {
this.username <=> other.username
}
}
assert -1 == ('a' <=> 'b')
assert 0 == (42 <=> 42)
assert -1 == (new Person([username:'foo', email: 'test#email.com']) <=> new Person([username:'zebra', email:'tester#email.com']))
assert [1, 2, 3, 4] == [4, 2, 1, 3].sort{ a, b -> a <=> b }

Pseudocode to compare two trees

This is a problem I've encountered a few times, and haven't been convinced that I've used the most efficient logic.
As an example, presume I have two trees: one is a folder structure, the other is an in-memory 'model' of that folder structure. I wish to compare the two trees, and produce a list of nodes that are present in one tree and not the other - and vice versa.
Is there an accepted algorithm to handle this?
Seems like you just want to do a pre-order traversal, essentially. Where "visiting" a node means checking for children that are in one version but not the other.
More precisely: start at the root. At each node, get a set of items in each of the two versions of the node. The symmetric difference of the two sets contains the items in one but not the other. Print/output those. The intersection contains the items that are common to both. For each item in the intersection (I assume you aren't going to look further into the items that are missing from one tree), call "visit" recursively on that node to check its contents. It's a O(n) operation, with a little recursion overhead.
public boolean compareTrees(TreeNode root1, TreeNode root2) {
if ((root1 == null && root2 != null) ||
(root1 != null && root2 == null)) {
return false;
}
if (root1 == null && root2 == null) {
return true;
}
if (root1.data != root2.data) {
return false;
}
return compareTrees(root1.left, root2.left) &&
compareTrees(root1.right, root2.right);
}
If you use a sort tree, like an AVL tree, you can also traverse your tree efficiently in-order. That will return your paths in sorted order from "low" to "high".
Then you can sort your directory array (e.g. Using quicksort) using the same compare method as you use in your tree algorithm.
Then start comparing the two side by side, advancing to the next item by traversing your tree in-order and checking the next item in your sorted directory array.
This should be more efficient in practice, but only benchmarking can tell.
A simple example code in python.
class Node(object):
def __init__(self, val):
self.val = val
self.child = {}
def get_left(self):
# if left is not in the child dictionary that means the element does not have a left child
if 'left' in self.child:
return self.child['left']
else:
return None
def get_right(self):
# if right is not in the child dictionary that means the element does not have a right child
if 'right' in self.child:
return self.child['right']
else:
return None
def traverse_tree(a):
if a is not None:
print 'current_node : %s' % a.val
if 'left' in a.child:
traverse_tree(a.child['left'])
if 'right' in a.child:
traverse_tree(a.child['right'])
def compare_tree(a, b):
if (a is not None and b is None) or (a is None and b is not None):
return 0
elif a is not None and b is not None:
print a.val, b.val
# print 'currently comparing a : %s, b : %s, left : %s, %s , right : %s, %s' % (a.val, b.val, a.child['left'].val, b.child['left'].val, a.child['right'].val, b.child['right'].val)
if a.val==b.val and compare_tree(a.get_left(), b.get_left()) and compare_tree(a.get_right(), b.get_right()):
return 1
else:
return 0
else:
return 1
# Example
a = Node(1)
b = Node(0)
a.child['left'] = Node(2)
a.child['right'] = Node(3)
a.child['left'].child['left'] = Node(4)
a.child['left'].child['right'] = Node(5)
a.child['right'].child['left'] = Node(6)
a.child['right'].child['right'] = Node(7)
b.child['left'] = Node(2)
b.child['right'] = Node(3)
b.child['left'].child['left'] = Node(4)
#b.child['left'].child['right'] = Node(5)
b.child['right'].child['left'] = Node(6)
b.child['right'].child['right'] = Node(7)
if compare_tree(a, b):
print 'trees are equal'
else:
print 'trees are unequal'
# DFS traversal
traverse_tree(a)
Also pasted an example that you can run.
You may also want to have a look at how git does it. Essentially whenever you do a git diff, under the hood a tree comparison is done.

Resources