trie or balanced binary search tree to store dictionary? - algorithm

I have a simple requirement (perhaps hypothetical):
I want to store english word dictionary (n words) and given a word (character length m), the dictionary is able to tell, if the word exists in dictionary or not.
What would be an appropriate data structure for this?
a balanced binary search tree? as done in C++ STL associative data structures like set,map
or
a trie on strings
Some complexity analysis:
in a balanced bst, time would be (log n)*m (comparing 2 string takes O(m) time character by character)
in trie, if at each node, we could branch out in O(1) time, we can find using O(m), but the assumption that at each node, we can branch in O(1) time is not valid. at each node, max possible branches would be 26. if we want O(1) at a node, we will keep a short array indexible on characters at each node. This will blow-up the space. After a few levels in the trie, branching will reduce, so its better to keep a linked list of next node characters and pointers.
what looks more practical? any other trade-offs?
Thanks,

I'd say use a Trie, or better yet use its more space efficient cousin the Directed Acyclic Word Graph (DAWG).
It has the same runtime characteristics (insert, look up, delete) as a Trie but overlaps common suffixes as well as common prefixes which can be a big saving on space.

If this is C++, you should also consider std::tr1::unordered_set. (If you have C++0x, you can use std::unordered_set.)
This just uses a hash table internally, which I would wager will out-perform any tree-like structure in practice. It is also trivial to implement because you have nothing to implement.

Binary search is going to be easier to implement and it's only going to involve comparing tens of strings at the most. Given you know the data up front, you can build a balanced binary tree so performance is going to be predictable and easily understood.
With that in mind, I'd use a standard binary tree (probably using set from C++ since that's typically implemented as a tree).

A simple solution is to store the dict as sorted, \n-separated words on disk, load it into memory and do a binary search. The only non-standard part here is that you have to scan backwards for the start of a word when you're doing the binary search.
Here's some code! (It assumes globals wordlist pointing to the loaded dict, and wordlist_end which points to just after the end of the loaded dict.
// Return >0 if word > word at position p.
// Return <0 if word < word at position p.
// Return 0 if word == word at position p.
static int cmp_word_at_index(size_t p, const char *word) {
while (p > 0 && wordlist[p - 1] != '\n') {
p--;
}
while (1) {
if (wordlist[p] == '\n') {
if (*word == '\0') return 0;
else return 1;
}
if (*word == '\0') {
return -1;
}
int char0 = toupper(*word);
int char1 = toupper(wordlist[p]);
if (char0 != char1) {
return (int)char0 - (int)char1;
}
++p;
++word;
}
}
// Test if a word is in the dictionary.
int is_word(const char* word_to_find) {
size_t index_min = 0;
size_t index_max = wordlist_end - wordlist;
while (index_min < index_max - 1) {
size_t index = (index_min + index_max) / 2;
int c = cmp_word_at_index(index, word_to_find);
if (c == 0) return 1; // Found word.
if (c < 0) {
index_max = index;
} else {
index_min = index;
}
}
return 0;
}
A huge benefit of this approach is that the dict is stored in a human-readable way on disk, and that you don't need any fancy code to load it (allocate a block of memory and read() it in in one go).
If you want to use a trie, you could use a packed and suffix-compressed representation. Here's a link to one of Donald Knuth's students, Franklin Liang, who wrote about this trick in his thesis.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.123.7018&rep=rep1&type=pdf
It uses half the storage of the straightforward textual dict representation, gives you the speed of a trie, and you can (like the textual dict representation) store the whole thing on disk and load it in one go.
The trick it uses is to pack all the trie nodes into a single array, interleaving them where possible. As well as a new pointer (and an end-of-word marker bit) in each array location like in a regular trie, you store the letter that this node is for -- this lets you tell if the node is valid for your state or if it's from an overlapping node. Read the linked doc for a fuller and clearer explanation, as well as an algorithm for packing the trie into this array.
It's not trivial to implement the suffix-compression and greedy packing algorithm described, but it's easy enough.

Industry standard is to store the dictionary in a hashtable and have an amortized O(1) lookup time. Space is no more critical in industry especially due to the advancement in distributive computing.
hashtable is how google implement its autocomplete feature. Specifically have every prefix of a word as a key and put the word as the value in the hashtable.

Related

Find the Most Frequent Ordered Word Pair In a Document

This is a problem from S. Skiena's "Algorithm. Design Manual" book, the problem statement is:
Give an algorithm for finding an ordered word pair(e.g."New York")
occurring with the greatest frequency in a given webpage.
Which data structure would you use? Optimize both time and space.
One obvious solution is inserting each ordered pair in a hash-map and then iterating over all of them, to find the most frequent one, however, there definitely should be a better way, can anyone suggest anything?
In a text with n words, we have exactly n - 1 ordered word pairs (not distinct of course). One solution is to use a max priority queue; we simply insert each pair in the max PQ with frequency 1 if not already present. If present, we increment the key. However, if we use a Trie, we don't need to represent all n - 1 pairs separately. Take for example the following text:
A new puppy in New York is happy with it's New York life.
The resulting Trie would look like the following:
If we store the number of occurrences of a pair in the leaf nodes, we could easily compute the maximum occurrence in linear time. Since we need to look at each word, that's the best we can do, time wise.
Working Scala code below. The official site has a solution in Python.
class TrieNode(val parent: Option[TrieNode] = None,
val children: MutableMap[Char, TrieNode] = MutableMap.empty,
var n: Int = 0) {
def add(c: Char): TrieNode = {
val child = children.getOrElseUpdate(c, new TrieNode(parent = Some(this)))
child.n += 1
child
}
def letter(node: TrieNode): Char = {
node.parent
.flatMap(_.children.find(_._2 eq node))
.map(_._1)
.getOrElse('\u0000')
}
override def toString: String = {
Iterator
.iterate((ListBuffer.empty[Char], Option(this))) {
case (buffer, node) =>
node
.filter(_.parent.isDefined)
.map(letter)
.foreach(buffer.prepend(_))
(buffer, node.flatMap(_.parent))
}
.dropWhile(_._2.isDefined)
.take(1)
.map(_._1.mkString)
.next()
}
}
def mostCommonPair(text: String): (String, Int) = {
val root = new TrieNode()
#tailrec
def loop(s: String,
mostCommon: TrieNode,
count: Int,
parent: TrieNode): (String, Int) = {
s.split("\\s+", 2) match {
case Array(head, tail # _*) if head.nonEmpty =>
val word = head.foldLeft(parent)((tn, c) => tn.add(c))
val (common, n, p) =
if (parent eq root) (mostCommon, count, word.add(' '))
else if (word.n > count) (word, word.n, root)
else (mostCommon, count, root)
loop(tail.headOption.getOrElse(""), common, n, p)
case _ => (mostCommon.toString, count)
}
}
loop(text, new TrieNode(), -1, root)
}
Inspired by the question here.
I think the first point to note is that finding the most frequent ordered word pair is no more (or less) difficult than finding the most frequent word. The only difference is that instead of words made up of the letters a..z+A.Z separated by punctuation or spaces, you are looking for word-pairs made up of the letters a..z+A..Z+exactly_one_space, similarly separated by punctuation or spaces.
If your web-page has n words then there are only n-1 word-pairs. So hashing each word-pair then iterating over the hash table will O(n) in both time and memory. This should be pretty quick to do even if n is ~10^6 (i.e. the length of an average novel). I can't imagine anything more efficient unless n is fairly small, in which case the memory savings resulting from constructing an ordered list of word pairs (instead of a hash table) might outweigh the cost of increasing time complexity to O(nlogn)
why not keep all the ordered pairs in AVL tree with 10 elements array to track top 10 ordered pairs. In AVL we would keep all the order pairs with their occurring count and top 10 will keep in the array. this way searching of any ordered pair would be O(log N) and traversing would be O(N).
I think we could not do better than O(n) in terms of time as one would have to see at least each element once. So time complexity cannot be optimised further.
But we can use a trie to optimise the space used. In a page, there are often words which are repeated, so this might lead to significant reduction in space usage. The leaf nodes in the trie cold store the frequency of the ordered pair and using two pointers to iterate in the text where one would point at the current word and second would point at previous word.

Design of a data structure that can search over objects with 2 attributes

I'm trying to think of a way to desing a data structure that I can efficiently insert to, remove from and search in it.
The catch is that the search function is getting a similar object as input, with 2 attributes, and I need to find an object in my dataset, such that both the 1st and 2nd of the object in my dataset are equal to or bigger than the one in search function's input.
So for example, if I send as input, the following object:
object[a] = 9; object[b] = 14
Then a valid found object could be:
object[a] = 9; object[b] = 79
but not:
object[a] = 8; object[b] = 28
Is there anyway to store the data such that the search complexity is better than linear?
EDIT:
I forgot to include in my original question. The search has to return the smallest possible object in the dataset, by multipication of the 2 attributes.
Meaning that the value of object[a]*object[b] of an object that fits the original condition, is smaller than any other object in the dataset that also fits.
You may want to use k-d tree data structure, which is typically use to index k dimensional points. The search operation, like what you perform, requires O(log n) in average.
This post may help when attributes are hierarchically linked like name, forename. For point in a 2D space k-d tree is more adapted as explain by fajarkoe.
class Person {
string name;
string forename;
... other non key attributes
}
You have to write a comparator function which take two objects of class X as input and returns -1, 0 or +1 for <, = and > cases.
Libraries like glibc(), with qsort() and bsearch or more higher languages like Java and its java.util.Comparator class and java.util.SortedMap (implementation java.util.TreeMap) as containers use comparators.
Other languages use equivalent concept.
The comparator method may be wrote followin your spec like:
int compare( Person left, Person right ) {
if( left.name < right.name ) {
return -1;
}
if( left.name > right.name ) {
return +1;
}
if( left.forename < right.forename ) {
return -1;
}
if( left.forename > right.forename ) {
return +1;
}
return 0;
}
Complexity of qsort()
Quicksort, or partition-exchange sort, is a sorting algorithm
developed by Tony Hoare that, on average, makes O(n log n) comparisons
to sort n items. In the worst case, it makes O(n2) comparisons, though
this behavior is rare. Quicksort is often faster in practice than
other O(n log n) algorithms.1 Additionally, quicksort's sequential
and localized memory references work well with a cache. Quicksort is a
comparison sort and, in efficient implementations, is not a stable
sort. Quicksort can be implemented with an in-place partitioning
algorithm, so the entire sort can be done with only O(log n)
additional space used by the stack during the recursion.2
Complexity of bsearch()
If the list to be searched contains more than a few items (a dozen,
say) a binary search will require far fewer comparisons than a linear
search, but it imposes the requirement that the list be sorted.
Similarly, a hash search can be faster than a binary search but
imposes still greater requirements. If the contents of the array are
modified between searches, maintaining these requirements may even
take more time than the searches. And if it is known that some items
will be searched for much more often than others, and it can be
arranged so that these items are at the start of the list, then a
linear search may be the best.

Efficient datastructure for pooling integers

I'm looking for a data structure to help me manage a pool of integers. It's a pool in that I remove integers from the pool for a short while then put them back with the expectation that they will be used again. It has some other odd constraints however, so a regular pool doesn't work well.
Hard requirements:
constant time access to what the largest in use integer is.
the sparseness of the integers needs to be bounded (even if only in principal).
I want the integers to be close to each other so I can quickly iterate over them with minimal unused integers in the range.
Use these if they help with selecting a data structure, otherwise ignore them:
Integers in the pool are 0 based and contiguous.
The pool can be constant sized.
Integers from the pool are only used for short periods with a high churn rate.
I have a working solution but it feels inelegant.
My (sub-optimal) Solution
Constant sized pool.
Put all available integers into a sorted set (free_set).
When a new integer is requested retrieve the smallest from the free_set.
Put all in-use integers into another sorted set (used_set).
When the largest is requested, retrieve the largest from the used_set.
There are a few optimization that may help with my particular solution (priority queue, memoization, etc). But my whole approach seems wasteful.
I'm hoping there some esoteric data structure that fits my problem perfectly. Or at least a better pooling algorithm.
pseudo class:
class IntegerPool {
int size = 0;
Set<int> free_set = new Set<int>();
public int Acquire() {
if(!free_set.IsEmpty()) {
return free_set.RemoveSmallest();
} else {
return size++;
}
}
public void Release(int i) {
if(i == size - 1) {
size--;
} else {
free_set.Add(i);
}
}
public int GetLargestUsedInteger() {
return size;
}
}
Edit
RemoveSmallest isn't useful as all. RemoveWhatever is good enough. So Set<int> can be replaced by LinkedList<int> as a faster alternative (or even Stack<int>).
Why not use a balanced binary search tree? You can store a pointer/iterator to the min element and access it for free, and updating it after an insert/delete is an O(1) operation. If you use a self balancing tree, insert/delete is O(log(n)). To elaborate:
insert : Just compare new element to previous min; if it is better make the iterator point to the new min.
delete : If min was deleted, then before removing find the successor (which you can do by just walking the iterator forward 1 step), and then take that guy to be the new min.
While it is theoretically possible to do slightly better using some kind of sophisticated uber-heap data structure (ie Fibonacci heaps), in practice I don't think you would want to deal with implementing something like that just to save a small log factor. Also, as a bonus you get fast in-order traversal for free -- not to mention that most programming languages these days^ come with fast implementations of self-balancing binary search trees out of the box (like red-black trees/avl etc.).
^ with the exception of javascript :P
EDIT: Thought of an even better answer.

Hashing a Tree Structure

I've just come across a scenario in my project where it I need to compare different tree objects for equality with already known instances, and have considered that some sort of hashing algorithm that operates on an arbitrary tree would be very useful.
Take for example the following tree:
O
/ \
/ \
O O
/|\ |
/ | \ |
O O O O
/ \
/ \
O O
Where each O represents a node of the tree, is an arbitrary object, has has an associated hash function. So the problem reduces to: given the hash code of the nodes of tree structure, and a known structure, what is a decent algorithm for computing a (relatively) collision-free hash code for the entire tree?
A few notes on the properties of the hash function:
The hash function should depend on the hash code of every node within the tree as well as its position.
Reordering the children of a node should distinctly change the resulting hash code.
Reflecting any part of the tree should distinctly change the resulting hash code
If it helps, I'm using C# 4.0 here in my project, though I'm primarily looking for a theoretical solution, so pseudo-code, a description, or code in another imperative language would be fine.
UPDATE
Well, here's my own proposed solution. It has been helped much by several of the answers here.
Each node (sub-tree/leaf node) has the following hash function:
public override int GetHashCode()
{
int hashCode = unchecked((this.Symbol.GetHashCode() * 31 +
this.Value.GetHashCode()));
for (int i = 0; i < this.Children.Count; i++)
hashCode = unchecked(hashCode * 31 + this.Children[i].GetHashCode());
return hashCode;
}
The nice thing about this method, as I see it, is that hash codes can be cached and only recalculated when the node or one of its descendants changes. (Thanks to vatine and Jason Orendorff for pointing this out).
Anyway, I would be grateful if people could comment on my suggested solution here - if it does the job well, then great, otherwise any possible improvements would be welcome.
If I were to do this, I'd probably do something like the following:
For each leaf node, compute the concatenation of 0 and the hash of the node data.
For each internal node, compute the concatenation of 1 and the hash of any local data (NB: may not be applicable) and the hash of the children from left to right.
This will lead to a cascade up the tree every time you change anything, but that MAY be low-enough of an overhead to be worthwhile. If changes are relatively infrequent compared to the amount of changes, it may even make sense to go for a cryptographically secure hash.
Edit1: There is also the possibility of adding a "hash valid" flag to each node and simply propagate a "false" up the tree (or "hash invalid" and propagate "true") up the tree on a node change. That way, it may be possible to avoid a complete recalculation when the tree hash is needed and possibly avoid multiple hash calculations that are not used, at the risk of slightly less predictable time to get a hash when needed.
Edit3: The hash code suggested by Noldorin in the question looks like it would have a chance of collisions, if the result of GetHashCode can ever be 0. Essentially, there is no way of distinguishing a tree composed of a single node, with "symbol hash" 30 and "value hash" 25 and a two-node tree, where the root has a "symbol hash" of 0 and a "value hash" of 30 and the child node has a total hash of 25. The examples are entirely invented, I don't know what expected hash ranges are so I can only comment on what I see in the presented code.
Using 31 as the multiplicative constant is good, in that it will cause any overflow to happen on a non-bit boundary, although I am thinking that, with sufficient children and possibly adversarial content in the tree, the hash contribution from items hashed early MAY be dominated by later hashed items.
However, if the hash performs decently on expected data, it looks as if it will do the job. It's certainly faster than using a cryptographic hash (as done in the example code listed below).
Edit2: As for specific algorithms and minimum data structure needed, something like the following (Python, translating to any other language should be relatively easy).
#! /usr/bin/env python
import Crypto.Hash.SHA
class Node:
def __init__ (self, parent=None, contents="", children=[]):
self.valid = False
self.hash = False
self.contents = contents
self.children = children
def append_child (self, child):
self.children.append(child)
self.invalidate()
def invalidate (self):
self.valid = False
if self.parent:
self.parent.invalidate()
def gethash (self):
if self.valid:
return self.hash
digester = crypto.hash.SHA.new()
digester.update(self.contents)
if self.children:
for child in self.children:
digester.update(child.gethash())
self.hash = "1"+digester.hexdigest()
else:
self.hash = "0"+digester.hexdigest()
return self.hash
def setcontents (self):
self.valid = False
return self.contents
Okay, after your edit where you've introduced a requirement that the hashing result should be different for different tree layouts, you're only left with option to traverse the whole tree and write its structure to a single array.
That's done like this: you traverse the tree and dump the operations you do. For an original tree that could be (for a left-child-right-sibling structure):
[1, child, 2, child, 3, sibling, 4, sibling, 5, parent, parent, //we're at root again
sibling, 6, child, 7, child, 8, sibling, 9, parent, parent]
You may then hash the list (that is, effectively, a string) the way you like. As another option, you may even return this list as a result of hash-function, so it becomes collision-free tree representation.
But adding precise information about the whole structure is not what hash functions usually do. The way proposed should compute hash function of every node as well as traverse the whole tree. So you may consider other ways of hashing, described below.
If you don't want to traverse the whole tree:
One algorithm that immediately came to my mind is like this. Pick a large prime number H (that's greater than maximal number of children). To hash a tree, hash its root, pick a child number H mod n, where n is the number of children of root, and recursively hash the subtree of this child.
This seems to be a bad option if trees differ only deeply near the leaves. But at least it should run fast for not very tall trees.
If you want to hash less elements but go through the whole tree:
Instead of hashing subtree, you may want to hash layer-wise. I.e. hash root first, than hash one of nodes that are its children, then one of children of the children etc. So you cover the whole tree instead of one of specific paths. This makes hashing procedure slower, of course.
--- O ------- layer 0, n=1
/ \
/ \
--- O --- O ----- layer 1, n=2
/|\ |
/ | \ |
/ | \ |
O - O - O O------ layer 2, n=4
/ \
/ \
------ O --- O -- layer 3, n=2
A node from a layer is picked with H mod n rule.
The difference between this version and previous version is that a tree should undergo quite an illogical transformation to retain the hash function.
The usual technique of hashing any sequence is combining the values (or hashes thereof) of its elements in some mathematical way. I don't think a tree would be any different in this respect.
For example, here is the hash function for tuples in Python (taken from Objects/tupleobject.c in the source of Python 2.6):
static long
tuplehash(PyTupleObject *v)
{
register long x, y;
register Py_ssize_t len = Py_SIZE(v);
register PyObject **p;
long mult = 1000003L;
x = 0x345678L;
p = v->ob_item;
while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
x = (x ^ y) * mult;
/* the cast might truncate len; that doesn't change hash stability */
mult += (long)(82520L + len + len);
}
x += 97531L;
if (x == -1)
x = -2;
return x;
}
It's a relatively complex combination with constants experimentally chosen for best results for tuples of typical lengths. What I'm trying to show with this code snippet is that the issue is very complex and very heuristic, and the quality of the results probably depend on the more specific aspects of your data - i.e. domain knowledge may help you reach better results. However, for good-enough results you shouldn't look too far. I would guess that taking this algorithm and combining all the nodes of the tree instead of all the tuple elements, plus adding their position into play will give you a pretty good algorithm.
One option of taking the position into account is the node's position in an inorder walk of the tree.
Any time you are working with trees recursion should come to mind:
public override int GetHashCode() {
int hash = 5381;
foreach(var node in this.BreadthFirstTraversal()) {
hash = 33 * hash + node.GetHashCode();
}
}
The hash function should depend on the hash code of every node within the tree as well as its position.
Check. We are explicitly using node.GetHashCode() in the computation of the tree's hash code. Further, because of the nature of the algorithm, a node's position plays a role in the tree's ultimate hash code.
Reordering the children of a node should distinctly change the resulting hash code.
Check. They will be visited in a different order in the in-order traversal leading to a different hash code. (Note that if there are two children with the same hash code you will end up with the same hash code upon swapping the order of those children.)
Reflecting any part of the tree should distinctly change the resulting hash code
Check. Again the nodes would be visited in a different order leading to a different hash code. (Note that there are circumstances where the reflection could lead to the same hash code if every node is reflected into a node with the same hash code.)
The collision-free property of this will depend on how collision-free the hash function used for the node data is.
It sounds like you want a system where the hash of a particular node is a combination of the child node hashes, where order matters.
If you're planning on manipulating this tree a lot, you may want to pay the price in space of storing the hashcode with each node, to avoid the penalty of recalculation when performing operations on the tree.
Since the order of the child nodes matters, a method which might work here would be to combine the node data and children using prime number multiples and addition modulo some large number.
To go for something similar to Java's String hashcode:
Say you have n child nodes.
hash(node) = hash(nodedata) +
hash(childnode[0]) * 31^(n-1) +
hash(childnode[1]) * 31^(n-2) +
<...> +
hash(childnode[n])
Some more detail on the scheme used above can be found here: http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/
I can see that if you have a large set of trees to compare, then you could use a hash function to retrieve a set of potential candidates, then do a direct comparison.
A substring that would work is just use lisp syntax to put brackets around the tree, write out the identifiere of each node in pre-order. But this is computationally equivalent to a pre-order comparison of the tree, so why not just do that?
I've given 2 solutions: one is for comparing the two trees when you're done (needed to resolve collisions) and the other to compute the hashcode.
TREE COMPARISON:
The most efficient way to compare will be to simply recursively traverse each tree in a fixed order (pre-order is simple and as good as anything else), comparing the node at each step.
So, just create a Visitor pattern that successively returns the next node in pre-order for a tree. i.e. it's constructor can take the root of the tree.
Then, just create two insces of the Visitor, that act as generators for the next node in preorder. i.e. Vistor v1 = new Visitor(root1), Visitor v2 = new Visitor(root2)
Write a comparison function that can compare itself to another node.
Then just visit each node of the trees, comparing, and returning false if comparison fails. i.e.
Module
Function Compare(Node root1, Node root2)
Visitor v1 = new Visitor(root1)
Visitor v2 = new Visitor(root2)
loop
Node n1 = v1.next
Node n2 = v2.next
if (n1 == null) and (n2 == null) then
return true
if (n1 == null) or (n2 == null) then
return false
if n1.compare(n2) != 0 then
return false
end loop
// unreachable
End Function
End Module
HASH CODE GENERATION:
if you want to write out a string representation of the tree, you can use the lisp syntax for a tree, then sample the string to generate a shorter hashcode.
Module
Function TreeToString(Node n1) : String
if node == null
return ""
String s1 = "(" + n1.toString()
for each child of n1
s1 = TreeToString(child)
return s1 + ")"
End Function
The node.toString() can return the unique label/hash code/whatever for that node. Then you can just do a substring comparison from the strings returned by the TreeToString function to determine if the trees are equivalent. For a shorter hashcode, just sample the TreeToString Function, i.e. take every 5 character.
End Module
I think you could do this recursively: Assume you have a hash function h that hashes strings of arbitrary length (e.g. SHA-1). Now, the hash of a tree is the hash of a string that is created as a concatenation of the hash of the current element (you have your own function for that) and hashes of all the children of that node (from recursive calls of the function).
For a binary tree you would have:
Hash( h(node->data) || Hash(node->left) || Hash(node->right) )
You may need to carefully check if tree geometry is properly accounted for. I think that with some effort you could derive a method for which finding collisions for such trees could be as hard as finding collisions in the underlying hash function.
A simple enumeration (in any deterministic order) together with a hash function that depends when the node is visited should work.
int hash(Node root) {
ArrayList<Node> worklist = new ArrayList<Node>();
worklist.add(root);
int h = 0;
int n = 0;
while (!worklist.isEmpty()) {
Node x = worklist.remove(worklist.size() - 1);
worklist.addAll(x.children());
h ^= place_hash(x.hash(), n);
n++;
}
return h;
}
int place_hash(int hash, int place) {
return (Integer.toString(hash) + "_" + Integer.toString(place)).hash();
}
class TreeNode
{
public static QualityAgainstPerformance = 3; // tune this for your needs
public static PositionMarkConstan = 23498735; // just anything
public object TargetObject; // this is a subject of this TreeNode, which has to add it's hashcode;
IEnumerable<TreeNode> GetChildParticipiants()
{
yield return this;
foreach(var child in Children)
{
yield return child;
foreach(var grandchild in child.GetParticipiants() )
yield return grandchild;
}
IEnumerable<TreeNode> GetParentParticipiants()
{
TreeNode parent = Parent;
do
yield return parent;
while( ( parent = parent.Parent ) != null );
}
public override int GetHashcode()
{
int computed = 0;
var nodesToCombine =
(Parent != null ? Parent : this).GetChildParticipiants()
.Take(QualityAgainstPerformance/2)
.Concat(GetParentParticipiants().Take(QualityAgainstPerformance/2));
foreach(var node in nodesToCombine)
{
if ( node.ReferenceEquals(this) )
computed = AddToMix(computed, PositionMarkConstant );
computed = AddToMix(computed, node.GetPositionInParent());
computed = AddToMix(computed, node.TargetObject.GetHashCode());
}
return computed;
}
}
AddToTheMix is a function, which combines the two hashcodes, so the sequence matters.
I don't know what it is, but you can figure out. Some bit shifting, rounding, you know...
The idea is that you have to analyse some environment of the node, depending on the quality you want to achieve.
I have to say, that you requirements are somewhat against the entire concept of hashcodes.
Hash function computational complexity should be very limited.
It's computational complexity should not linearly depend on the size of the container (the tree), otherwise it totally breaks the hashcode-based algorithms.
Considering the position as a major property of the nodes hash function also somewhat goes against the concept of the tree, but achievable, if you replace the requirement, that it HAS to depend on the position.
Overall principle i would suggest, is replacing MUST requirements with SHOULD requirements.
That way you can come up with appropriate and efficient algorithm.
For example, consider building a limited sequence of integer hashcode tokens, and add what you want to this sequence, in the order of preference.
Order of the elements in this sequence is important, it affects the computed value.
for example for each node you want to compute:
add the hashcode of underlying object
add the hashcodes of underlying objects of the nearest siblings, if available. I think, even the single left sibling would be enough.
add the hashcode of underlying object of the parent and it's nearest siblings like for the node itself, same as 2.
repeat this to with the grandparents to a limited depth.
//--------5------- ancestor depth 2 and it's left sibling;
//-------/|------- ;
//------4-3------- ancestor depth 1 and it's left sibling;
//-------/|------- ;
//------2-1------- this;
the fact that you are adding a direct sibling's underlying object's hashcode gives a positional property to the hashfunction.
if this is not enough, add the children:
You should add every child, just some to give a decent hashcode.
add the first child and it's first child and it's first child.. limit the depth to some constant, and do not compute anything recursively - just the underlying node's object's hashcode.
//----- this;
//-----/--;
//----6---;
//---/--;
//--7---;
This way the complexity is linear to the depth of the underlying tree, not the total number of elements.
Now you have a sequence if integers, combine them with a known algorithm, like Ely suggests above.
1,2,...7
This way, you will have a lightweight hash function, with a positional property, not dependent on the total size of the tree, and even not dependent on the tree depth, and not requiring to recompute hash function of the entire tree when you change the tree structure.
I bet this 7 numbers would give a hash destribution near to perfect.
Writing your own hash function is almost always a bug, because you basically need a degree in mathematics to do it well. Hashfunctions are incredibly nonintuitive, and have highly unpredictable collision characteristics.
Don't try directly combining hashcodes for Child nodes -- this will magnify any problems in the underlying hash functions. Instead, concatenate the raw bytes from each node in order, and feed this as a byte stream to a tried-and-true hash function. All the cryptographic hash functions can accept a byte stream. If the tree is small, you may want to just create a byte array and hash it in one operation.

In-Place Radix Sort

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?
Preliminary
I've got a huge number of small fixed-length strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.
At the moment, I use std::sort which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.
Details
I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).
The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.
Use Cases
Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.
Research
Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.
In particular, there are the following things.
First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group and dest_address without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address isn’t trivial).
Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).
Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.
To recap: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?
Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.
void radixSort(string[] seqs, size_t base = 0) {
if(seqs.length == 0)
return;
size_t TPos = seqs.length, APos = 0;
size_t i = 0;
while(i < TPos) {
if(seqs[i][base] == 'A') {
swap(seqs[i], seqs[APos++]);
i++;
}
else if(seqs[i][base] == 'T') {
swap(seqs[i], seqs[--TPos]);
} else i++;
}
i = APos;
size_t CPos = APos;
while(i < TPos) {
if(seqs[i][base] == 'C') {
swap(seqs[i], seqs[CPos++]);
}
i++;
}
if(base < seqs[0].length - 1) {
radixSort(seqs[0..APos], base + 1);
radixSort(seqs[APos..CPos], base + 1);
radixSort(seqs[CPos..TPos], base + 1);
radixSort(seqs[TPos..seqs.length], base + 1);
}
}
Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.
Edit:
I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.
I've never seen an in-place radix sort, and from the nature of the radix-sort I doubt that it is much faster than a out of place sort as long as the temporary array fits into memory.
Reason:
The sorting does a linear read on the input array, but all writes will be nearly random. From a certain N upwards this boils down to a cache miss per write. This cache miss is what slows down your algorithm. If it's in place or not will not change this effect.
I know that this will not answer your question directly, but if sorting is a bottleneck you may want to have a look at near sorting algorithms as a preprocessing step (the wiki-page on the soft-heap may get you started).
That could give a very nice cache locality boost. A text-book out-of-place radix sort will then perform better. The writes will still be nearly random but at least they will cluster around the same chunks of memory and as such increase the cache hit ratio.
I have no idea if it works out in practice though.
Btw: If you're dealing with DNA strings only: You can compress a char into two bits and pack your data quite a lot. This will cut down the memory requirement by factor four over a naiive representation. Addressing becomes more complex, but the ALU of your CPU has lots of time to spend during all the cache-misses anyway.
You can certainly drop the memory requirements by encoding the sequence in bits.
You are looking at permutations so, for length 2, with "ACGT" that's 16 states, or 4 bits.
For length 3, that's 64 states, which can be encoded in 6 bits. So it looks like 2 bits for each letter in the sequence, or about 32 bits for 16 characters like you said.
If there is a way to reduce the number of valid 'words', further compression may be possible.
So for sequences of length 3, one could create 64 buckets, maybe sized uint32, or uint64.
Initialize them to zero.
Iterate through your very very large list of 3 char sequences, and encode them as above.
Use this as a subscript, and increment that bucket.
Repeat this until all of your sequences have been processed.
Next, regenerate your list.
Iterate through the 64 buckets in order, for the count found in that bucket, generate that many instances of the sequence represented by that bucket.
when all of the buckets have been iterated, you have your sorted array.
A sequence of 4, adds 2 bits, so there would be 256 buckets.
A sequence of 5, adds 2 bits, so there would be 1024 buckets.
At some point the number of buckets will approach your limits.
If you read the sequences from a file, instead of keeping them in memory, more memory would be available for buckets.
I think this would be faster than doing the sort in situ as the buckets are likely to fit within your working set.
Here is a hack that shows the technique
#include <iostream>
#include <iomanip>
#include <math.h>
using namespace std;
const int width = 3;
const int bucketCount = exp(width * log(4)) + 1;
int *bucket = NULL;
const char charMap[4] = {'A', 'C', 'G', 'T'};
void setup
(
void
)
{
bucket = new int[bucketCount];
memset(bucket, '\0', bucketCount * sizeof(bucket[0]));
}
void teardown
(
void
)
{
delete[] bucket;
}
void show
(
int encoded
)
{
int z;
int y;
int j;
for (z = width - 1; z >= 0; z--)
{
int n = 1;
for (y = 0; y < z; y++)
n *= 4;
j = encoded % n;
encoded -= j;
encoded /= n;
cout << charMap[encoded];
encoded = j;
}
cout << endl;
}
int main(void)
{
// Sort this sequence
const char *testSequence = "CAGCCCAAAGGGTTTAGACTTGGTGCGCAGCAGTTAAGATTGTTT";
size_t testSequenceLength = strlen(testSequence);
setup();
// load the sequences into the buckets
size_t z;
for (z = 0; z < testSequenceLength; z += width)
{
int encoding = 0;
size_t y;
for (y = 0; y < width; y++)
{
encoding *= 4;
switch (*(testSequence + z + y))
{
case 'A' : encoding += 0; break;
case 'C' : encoding += 1; break;
case 'G' : encoding += 2; break;
case 'T' : encoding += 3; break;
default : abort();
};
}
bucket[encoding]++;
}
/* show the sorted sequences */
for (z = 0; z < bucketCount; z++)
{
while (bucket[z] > 0)
{
show(z);
bucket[z]--;
}
}
teardown();
return 0;
}
If your data set is so big, then I would think that a disk-based buffer approach would be best:
sort(List<string> elements, int prefix)
if (elements.Count < THRESHOLD)
return InMemoryRadixSort(elements, prefix)
else
return DiskBackedRadixSort(elements, prefix)
DiskBackedRadixSort(elements, prefix)
DiskBackedBuffer<string>[] buckets
foreach (element in elements)
buckets[element.MSB(prefix)].Add(element);
List<string> ret
foreach (bucket in buckets)
ret.Add(sort(bucket, prefix + 1))
return ret
I would also experiment grouping into a larger number of buckets, for instance, if your string was:
GATTACA
the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.
I'm going to go out on a limb and suggest you switch to a heap/heapsort implementation. This suggestion comes with some assumptions:
You control the reading of the data
You can do something meaningful with the sorted data as soon as you 'start' getting it sorted.
The beauty of the heap/heap-sort is that you can build the heap while you read the data, and you can start getting results the moment you have built the heap.
Let's step back. If you are so fortunate that you can read the data asynchronously (that is, you can post some kind of read request and be notified when some data is ready), and then you can build a chunk of the heap while you are waiting for the next chunk of data to come in - even from disk. Often, this approach can bury most of the cost of half of your sorting behind the time spent getting the data.
Once you have the data read, the first element is already available. Depending on where you are sending the data, this can be great. If you are sending it to another asynchronous reader, or some parallel 'event' model, or UI, you can send chunks and chunks as you go.
That said - if you have no control over how the data is read, and it is read synchronously, and you have no use for the sorted data until it is entirely written out - ignore all this. :(
See the Wikipedia articles:
Heapsort
Binary heap
"Radix sorting with no extra space" is a paper addressing your problem.
Performance-wise you might want to look at a more general string-comparison sorting algorithms.
Currently you wind up touching every element of every string, but you can do better!
In particular, a burst sort is a very good fit for this case. As a bonus, since burstsort is based on tries, it works ridiculously well for the small alphabet sizes used in DNA/RNA, since you don't need to build any sort of ternary search node, hash or other trie node compression scheme into the trie implementation. The tries may be useful for your suffix-array-like final goal as well.
A decent general purpose implementation of burstsort is available on source forge at http://sourceforge.net/projects/burstsort/ - but it is not in-place.
For comparison purposes, The C-burstsort implementation covered at http://www.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf benchmarks 4-5x faster than quicksort and radix sorts for some typical workloads.
You'll want to take a look at Large-scale Genome Sequence Processing by Drs. Kasahara and Morishita.
Strings comprised of the four nucleotide letters A, C, G, and T can be specially encoded into Integers for much faster processing. Radix sort is among many algorithms discussed in the book; you should be able to adapt the accepted answer to this question and see a big performance improvement.
You might try using a trie. Sorting the data is simply iterating through the dataset and inserting it; the structure is naturally sorted, and you can think of it as similar to a B-Tree (except instead of making comparisons, you always use pointer indirections).
Caching behavior will favor all of the internal nodes, so you probably won't improve upon that; but you can fiddle with the branching factor of your trie as well (ensure that every node fits into a single cache line, allocate trie nodes similar to a heap, as a contiguous array that represents a level-order traversal). Since tries are also digital structures (O(k) insert/find/delete for elements of length k), you should have competitive performance to a radix sort.
I would burstsort a packed-bit representation of the strings. Burstsort is claimed to have much better locality than radix sorts, keeping the extra space usage down with burst tries in place of classical tries. The original paper has measurements.
It looks like you've solved the problem, but for the record, it appears that one version of a workable in-place radix sort is the "American Flag Sort". It's described here: Engineering Radix Sort. The general idea is to do 2 passes on each character - first count how many of each you have, so you can subdivide the input array into bins. Then go through again, swapping each element into the correct bin. Now recursively sort each bin on the next character position.
Radix-Sort is not cache conscious and is not the fastest sort algorithm for large sets.
You can look at:
ti7qsort. ti7qsort is the fastest sort for integers (can be used for small-fixed size strings).
Inline QSORT
String sorting
You can also use compression and encode each letter of your DNA into 2 bits before storing into the sort array.
dsimcha's MSB radix sort looks nice, but Nils gets closer to the heart of the problem with the observation that cache locality is what's killing you at large problem sizes.
I suggest a very simple approach:
Empirically estimate the largest size m for which a radix sort is efficient.
Read blocks of m elements at a time, radix sort them, and write them out (to a memory buffer if you have enough memory, but otherwise to file), until you exhaust your input.
Mergesort the resulting sorted blocks.
Mergesort is the most cache-friendly sorting algorithm I'm aware of: "Read the next item from either array A or B, then write an item to the output buffer." It runs efficiently on tape drives. It does require 2n space to sort n items, but my bet is that the much-improved cache locality you'll see will make that unimportant -- and if you were using a non-in-place radix sort, you needed that extra space anyway.
Please note finally that mergesort can be implemented without recursion, and in fact doing it this way makes clear the true linear memory access pattern.
First, think about the coding of your problem. Get rid of the strings, replace them by a binary representation. Use the first byte to indicate length+encoding. Alternatively, use a fixed length representation at a four-byte boundary. Then the radix sort becomes much easier. For a radix sort, the most important thing is to not have exception handling at the hot spot of the inner loop.
OK, I thought a bit more about the 4-nary problem. You want a solution like a Judy tree for this. The next solution can handle variable length strings; for fixed length just remove the length bits, that actually makes it easier.
Allocate blocks of 16 pointers. The least significant bit of the pointers can be reused, as your blocks will always be aligned. You might want a special storage allocator for it (breaking up large storage into smaller blocks). There are a number of different kinds of blocks:
Encoding with 7 length bits of variable-length strings. As they fill up, you replace them by:
Position encodes the next two characters, you have 16 pointers to the next blocks, ending with:
Bitmap encoding of the last three characters of a string.
For each kind of block, you need to store different information in the LSBs. As you have variable length strings you need to store end-of-string too, and the last kind of block can only be used for the longest strings. The 7 length bits should be replaced by less as you get deeper into the structure.
This provides you with a reasonably fast and very memory efficient storage of sorted strings. It will behave somewhat like a trie. To get this working, make sure to build enough unit tests. You want coverage of all block transitions. You want to start with only the second kind of block.
For even more performance, you might want to add different block types and a larger size of block. If the blocks are always the same size and large enough, you can use even fewer bits for the pointers. With a block size of 16 pointers, you already have a byte free in a 32-bit address space. Take a look at the Judy tree documentation for interesting block types. Basically, you add code and engineering time for a space (and runtime) trade-off
You probably want to start with a 256 wide direct radix for the first four characters. That provides a decent space/time tradeoff. In this implementation, you get much less memory overhead than with a simple trie; it is approximately three times smaller (I haven't measured). O(n) is no problem if the constant is low enough, as you noticed when comparing with the O(n log n) quicksort.
Are you interested in handling doubles? With short sequences, there are going to be. Adapting the blocks to handle counts is tricky, but it can be very space-efficient.
While the accepted answer perfectly answers the description of the problem, I've reached this place looking in vain for an algorithm to partition inline an array into N parts. I've written one myself, so here it is.
Warning: this is not a stable partitioning algorithm, so for multilevel partitioning, one must repartition each resulting partition instead of the whole array. The advantage is that it is inline.
The way it helps with the question posed is that you can repeatedly partition inline based on a letter of the string, then sort the partitions when they are small enough with the algorithm of your choice.
function partitionInPlace(input, partitionFunction, numPartitions, startIndex=0, endIndex=-1) {
if (endIndex===-1) endIndex=input.length;
const starts = Array.from({ length: numPartitions + 1 }, () => 0);
for (let i = startIndex; i < endIndex; i++) {
const val = input[i];
const partByte = partitionFunction(val);
starts[partByte]++;
}
let prev = startIndex;
for (let i = 0; i < numPartitions; i++) {
const p = prev;
prev += starts[i];
starts[i] = p;
}
const indexes = [...starts];
starts[numPartitions] = prev;
let bucket = 0;
while (bucket < numPartitions) {
const start = starts[bucket];
const end = starts[bucket + 1];
if (end - start < 1) {
bucket++;
continue;
}
let index = indexes[bucket];
if (index === end) {
bucket++;
continue;
}
let val = input[index];
let destBucket = partitionFunction(val);
if (destBucket === bucket) {
indexes[bucket] = index + 1;
continue;
}
let dest;
do {
dest = indexes[destBucket] - 1;
let destVal;
let destValBucket = destBucket;
while (destValBucket === destBucket) {
dest++;
destVal = input[dest];
destValBucket = partitionFunction(destVal);
}
input[dest] = val;
indexes[destBucket] = dest + 1;
val = destVal;
destBucket = destValBucket;
} while (dest !== index)
}
return starts;
}

Resources