Finding all subsequences from dictionary - algorithm

In a program I need to efficiently answer queries of the following form:
Given a set of strings A and a query string q return all s ∈ A such that s is a subsequence of q
For example, given A = {"abc", "aaa", "abd"} and q = "abcd", "abc" and "abd" should be returned.
Is there any better way than iterating each element of A and checking if it is a subsequence of q?
NOTE: I have STRIPS planner or automated planner in mind. Each state in STRIPS planner is a set of propositions like {"(room rooma)", "(at-robby rooma)", "(at ball1 rooma)"}. I want to find all ground actions applicable to a given state. Actions in STRIPS planner basically consist of two parts, preconditions and effects(which are not really relevant here). Preconditions are set of propositions needed to be true to apply an action to a state. For example, to apply an action"(move rooma roomb)", its preconditions, {"(room rooma)", "(room roomb)","(at-robby rooma)"} must all be true in the state.

If your set A is large and you have many queries, you could implement a trie-like structure, where level n refers to character n in a string. In your example:
trie = {
a: {
a: {
a: { value: "aaa"}
b {
c: { value: "abc"},
d: { value: "abd"}
That would enable you to look up matches in a forked path through the trie:
function query(trie, q) {
s = Set();
if (q.isEmpty()) {
if (trie.value) s.add(t.value);
} else {
s = s.union(query(trie, q[1:]));
c = substr(q, 0, 1);
if (t[c]) {
s = s.union(query(t[c], substr(q, 1));
return s;
Efectively, you will generate all 2^m subsets of the quesy string of m characters, but in practice, the trie is sparse and you end up checking fewer paths.
The speed payoff comes with many lookups. Building the trie is more costly than doing a brute-force lookup. But if you build the trie only one or have a means to update the trie when you update the set A, you wil get a good lookup performance.
The actual data structure for the trie nodes depends on how many possible elements the items can have. In your example, only four letters are used. If you have a limited range of "letters", you can use an array. Otherwise you might need a sort of dictionary, which might make the tree quite big in memory.


Find the Most Frequent Ordered Word Pair In a Document

This is a problem from S. Skiena's "Algorithm. Design Manual" book, the problem statement is:
Give an algorithm for finding an ordered word pair(e.g."New York")
occurring with the greatest frequency in a given webpage.
Which data structure would you use? Optimize both time and space.
One obvious solution is inserting each ordered pair in a hash-map and then iterating over all of them, to find the most frequent one, however, there definitely should be a better way, can anyone suggest anything?
In a text with n words, we have exactly n - 1 ordered word pairs (not distinct of course). One solution is to use a max priority queue; we simply insert each pair in the max PQ with frequency 1 if not already present. If present, we increment the key. However, if we use a Trie, we don't need to represent all n - 1 pairs separately. Take for example the following text:
A new puppy in New York is happy with it's New York life.
The resulting Trie would look like the following:
If we store the number of occurrences of a pair in the leaf nodes, we could easily compute the maximum occurrence in linear time. Since we need to look at each word, that's the best we can do, time wise.
Working Scala code below. The official site has a solution in Python.
class TrieNode(val parent: Option[TrieNode] = None,
val children: MutableMap[Char, TrieNode] = MutableMap.empty,
var n: Int = 0) {
def add(c: Char): TrieNode = {
val child = children.getOrElseUpdate(c, new TrieNode(parent = Some(this)))
child.n += 1
def letter(node: TrieNode): Char = {
.flatMap(_.children.find(_._2 eq node))
override def toString: String = {
.iterate((ListBuffer.empty[Char], Option(this))) {
case (buffer, node) =>
(buffer, node.flatMap(_.parent))
def mostCommonPair(text: String): (String, Int) = {
val root = new TrieNode()
def loop(s: String,
mostCommon: TrieNode,
count: Int,
parent: TrieNode): (String, Int) = {
s.split("\\s+", 2) match {
case Array(head, tail # _*) if head.nonEmpty =>
val word = head.foldLeft(parent)((tn, c) => tn.add(c))
val (common, n, p) =
if (parent eq root) (mostCommon, count, word.add(' '))
else if (word.n > count) (word, word.n, root)
else (mostCommon, count, root)
loop(tail.headOption.getOrElse(""), common, n, p)
case _ => (mostCommon.toString, count)
loop(text, new TrieNode(), -1, root)
Inspired by the question here.
I think the first point to note is that finding the most frequent ordered word pair is no more (or less) difficult than finding the most frequent word. The only difference is that instead of words made up of the letters a..z+A.Z separated by punctuation or spaces, you are looking for word-pairs made up of the letters a..z+A..Z+exactly_one_space, similarly separated by punctuation or spaces.
If your web-page has n words then there are only n-1 word-pairs. So hashing each word-pair then iterating over the hash table will O(n) in both time and memory. This should be pretty quick to do even if n is ~10^6 (i.e. the length of an average novel). I can't imagine anything more efficient unless n is fairly small, in which case the memory savings resulting from constructing an ordered list of word pairs (instead of a hash table) might outweigh the cost of increasing time complexity to O(nlogn)
why not keep all the ordered pairs in AVL tree with 10 elements array to track top 10 ordered pairs. In AVL we would keep all the order pairs with their occurring count and top 10 will keep in the array. this way searching of any ordered pair would be O(log N) and traversing would be O(N).
I think we could not do better than O(n) in terms of time as one would have to see at least each element once. So time complexity cannot be optimised further.
But we can use a trie to optimise the space used. In a page, there are often words which are repeated, so this might lead to significant reduction in space usage. The leaf nodes in the trie cold store the frequency of the ordered pair and using two pointers to iterate in the text where one would point at the current word and second would point at previous word.

Given a dictionary and a list of letters find all valid words that can be built with the letters

The brute force way can solve the problem in O(n!), basically calculating all the permutations and checking the results in a dictionary. I am looking for ways to improve the complexity. I can think of building a tree out of the dictionary but still checking all letters permutations is O(n!). Are there better ways to solve this problem?
Letters can have duplicates.
The api for the function looks like this:
List<String> findValidWords(Dict dict, char letters[])
Assume that letters only contains letters from a to z.
Use an integer array to count the number of occurrence of a character in letters.
For each word in the dictionary, check if there is a specific character in the word that appears more than allowed, if not, add this word into result.
List<String> findValidWords(List<String> dict, char letters[]){
int []avail = new int[26];
for(char c : letters){
int index = c - 'a';
List<String> result = new ArrayList();
for(String word: dict){
int []count = new int[26];
boolean ok = true;
for(char c : word.toCharArray()){
int index = c - 'a';
if(count[index] > avail[index]){
ok = false;
return result;
So we can see that the time complexity is O(m*k) with m is number of word in the dictionary and k is the maximum total of characters in a word
You can sort each word in your dictionary so that the letters appear in the same order as they do in the alphabet, and then build a trie out of your sorted words. (where each node contains a list of all words that can be made out of the letters). (linear time in total letter length of dictionary) Then, given a set of query letters, sort the letters the same way and proceed through the trie using depth first search in all possible directions that use a subset of your letters from left to right. Any time you reach a node in the trie that contains words, output those words. Each path you explore can be charged to at least one word in the dictionary, so the worst case complexity to find all nodes that contain words you can make is O(kn) where n is the number of words in the dictionary and k is the maximum number of letters in a word. However for somewhat restricted sets of query letters, the running time should be much faster per query.
Here is the algorithm that will find all words that can be formed from a set of letters in O(1). We will represent words with their spectra and store them in a prefix tree (aka trie).
General Description
The spectrum of a word W is an array S of size N, such that S(i) is the number of occurrences (aka frequency) of an A(i) letter in the word W, where A(i) is the i-th letter of a chosen alphabet and N is its size.
For example, in the English alphabet, A(0) is A, A(1) is B, ... , A(25) is Z. A spectrum of the word aha is <2,0,0,0,0,0,0,1,0,...,0>.
We will store the dictionary in a prefix trie, using spectrum as a key. The first token of a key is the frequency of letter A, the second is the frequency of letter B and so on. (From here and below we will use the English alphabet as an example).
Once formed, our dictionary will be a tree with the height 26 and width that varies with each level, depending on a popularity of the letter. Basically, each layer will have a number of subtrees that is equal to the maximum word frequency of this letter in the provided dictionary.
Since our task is not only to decide whether we can build a word from the provided set of characters but also to find these words (a search problem), then we need to attach the words to their spectra (as spectral transformation is not invertible, consider spectra of words read and dear). We will attach a word to the end of each path that represents its spectrum.
To find whether we can build a word from a provided set we will build a spectrum of the set, and find all paths in the prefix trie with the frequencies bounded by the corresponding frequencies of the set's spectrum. (Note, we are not forcing to use all letters from the set, so if a word uses fewer letters, then we can build it. Basically, our requirement is that for all letters in the word the frequency of a letter should be less than or equal than a frequency of the same letter in the provided set).
The complexity of the search procedure doesn't depend on the length of the dictionary or the length of the provided set. On average, it is equal to 26 times the average frequency of a letter. Given the English alphabet, it is a quite small constant factor. For other alphabets, it might not be the case.
Reference implementation
I will provide a reference implementation of an algorithm in OCaml.
The dictionary data type is recursive:
type t = {
dict : t Int.Map.t;
data : string list;
(Note: it is not the best representation, probably it is better to represent it is a sum type, e.g., type t = Dict of t Int.Map.t | Data of string list, but I found it easier to implement it with the above representation).
We can generalize the algorithm by a spectrum function, either using a functor, or by just storing the spectrum function in the dictionary, but for the simplicity, we will just hardcode the English alphabet in the ASCII representation,
let spectrum word =
let index c = Char.(to_int (uppercase c) - to_int 'A') in
let letters = Char.(to_int 'Z' - to_int 'A' + 1) in
Array.init letters ~f:(fun i ->
String.count word ~f:(fun c -> index c = i))
Next, we will define the add_word function of type dict -> string -> dict, that will add a new path to our dictionary, by decomposing a word to its spectrum, and adding each constituent. Each addition will require exactly 26 iterations, not including the spectrum computation. Note, the implementation is purely functional, and doesn't use any imperative features. Every time the function add_word returns a new data structure.
let add_word dict word =
let count = spectrum word in
let rec add {dict; data} i =
if i < Array.length count then {
dict = Map.update dict count.(i) ~f:(function
| None -> add empty (i+1)
| Some sub -> add sub (i+1))
} else {empty with data = word :: data} in
add dict 0
We are using the following definition of the empty value in the add function:
let empty = {dict = Int.Map.empty; data=[]}
Now let's define the is_buildable function of type dict -> string -> bool that will decide whether the given set of characters can be used to build any word in the dictionary. Although we can express it via the search, by checking the size of the found set, we would still prefer to have a specialized implementation, as it is more efficient and easier to understand. The definition of the function follows closely the general description provided above. Basically, for every character in the alphabet, we check whether there is an entry in the dictionary with the frequency that is less or equal than the frequency in the building set. If we checked all letters, then we proved, that we can build at least one word with the given set.
let is_buildable dict set =
let count = spectrum set in
let rec find {dict} i =
i >= Array.length count ||
Sequence.range 0 count.(i) ~stop:`inclusive |>
Sequence.exists ~f:(fun cnt -> match Map.find dict cnt with
| None -> false
| Some dict -> find dict (i+1)) in
find dict 0
Now, let's actually find the set of all words, that are buildable from the provided set:
let build dict set =
let count = spectrum set in
let rec find {dict; data} i =
if i < Array.length count then
Sequence.range 0 count.(i) ~stop:`inclusive |>
Sequence.concat_map ~f:(fun cnt -> match Map.find dict cnt with
| None -> Sequence.empty
| Some dict -> find dict (i+1))
else Sequence.of_list data in
find dict 0
We will basically follow the structure of the is_buildable function, except that instead of proving that such a frequency exists for each letter, we will collect all the proofs by reaching the end of the path and grabbing the set of word attached to it.
Testing and example
For the sake of completeness, we will test it by creating a small program, that will read a dictionary, with each word on a separate line, and interact with a user, by asking for a set and printing the resultion set of words, that can be built from it.
module Test = struct
let run () =
let dict =
In_channel.(with_file Sys.argv.(1)
~f:(fold_lines ~init:empty ~f:add_word)) in
let prompt () =
printf "Enter characters and hit enter (or Ctrl-D to stop): %!" in
prompt ();
In_channel.iter_lines stdin ~f:(fun set ->
build dict set |> Sequence.iter ~f:print_endline;
prompt ())
Here comes and example of interaction, that uses /usr/share/dict/american-english dictionary available on my machine (Ubunty Trusty).
./scrabble.native /usr/share/dict/american-english
Enter characters and hit enter (or Ctrl-D to stop): read
Enter characters and hit enter (or Ctrl-D to stop):
(Yep, the dictionary contains words, that like r and d that are probably not true English words. In fact, for each letter the dictionary has a word, so, we can basically build a word from each non-empty set of alphabet letters).
The full implementation along with the building instructions can be found on Gist
A better way to do this is to loop through all the words in the dictionary and see if the word can be built with the letters in the array.
"Sign" the letters available by sorting them in order; that's O(m log m), where m is the number of letters.
"Sign" each word in the dictionary by sorting the letters of the word in order; that's O(k log k), where k is the length of the word.
Compare the letter signature to each word signature; that's O(min(m, k) * n), where n is the number of words in the dictionary. Output any word that matches.
Assuming an English word list of approximately a quarter-million words, and no more than about half a dozen, that should be nearly instantaneous.
I was recently asked the same question in BankBazaar interview. I was given the option to (he said that in a very subtle manner) pre-process the dictionary in any way I want.
My first thought was to arrange the dictionary in a trie or ternary search tree, and make all the words from the letters given. In any optimization way, that would take n! + n-1! + n-2! n-3! + ..... + n word checks(n being the number of letters) in worst case, which was not acceptable.
The other way could be to check all the dictionary words whether they can be made from the given letters. This again in any optimized way would take noOfDictionaryWords(m) * average size of dictionary words(k) at worst case, which was again not acceptable.
Now I have n! + n-1! + n-2! + .... + N words, which I have to check in the dictionary, and I don't want to check them all, so what are the situations that I have to check only a subset of them, and how to group them.
If I have to check only combination and not permutation, the result gets to 2^n.
so I have to pre-process the dictionary words in such a way that if I pass a combination, all the anagrams would be printed.
A ds something like this :
A hashvalue made by the letters(irrespective of its positions and permutation), pointing to list containing all the words made by those letters, then we only need to check that hashvalue.
I gave the answer to make the hash value by assigning a prime value to all the alphabets and while calculating the hash value of a word, multiply all the assigned values. This will create a problem of having really big hash values given that 26th prime is 101, and many null values in the map taking space. We could optimize it a bit by rather than starting lexicographically with a = 2, b = 3, c = 5, d = 7.... z = 101, we search for the most used alphabets and assign them small values, like vowels, and 's', 't' etc.
The interviewer accepted it, but was not expecting the answer, so there is definitely another answer, for better or worse but there is.
Swift 4
func findValidWords(in dictionary: [String], with letters: [Character]) -> [String] {
var validWords = [String]()
for word in dictionary {
var temp = word
for char in letters {
temp = temp.filter({ $0 != char })
if temp.isEmpty {
return validWords
print(findValidWords(in: ["ape", "apples", "orange", "elapse", "lap", "soap", "bar", "sole"], with: ["a","p","l","e","s","o"]))
Output => ["ape", "apples", "elapse", "lap", "soap", "sole"]
My English is not good so try to understand.
My approach is using bit/bitwise to increase speed. Still bruteforce, though.
We only consider distinct character in each word and mark its existence. English has 26 characters, so we need 26 bits. Integer is 32 bits. That's enough.
Now encode each words in dictionary to an integer number.
abcdddffg -> 123444667 -> 123467 (only distinct characters) -> 1111011 (bits) -> 123 (decimal number)
So 2,000,000 words will be converted into 2,000,000 integer numbers.
Now let say you have this set of letters: a,b,c,d,e
abcde -> 12345 -> 1111100 (bits)
Do AND operation and we have:
1111100 (abcde)
1111011 (abcdddffg, no e)
1111000 (result) => result != abcdddffg => word cannot be created
Other example with a,b,c,d,e,f,g,h:
11111111 (abcdefgh)
11110110 (abcdddffg, no e and h)
11110110 (result) => result == abcdddffg => word can be created
While converting word to number, store the letter count also. If we found a match in first step, we continue to check if the number of letters is enough too.
Depend on the requirement, you might not need this second step.
O(n) to convert word to number and store letters count. Only need to do this once.
O(n) for each search query.
Following is more efficient way :-
1.Use count sort to count all letters appearing in the a word in dictionary.
2.Do count sort on the collection of letter that you are given.
3.Compare if the counts are same then the word can be made.
4. Do this for all words in dictionary.
This will be inefficient for multiple such queries so you can do following :-
1. make a tupple for each word using count sort.
2. put the tupple in a Tree or hashmap with count entries.
3. When query is given do count sort and lookup tupple in hashmap
Time complexity :-
The above method gives O(1) time complexity for a query and O(N) time complexity for hash table construction where N is no of words in dictionary
(cf. anagram search, e.g. using primes looks cleaner for a signature based approach - collect for all non-equivalent "substrings of letters"])
Given the incentive, I'd (pre)order Dict by (set of characters that make up each word, increasing length) and loop over the subsets from letters checking validity of each word until too long.
Alternatively, finding the set of words from dict out of chars from letters can be considered a multi-dimensional range query: with "eeaspl" specifying letters, valid words have zero to two "e", one or none of a, s, p, l, and no other characters at all - bounds on word length (no longer than letters, lower bound to taste) blend in nicely.
Then again, data structures like k-d-trees do well with few, selective dimensions.
(Would-be comment: you do not mention alphabet cardinality, whether "valid" depends on capitalisation or diacritics, "complexity" includes programmer effort or preprocessing of dict - the latter may be difficult to amortise if dict is immutable.)
Swift 3
func findValidWords(wordsList: [String] , string: String) -> [String]{
let charCountsDictInTextPassed = getCharactersCountIn(string: string)
var wordsArrayResult: [String] = []
for word in wordsList {
var canBeProduced = true
let currentWordCharsCount = getCharactersCountIn(string: word)
for (char, count) in currentWordCharsCount {
if let charCountInTextPassed = charCountsDictInTextPassed[char], charCountInTextPassed >= count {
canBeProduced = false
}// end for
if canBeProduced {
}//end if
}//end for
return wordsArrayResult
// Get the count of each character in the string
func getCharactersCountIn(string: String) -> [String: Int]{
var charDictCount:[String: Int] = [:]
for char in string.characters {
if let count = charDictCount[String(char)] {
charDictCount[String(char)] = count + 1
charDictCount[String(char)] = 1
}//end for
return charDictCount
If letters can be repeated, that means that a word can be infinitely long. You would obviously cap this at the length of the longest word in the dictionary, but there are still too many words to check. Like nmore suggested, you'd rather iterate over the dictionary to do this.
List<String> findAllValidWords(Set<String> dict, char[] letters) {
List<String> result = new LinkedList<>();
Set<Character> charSet = new HashSet<>();
for (char letter : letters) {
for (String word : dict) {
if (isPossible(word, charSet)) {
return result;
boolean isPossible(String word, Set<Character> charSet) {
// A word is possible if all its letters are contained in the given letter set
for (int i = 0; i < word.length(); i++) {
if (!charSet.contains(word.charAt(i))) {
return false;
return true;

Efficient algorithm to find a maximum common subset of two sets?

Each set contains bunch of checksums. For example:
Set A:
Set B:
The maximum common subset of A and B is:
A lot of these operations will be performed, so I'm looking for an efficient algorithm to do so.
Thanks for your help.
Put one of the sets in a hash table and iterate through the other, discarding elements that aren't in the hash. Alternatively, sort both and iterate through them simultaneously, as in merge sort.
EDIT: The latter method creates a sorted result. I should add that if the sets are of widely disparate sizes and they're presorted (say because you're doing a bunch of intersections), then you can realize a large performance improvement by using "unbounded" binary search to skip ahead in the large list.
Stick them in a hashtable and note the exact collisions.
Add Set A to a structure where you can find if a checksum exists.
Loop Set B, check if element exists in Set A, if it exists, add to Set C
Set C is your common subset.
Make ordered vector/list A from Set A
Make ordered vector/list B from Set B
Iterate over ordered A,B making new step on smaller element - if identical, add to restult and move both.
When underlying set structure is ordered - common case is a kind of Tree (BST,AVL etc.), - then you need only last step to perform.
To make last step clear, here is it's pseudocode:
a = A.begin(); b = B.begin();
while(a!=A.end() && b!=B.end()){
++a; ++b;
} else if(*a < *b) {
} else {

String Tiling Algorithm

I'm looking for an efficient algorithm to do string tiling. Basically, you are given a list of strings, say BCD, CDE, ABC, A, and the resulting tiled string should be ABCDE, because BCD aligns with CDE yielding BCDE, which is then aligned with ABC yielding the final ABCDE.
Currently, I'm using a slightly naïve algorithm, that works as follows. Starting with a random pair of strings, say BCD and CDE, I use the following (in Java):
public static String tile(String first, String second) {
for (int i = 0; i < first.length() || i < second.length(); i++) {
// "right" tile (e.g., "BCD" and "CDE")
String firstTile = first.substring(i);
// "left" tile (e.g., "CDE" and "BCD")
String secondTile = second.substring(i);
if (second.contains(firstTile)) {
return first.substring(0, i) + second;
} else if (first.contains(secondTile)) {
return second.substring(0, i) + first;
return EMPTY;
System.out.println(tile("CDE", "ABCDEF")); // ABCDEF
System.out.println(tile("BCD", "CDE")); // BCDE
System.out.println(tile("CDE", "ABC")); // ABCDE
System.out.println(tile("ABC", tile("BCX", "XYZ"))); // ABCXYZ
Although this works, it's not very efficient, as it iterates over the same characters over and over again.
So, does anybody know a better (more efficient) algorithm to do this ? This problem is similar to a DNA sequence alignment problem, so any advice from someone in this field (and others, of course) are very much welcome. Also note that I'm not looking for an alignment, but a tiling, because I require a full overlap of one of the strings over the other.
I'm currently looking for an adaptation of the Rabin-Karp algorithm, in order to improve the asymptotic complexity of the algorithm, but I'd like to hear some advice before delving any further into this matter.
Thanks in advance.
For situations where there is ambiguity -- e.g., {ABC, CBA} which could result in ABCBA or CBABC --, any tiling can be returned. However, this situation seldom occurs, because I'm tiling words, e.g. {This is, is me} => {This is me}, which are manipulated so that the aforementioned algorithm works.
Similar question: Efficient Algorithm for String Concatenation with Overlap
Order the strings by the first character, then length (smallest to largest), and then apply the adaptation to KMP found in this question about concatenating overlapping strings.
I think this should work for the tiling of two strings, and be more efficient than your current implementation using substring and contains. Conceptually I loop across the characters in the 'left' string and compare them to a character in the 'right' string. If the two characters match, I move to the next character in the right string. Depending on which string the end is first reached of, and if the last compared characters match or not, one of the possible tiling cases is identified.
I haven't thought of anything to improve the time complexity of tiling more than two strings. As a small note for multiple strings, this algorithm below is easily extended to checking the tiling of a single 'left' string with multiple 'right' strings at once, which might prevent extra looping over the strings a bit if you're trying to find out whether to do ("ABC", "BCX", "XYZ") or ("ABC", "XYZ", BCX") by just trying all the possibilities. A bit.
string Tile(string a, string b)
// Try both orderings of a and b,
// since TileLeftToRight is not commutative.
string ab = TileLeftToRight(a, b);
if (ab != "")
return ab;
return TileLeftToRight(b, a);
// Alternatively you could return whichever
// of the two results is longest, for cases
// like ("ABC" "BCABC").
string TileLeftToRight(string left, string right)
int i = 0;
int j = 0;
while (true)
if (left[i] != right[j])
if (i >= left.Length)
return "";
if (i >= left.Length)
return left + right.Substring(j);
if (j >= right.Length)
return left;
If Open Source code is acceptable, then you should check the genome benchmarks in Stanford's STAMP benchmark suite: it does pretty much exactly what you're looking for. Starting with a bunch of strings ("genes"), it looks for the shortest string that incorporates all the genes. So for example if you have ATGC and GCAA, it'll find ATGCAA. There's nothing about the algorithm that limits it to a 4-character alphabet, so this should be able to help you.
The first thing to ask is if you want to find the tilling of {CDB, CDA}? There is no single tilling.
Interesting problem. You need some kind of backtracking. For example if you have:
Combining DBC with BCD results in:
Which is not solvable. But combining ABC with BCD results in:
Which can be combined to:

Hashing a Tree Structure

I've just come across a scenario in my project where it I need to compare different tree objects for equality with already known instances, and have considered that some sort of hashing algorithm that operates on an arbitrary tree would be very useful.
Take for example the following tree:
/ \
/ \
/|\ |
/ | \ |
/ \
/ \
Where each O represents a node of the tree, is an arbitrary object, has has an associated hash function. So the problem reduces to: given the hash code of the nodes of tree structure, and a known structure, what is a decent algorithm for computing a (relatively) collision-free hash code for the entire tree?
A few notes on the properties of the hash function:
The hash function should depend on the hash code of every node within the tree as well as its position.
Reordering the children of a node should distinctly change the resulting hash code.
Reflecting any part of the tree should distinctly change the resulting hash code
If it helps, I'm using C# 4.0 here in my project, though I'm primarily looking for a theoretical solution, so pseudo-code, a description, or code in another imperative language would be fine.
Well, here's my own proposed solution. It has been helped much by several of the answers here.
Each node (sub-tree/leaf node) has the following hash function:
public override int GetHashCode()
int hashCode = unchecked((this.Symbol.GetHashCode() * 31 +
for (int i = 0; i < this.Children.Count; i++)
hashCode = unchecked(hashCode * 31 + this.Children[i].GetHashCode());
return hashCode;
The nice thing about this method, as I see it, is that hash codes can be cached and only recalculated when the node or one of its descendants changes. (Thanks to vatine and Jason Orendorff for pointing this out).
Anyway, I would be grateful if people could comment on my suggested solution here - if it does the job well, then great, otherwise any possible improvements would be welcome.
If I were to do this, I'd probably do something like the following:
For each leaf node, compute the concatenation of 0 and the hash of the node data.
For each internal node, compute the concatenation of 1 and the hash of any local data (NB: may not be applicable) and the hash of the children from left to right.
This will lead to a cascade up the tree every time you change anything, but that MAY be low-enough of an overhead to be worthwhile. If changes are relatively infrequent compared to the amount of changes, it may even make sense to go for a cryptographically secure hash.
Edit1: There is also the possibility of adding a "hash valid" flag to each node and simply propagate a "false" up the tree (or "hash invalid" and propagate "true") up the tree on a node change. That way, it may be possible to avoid a complete recalculation when the tree hash is needed and possibly avoid multiple hash calculations that are not used, at the risk of slightly less predictable time to get a hash when needed.
Edit3: The hash code suggested by Noldorin in the question looks like it would have a chance of collisions, if the result of GetHashCode can ever be 0. Essentially, there is no way of distinguishing a tree composed of a single node, with "symbol hash" 30 and "value hash" 25 and a two-node tree, where the root has a "symbol hash" of 0 and a "value hash" of 30 and the child node has a total hash of 25. The examples are entirely invented, I don't know what expected hash ranges are so I can only comment on what I see in the presented code.
Using 31 as the multiplicative constant is good, in that it will cause any overflow to happen on a non-bit boundary, although I am thinking that, with sufficient children and possibly adversarial content in the tree, the hash contribution from items hashed early MAY be dominated by later hashed items.
However, if the hash performs decently on expected data, it looks as if it will do the job. It's certainly faster than using a cryptographic hash (as done in the example code listed below).
Edit2: As for specific algorithms and minimum data structure needed, something like the following (Python, translating to any other language should be relatively easy).
#! /usr/bin/env python
import Crypto.Hash.SHA
class Node:
def __init__ (self, parent=None, contents="", children=[]):
self.valid = False
self.hash = False
self.contents = contents
self.children = children
def append_child (self, child):
def invalidate (self):
self.valid = False
if self.parent:
def gethash (self):
if self.valid:
return self.hash
digester =
if self.children:
for child in self.children:
self.hash = "1"+digester.hexdigest()
self.hash = "0"+digester.hexdigest()
return self.hash
def setcontents (self):
self.valid = False
return self.contents
Okay, after your edit where you've introduced a requirement that the hashing result should be different for different tree layouts, you're only left with option to traverse the whole tree and write its structure to a single array.
That's done like this: you traverse the tree and dump the operations you do. For an original tree that could be (for a left-child-right-sibling structure):
[1, child, 2, child, 3, sibling, 4, sibling, 5, parent, parent, //we're at root again
sibling, 6, child, 7, child, 8, sibling, 9, parent, parent]
You may then hash the list (that is, effectively, a string) the way you like. As another option, you may even return this list as a result of hash-function, so it becomes collision-free tree representation.
But adding precise information about the whole structure is not what hash functions usually do. The way proposed should compute hash function of every node as well as traverse the whole tree. So you may consider other ways of hashing, described below.
If you don't want to traverse the whole tree:
One algorithm that immediately came to my mind is like this. Pick a large prime number H (that's greater than maximal number of children). To hash a tree, hash its root, pick a child number H mod n, where n is the number of children of root, and recursively hash the subtree of this child.
This seems to be a bad option if trees differ only deeply near the leaves. But at least it should run fast for not very tall trees.
If you want to hash less elements but go through the whole tree:
Instead of hashing subtree, you may want to hash layer-wise. I.e. hash root first, than hash one of nodes that are its children, then one of children of the children etc. So you cover the whole tree instead of one of specific paths. This makes hashing procedure slower, of course.
--- O ------- layer 0, n=1
/ \
/ \
--- O --- O ----- layer 1, n=2
/|\ |
/ | \ |
/ | \ |
O - O - O O------ layer 2, n=4
/ \
/ \
------ O --- O -- layer 3, n=2
A node from a layer is picked with H mod n rule.
The difference between this version and previous version is that a tree should undergo quite an illogical transformation to retain the hash function.
The usual technique of hashing any sequence is combining the values (or hashes thereof) of its elements in some mathematical way. I don't think a tree would be any different in this respect.
For example, here is the hash function for tuples in Python (taken from Objects/tupleobject.c in the source of Python 2.6):
static long
tuplehash(PyTupleObject *v)
register long x, y;
register Py_ssize_t len = Py_SIZE(v);
register PyObject **p;
long mult = 1000003L;
x = 0x345678L;
p = v->ob_item;
while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
x = (x ^ y) * mult;
/* the cast might truncate len; that doesn't change hash stability */
mult += (long)(82520L + len + len);
x += 97531L;
if (x == -1)
x = -2;
return x;
It's a relatively complex combination with constants experimentally chosen for best results for tuples of typical lengths. What I'm trying to show with this code snippet is that the issue is very complex and very heuristic, and the quality of the results probably depend on the more specific aspects of your data - i.e. domain knowledge may help you reach better results. However, for good-enough results you shouldn't look too far. I would guess that taking this algorithm and combining all the nodes of the tree instead of all the tuple elements, plus adding their position into play will give you a pretty good algorithm.
One option of taking the position into account is the node's position in an inorder walk of the tree.
Any time you are working with trees recursion should come to mind:
public override int GetHashCode() {
int hash = 5381;
foreach(var node in this.BreadthFirstTraversal()) {
hash = 33 * hash + node.GetHashCode();
The hash function should depend on the hash code of every node within the tree as well as its position.
Check. We are explicitly using node.GetHashCode() in the computation of the tree's hash code. Further, because of the nature of the algorithm, a node's position plays a role in the tree's ultimate hash code.
Reordering the children of a node should distinctly change the resulting hash code.
Check. They will be visited in a different order in the in-order traversal leading to a different hash code. (Note that if there are two children with the same hash code you will end up with the same hash code upon swapping the order of those children.)
Reflecting any part of the tree should distinctly change the resulting hash code
Check. Again the nodes would be visited in a different order leading to a different hash code. (Note that there are circumstances where the reflection could lead to the same hash code if every node is reflected into a node with the same hash code.)
The collision-free property of this will depend on how collision-free the hash function used for the node data is.
It sounds like you want a system where the hash of a particular node is a combination of the child node hashes, where order matters.
If you're planning on manipulating this tree a lot, you may want to pay the price in space of storing the hashcode with each node, to avoid the penalty of recalculation when performing operations on the tree.
Since the order of the child nodes matters, a method which might work here would be to combine the node data and children using prime number multiples and addition modulo some large number.
To go for something similar to Java's String hashcode:
Say you have n child nodes.
hash(node) = hash(nodedata) +
hash(childnode[0]) * 31^(n-1) +
hash(childnode[1]) * 31^(n-2) +
<...> +
Some more detail on the scheme used above can be found here:
I can see that if you have a large set of trees to compare, then you could use a hash function to retrieve a set of potential candidates, then do a direct comparison.
A substring that would work is just use lisp syntax to put brackets around the tree, write out the identifiere of each node in pre-order. But this is computationally equivalent to a pre-order comparison of the tree, so why not just do that?
I've given 2 solutions: one is for comparing the two trees when you're done (needed to resolve collisions) and the other to compute the hashcode.
The most efficient way to compare will be to simply recursively traverse each tree in a fixed order (pre-order is simple and as good as anything else), comparing the node at each step.
So, just create a Visitor pattern that successively returns the next node in pre-order for a tree. i.e. it's constructor can take the root of the tree.
Then, just create two insces of the Visitor, that act as generators for the next node in preorder. i.e. Vistor v1 = new Visitor(root1), Visitor v2 = new Visitor(root2)
Write a comparison function that can compare itself to another node.
Then just visit each node of the trees, comparing, and returning false if comparison fails. i.e.
Function Compare(Node root1, Node root2)
Visitor v1 = new Visitor(root1)
Visitor v2 = new Visitor(root2)
Node n1 =
Node n2 =
if (n1 == null) and (n2 == null) then
return true
if (n1 == null) or (n2 == null) then
return false
if != 0 then
return false
end loop
// unreachable
End Function
End Module
if you want to write out a string representation of the tree, you can use the lisp syntax for a tree, then sample the string to generate a shorter hashcode.
Function TreeToString(Node n1) : String
if node == null
return ""
String s1 = "(" + n1.toString()
for each child of n1
s1 = TreeToString(child)
return s1 + ")"
End Function
The node.toString() can return the unique label/hash code/whatever for that node. Then you can just do a substring comparison from the strings returned by the TreeToString function to determine if the trees are equivalent. For a shorter hashcode, just sample the TreeToString Function, i.e. take every 5 character.
End Module
I think you could do this recursively: Assume you have a hash function h that hashes strings of arbitrary length (e.g. SHA-1). Now, the hash of a tree is the hash of a string that is created as a concatenation of the hash of the current element (you have your own function for that) and hashes of all the children of that node (from recursive calls of the function).
For a binary tree you would have:
Hash( h(node->data) || Hash(node->left) || Hash(node->right) )
You may need to carefully check if tree geometry is properly accounted for. I think that with some effort you could derive a method for which finding collisions for such trees could be as hard as finding collisions in the underlying hash function.
A simple enumeration (in any deterministic order) together with a hash function that depends when the node is visited should work.
int hash(Node root) {
ArrayList<Node> worklist = new ArrayList<Node>();
int h = 0;
int n = 0;
while (!worklist.isEmpty()) {
Node x = worklist.remove(worklist.size() - 1);
h ^= place_hash(x.hash(), n);
return h;
int place_hash(int hash, int place) {
return (Integer.toString(hash) + "_" + Integer.toString(place)).hash();
class TreeNode
public static QualityAgainstPerformance = 3; // tune this for your needs
public static PositionMarkConstan = 23498735; // just anything
public object TargetObject; // this is a subject of this TreeNode, which has to add it's hashcode;
IEnumerable<TreeNode> GetChildParticipiants()
yield return this;
foreach(var child in Children)
yield return child;
foreach(var grandchild in child.GetParticipiants() )
yield return grandchild;
IEnumerable<TreeNode> GetParentParticipiants()
TreeNode parent = Parent;
yield return parent;
while( ( parent = parent.Parent ) != null );
public override int GetHashcode()
int computed = 0;
var nodesToCombine =
(Parent != null ? Parent : this).GetChildParticipiants()
foreach(var node in nodesToCombine)
if ( node.ReferenceEquals(this) )
computed = AddToMix(computed, PositionMarkConstant );
computed = AddToMix(computed, node.GetPositionInParent());
computed = AddToMix(computed, node.TargetObject.GetHashCode());
return computed;
AddToTheMix is a function, which combines the two hashcodes, so the sequence matters.
I don't know what it is, but you can figure out. Some bit shifting, rounding, you know...
The idea is that you have to analyse some environment of the node, depending on the quality you want to achieve.
I have to say, that you requirements are somewhat against the entire concept of hashcodes.
Hash function computational complexity should be very limited.
It's computational complexity should not linearly depend on the size of the container (the tree), otherwise it totally breaks the hashcode-based algorithms.
Considering the position as a major property of the nodes hash function also somewhat goes against the concept of the tree, but achievable, if you replace the requirement, that it HAS to depend on the position.
Overall principle i would suggest, is replacing MUST requirements with SHOULD requirements.
That way you can come up with appropriate and efficient algorithm.
For example, consider building a limited sequence of integer hashcode tokens, and add what you want to this sequence, in the order of preference.
Order of the elements in this sequence is important, it affects the computed value.
for example for each node you want to compute:
add the hashcode of underlying object
add the hashcodes of underlying objects of the nearest siblings, if available. I think, even the single left sibling would be enough.
add the hashcode of underlying object of the parent and it's nearest siblings like for the node itself, same as 2.
repeat this to with the grandparents to a limited depth.
//--------5------- ancestor depth 2 and it's left sibling;
//-------/|------- ;
//------4-3------- ancestor depth 1 and it's left sibling;
//-------/|------- ;
//------2-1------- this;
the fact that you are adding a direct sibling's underlying object's hashcode gives a positional property to the hashfunction.
if this is not enough, add the children:
You should add every child, just some to give a decent hashcode.
add the first child and it's first child and it's first child.. limit the depth to some constant, and do not compute anything recursively - just the underlying node's object's hashcode.
//----- this;
This way the complexity is linear to the depth of the underlying tree, not the total number of elements.
Now you have a sequence if integers, combine them with a known algorithm, like Ely suggests above.
This way, you will have a lightweight hash function, with a positional property, not dependent on the total size of the tree, and even not dependent on the tree depth, and not requiring to recompute hash function of the entire tree when you change the tree structure.
I bet this 7 numbers would give a hash destribution near to perfect.
Writing your own hash function is almost always a bug, because you basically need a degree in mathematics to do it well. Hashfunctions are incredibly nonintuitive, and have highly unpredictable collision characteristics.
Don't try directly combining hashcodes for Child nodes -- this will magnify any problems in the underlying hash functions. Instead, concatenate the raw bytes from each node in order, and feed this as a byte stream to a tried-and-true hash function. All the cryptographic hash functions can accept a byte stream. If the tree is small, you may want to just create a byte array and hash it in one operation.
