I've been studying trees for a few days, and I'm a little confused about how they are sorted. The basic node is easy enough:
template < typename TYPE >
struct Node {
TYPE Data;
Node < TYPE > * Left;
Node < TYPE > * Right;
};
In terms of sorting, I understand that it is a simple comparison entailing a node's data (lower on the left, higher on the right), and I can see how this would work on integral types (int, double, float, char). What I'm confused about is how this is done with user defined types and objects. What exactly is/should be compared? Is this simply a case by case answer or is there a general method that can be used?
Any information that could help clear this up in my head would be greatly appreciated.
I think there's not a true answer to this question.
What exactly is/should be compared?
It depends on the data type! If you have a string you may want to sort them in alphabetical order. If the string represents a role you may want to sort by importance. What if a color? You may sort by hue or by brightness. There is not a universal answer.
If something is not comparable at all (or a comparison doesn't have sense, imagine you're comparing objects that represent clothes) maybe you shouldn't use a sorted tree!
Related
I can think of only a few for example the zero length list or set. The zero length string. How about empty matrices or tensors? How about parallelograms with all zero degree angles? How about a rectangle with two sides zero length? Or a triangle having ones side 180 degrees and the other two are zero? Can we keep going with many sided polygons? Nah that doesn't feel right. But I do believe there are similar degenerate shapes in 3-space.
But those I am not much interested in. I'm looking for some common math functions often used in programming which have well known degenerate cases. I do lots of Mathematica and some JavaScript programming but the actual programming language doesn't really matter as this is more of a computer science task.
There are some interesting examples of degenerate data structures
Degenerate Binary Tree - It's basically a Binary Tree where every parent has only one child. So it degenerates into a linked list.
Hash Table with a constant hash function - Hash Table collisions can be handled in two main ways:
Chaining - Every cell of the array links to a linked list and elements with the same hash value are chained together into this list. So, when the hash function is constant, all elements have the same hash value and they are all connected; here, the hash table degenerates into a linked list.
probing - Here, if an element has the same hash as another one, I simply look for an empty space. Now, when the probing sequence is linear (so if the cell i is occupied I'll look for the cell i+1) and the hash value is always the same, I will generate only collisions, every element is put into the first empty space of the list and It will degenerate into an another linked list.
Classes with no methods - A class without methods, so written like this:
class Fraction {
int numerator;
int denominator;
}
It degenerates into a struct, so like this:
struct Fraction {
int numerator;
int denominator;
}
And so on. Obviously, there are many other examples of degenerate cases for data structures or functions (in graph theory for example).
I hope this can help.
The problem
I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.
Or, in terms of sets instead of arrays:
Store N sets (each with C possible elements) into a datastructure so you can quickly look up if a given set is a superset of any of the stored sets.
Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.
Some context
I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:
N = 10000
C = 1000
The stored arrays are sparse
The looked up arrays are random (so not sparse)
What I've come up with so far
For O(NC) lookup: Just iterate all the arrays. This is just too slow though.
For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.
I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.
EDIT: A new idea I've come up with
For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.
My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.
I will try out both this new idea and the BDD method, and see which of the 2 work out best.
But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.
Just to add some background information to the prefix trie solution, recently I found the following paper:
I.Savnik: Index data structure for fast subset and superset queries. CD-ARES, IFIP LNCS, 2013.
The paper proposes the set-trie data structure (container) which provides support for efficient storage and querying of sets of sets using the trie data structure, supporting operations like finding all the supersets/subsets of a given set from a collection of sets.
For any python users interested in an actual implementation, I came up with a python3 package based partly on the above paper. It contains a trie-based container of sets and also a mapping container where the keys are sets. You can find it on github.
I think prefix trie is a great start.
Since yours arrays are sparse, I would additionally test them in bulk. If (B1 ∪ B2) ⊂ A, both are included. So the idea is to OR-pack arrays by pairs, and to reiterate until there is only one "root" array (it would take only twice as much space). It allows to answer 'Yes' to your question earlier, which is mainly useful if you don't need to know with array is actually contained.
Independently, you can apply for each array a hash function preserving ordering.
Ie : B ⊂ A ⇒ h(B) ≺ h(A)
ORing bits together is such a function, but you can also count each 1-bit in adequate partitions of the array. Here, you can eliminate candidates faster (answering 'No' for a particular array).
You can simplify the problem by first reducing your list of sets to "minimal" sets: keep only those sets which are not supersets of any other ones. The problem remains the same because if some input set A is a superset of some set B you removed, then it is also a superset of at least one "minimal" subset C of B which was not removed. The advantage of doing this is that you tend to eliminate large sets, which makes the problem less expensive.
From there I would use some kind of ID3 or C4.5 algorithm.
Building on the trie solution and the paper mentioned by #mmihaltz, it is also possible to implement a method to find subsets by using already existing efficient trie implementations for python. Below I use the package datrie. The only downside is that the keys must be converted to strings, which can be done with "".join(chr(i) for i in myset). This, however, limits the range of elements to about 110000.
from datrie import BaseTrie, BaseState
def existsSubset(trie, setarr, trieState=None):
if trieState is None:
trieState = BaseState(trie)
trieState2 = BaseState(trie)
trieState.copy_to(trieState2)
for i, elem in enumerate(setarr):
if trieState2.walk(elem):
if trieState2.is_terminal() or existsSubset(trie, setarr[i:], trieState2):
return True
trieState.copy_to(trieState2)
return False
The trie can be used like dictionary, but the range of possible elements has to be provided at the beginning:
alphabet = "".join(chr(i) for i in range(100))
trie = BaseTrie(alphabet)
for subset in sets:
trie["".join(chr(i) for i in subset)] = 0 # the assigned value does not matter
Note that the trie implementation above works only with keys larger than (and not equal to) 0. Otherwise, the integer to character mapping does not work properly. This problem can be solved with an index shift.
A cython implementation that also covers the conversion of elements can be found here.
Here's a restatement of the rather cryptic title question:
Suppose we have a Prototype tree that has been built, that contains all the info on the structure of the tree and the generic description of each node. Now we want to create instances of this tree with elements that contain extra unique data. Let's call these Concrete trees.
The only difference between Concrete and Prototype trees is the extra data in the nodes of the Concrete tree. Supposing each node of a Concrete tree has a pointer/link to the corresponding element in the Prototype tree for generic information about the node, but no parent/child information of its own:
Is it possible to traverse the Concrete tree?
In particular, given a starting node in the Concrete tree, and a path through the Prototype tree, is it possible to efficiently get the corresponding node in the Concrete tree? There can be many Concrete trees, so a link back from Prototype tree is not possible.
Even though I might not need to optimize things to such an extent in my code, this is still an interesting problem!
Thanks in advance!
NOTE: There are no restrictions on the branching factor of the tree- a node can have between one and hundreds of children.
Extra ramblings/ideas:
The reason I ask, is that it seems like it would be a waste to copy parent/child information each time a new instance of a Concrete tree is created, since this structure is identical to the Prototype tree. In my particular case, children are identified by string names, so I have to store a string-to-pointer hash at each node. There can be many instances of Concrete trees, and duplicating this hash seems like a huge waste of space.
As a first idea, perhaps the path could be somehow hashed into an int or something that compactly identifies an element (not a string, since that's too big), which is then used to look up concrete elements in hashes for each Concrete tree?
Once created, will the prototype tree ever change (i.e. will nodes ever be inserted or removed)?
If not, you could consider array-backed trees (i.e. child/parent links are represented by array indices, not raw pointers), and use consistent indexing for your concrete trees. That way, it's trivial to map from concrete to prototype, and vice versa.
You could have a concrete leaf for each prototype node, but you'd need to do some kind of hashing per tree (as you suggest) to keep different concrete trees separate. At this point you've incurred the same storage cost as a completely separate tree with redundant child/parent pointers. You definitely want a link from the prototype tree to the concrete trees.
I can see this approach being useful if you want to make structural changes to the prototype tree affect all linked concrete trees. Shuffling nodes would instantly affect all concrete trees. You may incur extra cost since it will be impossible to transmit a single concrete tree without either sending every concrete tree or doing some extract operation to rip one tree out.
In general you will not be able to encode a path uniquely in an int.
Just store the parent child relationship in the concrete tree and forget about it. At best it's a single pointer value, worst it's two pointer values. You would need at least that much to keep links between the prototype tree and the concrete tree anyway.
Its possible when there's a known dependency between addresses of nodes in
both trees. Basically it means that nodes have to be fixed-size and allocated
all at once.
Sure, its also possible to use a hashtable for mapping of addresses of first tree
nodes to second tree nodes, but such a hashtable has to have at least 10x more nodes
than first tree, otherwise mapping would be too slow.
#include <stdio.h>
typedef unsigned char byte;
struct Node1 {
Node1* child[2];
Node1() { child[0]=child[1]=0; }
};
struct Node2 {
int N;
Node2() { N=0; }
};
int main( void ) {
int i,j,k,N = 256;
Node1* p = new Node1[2*N];
Node2* q = new Node2[2*N];
// insert
for( i=0,k=1; i<N; i++ ) {
Node1* root = &p[0];
Node1** r = &root;
for( j=7;; j-- ) {
if( r[0]==0 ) r[0]=&p[k++];
if( j<0 ) break;
r = &r[0]->child[(i>>j)&1];
}
q[r[0]-p].N = byte(i+123);
// ^^^^^ - mapping from p[] to q[]
}
// check
for( i=N-1; i>=0; i-- ) {
Node1* r = &p[0];
for( j=7; j>=0; j-- ) r = r->child[(i>>j)&1];
if( q[r-p].N != byte(i+123) ) printf( "error!\n" );
}
}
I think you can do what you describe, but I don't believe it constitutes an optimisation (for the type of reasons referred to by #Dave). The key to doing so lies in tying the pointers back to the prototype in such a way that they also act as identifiers. In addition major traversals through the prototype tree would need to be pre-calculated - a breadth first and a depth first traversal.
The pre-calculated traversals are likely to use a stack or queue, depending on the particular traversal. In addition, as the traversals are done, an indexed linked list needs to be built in the traversal order (or as #Oli suggests an indexed array). The data in the linked list is the identifier (see following) of the node. Each prototype tree and each prototype node needs an identifier (could be an address, or an arbitary identifier). Each concrete tree has its own identifier. Each concrete node is given the SAME identifier as its corresponding node in the prototype tree. Then to follow a partial traversal you identify the node identifier in the linked list and use this as the identifier of the concrete node.
In essence you are creating a link between the prototype and the concrete nodes, by using the equivalence of the identifiers as the pointer (a sort of "ghost" pointer). It does require a number of supporting mechanisms, and these are likely to cause this route not to be an actual optimisation.
Consider a class of type doubles
class path_cost {
double length;
double time;
};
If I want to lexicographically order a list of path_costs, I have a problem. Read on :)
If I use exact equal for the equality test like so
bool operator<(const path_cost& rhs) const {
if (length == rhs.length) return time < rhs.time;
return length < rhs.length;
}
the resulting order is likely to be wrong, because a small deviation (e.g. due to numerical inaccuracies in the calculation of the length) may cause the length test to fail, so that e.g.
{ 231.00000000000001, 40 } < { 231.00000000000002, 10 }
erroneously holds.
If I alternatively use a tolerance like so
bool operator<(const path_cost& rhs) const {
if (std::fabs(length-rhs.length)<1-e6)) return time < rhs.time;
return length < rhs.length;
}
then the sorting algorithm may horribly fail since the <-operator is no longer transitive (that is, if a < b and b < c then a < c may not hold)
Any ideas? Solutions? I have thought about partitioning the real line, so that numbers within each partition is considered equal, but that still leaves too many cases where the equality test fails but should not.
(UPDATE by James Curran, hopefully explaining the problem):
Given the numbers:
A = {231.0000001200, 10}
B = {231.0000000500, 40}
C = {231.0000000100, 60}
A.Length & B.Length differ by 7-e7, so we use time, and A < B.
B.Length & C.Length differ by 4-e7, so we use time, and B < C.
A.Length & C.Length differ by 1.1-e6, so we use length, and A > C.
(Update by Esben Mose Hansen)
This is not purely theoretical. The standard sort algorithms tends to crash or worse when given a non-transitive sort operator. And this is exactly what I been contending with (and boy was that fun to debug ;) )
Do you really want just a compare function?
Why don't you sort by length first, then group the pairs into what you think are the same length and then sort within each group by time?
Once sorted by length, you can apply whatever heuristic you need, to determine 'equality' of lengths, to do the grouping.
I don't think you are going to be able to do what you want. Essentially you seem to be saying that in certain cases you want to ignore the fact that a>b and pretend a=b. I'm pretty sure that you can construct a proof that says if a and b are equivalent when the difference is smaller than a certain value then a and b are equivalent for all values of a and b. Something along the lines of:
For a tolerance of C and two numbers A and B where without loss of generality A > B then there exist D(n) = B+n*(C/10) where 0<=n<=(10*(A-B))/(C) such that trivially D(n) is within the tolerance of D(n-1) and D(n+1) and therefore equivalent to them. Also D(0) is B and D((10*(A-B))/(C))=A so A and B can be said to be equivalent.
I think the only way you can solve that problem is using a partitioning method. Something like multiplying by 10^6 and then converting to an int shoudl partition pretty well but will mean that if you have 1.00001*10^-6 and 0.999999*10^-6 then they will come out in different partitions which may not be desired.
The problem then becomes looking at your data to work out how to best partition it which I can't help with since I don't know anything about your data. :)
P.S. Do the algorithms actually crash when given the algorithm or just when they encounter specific unsolvable cases?
I can think of two solutions.
You could carefully choose a sorting algorithm that does not fail when the comparisons are intransitive. For example, quicksort shouldn't fail, at least if you implement it yourself. (If you are worried about the worst case behavior of quicksort, you can first randomize the list, then sort it.)
Or you could extend your tolerance patch so that it becomes an equivalence relation and you restore transitivity. There are standard union-find algorithms to complete any relation to an equivalence relation. After applying union-find, you can replace the length in each equivalence class with a consensus value (such as the average, say) and then do the sort that you wanted to do. It feels a bit strange to doctor floating point numbers to prevent spurious reordering, but it should work.
Actually, Moron makes a good point. Instead of union and find, you can sort by length first, then link together neighbors that are within tolerance, then do a subsort within each group on the second key. That has the same outcome as my second suggestion, but it is a simpler implementation.
I'm not familiar with your application, but I'd be willing to bet that the differences in distance between points in your graph are many orders of magnitude larger than the rounding errors on floating point numbers. Therefore, if two entries differ by only the round-off error, they are essentially the same, and it makes no difference in which order they appear in your list. From a common-sense perspective, I see no reason to worry.
You will never get 100% precision with ordinary doubles. You say that you are afraid that using tolerances will affect the correctness of your program. Have you actually tested this? What level of precision does your program actually need?
In most common applications I find a tolerance of something like 1e-9 suffices. Of course it all depends on your application. You can estimate the level of accuracy you need and just set the tolerance to an acceptable value.
If even that fails, it means that double is simply inadequate for your purposes. This scenario is highly unlikely, but can arise if you need very high precision calculations. In that case you have to use an arbitrary precision package (e.g. BigDecimal in Java or something like GMP for C). Again, only choose this option when there is no other way.
Say I have a binary tree with the following definition for a node.
struct node
{
int key1 ;
int key2 ;
}
The binary search tree is created on the basis of key1. Now is it possible to rearrange the binary search tree on basis of key2 in O(1) space. Although I can do this in variable space using an array of pointers to nodes.
The actual problem where I require this is "counting number of occurrences of unique words in a file and displaying the result in decreasing order of frequency."
Here, a BST node is
{
char *word;
int freq ;
}
The BST is first created on basis of alphabetic order of words and finally I want it on basis of freq.
Am I wrong at choice of data structure i.e a BST?
I think you can create a new tree sorted by freq and push there all elements popping them from an old tree.
That could be O(1) though likely more like O(log N) which isn't big anyway.
Also, I don't know how you call it in C#, but in Python you can use list but sort it by two different keys in-place.
Map, BST are good if you need to have sorted output for your dictionnary.
And it is good if you need to mix up add, remove and lookup operations.
I don't think this is your need here. You load the dictionnary, sort it, then do only look up in it, that's right ?
In this case a sorted array is probably a better container. (See Item 23 from Effective STL from Scott Meyer).
(Update: simply consider that a map could generate more memory cache misses than a sorted array, as an array get its data contiguous in memory, and as each node in a map contain 2 pointers to other nodes in the map. When your objects are simple and take not much space in memory, a sorted vector is probable a better option. I warmly recommand you to read that item from Meyer's book)
About the kind of sort you are talking about, you will need that algorithm from the stl:
stable_sort.
The idea is to sort the dictionnary, then sort with stable_sort() on the frequence key.
It will give something like that (not tested actually, but you got the idea):
struct Node
{
char * word;
int key;
};
bool operator < (const Node& l, const Node& r)
{
return std::string(l.word) < std::string(r.word));
}
bool freq_comp(const Node& l, const Node& r)
{
return l.key < r.key;
}
std::vector<node> my_vector;
... // loading elements
sort(vector.begin(), vector.end());
stable_sort(vector.begin(), vector.end(), freq_comp);
Using a HashTable (Java) or Dictionary (.NET) or equivalent data structure in your language of choice (hash_set or hash_map in STL) will give you O(1) inserts during the counting phase, unlike the binary search tree which would be somewhere from O(log n) to O(n) on insert depending on whether it balances itself. If performance is really that important just make sure you try to initialize your HashTable to a large enough size that it won't need to resize itself dynamically, which can be expensive.
As for listing by frequency, I can't immediately think of a tricky way to do that without involving a sort, which would be O(n log n).
Here is my suggestion for re-balancing the tree based off of the new keys (well, I have 2 suggestions).
The first and more direct one is to somehow adapt Heapsort's "bubble-up" function (to use Sedgewick's name for it). Here is a link to wikipedia, there they call it "sift-up". It is not designed for an entirely-unbalanced tree (which is what you'd need), but I believe it demonstrates the basic flow of an in-place reordering of a tree. It may be a bit hard to follow because the tree is in fact stored in array rather than a tree (though the logic in a sense treats it as a tree) --- perhaps, though, you'll find such an array-based representation is best! Who knows.
The more crazy-out-there suggestion of mine is to use a splay tree. I think they're nifty, and here's the wiki link. Basically, whichever element you access is "bubbled up" to the top, but it maintains the BST invariants. So you maintain the original Key1 for building the initial tree, but hopefully most of the "higher-frequency" values will also be near the top. This may not be enough (as all it will mean is that higher-frequency words will be "near" the top of the tree, not necessarily ordered in any fashion), but if you do happen to have or find or make a tree-balancing algorithm, it may run a lot faster on such a splay tree.
Hope this helps! And thank you for an interesting riddle, this sounds like a good Haskell project to me..... :)
You can easily do this in O(1) space, but not in O(1) time ;-)
Even though re-arranging a whole tree recursively until it is sorted again seems possible, it is probably not very fast - it may be O(n) at best, probably worse in practice. So you might get a better result by adding all nodes to an array once you are done with the tree and just sorting this array using quicksort on frequency (which will be O(log n) on average). At least that's what I would do. Even tough it takes extra space it sounds more promising to me than re-arranging the tree in place.
One approach you could consider is to build two trees. One indexed by word, one indexed by freq.
As long as the tree nodes contain a pointer to the data node, you could access if via the word-based tree to update the info, but later access it by the freq-based tree to output.
Although, if speed is really that important, I'd be looking to get rid of the string as a key. String comparisons are notoriously slow.
If speed is not important, I think your best bet is to gather the data based on word and re-sort based on freq as yves has suggested.